One, Two, Some, Many: Search Results & Meaning
When is a number not meaningful—even though it may be correct?
Many times. That’s a whole series of commentaries, some of which I’ve written in the past. How bad is the numeracy problem? Bad enough and, I believe, getting worse. Far too many survey-based statistics use more precision than can be justified; far too many comparisons use percentages in ways that are true but misleading.
This essay isn’t about the general problem. It’s about reported search result counts and their meaning. More specifically, it’s about reported result counts from open web search engines (Google, Yahoo!, Live Search and others) and the uses of such result counts.
Here’s the full title for this perspective:
One, two, three, ten, some; one hundred, nine hundred, one thousand, many: The useful numbers for web search results.
That’s partly a reference to languages that lack extended counting systems, because those languages sometimes get it right. For many purposes, “one, two, three…ten…lots” is in line with what we perceive. At higher magnitudes that’s fairly clear. Can you visualize the difference between ten and 100? Probably. The difference between 100 and 1,000? Less clearly, I suspect: If I ask you how many houses are in your neighborhood, can you say confidently whether it’s closer to 100 or 1,000?
How about the difference between 1,000 and 10,000? Or 10,000 and 100,000? Or 100,000 and one million? Do you have any way to visualize those differences in real-world terms? Really? Orders of magnitude are hard once you get past “some.”
I digress. My focus is on open web search engine result counts—because such counts are so often used as though they were meaningful and as though comparisons among counts had meaning.
That’s the problem. Search engine result counts are stated as though they had absolute and comparative meaning, even when they frequently don’t—or at least don’t in any verifiable way.
The problem arises because of laziness, but also because LexisNexis result size counts were frequently used as indicators of current popularity. You know the drill: If “Britney Crenshaw” shows 1,500 hits for the last six months in LexisNexis and “Florence Casaba” shows 2,500 hits for the same period, Florence is a lot more popular in the media than Britney.
That was also lazy journalism, but based on a truth of sorts. There almost certainly were 1,500 items mentioning Britney Crenshaw and 2,500 items mentioning Florence Casaba during the six-month period—even if the items might be repetitive. A dedicated reporter could look at all 1,500 and 2,500 items. You could reasonably assert the following for the universe as covered in LexisNexis:
· For the period in question, Florence Casaba was mentioned in 66% more articles than Britney Crenshaw.
· For the period in question, Florence Casaba was mentioned in 2,500 articles.
· For the period in question, Britney Crenshaw was mentioned in 1,500 articles.
· Florence Casaba was a higher-profile personality during that period than Britney Crenshaw.
As it happens, both those searches yield Google results—“about 116,000” for Britney Crenshaw and “about 3,880” for Florence Casaba. (I made up the two names, but you know how it is…) Which of these statements is likely to be true for the universe of Google-indexed web pages?
· Britney Crenshaw is mentioned in 30 times as many web pages as Florence Casaba.
· Britney Crenshaw is mentioned in roughly 116,000 web pages.
· Florence Casaba is mentioned in roughly 3,880 web pages.
· Britney Crenshaw has a much higher profile than Florence Casaba.
My answer? At most one of those four statements is likely to be true—and none is verifiable. The fourth statement may be true, or it may not. The first three? The nature of web search engine result counts and the character of web search engine results combine to make it impossible to verify those statements, although it is sometimes possible to suggest that they’re false.
Let’s go through this example in more depth before moving on to “real” examples.
First admission: I used word searches, as do most politicians and demagogues trying to make claims based on web searches (e.g., “there are forty-six million free porn web sites” based on searching the words “free” and “porn” in Google). I didn’t surround the names with quotes.
Never mind. That’s what most people do, isn’t it?
Let’s look at those 116,000 web sites. I have Google set to show 100 results per page. I’ll click on the “10” at the bottom, to get to the end of the first 1,000 results.
Whoops. It isn’t showing results 901-1,000; it’s showing results 501-591 of “about 125,000.” Suddenly there are 9,000 more results—but I can only see 591 of them. Why is that?
You know Google’s first explanation:
In order to show you the most relevant results, we have omitted some entries very similar to the 591 already displayed.
If you like, you can repeat the search with the omitted results included.
Sounds good to me. Let’s do that.
Click on “10” at the bottom. I get “Results 601-604 of about 115,000.” In other words, by including the “omitted results” I can see 13 more items—but Google seems to think the total number has gone down. And, oddly, the “omitted some entries” footer still appears—but clicking the link again produces the same 604 (of 115,000) results.
What of the other 114,400 or so? Do they exist? You can’t prove it by me (filtering was off). Why does Google claim such a high number? Your guess is as good as mine. Maybe there’s an index node for the join of Britney and Crenshaw that yields 115,000 hits, which could include many occurrences in the same documents. Maybe something else is happening. Maybe the number has no relation to reality.
There is a clear, verifiable, clean result for the phrase “Britney Crenshaw”—because it’s an obscure name. It yields eight results at first, 13 with “very similar” results included—and you can see all 13 results. (Or not: Checking a few, there were parked pages, 404s, duplicates, redirects…but that’s the reality of the web. Two identical pages were attempts to find this person, most of the rest appeared to be video pages where “Britney” and “Crenshaw” only incidentally appear together in a set of tags.)
So: There may be anywhere from zero to 13 Google-indexed web pages that mention Britney Crenshaw…a slight reduction from 115,000.
Note this: If there were more than 1,000 distinct Google-indexed web pages mentioning, say, a different Britney, you wouldn’t be able to verify how many there are. Google won’t show you more than 1,000 results. Period. End of discussion. You can try to get more by refining the results in various ways, but that only goes so far. (Incidentally, unless you include very similar pages, that other Britney only shows 733 pages out of, oh, about 115 million.)
Remember poor Florence? She had a mere 3,880 results, one-thirtieth as many as Britney. Ah, but I had to break the writing process for this article over several days—and on the next day, most of the results magically vanished, leaving a mere 928.
Or did they? Here’s another interesting aspect of web search engines, especially Google:
· Repeating a search doesn’t necessarily yield the same results even within the same hour.
The first time, I used the FireFox search box (with Google as the selected engine) to search for Florence Casaba. The second time, I searched directly within Google. Where did the other 2,952 items go? Into some phase warp between direct and indirect Google searching: Searching via the Firefox search box, even when you’re on Google, yields 3,880.
At this point, I’m reminded that the Rev. Charles Lutwidge Dodgson was a mathematician and logician. I’m sure he’d appreciate the extent to which I feel as though I’m looking at a table with little cakes saying “search me” and wondering whether I’ll disappear into the woodwork or grow suddenly larger… I couldn’t recreate the problem (but I’ve seen it before), so I’ll stick with direct searches from now on. Which, done again, go back to 3,880.
If you start with 3,880, you wind up with 617—more than for Britney Crenshaw, not one-thirtieth as many. Redo the search with similar results included, and you get 877 of about 935—nearly half again as many as Crenshaw. Mysterious enough?
But what of a phrase search? That’s a disappointment. At this writing, there are no results for the phrase “florence casaba” in Google—although that will presumably cease to be true shortly after this issue of Cites & Insights appears.
It may be that there are no web pages mentioning either of these two names except indirectly or as search terms. There certainly aren’t many. As for the two-word combinations, I’m not sure what you can conclude from a situation where Google says one word pair appears thirty times as often as another one—but when you ask to see the records, Google shows you nearly half again as many for the second word pair as for the first. Which word pair actually appears more often?
Let’s not forget the other search engines. For this article I’ll limit that to the two major competitors.
Yahoo! shows 139,000 (no “about”) for Britney Crenshaw (as words), and shows its maximum of 1,000 actual pages. Repeating the search with similar results included yields 140,000—but, of course, you still can’t show more than 1,000. The phrase “Britney Crenshaw” shows “4 of 7” or “4 of 15” with similar results included (not clear why I can’t see the other three or 11 records)—and it appears that one page, from ESPN, actually does have that name, a misspelled version of Brittney Crenshaw, a basketball player at Florida Atlantic University. As for poor Florence, there are 8,990 records for the two words, turning into 545 of 1,870 when you try to show all of them—and with omitted results, that starts out as 8,640 and becomes 1,820, hitting the 1,000-viewable limit. For the phrase? For once, the two engines agree: Zero.
Live Search says there are 81,000 results for Britney Crenshaw (as words), and hits its 1,000-page limit, at that point saying there are 1,000 results (but only when you attempt to go beyond record 999; otherwise, on the 20th page of 50 results each, it shows 80,800 results). Oddly, Live Search shows higher page numbers to click on (e.g., up to 24 50-result pages)—but won’t go past 1,000 (page 20). The phrase “Britney Crenshaw” yields five results, not including the legit ESPN typo. As for Florence, you get an astonishing 909,000 results with an odd note: “Results are included for other related terms. Show just the results for florence casaba.” That yields 3,350. Your guess as to what’s happening is as good as mine (or that of any other naďve searcher). The smaller set quickly turns into 653 viewable records. The larger set hits the 1,000-record limit, showing 655,000 results at that point. (It appears that Live Search is doing some slightly crazed stemming in the initial search, as results show “Canary” as a highlighted word in many of the results. Maybe you can find a relationship between Canary and Casaba; damned if I can.) As a phrase, “Florence Casaba” yields nothing: Once again, agreement!
I’ve thought about doing more numeracy columns for a while—and a number of occasions have arisen where Google’s results have caused me and others some confusion. But a specific incident pushed this article from idea to epaper.
Tim Spalding posted “Getting real: Libraries are missing books” on March 26, 2008 at the Thingology blog (www.librarything.com/thingology). He takes libraries, especially academic libraries, to task for not buying Jason Fried’s Getting Real: The smarter, faster, easier way to build a successful web application. Getting Real is published through Lulu and was, at the time, the sixth best seller there. That could mean several hundred sales, but probably means a few thousand. The earlier PDF version supposedly sold 30,000 copies. Worldcat.org showed three libraries holding the book in March 2008; there are five or six in mid-June 2008.
One of Spalding’s arguments for the significance of the book: “Google records 166,000 mentions.”
His more general argument is that libraries are ignoring Lulu books (generally true) and Lulu isn’t all crap (also true)—and he can’t see how libraries could miss a book as important as Getting Real.
A number of comments offered reasons libraries don’t and can’t cope with Lulu and its flood of unreviewed, unedited books. I added this comment:
Speaking as a Lulu user (five books) and library person, two notes:
1. Lulu isn't a publisher. It's a service supplier/fulfillment agency. (Technically, if you buy a Lulu ISBN, then it's the publisher of record, but that's all.) What that means--apart from Lulu books being self-published, not vanity press books (a distinction I'd like to see more people make!), is that it's entirely up to the author to do publicity, send out review copies, etc., etc.
2. With more than 150,000 new titles published each year, it's hardly surprising that libraries don't pick up most titles that aren't reviewed in some form--and most Lulu titles don't show up in review media.
My top Lulu title shows 44 library copies at Worldcat. That's without print reviews, but with fairly strong name recognition among librarians. Still, I'd bet that roughly zero of those sales came from people browsing Lulu...
I'm a little surprised that a book was sufficiently well-publicized to sell 30,000 copies through Lulu and only make it into three libraries. That sounds unlikely, frankly, but I certainly won't argue with the author's reporting on sales... Still, with that many sales, you'd expect (a) that the author would spring for an ISBN, which would improve visibility and orderability, (b) that the author might even spring for Ingram distribution, which would make it easier to buy...and probably get it into Amazon.
Well, and maybe (c) choose a primary title that doesn't duplicate so many other books!
I misread Spalding’s post on one item: He didn’t claim 30,000 Lulu sales. But I was certainly not ready for Spalding’s response, which began: “What you're missing here is that someone like Fried doesn't care about your ISBNs and he doesn't care about your Ingram. They are irrelevant to him and to much of his audience,” continued by touting 37signal’s blog’s Technorati rating and noting “I doubt if there’s a librarian blog in the world with half that,” and tossing in another comment about not needing ISBNs.
I responded once more, before it became clear that Spalding wanted to lecture more than discuss and that, for someone trying to sell to libraries, he’s pretty dismissive of librarians if they disagree with him.
I noted that they weren’t “my” ISBNs or Ingram; I was just offering reasons. I also noted that being mentioned lots of times on a publisher’s own blog doesn’t offer any indication of quality. Then I did a different search for the book on Google, getting 654 hits (115 records) rather than Spalding’s claimed result. And I noted that one library blog does have one-third the Technorati rating of 37signal’s blog (counting links rather than authority, that’s true). Spalding seemed hostile about my questioning of his Google count, saying he could write a PERL script to get beyond the 1,000 record limit and finishing with “Your point is that Google is lying?” He threw in some other questionable statements (e.g., that 37signal’s blog is read more than all but a few magazines), but after noting a couple of them I decided to stop responding.
That left the Google issue open. Is Google lying? I’m not sure that’s a meaningful question. I am sure that, for real-world purposes, most large Google search result numbers are meaningless. Spalding calls the number of hits on the book’s short title (to which he added the author’s name) “astounding.” So let’s play with this book title a little more…
(Others have been discussing Spalding’s underlying point, that at least some libraries should be paying more attention to self-published and other non-mainstream materials. It’s an interesting discussion. I’m not directly addressing the underlying issue in this essay.)
First, another digression having to do with numbers and reality. Spalding claims that 37signal’s blog, which he calls Signal to Noise (the name is actually A design and usability blog: Signal vs. Noise and it’s at www.37signals.com/svn), has a readership “certainly higher than all but a few magazines.”
As for readership, 37signals’ blog states that up front: Feedburner shows 90,652 readers (as of June 12, 2008). Let’s say 100,000 readers overall—which might be low, but probably isn’t all that low. Of course, you can define “a few” any way you want, but the top 101 magazines each had at least 925,000 circulation in 2006 (according to Wikipedia, based on New York Job Source reporting). Circulation isn’t dropping drastically, at least not at this point. MPA’s report for total paid and verified circulation for 2007, based on Audit Bureau of Circulation figures, shows the hundredth highest title (Food & Wine) with 940,983 circulation. How many other magazines circulate more paid copies than the free readership of 37signal’s blog? An initial inspection of paid & audited numbers suggests at least 300 more, not including controlled circulation, freebies and newspapers. In any case, 400+ is more than “a few” by my standards.
Spalding seems to be saying jobbers or libraries should be paying attention to books mentioned in 37signal’s blog because it’s such a prominent source—but do jobbers or libraries pay attention to books mentioned in, say, Backpacker (349,598 paid & verified 2007 circulation) or Super Chevy (164,535) or Model Railroader (154,244) or Easy Home Cooking (331,727) or North American Whitetail (140,863) or Garden Design (258,805)?
Back to that issue: Just how astonishing is the web presence of 37signal’s book?
I just did the same search Spalding had done: the phrase “getting real” and the word “37signals.” It still shows “about 166,000,” interesting given that it’s now more than ten weeks later. And guess what? That search result doesn’t hit the 1,000-record limit: Google shows 684 pages. To what extent is that self-promotion? Not much: Seven of the first 100 results were from 37signals.com and subdomains, and I didn’t see that many elsewhere.
Ah, but the phrase “walt crawford” yields 619 pages (of “about 27,800”). So my personal web presence is almost as astonishing as that of 37signal’s book! I consider myself a non-celebrity, appropriately not in the English-language Wikipedia because I’m not notable, although I’m reasonably well known within library circles. (Incidentally, “cites & insights” as a phrase shows “about 20,200”—but goes to “401-437 of 437.”)
As a more interesting comparison, consider one of those obscure librarians in her blog persona. The phrase “shifted librarian” yields “about 187,000” in Google—more than the 37signals book—although slightly fewer records (582) are visible.
I wouldn’t call any of these web presences astonishing. I’d call them all significant and quite similar: 582, 619, and 684 records respectively. Anything else is pure conjecture.
37signal’s book is far more astonishing on Yahoo!: 418,000 (for the same search)—and it does hit the 1,000-record limit. But then, so am I; “walt crawford” yields 236,000 and also hits the 1,000-record limit. “cites & insights” gets 271,000 (hitting the limit) and “shifted librarian” yields 1,520,000 (hitting the limit). In other words, according to Yahoo!, “shifted librarian” is more than three times as astonishing as 37signals’ book. So much for those obscure librarians?
Then there’s Live Search. It shows a mere 95,100 for the 37signals book and hits the limit—but that’s with the zooey default search. Using its real-search click, there’s 82,100 results—and the viewing limit is there once more. The others? “walt crawford” yields 27,200 (and hits the viewing limit), “cites & insights” shows 109,000, but that’s down to 43,000 when the viewing limit is reached. “shifted librarian” shows 386,000—again, around four times as many as the 37signals book—and by the time you hit the viewing limit, that’s only down to 220,000.
“Google Fight” (www.googlefight.com) is presumably just for fun—the site invites you to enter two phrases and “make a fight,” returning the “results” for each phrase. Let’s look at the raw numbers and whatever reality seems plausible for some of those fights:
· “Usain Bolt” and “Asafa Powell” (two runners): Reported as 672,000 for Bolt, 599,000 for Powell. Viewable results for word pairs: 587 for Bolt, 632 for Powell. As phrases, 702,000 (!) for Bolt (588 viewable), 599,000 for Powell (635 viewable). Google Fight says Bolt wins—but Powell has more viewable records.
· “Anna Nicole Smith” and “Pamela Anderson.” The reported “fight” has Anderson a clear winner, with 22.3 million to Smith’s 7.36 million. You can view 739 records for Smith (as three words) and 729 for Anderson—once again, the “winner” seems to be reversed (and it’s a difference of just over 1%, not more than 3 to 1). As phrases, Anderson moves up to 26.3 million (740 viewable) and Smith moves up to 8.2 million (742 viewable). Clear as mud.
· Here’s one that’s topical rather than a personal name: Christmas tree vs. Christmas pudding. Googlefight shows 15.6 million to 846,000 (tree vs. pudding). Google itself? Tree: 29.4 million (words), 19.9 million (words), still not hitting the 1,000-record viewing limit (889). Pudding: 803,000 (words), 555,000 (phrase), and nearly as many viewable records (822).
I could go through many more examples, and so could you. Try a few on your own—both as words and as phrases. I find it interesting that you’ll frequently see more records in Yahoo and Live Search, even though Google may show larger result sizes.
The Spalding incident didn’t involve the words-vs.-phrase issue, which arises in so many public claims about what’s on the web. If you’re interested, the words walt Crawford yield “about 661,000” in Google; shifted librarian, “about 361,000,” and cites & insights “about 339,000” (oddly enough, fewer—about 336,000—without the ampersand). Oh, and the words getting real 37signals yield about 644,000.
This isn’t really about the prominence of 37signals’ book. If you wanted web prominence, you could try “cory doctorow”—which yields, about 963,000, and even then only 769 are visible. I think the conclusions are the same:
· The initial result count for a Google (or competitor’s) search is essentially meaningless, either on its own or (especially) in comparison with result counts for other searches.
· In a surprisingly large number of cases, the viewable results of a Google search fall well within the 1,000-record viewing limit imposed by all three major web search engines.
· Viewable results may be better measures of web prominence—but they may not.
· For most uses and most users, the key phrases are one, two, ten, some; one hundred, nine hundred, one thousand, many. Most people won’t go beyond the first ten—and maybe not beyond the first two. Anything larger than that is “some.” Dogged users may go to the first hundred or even the first nine hundred—but beyond 1,000, all you know is that there are “many” records, regardless of the stated count.
I think we tend to bring our traditional search experience with us to open web search results, and that doesn’t work. If you get a large result in most traditional search engines, you can—at least theoretically—plow through the whole thing.
I’m not criticizing Google or Yahoo! or Live Search here. (Well, maybe Live Search—it seems more blatantly mysterious than either of the others.) I’m just saying these are bad numbers, particularly when making comparisons. “About” means they can’t be taken seriously as indicators.
Cites & Insights is sponsored by YBP Library Services, http://www.ybp.com.
Opinions herein may not represent those of PALINET or YBP Library Services.
Comments should be sent to firstname.lastname@example.org. Cites & Insights: Crawford at Large is copyright © 2008 by Walt Crawford: Some rights reserved.
All original material in this work is licensed under the Creative Commons Attribution-NonCommercial License. To view a copy of this license, visit http://creativecommons.org/licenses/by-nc/1.0 or send a letter to Creative Commons, 559 Nathan Abbott Way, Stanford, California 94305, USA.