Google, Wikis and Media Hacks
Before proceeding to semi-organized chunks of current items about Google and wikis, a couple of standalone columns caught my eye—both “Media hack” pieces by Adam L. Penenberg (an assistant professor of journalism at NYU), both appearing at Wired News. Not that I agree with or accept what Penenberg says, but he’s interesting and thought provoking.
His April 28, 2005 piece, “The new old journalism,” includes the unfortunate assumption that [all?] “younger people will undoubtedly choose the web” over newspapers and “Ultimately, the printed word will die off.” Although he’s talking about print newspapers, he doesn’t explicitly make that limit. “It’s inevitable since it will be more cost-effective…to distribute news over the web and via cell phones and PDAs…” Interesting to have cost-effectiveness as the basis for inevitability; unfortunate that Penenberg sees no loss in moving from the broad, socializing, local-business-serving role of the print newspaper to the “tell me only what I want to hear about” role that web news plays in most lives (I believe).
The survey he’s basing this on says that 19% of Americans 18 to 34 do read print newspapers, but universalisms are always tempting for columnists. What I found most noteworthy is his assertion that people aren’t abandoning newspapers: They’re abandoning the print medium. Oddly, he includes “magazines” in this assertion—and there’s no indication that people are abandoning print magazines or avidly adopting digital versions.
He thinks it makes sense to keep teaching the skills of journalism: We’ll still need reporters even if they’re working entirely in net media. “[W]hen all is said and done, I still expect that each student will know how to craft a hard news lede on a tight deadline. Because whether we’re talking today or 10 years ago, it’s not the medium, it’s the reporter.” I agree with the conclusion, even if I disagree with much of the column. (“Lede” is newspaper jargon. What’s a profession without jargon?)
The July 21 piece, “Web publishers eye your wallet,” is a discussion of the “Balkanization of online media”—the idea that we will pay for internet content in the future, with all the good stuff locked behind subscription and pay-per-article doors. The source is Pat Kenealy of International Data Group. His analogy is TV, where it was free in 1955 “and two generations later most people pay for it.” That’s a tricky analogy, since the most frequently watched TV is still free, even if most of us pay someone so we don’t have to fiddle with an antenna. Kenealy uses another truly odd analogy: “We got used to paying $1.50 or so at some ATMs—and that’s to withdraw our own money.” Maybe you got used to it, Pat, but millions of us at Washington Mutual and some other big banks don’t intend to pay to withdraw our own money.
The big holdup, of course, is the ever-elusive micro-transaction software: Kenealy thinks we’re all just waiting to pay say, $0.50 or $2 or whatever for econtent if the transaction’s as easy as buying a magazine at a newsstand. Kenealy’s fine with the idea that you lose most of your readers when you require registration and, in the future, a subscription: The remaining readers will be more attractive to advertisers, who will then pay higher rates. That’s right: Paying for your econtent won’t avoid ads, but may make them more pervasive—even though nobody’s quite figured out how to make anything other than tiny text ads as acceptable in net media as they are in print magazines.
As for weblogs? Kenealy’s analogy, both overbroad and nicely dismissive for an oldline print publisher: “Every blogger is a rock band without a record contract.” So they’ll still be free. Kenealy should learn something about what “rock bands without record contracts” are doing these days with downloads, create-to-order CD-Rs, short-run CDs, and other ways of doing without the star-making machinery. They may not be able to quit their day jobs, but they also don’t sell their souls to The Man.
Lorcan Dempsey offered some early thoughts about “the G5” (the combination of Google and five major libraries involved in the huge digitization project) in an April 16, 2005 blog post. “If large amounts of the G5 library collections are digitized, indexed and searchable then we have an index to books in all library collections. This initiative potentially improves access to all library collections, provided we have good ways of moving from the Google results into those collections.” He also discusses “coverage moving forward” and the implications of the under-copyright portion of the project for the distinction between libraries’ “bought” and “licensed” collections. I won’t quote more and the third discussion may be somewhat moot; go to orweblog.oclc.org/archives/000632. html for the whole story.
Bill Drew, Baby boomer librarian, posted this entry on April 19, discussing a report of an ACRL program—and later that day posted a longer response from Steven Bell, one of the people involved in the program. Here’s Bell’s statement, which Bill Drew used as the basis for his “I disagree” initial essay:
If you care about helping your users get to the highest quality results, it’s difficult to say that Google is a good model for searching in an academic context.
Now, before you shout “I LOVE GOOGLE!” or “It’s good enough” or whatever, read that sentence carefully. Bell is not saying Google is worthless. He is saying that it may not be “a good model” for academic searching when you’re looking for the highest-quality results. Drew’s response is that library databases are too difficult to use; that “quite often ‘good enough is good enough’”; freshman papers don’t need the “highest quality results”; and Google/Google Scholar results may be all that’s needed. But that’s not what Bell said.
Drew goes on to “imagine the world where Google Scholar is the interface to all of our databases and our online catalog as well as to web pages”—and, frankly, I can’t imagine that working out all that well, quite apart from the logical stretches involved. Even then, Drew says, “Those with greater needs could use the separate databases”—such as those needing “the highest quality results”? The close of Drew’s original post strikes me as truly odd: “All librarians that like and use Google, do not be afraid of standing up and saying so.” But very few people—Bell certainly not among them—are saying you shouldn’t like or use Google. The critics are saying that Google is not the be-all and end-all. Bell responded to Drew’s oddly off-center attack with a long, thoughtful email (which Bell was willing to have posted) that’s better read directly. Drew precedes Bell’s response by saying he and Bell are “not that far apart after all.” You’ll find both posts in the April archives at babyboomerlibrarian.blogspot.com.
Speaking of Google Scholar, the California Digital Libraries released “UC libraries use of Google Scholar” on August 10, 2005. It summarizes the results of a quick survey on librarian and library staff use of Google Scholar within the University of California.
The replies indicate a core of respondents do not use Google Scholar at all. Others use it rarely, instead strongly preferring licensed article databases purchased by the libraries for use in specific disciplines. Some are reluctant to use it because they are unsure of what it actually covers.
This isn’t a statistical study. Boxed sets of bullets cite uses for Google Scholar, reasons for usefulness, “I don’t use it because,” uses of Google Scholar at public service desks and in teaching, and other comments. Seven pages provide all responses grouped by the survey questions, followed by a brief essay from UCLA, which has elevated Google Scholar to their home page. An interesting direct look at how some academic librarians are dealing with this resource.
Here’s an odd one: Another Penenberg “Media hack” column dated April 21. It’s a snide little piece, drawing plausible and stretched parallels between the two firms. For example, “Alternative slogans: Wal-Mart: ‘Always low wages.’ Google: ‘Maybe not evil, but after the IPO not so good either.’” Penenberg says Google “accounts for almost four out of five internet searches,” which doesn’t agree with any other reports I’ve seen, and also claims Google pays less than other Silicon Valley companies. (I’m reporting what Penenberg says; I’m not convinced of any of this.) The most troubling parallel, to be sure: Just as Wal-Mart insists on censored versions of some CDs and DVDs, Google blocks some sites in some countries (of necessity) and bars AdSense affiliates from criticizing Google. (Does Google really pay systems administators $35K? In Mountain View? I find that a little hard to believe except as an entry salary for an entry-level position—but I have zero inside information.)
Speaking of “four out of five internet searches,” here’s an article that flatly disagrees with that number, by Charles H. Ferguson, posted in January 2005 at TechnologyReview.com (www.techreview.com/articles/05/ 01/issue/ferguson0105.asp?p=0). This long article (eight pages of very small type) discusses likely competition between Microsoft and Google. A pie chart on the second page says Google’s own sites perform 38 percent of web searches, while other sites that license its technology (some of which are moving to other technologies) account for another 10 to 15 percent. Maybe 48 to 53% equals “almost four out of five” to Penenberg, but not to me. The 38% figure agrees with other metrics I’ve seen.
The article’s interesting and challenging, reminding us of the days when Netscape’s Jim Barksdale assured us Microsoft could never catch up with Netscape in the browser market. It’s a detailed article, arguing that Google needs to establish itself as a platform (by promoting APIs)—and that it should avoid going after MS in the browser (and OS) arena. There seems to be an implication that either MS or Google will “win,” as opposed to expanding the current Google/Yahoo! duopoly to a broader, more competitive three- or even four-way search market. Still, worth reading.
Then there’s the copyright flap over the library portion of Google Print. The publishers attacked and Google retreated, at least temporarily. An oddly-titled August 11, 2005 post on the Google blog, “Making books easier to find,” notes the “two new features” for publishers—one to “give us a list of the books that, if we scan them at a library, you’d like to have added immediately to your account” (which gets publishers ad revenue and directed buyers) and one to allow publishers to “tell us which books they’d prefer that we not scan if we find them in a library. To allow plenty of time to review these new options, we won’t scan any in-copyright books from now until this November.” That second option is the partial retreat. Not too surprisingly, publishers assailed it because it’s the wrong way around: Copyright holders don’t have to provide would-be infringers with lists of “things we’d like you to not infringe.” This, of course, assumes that scanning entire books into a database constitutes copying even if those books aren’t made available except in snippets—and, for a commercial entity, even that nominal level of copying may infringe copyright. Google’s lawyers apparently didn’t believe that was true, at least initially. Maybe they’ve talked to other lawyers. (One Harvard law professor believes Google would win a court fight over fair use based on the “social worth” of their scanning.)
A fight erupted at Copyfight (see the August 2005 archives at www.corante.com/copyfight). Aaron Swartz thought Google had every right to keep on scanning. Siva Vaidhyanathan disagreed based on Google’s commercial status and current law.
If copyright is to mean anything at all, then corporations may not copy entire works that they have never purchased without permission for commercial gain. I can’t imagine what sort of argument—short of copyright nihilism—would justify such a radical change in copyright law.
Vaidhyanathan is no copyright maximalist. He goes on to claim that the University of Michigan, for example, could do such copying for its own patrons. “I wish more libraries would push their rights under copyright.” As I read the library exceptions to copyright they’re quite limited, but there’s no question that nonprofit libraries have more leeway than corporations do. As the multipart discussion went on, we had an astonishing suggestion that Google really was sort of a library, or at least close enough (“good enough”?) for jazz—to which Vaidhyanathan had some ripe responses that suggest he knows something about what libraries and librarians are and do.
Among the many voices in this ongoing discussion, I found four particularly interesting.
Ø The always-thoughtful Seth Finkelstein posted “Google Print: Copyright vs. innovation vs. commercial value” at Infothought on August 12, 2005, noting that Google surely isn’t mounting the expensive digitization effort just because it’s cool but because they anticipate commercial gains. The argument brings up one of the intrinsic conflicts in copyright law: protection almost automatically narrows some forms of innovation. Letting Google digitize all the in-copyright books and display only search results is an innovation, but “clearly very dubious under copyright.” Since Google stands to gain, it is a balance issue: “The technology company can’t be right every time, almost by definition.”
Ø Paul Miller posted “Google Print on hold” that same day at Common Information Environment. Maybe British copyright law’s different, but his take is simple: “I was sad to see that Google has bowed to the whinging of publishers… I had been impressed by the breadth of their vision…and saw plenty of ways in which access to in-copyright material could have been managed to the benefit of all (including the publishers). We give in to the whinging of those with no vision all too often.” Well, yes, Google Print could be managed to the benefit of all—but as long as published material isn’t immediately and automatically part of the commons, Google doesn’t get to decide that on its own. Otherwise, copyright effectively ceases to exist; I don’t considers that a desirable outcome.
Ø Tim O’Reilly (the publisher) “defend[s] Google’s approach, arguing that this is another case where old line publishers are being dragged kicking and screaming towards a future that is actually going to be good for them.” Sure, Tim, but again: You don’t get to tell other publishers that they must submit to being “dragged kicking and screaming.” You can try to persuade and Google will try to do that—but their original position (that scanning to create the search index is fair use) was probably wrong, and that leaves the choice with the publishers. (Copyright spills over into so many other areas. As a balanced-copyright advocate, I’m as frustrated with those who say “Trust us, it’ll be good for you” as with those who insist on 100% control over uses they never had control over before.)
Ø Jenn Riley of Inquiring Librarian discussed Google Print and related issues on August 28 and 29 (inquiringlibrarian.blogspot.com). The August 29 post is a non-lawyer’s attempt to judge Google Print against the four factors of fair use as stated in section 107 of the copyright act. It’s a good analysis that concludes that the fair use claim is “far from a slam dunk in either direction.” The August 28 post is even more interesting: Riley wonders whether cached web pages could also be considered copyright violations and whether indexing and abstracting, and for that matter cataloging, could be considered infringement? I would argue that the latter questions are simple: Preparing a description of a copyright item is an act of intellectual creation that results in a new (copyrightable) work; it is not a derivative work. (Otherwise, every book and movie review could be considered infringement.) As to the caching question—one answer is the one Google’s trying to use with Google Print: “Tell us not to, and we won’t.” That is, if a web search engine caches pages that have no-spider specifications or retain those caches after a site owner objects, they could be in trouble—and they don’t do either one of those. Whether you can apply opt-out logic to printed books: That’s another issue.
You must have heard the claim by now: The Yahoo! index now provides access to over 20 billion items. The claim was apparently first made August 8 on the Yahoo! search blog by Tim Mayer. Two days later, John Battelle reported that Google “refuted” this claim saying, “[Their] scientists are not seeing the increase claimed in the Yahoo! index.” Researchers who work at the National Computer for Supercomputing Applications decided to study the situation—and released the results on August 16, only six working days later. (“A comparison of the size of the Yahoo! and Google indices,” vburton.ncsa.uiuc.edu/indexsize.html)
How did they check it out? They assume there’s no filtering going on and that if Yahoo’s claim is true, “a series of random searches to both search engines should return more than twice as many results from Yahoo! than Google.” Ah, but they’re not willing to take raw numbers, and they know both search engines refuse to return more than 1,000 results. “Any search result found to have more than 1,000 returned results on either search engine was disregarded from our sample.” So how did they get lots of queries returning relatively small results? By taking a list of English words and randomly selecting two words at a time—in all, 10,012 searches.
Note that they weren’t actually examining the results, which makes me wonder why result counts of more than 1,000 were unacceptable. The results were striking: “On average Yahoo! only returns 37.4% of the results that Google does and, in many cases, returns significantly less.” In fact, these word combinations were so obscure that Google only returned an average of 38 results (“excluding duplicate results,” by which I assume they mean “similar to these” results), where Yahoo! returned only 14. How many searches have you done on either engine in the last six months that returned results that small?
They also assert that the actual number of results returned was about half the estimate on Google, only one-fifth the estimate on Yahoo! Their conclusion: a user can expect, on average, to receive 166.9% more results using the Google search engine than the Yahoo! search engine… “It is the opinion of this study that Yahoo!’s claim to have a web index of over twice as many documents as Google’s index is suspicious.” (Who knew that studies could have opinions?)
The speedy study, syntax and all, flew around the blogosphere—Google’s still #1! Seth Finkelstein took a careful look at how the study was done and posted his comments at Infothought on August 16, using the study’s name as the entry name. His conclusion: “The methodology is severely flawed, with a sampling-error bias… By sampling random words, they biased the samples to files of large words lists. And this effect applies, to a greater or lesser extent, to every sample.” He offers examples—and, later, realizes that the methodology returns not only large word lists but also “gibberish spam pages.” In the results he looked at—cases where Yahoo! returned no results and Google returned significant numbers—“Every page is either a gibberish spam page or a wordlist.” Unfortunately, and as usual, Finkelstein’s cogent criticism of the study received nowhere near the publicity of the study itself.
Matthew Cheney and Mike Perry replaced the earlier study noted above with “A comparison of the size of the Yahoo! and Google indices” (same URL) a few days later. This time, the searches involved two random words and a third word preceded by “-” (that is, the first two words and NOT the third word), which the researchers claim has the effect of “excluding dictionaries and wordlists.” They also threw out any query yielding fewer than 26 “actual results” on both engines.
There’s a huge difference from the original study: Now Yahoo! “only returns 65% of the results that Google does” as opposed to 37.4%—but the researchers continue to characterize Google results as “overwhelmingly larger,” an adjective that does little to convince me that the researchers have no prior agendas going into this hurried and hurriedly-redone, project. (One of the tables is clearly mislabeled, which really makes me wonder about the rush to publish: As stated, it shows Yahoo! returning more than Google—73% more, to be precise.) The conclusions are pretty much identical to the first version. Seth Finkelstein points out that their method of excluding wordlists doesn’t really work very well and why that is so.
Fact is, the study could not conclusively prove that Yahoo! is lying (which is certainly the implication). No study could, short of actual access to both companies’ server farms. There’s no reason to assume that Google and Yahoo! index documents identically (e.g., how deeply they index very long documents) and every reason to assume that they do not. There’s no reason to assume that they define “document” identically. There’s no reason to assume that the algorithms for blocking spam pages, eliminating near-duplicates, and otherwise making results semi-manageable are identical—and every reason to assume they’re not.
It would be interesting to see strong anecdotal studies using real search terms—understanding that even 50,000 such searches would still be anecdotal. Then again, if you can’t get beyond the first 1,000 on either engine, does it matter all that much which one is larger? What might matter is which engine returns a higher percentage of highly relevant results within the pages that a typical user would scan.
Here’s an odd one: An August 16 post at Nexgen librarian by Fritz “Ian” Herrick. He believes Google is “threatening the public library” and calls that evil.
If you needed a list of dry-cleaners in Syracuse, you used to call the library. If you needed the zip code of an address in Tallahassee, you used to call the library. If you needed to know the capital of Mozambique, you used to call the library. Now, everybody uses Google.
Have you ever called your library for a list of local businesses—or do you use the yellow pages? Herrick thinks taxpayers will say, “Everything’s in Google. Why are we paying for a library?” and be happy enough when the city cuts the library budget. I wonder how many taxpayers think ready reference is the primary benefit they receive from public libraries? Last time I looked at this situation, healthy public libraries averaged about 12 circulations per person in their service area—and considerably less than two reference transactions.
Herrick’s list of what would be missed if library budgets get cut is reasonable, although he ignores one huge thing Google doesn’t do: Circulate books, DVDs, and other materials. For free! Nearly every survey shows that the public wants books in their libraries. Google won’t change that.
Simon at VALISblog responded two days later, with “Librarians to Google: stop being evil (our buggy whip sales are down).” His response (in part):
If Google is good at answering people’s factual reference questions, then let it continue to do that. Criticizing Google from the assumption that we have a divine right to continue to perform this role is arrogant.
Either we need to do what we do better, or we need to stop doing it, and let Google do it. And then re-focus what we mean by ‘library’…the library as place…the library as entertainment source (books on paper are still better and easier to read than books on screen); the library as source of serious scholarly information… We can do things that Google will never be able to—so let’s use it as a resource and an ally, and concentrate on marketing our strengths.
If you’ve followed some of the discussions regarding Wikipedia, you may already know about the two-part Early history of Nupedia and Wikipedia, written by Larry Sanger and posted on slashdot April 18 and 19, 2005. (The essay will also appear or has appeared this summer in Open Sources 2.0, an O’Reilly publication.) Yes, this is the same Larry Sanger who posted “Why Wikipedia must jettison its anti-elitism” at Kuro5hin, discussed in Cites & Insights 5:3 (February 2005).
This is a long essay, particularly by slashdot standards: Part 1 runs 26 pages (admittedly fairly narrow pages), with another 27 pages in Part 2. By April 20, when I printed off the posts and first-level comments, they already added 18 and 12 pages respectively.
Sanger is not anti-Wikipedia: “Wikipedia as it stands is a fantastic project…” He considers himself one of its strongest supporters, is partly responsible for founding it, “and I still love it and want only the best for it.” He’d like to see it better, though, and that seems to disturb lots of readers. His memoir starts with Nupedia, an earlier and very different project:
Nupedia was to be a highly reliable, peer-reviewed resource that fully appreciated and employed the efforts of subject area experts, as well as the general public. When the more free-wheeling Wikipedia took off, Nupedia was left to wither…
He believes that was unnecessary, and that a redesigned Nupedia could have worked together with Wikipedia to “be not only the world’s largest but also the world’s most reliable encyclopedia.” He offers a brief history of that earlier project (and makes it clear that both ideas came from Jimmy Wales).
If you care about Wikipedia, it makes sense to read this memoir, since you’ve doubtless read some of the ecstatic writeups of Wales’ genius. Sanger does not try to detract from Wales; he does offer additional perspectives.
Meredith Farkas at Information wants to be free set up a wiki for the ALA Annual Conference in Chicago. A post on July 5, 2005 offers observations about that wiki and what it means for future conference wikis. For example:
1. A wiki must have a specific purpose.
2. You can’t just offer a wiki to the public as a blank slate and expect people t add to it…
3. It’s good to add some content to the wiki before making it public…
4. You need to make it very clear that people can add whatever they want to the wiki or they’ll ask you to do it instead of doing it themselves…
5. If your name is on the wiki, some people will email you assuming that you wrote everything on it…
6. Yes, spam is a problem, but a manageable one if you have enough loyal users…
7. It is amazing to watch what a wiki has become…
The ALA Wiki did succeed. It’s still available (meredith.wolfwater.com/wiki/index.php?title=Main_Page) and includes what must be the most impressive set of conference reports I’ve ever seen—103 in all, many (most?) consisting of links to reports in blogs and elsewhere. The wiki still provides an enormous gathering and organizing service. As Farkas says, “It’s great to have a single place to read all of the reports people have written about the conference.”
Based on that success, Farkas has established another wiki, “Library success: a best practices wiki”—“a one-stop shop for inspiration.” It has its own domain: www.libsuccess.org. Take a look. If you think something’s lacking or you disagree with something—well, it’s a wiki. You can contribute. (I may not be a wiki contributor at this point; that doesn’t mean I regard them as bad or useless. Quite the contrary.)
Here, then, three pieces discussing both Google and Wikipedia. Stephen Manes’ August 15, 2005 “Digital tools” column at Forbes.com is “Google isn’t everything.” Here’s the first paragraph (after a tease that’s pro-library, but apparently only for virtual services):
In the age of Google, when we wonder about stuff we want instant answers. I happened to wonder about the first recorded use of the term “personal computer,” so I Googled around and ended up at Wikipedia, the hit-or-miss user-developed encyclopedia, whose “personal computer” entry declared authoritatively that “The earliest known use of the term was in New Scientist magazine in 1964, in a series of articles called ‘The World in 1984.’”
Manes goes on to say that he still doesn’t know the answer. But he knows Wikipedia got it wrong, thanks to “an even older purveyor of information: my public library”—where he found a November 2, 1926 New York Times article (in an online database) quoting John W. Mauchly saying “There is no reason to suppose the average boy or girl cannot be master of a personal computer.” Manes goes on to discuss all the stuff you can get for free online from your library, stuff that would cost you elsewhere. Good column; too bad Manes limits his praise for libraries to online offerings.
Laura at Lis.dom (lisdom.blogspot.com) posted “what for and for what,” noting the need to ask “for what?” when discussing whether tools are good.
The answer to “Is Wikipedia a good source of information?” is not “Yes” or “No”—it’s “A good source of information for what?”
That’s a sensible distinction. As Laura notes (again, I don’t believe I’ve met her, but she signs her posts with one name), Wikipedia’s probably a great place to find out about podcasting, but might not be the ideal source for an “analysis of gender roles in A Winter’s Tale.” There’s more here and it’s good: Like it or not, every “objective” source is objective with a viewpoint. Google Print and Google itself are good for some things, not for other things. “There’s no such thing as a ‘good source of information’ or a ‘good technology’—there are only sources of information and technologies that are good for certain things.” This is a fairly long post (four print pages plus comments), worth reading in the original: It was posted August 3. Jane at A wandering eyre (wanderingeyre.blogspot. com) wrote a followup post on August 4 pointing to the LIS.dom post and expanding on it a bit, and I would never disagree with this sentence: “We should learn to not only harness the technology around us, but learn to examine it critically.”
Cites & Insights is sponsored by YBP Library Services, http://www.ybp.com.
Hosting provided by Boise State University Libraries.
Opinions herein may not represent those of RLG, YBP Library Services, or Boise State University Libraries.
Send comments to firstname.lastname@example.org. Cites & Insights: Crawford at Large is copyright © 2005 by Walt Crawford: Some rights reserved.
All original material in this work is licensed under the Creative Commons Attribution-NonCommercial License. To view a copy of this license, visit http://creativecommons.org/licenses/by-nc/1.0 or send a letter to Creative Commons, 559 Nathan Abbott Way, Stanford, California 94305, USA.