Book Searching: OCA/GBS Update
What’s happened since the last OCA/GBS perspective (C&I 6:6, Spring 2006)? Less than might have been expected. It seems unlikely that we’ll ever run out of commentaries based on the notion that Open Content Alliance and Google Library Project somehow mean either the death of print books or the death of library circulating collections.
For those in a hurry, here’s a quick summary:
Ø Google continues to scan books at unknown rates and Google Book Search now includes enough of those books that we can see both the uses and limits of GBS. Google is making public-domain books downloadable, if you don’t mind PDFs with “Scanned by Google” on every page. GBS now makes Worldcat and other library searching available more often.
Ø The big October Open Content Alliance spectacular didn’t happen. The OCA website shows signs of inattention. If there’s an OCA site searching scanned books, it’s well hidden.
Ø Despite its early public lead, Yahoo! doesn’t have any visible presence as a source of book-related information or scans. Microsoft has introduced a beta version of Live Search Books, part of the rebranding of MSN Search and based on Microsoft’s OCA scans. Those books are also available as downloadable PDFs—if you don’t mind a “Digitized by Microsoft” watermark on each page. So far, the interface only offers the books themselves, with no “Find in a library” or “Buy this book” links.
Ø The Internet Archive includes 35,000 books scanned as part of OCA (as of early December), including some—but apparently not all—of those at Live Search Books. These are also downloadable as PDFs—the exact same PDFs as on Live Search Books, for those books scanned thanks to Microsoft.
Ø The Google copyright suits are still active and not yet in court. Google is attempting to subpoena information from Yahoo! and others regarding their book digitization efforts.
That’s the gist. Detailed comments follow.
First, however, there’s Barbara Fister’s December 9, 2006 ACRLog post, “The big book has missing pages.” This charmer references Kevin Kelly’s silly manifesto and notes that Kelly’s “Big Book o’Everything” is “a long way from reality” and some reasons why—e.g., even if Google and OCA complete their projects, roughly 80% of the books out there would not be available in anything more than snippets. “Even if Google can convince the courts what they’re doing is legal, the user will only be able to view scraps, and certainly won’t be able to do any of the interactive remixing that Kelly envisions.” Fister notes the “school of thought” (based on limited real-world experience) that full-text online access to book content “is not going to destroy the industry—it might just save it.” I’m not sure that a $55 billion industry (U.S., which suggests around $110 billion worldwide) growing at 3.4% a year (U.S.) needs “saving,” but it’s also far from destruction.
An April 18, 2006 item at OptimizationWeek.com offers notes from John Wilkin’s April 3 talk on the University of Michigan and Google, held at Ann Arbor’s public library. Wilkin estimated that the UM portion of Google’s project, digitizing seven million bound volumes, would be completed by July 2011—and noted that UM had been digitizing books at a rate of 5,000 to 8,000 volumes per year until Google came along.
Google issued a short series of Google Librarian Newsletters, the final one appearing in June 2006. That issue included an introduction to GBS by Jen Grant (product marketing manager), with noting that founders Page and Brin asked this question early on: “What if every book in the world could be scanned and sorted for relevance by analyzing the number and quality of citations from other books?” Apart from the usual Googlish simplification as to what “relevance” means, it’s an interesting way to lead into GBS. Discussing problems inherent in the fact (credited to OCLC) that only 20% of extant books are in the public domain, Grant cites an estimate that only 5% are in print—which seems likely. “That leaves 75 percent or more of the world’s book in [a twilight zone].” Given the GBS goal “to build a comprehensive index that enables people to discover all books,” Google needed a way to handle the “twilight zone” books—thus the snippet approach.
Ben Bunnell (another Google manager) offers “Find a page from your past” in the same issue, beginning “The idea that within our lifetimes, people everywhere will be able to search all the world’s books from their desktops thrills me.” Bunnell notes examples of “interesting uses” of GBS for family research; it’s an interesting commentary that stresses GBS as a way of locating books that might be of interest, not primarily a way of reading them.
I contributed “Libraries and Google/Google Book Search: No competition!” to the same issue. I focused on locality, expertise, community, and resources—four “reasons libraries don’t need to fear Google Book Search or Google itself.” Briefly (since the article’s readily available):
Ø Every good library is a local library—and libraries do local better than Google.
Ø GBS “will be a fine way to discover the more obscure portions of books, and obscure books in general. But librarians and library catalogs offer expertise—professional education and knowledge to guide users whose needs are out of the ordinary, and classification methods to support comprehensive retrieval and guide people to the materials they need.”
Ø “Good libraries aren't just local libraries. They're places that serve their communities in that regard. Good libraries build and preserve communities. ‘Cybercommunities’ can be fascinating—but the physical community continues to be vital.” I note that Google can strengthen a library’s role in the community.
Ø “Need I state the obvious? Google Book Search helps people discover books. Libraries help them read books.”
I also took Google to task somewhat—which delayed publication of the article and resulted in a Google response from the editor. My grumps:
Ø Many Google Book Search books published prior to 1923, necessarily in the public domain, show only snippets when they should show the whole book. The same is true for quite a few government publications almost certainly in the public domain within the U.S.
Ø There should be a “Find this book in a library” link for every book that originates in the Google Library Project and for every book in the public domain. That wasn't the case the last time I tried date-limited searching.
Ø Ideally, every result in Google Book Search should include a “Find this book in a library” link—after all, even books supplied by publishers show purchase links for sources other than the publisher. If Google Book Search is to be a great way to discover books, it should include all the great ways to get the books.
Summarizing the responses, the editor said Google was digitizing quickly and would change some books from “snippet view” to “full view” later on—and Google agreed on the second and third points. Google Book Search does now show either “Find this book in a library” or “Find libraries” on all or almost all book results, and that’s a significant improvement.
John Dupuis noted my article in a June 27, 2006 post at Confessions of a science librarian, “Google Book Search @ your reference desk.” He recounted an incident in which a young woman was writing a paper on space elevators and needed a book reference. The catalog didn’t help.
Well, I immediately went into Google Book Search and searched on “space elevator.” Lo and behold, we immediately found a few books which seemed to have significant sections on space elevators. Checking our catalogue, we figured out which ones are in our collection. The student went away very happy…. I also immediately ordered a bunch of the books that we discovered that aren’t in our collection.
With “Find in a library” fully active, Dupuis should be able to handle both pieces of that transaction from the Google interface—showing the university’s online catalog as books are found. That’s a win-win situation.
That’s the title of Bob Thompson’s August 13, 2006 Washington Post story, a long story (nine print pages) that begins and ends with This is our land, a slim blue 1950 family travelogue by Lillian Dean found in Stanford’s stacks at E169 D3. Thompson discusses the journey that book will eventually make to an “undisclosed location” to be scanned. He considers the copyright controversies—and Andrew Herkovic (Stanford) notes this “Vantage Press” book as a “great example,” since it’s highly probable (say 90%) that the copyright was never renewed—but “if you were the corporate counsel for Stanford, Google or anybody else, is 10 to 1 good enough?”
The story covers a lot of ground, including Google’s semi-humble beginnings (it wasn’t just a garage, and the owner who rented the garage, three bedrooms and two bathrooms to Google is now Google’s VP for product management) and the founding of GBS. Stanford’s Michael Keller was enthusiastic. He notes reasons—one of which, preservation, seems a bit iffy given the apparent quality of GBS scans. Currently, Stanford only provides out-of-print materials, but Keller believes Google’s scanning is fair use.
Thompson talks to Allan Adler (AAP) and Paul Aiken (Authors Guild), both of whom make questionable claims about GBS. Adler says the Google database “in essence would be the world’s largest digital library” and Aiken says “it’s an attempt to avoid licensing. Without the ability to say no, a rights holder really has nothing to license.” It would be interesting to poke at Aiken about fair use, but I suspect the answers would be unsatisfactory. As Thompson summarizes, “Permission, permission is their refrain.”
There’s more—Google’s analogy between web searching and GBS, publishers’ denial that the analogy works, and so on. It’s a good piece, worth reading.
In August, UC announced it would join the Google Library Project. One early commentary struck me as extreme: “Google ‘Showtimes’ the UC library system,” posted August 13, 2006 by Jeff Ubois at Television archiving. Immediately noting that this was a “secret agreement,” Ubois presumes the agreement “may enrich Google’s shareholders at public expense.” After quoting Brewster Kahle about providing “universal access to all human knowledge, within our lifetime,” Ubois says “[I]t’s troubling to see public institutions transfer cultural assets, accumulated with public funds, into private hands without disclosing the terms of the transaction.” [Emphasis added.]
How is UC transferring assets? It’s lending books, which will be returned (they never leave the building in most cases). That’s (part of) what libraries do. As for “without disclosing,” it doesn’t take much research to find out that California is (like Michigan) a state in which that “secret” contract was only secret until someone filed a formal request to see it, since it involved a public agency. “UC should expect and welcome public comment if its inventory is effectively being privatized”—but that’s not what’s happening.
Ubois presumes that Google’s contract must be like Showtime’s offensive contract with the Smithsonian, which did provide exclusive access for some length of time—thus the neoverb in the post title.
UC’s agreement is probably not explicitly exclusive. But as a practical matter, scanning doesn’t happen twice… This deal will be costly for UC in staff time and other resources, and the chances that another vendor will come through and duplicate the work are slim.
This discussion is based on pure speculation—and happens to be false, since UC was already an OCA partner and Microsoft was already scanning UC books and documents!
Ubois makes things worse: Assuming Google’s efficient, it won’t scan a Berkeley copy of something it’s scanned at Harvard, and restrictions may make it difficult for Berkeley to borrow Harvard’s digital copy. “The student of 2012 will have a choice: go to the complete digital library, owned by Google, or go to the partial digital library of his or her own university.”
That’s nonsense. The student of 2012 won’t be able to get the book from Google’s so-called digital library anyway if the book’s not in the public domain, which means the student can do exactly what he or she can do now: Go read the actual, honest-to-trees, printed book, either UC’s copy (if there is one) or one loaned from another library.
Then Ubois asks a series of questions, at least some of which make the same assumptions. For example: “Is it reasonable to ask the public to pay a second time…for material already purchased, simply because it’s now necessary to convert the format in which it is stored?” But UC is not “converting the format” in which books are stored. It’s adding new search capabilities to find print books, which still exist as print books.
Ubois concludes, “By acquiescing to Google’s demands for secrecy, UC has compromised the public interest, and set a dangerous precedent for the rest of the academic community.” Which is truly strange, given that UC is by no means the first academic institution to sign a confidential Google contract, unless we assume that Stanford, Harvard, and Oxford aren’t prestigious enough to set precedent. And given that UC knew the “secret agreement” could not be kept secret. As with Michigan, both UC and Google must have known that the confidentiality clause was not enforceable and the contract would be secret only until someone asked to see it. (UM says it always planned to post its contract.)
The contract was posted later in August. A Computerworld story notes that the contract grants Google sole discretion over use of the scanned material in Google’s services, which is scarcely surprising—and that it explicitly prevents charging end-user fees for searching and viewing search results or for access to the full text of public domain works. UC also agrees not to charge for services using the scanned material (excluding value-added services) and that it won’t license or sell the digital material provided by Google to a third party, or distribute more than 10% of it to other libraries and educational institutions. Finally, Google promises to return the books in the same condition (or pay for or replace them) and has 15 business days (three weeks) to scan a given book.
Karen Coyle compared Michigan and UC contracts carefully. She notes that UC’s contract is silent about quality control for the scans (probably a good thing, given GLP’s early results)—and that UC managed to get “image coordinates” so they can highlight searched words on displayed pages (not in Michigan’s contract). There’s a lot more to Coyle’s analysis, posted August 29, 2006 at Coyle’s InFormation.
Phil Bradley spent some time with GBS and commented in an August 31, 2006 search on his blog, “Google Book Search—to download or not download?” You’ll get the tone from the beginning:
In theory Google Book Search now allows users to download out of copyright books for nothing. In practice, it’s the usual Google botched disaster that we’re getting used to.
Bradley notes that it’s difficult to find books you can download—and when you do, “they’re often either so old [as] to be illegible, or they’ve been badly scanned so it’s almost impossible to read.” Bradley tried some Shakespeare, to compare the results “with the Google disaster that is Google’s Shakespeare Collection.” He found 14 (of 23 searched) that he could immediately download, although “most of the editions would have been difficult to read, to say the very least”—but that’s better than the three at the special collection.
An August 31, 2006 press release from the University of Michigan notes that digital works from the Google project are now enhancing UM’s online catalog via MBooks, a system “intended to support scholarly research.” Mbooks provides a page-turning function, the ability to change resolution and change format, updated bibliographic information, and persistent URLs. Users may determine the number of times a search term appears on each page of any scanned book but apparently even UM researchers won’t be able to view the entirety of books still in copyright.
Finding a downloadable book at Google, I noted the special page that comes along. It’s an interesting document and includes usage guidelines, fortunately after saying “Public domain books belong to the public and we are merely their custodians.” One interesting guideline: “Maintain attribution”—specifically, don’t remove the Google watermark from each page. That’s not an entirely unreasonable request, and it’s stated as a request, not a demand. There’s another: “Make non-commercial use of the files.” The books themselves are in the public domain, which means you’re perfectly free to make any use of them—but Google’s asserting a right in the scanned version. A September 4, 2006 post by Bill McCoy on his Adobe blog questions Google’s “pseudo-license” and repeats Ubois’ assertion, in a different manner: “Just because you’ve got a huge pile of cash and were first in line with a cozy no-bid deal to do this scanning—a deal that cannot even be repeated given the wear and tear on collection items—doesn’t create a special exemption to [public domain].” [Emphasis added.] But Google and OCA both assert that their scanning methods create no more wear and tear than reading a book. McCoy’s assertion doesn’t work for books that are ever circulated, and certainly doesn’t work for UC (as one example). McCoy’s counter-examples are flawed. Google is not claiming ownership of public domain works, only of its scans. Google isn’t preventing libraries from lending the books that Google scanned and anyone (Microsoft, Yahoo, me) is free to scan a borrowed book and, if it’s in the public domain, do anything we want with our scan.
Christina Pikas responded to some of the negative posts on GBS in a September 4, 2006 post at Christina’s LIS rant. “In my world, I’ve found [GBS] to be pretty helpful.” She deals with scientific information, where “you go from less reliable but close to the research to nailed down but far from the cutting edge.” She’s used GBS to improve access to her library’s collection, e.g., searching the scientific name of an uncommon bacterium, which pointed to a molecular biology textbook the library owned. As she concludes, “YMMV,” a basic principle for GBS.
By October, some publishers were beginning to admit that GBS is helping sales, as reported by Jeffrey Goldfarb in an October 6, 2006 Reuters story. Oxford University Press estimates that a million customers have viewed 12,000 OUP titles (from the Google Publisher segment of GBS). Springer Science + Business reports growth in backlist sales based on GBS. Penguin finds more success from Amazon—and specialized publisher Osprey found healthy growth from both sources.
Karen Coyle posts an important lesson from early GBS scanning in an October 24, 2006 post at Coyle’s InFormation: “Google Book Search is NOT a library backup.” GBS uses uncorrected OCR, which “means that there are many errors that remain in the extracted text” (including all line-break hyphenation). Also, it’s not digitizing everything: Some books are too delicate, some will be problematic. “Quality control is generally low” (she provides egregious examples). None of this came as a surprise to most digital librarians, according to a comment from Dorothea Salo.
Péter Jacsó reviewed GBS for Péter’s digital reference shelf (downloaded November 3, 2006); it’s an extensive and negative review, well worth reading. He notes the “ignorance, illiteracy and innumeracy” of the software—“OR” searches yielding fewer results than one of the two terms (or more results than the sum of the two terms!), limits that don’t work, inconsistent handling of full-view books, confusing hit counts. Google doesn’t say how many books are in GBS (or in the full-view portion), always problematic for a database. There’s a lot more here, and although some of it seems based on using GBS as a source for actual reference information rather than a way to find books, it’s nonetheless a good, tough review.
Mick O’Leary wasn’t thrilled with GBS either, as he recounts in a November 2006 Information Today review. I’m not sure why O’Leary believes that GBS and Amazon’s Search Inside! “promise to affect the future of library book collections profoundly.” (O’Leary repeats the claim that you can get past three-page and five-page limitations on in-copyright views by searching for distinctive words on the last page of the excerpt. I’ve never seen that work, at least not in Google, and would love to see repeatable examples.) He says correctly that GBS, if completed, “will be useful primarily as a library finding tool”—and seems to dismiss the importance of that, saying “these books have already lost much of their value” because knowledge advances so rapidly. O’Leary dismisses public domain books as being “of interest only to scholars and other specialized researchers.” I’m not sure what to make of this review, but the synopsis is flat-out wrong: “Google Book Search is Google’s grand project to create a universal full-text e-book library.” That’s simply not true, according to everything Google’s said, unless by “library” you mean “collection whose contents you can determine but not see.”
In October, the University of Wisconsin at Madison became the eighth library in the Google project, focusing on public domain materials, following the Complutense University of Madrid (which announced its participation on September 26). The University of Virginia Library announced its participation on November 14, 2006, focusing on American history, literature, and humanities.
Finally, for now, November news coverage indicates that Google has subpoenaed information on the book digitization efforts of Yahoo! and Amazon—and that both have denied access to the information.
There’s not a lot to say about OCA since this Spring other than the summary notes at the top of this piece. The promised October rollout didn’t happen. 60-odd people attended an OCA workshop in October 2006—but as of mid-December, the OCA website shows the October 20 event as being in the future. The website for the OCA workshops has a faulty digital certificate; the “discussion area” has eight discussion sections, only one of which has any topics (that topic consisting of one anonymous post with no responses). On the home site, the “press page” shows stories through November 2005. The “Next Steps” page claims a November 2006 update date but appears to date from late 2005. The FAQ says “All content in the OCA archive will be available through the website. In addition, Yahoo! will index all content stored by the OCA to make it available to the broadest set of Internet users”—but there’s no search function on the OCA site. (A recent note: the Sloan Foundation’s kicking in $1 million, directly to Internet Archive, to support OCA digitizing.)
Fortunately, while the OCA level seems moribund, there’s some action within the ranks—although not, as far as I can tell, by Yahoo!, the partner with the highest initial profile.
Microsoft made good on its October 2005 promise to join OCA and to release a book search service. Books.live.com went live (in beta) on December 6, 2006. “Microsoft Live Search Books” (LSB) may be awkward, but it’s part of Microsoft’s general rebranding from MSN to Windows Live. A December 6 post at ResourceShelf offers an excellent brief history of LSB, including links to earlier stories. Gary Price focuses less on competition than on choices: “The more options and tools information professionals have the better. Even Google’s CEO, Eric Schmidt, has said that search is NOT a zero-sum game.”
Microsoft plans to integrate book content with the rest of Windows Live Search, presumably with an available limit for books only. The beta release includes “noncopyright” books from UC, Toronto and the British Library, with books from NYPL, Cornell, and the American Museum of Veterinary Medicine coming soon. (NYPL is also involved in both OCA and Google Library Project.) Price notes some features of LSB and that “Scanning looks nice from what we’ve seen.” (I put “noncopyright” in quotes because LSB includes quite a few oral histories from Bancroft’s Regional Oral History project that are much more recent than 1923, and those don’t appear to be in the public domain.)
CDLINFO Newsletter for December 14, 2006 offers an update on UC’s participation in OCA, noting LSB as a “new portal to access UC libraries books scanned by the Internet Archive for the Open Content Alliance.” The discussion calls LSB “serendipitously fruitful” and notes some interesting local searches. The scanning facilities for UC books are hosted at the two UC regional storage facilities. The article identifies the original focus as Americana, says books provided are identified based on catalog searches (they’re not just taking a shelf at a time), and says the non-damaging nature of Internet Archive’s scanning was affirmed by a test of 800 Berkeley mathematics books. It’s an interesting article.
Tom Peters comments on LSB in a December 12, 2006 post at ALA TechSource. “After playing around for an hour or so…I have to admit—against some vague sense that my better judgment is failing me—that I like it.” Unfortunately, Peters follows that by repeating a report that “LSB does not work well—or at all—when using browsing software other than Internet Explorer.” That’s generally not the case; most users of other browsers (certainly including Firefox) have used LSB without difficulty. Peters does interesting searches—and offers interesting comments. He doesn’t like the name of the service, but that’s really an issue with Microsoft’s online services in general. He wonders why there’s no overall count for the collection—as do I, although the same can be said of GBS and Amazon. (Internet Archive does provide a count for its American Libraries text collection, just over 35,000 at this writing—but that collection does not include everything on LSB.)
After reading Peters’ post, I did a little experimenting using his favorite search terms (“phrenology” and “spontaneous combustion”). Here’s what I found:
Ø LSB yielded 687 items for “phrenology” and was only willing to show the first 250 of them. It yielded 219 for “spontaneous combustion” (as a phrase; Peters’ 660 must be the two words, which yield 887 on December 15, 2006), and would show all 219 of those. (There appears to be a firm limit of 250 viewable results in the current LSB, as the 887-book result also stops at 250.)
Ø Neither of those searches yielded any results in Internet Archive’s text collection or American Libraries collection, even though the LSB PDF downloads come from IA servers; the two are clearly out of synch.
Ø Google Book Search yielded 2,618 for “phrenology”—but would show only 139 books, indicating a typically wifty total result count. For the phrase “spontaneous combustion,” GBS showed 1,041, of which 512 were actually available.
Ø Restricting GBS to full-view books reduced the first result to 1,603 and the actual result to a mere 63, either one-quarter or one-tenth of LSB’s result. The second search came down to 699 claimed, 489 actual.
Rick Roche discussed experiments using LSB as a genealogy tool in a December 18, 2006 post at ricklibrarian. Some searches came up empty, others did better. He urges Microsoft to add a proximity search. I suspect a California genealogist might do better at this point, given the source of most early material in the database—and it’s clear that the database has just begun. Roche suggests LSB as a tool even in its current state, since it’s free and can yield surprising results.
Microsoft has posted a significant (and presumably growing) collection of public domain materials in Live Search Books. The scans appear to be more carefully done than some at GBS, although Karen Coyle indicates that the OCR is still pretty poor. As with GBS, the PDF downloads include watermarks on each page (Microsoft’s watermark is light and small).
Otherwise, OCA seems to be missing in action. That may change over the next few months.
The reality of Google Book Search is much less enchanting than the promise; many of the scans seem pretty poor. None of this should be terribly surprising, although it may be disappointing.
Both projects can enhance discoverability for library collections, although LSB must first add “Find a library” functionality. Enhanced discoverability should mean increased use of print collections. Neither project, as far as I can tell, has any serious potential to disrupt libraries or make their print collections less valuable. Neither project will yield a universal digital library. Nor should they be expected to.
Cites & Insights is sponsored by YBP Library Services, http://www.ybp.com.
Opinions herein do not represent those of OCLC or YBP Library Services.
Comments should be sent to firstname.lastname@example.org. Comments specifically intended for publication should go to email@example.com. Cites & Insights: Crawford at Large is copyright © 2007 by Walt Crawford: Some rights reserved.
All original material in this work is licensed under the Creative Commons Attribution-NonCommercial License. To view a copy of this license, visit http://creativecommons.org/licenses/by-nc/1.0 or send a letter to Creative Commons, 559 Nathan Abbott Way, Stanford, California 94305, USA.