Cites & Insights: Crawford at Large
ISSN 1534-0937
Libraries · Policy · Technology · Media

Selection from Cites & Insights 6, Number 6: Spring 2006

Perspective

Discovering Books: The OCA/GBS Saga Continues

The short version could be one paragraph. New members continue to join the Open Content Alliance, with affiliated projects such as Alouette, involving 27 major Canadian academic research libraries, and a group of committees have formed to plan OCA’s future. The Google Library Project keeps scanning, the lawsuits haven’t been settled, Google continues to be more opaque than seems necessary—and Google Book Search generates lots of articles and discussions.

Open Content Alliance

Jeffrey Young wrote about OCA in “Scribes of the digital era,” Chronicle of Higher Education 52:21 (January 27, 2006). Young says “many see” the project as “primarily a response to the controversial book-scanning project led by Google,” and that’s unfortunate.

“Although the Open Content Alliance has pledged not to scan copyrighted works without permission, thereby avoiding that thorny legal issue, the project could do as much to shake up the library world as Google’s effort has.” I would question whether either project does all that much to “shake up the library world,” but maybe I’m dense. Young seems to suggest this is the first time libraries have worked together toward digital archives and quotes Brewster Kahle on it being “a vision for an open library.” As of January 27, some 34 libraries had joined OCA.

It’s an interesting article that includes a description of the Scribe, the document scanner used in OCA, which involves manual page turning by employees who can “scan about 500 pages per hour.”

The article says online users will be able to order bound reproductions of OCA books “by paying a small fee to a company that does the printing and binding.” I don’t see such a function yet at openlibrary.org, the site at which sample OCA books can be read, but the “book” about Open Library indicates this as a possibility.

The story reasonably contrasts OCA’s open model with Google’s continuing opacity about GBS and the Google Library Project. Google’s comment on OCA, as cited in the article: “We welcome efforts to make information accessible to the world, The OCA is focused on collecting out-of-copyright works which constitute a minority of the world’s books—a valuable minority, but certainly not complete.”

Next steps

OCA has published its 2006 work agenda at www.opencontentalliance.org/nextsteps.html.

The OCA will initially concentrate on digitally reformatted monographs and serials which represent diverse times, regions and subjects which are in the public domain or available under a Creative Commons license. In other words, the OCA is initially interested in the broad range of digitized documents that are in our libraries and archives.

For an October 2006 event, we would like to focus on materials that reflect the history, people, culture, and ecology of North America. This decision is in part a practical one. It establishes essential priorities for the OCA while emphasizing collection depth as a means of encouraging the development of value-added services. It also reflects the general orientation of the initial collections that have been offered to the OCA (at this stage, OCA is not harvesting metadata).

The coalition has also established six working groups to “advise on key operational issues and help establish essential policies and practices:” Metadata and collaborative collection development, digital preservation, contribution, book format, scanning protocol, and data transfer protocol. The groups are chaired by key people at the California Digital Library, Internet Archive, RLG and OCLC; each group is supported by an RLG staff member with appropriate expertise (for example, Robin Dale serves as RLG program officer support for the digital preservation working group).

Jim Michalko of RLG commented on the 2006 agenda in a January 12, 2006 hangingtogether.org post. He notes “a few key things” in early organizational steps “that please me”:

keeping the OCA a project of the Internet Archive sidesteps the kinks that long, premature conversations about governance would engender

declaring a collection focus—Americana, specifically North Americana—provides an essential filter for quick progress and priority setting

stating a target of October 2006 to unveil a significant digital collection allows all the contributors to focus their efforts.

Related projects and new participants

What about the Million Book Project? The stated goal of the project was to scan one million books by 2005. That goal was clearly not reached. Notably, 10,532 scanned books from this project were available at the Internet Archive two years ago—and the number has increased to 10,556 as of March 22, 2006, despite Brewster Kahle’s assurance in December 2004 that “tens of thousands” were on the way. According to MBP’s FAQ, some 600,000 books have been scanned (primarily in India), but these are not all available online—and, indeed, I can’t find any indication of how many are online.

Note this assertion at the Indian center: “The technological advances today make it possible to think in terms of storing all the knowledge of the human race in digital form by the year 2008.” I find that a trifle optimistic. It appears that the project is becoming affiliated with OCA, to some extent. It clearly can’t be accused of being Anglocentric: Of the 600,000 books scanned, roughly 135,000 are in English.

A December 29, 2005 note at CBC Arts (www.cbc.ca) adds a larger Canadian perspective to the early involvement of the University of Toronto in OCA: 27 major Canadian academic research libraries have joined the Alouette Canada project, a digitization alliance with a substantial scope. According to the release, Alouette Canada “is working with” OCA and also focuses on works already in the public domain.

Google Book Search: Brief Items

Some smaller items about GBS and the Google Library Project (GLP), in chronological order:

Ø A December 12, 2005 Library Journal item notes the difficulty of finding the “Find it in a library” link, mostly because it only appears on GLP books, not the “much larger (for now)” collection from publishers. That’s an issue Google needs to address; in my informal testing, the library link wasn’t showing even for some items clearly in the public domain. (When James Jacobs at diglet asked Google about this, he received a reply stating the facts, with no explanation. As Jacobs notes, following Google’s reasoning, GLP books should not have links to online booksellers.)

Ø Mary Sue Coleman, President of the University of Michigan, spoke on “Google, the Khmer Rouge and the public good” to AAP’s Professional/Scholarly Publishing Division on February 6, 2006. She strongly defends GLP and Michigan’s role, explaining why Michigan considers it “a legal, ethical, and noble endeavor that will transform our society.” I won’t go into details of the talk, which is readily available online, but would note that Coleman stresses the preservation aspect of GLP—and that turns out to be a tricky topic (see below). Apart from that issue, I believe Coleman gets it right.

Ø Siva Vaidhyanathan seems to have moved from an argument that GLP is a bad test case for fair use to a more general condemnation of Google. He now denounces GLP on several grounds—and concludes, apparently, that he knows more about librarianship than the directors of the Michigan, Stanford, Oxford, Harvard, and New York Public Libraries. He calls Coleman’s speech “disingenuous,” says that GBS offers “stunningly bad results” and offers libraries the arcane advice “Don’t throw away that card catalog just yet.” He calls the deal with Google “horrible,” and says “it is stupid and counterproductive” for librarians to “sign over control to an unaccountable private entity.” He says “libraries that are giving away the treasure have abrogated their responsibility to defend the very values that librarianship supports.” Vaidhyanathan claims to be pro-librarian/pro-library. Gale Norton claims to be an environmentalist. I, for one, was not aware that librarians were “signing over control” through participation in GLP or that librarians were “giving away the treasure” by lending copies of books (which is, after all, one of the things libraries do). Michael Madison at madisonian.net has been arguing some of these issues with Vaidhyanathan; Madison doesn’t seem to think we need to “stop Google to save librarians,” and I agree. Apparently, one of Vaidhyanathan’s arguments is that he had trouble finding Cory Doctorow’s Down and out in the Magic Kingdom using GBS—but searching for “science fiction magic kingdom” in Google yields the book right away. The problem here is that Doctorow’s novel apparently isn’t in GBS—so you can’t find it there, although it’s readily available through Google itself.

Ø James Jacobs posted something much more significant at diglet on February 16, 2006, in “Thoughts on Google Book Search,” after hearing Daniel Clancy, engineering director for GBS, speak at Stanford: “Clancy mentioned that Google was not going for archival quality (indeed could not) in their scans and were ok with skipped pages, missing content and less than perfect OCR—he mentioned that the OCR process averaged one word error per page of every book scanned! The key point that I took away from this is that Google book project is not an alternative to library/archive/archival/preservation scans. Libraries will still have an important role to play (as we already know!) because a certain percentage of the digitized content owned by StanMichOxYork will be basically unusable as archival, preservation-level digital content. Google's ok with that, but libraries shouldn't be!” For a book search engine, one word error per page isn’t bad (that’s roughly 99.7% perfect OCR)—but it appears that Mary Sue Coleman may have received a scrambled message about preservation.

Ø Cory Doctorow thinks publishers “should send fruit-baskets to Google” and explains why in a February 14, 2006 essay at boing boing. I disagree with Doctorow on huge chunks of his argument (print books are going away, people now get all their info online, yada yada), but he makes excellent points on some of publisher and author complaints against Google, specifically the idea that because Google intends to make money (indirectly) from GBS, authors and publishers should get a cut of the action. “No one comes after carpenters for a slice of bookshelf revenue. Ford doesn’t get money from Nokia every time they sell a cigarette-lighter phone-charger. The mere fact of making money isn’t enough to warrant owing something to the company that made the product you’re improving.” It’s a long essay, particularly for boing boing—4,096 words, the equivalent of more than five C&I pages. (Commenting on Doctorow’s essay, Vaidhyanathan says “the case law on fair use is totally hostile to Google,” despite Doctorow’s citation of case law that favors Google. Lawyer Jonathan Band, cited below, also believes that there’s significant case law favoring Google. Apparently, Siva Vaidhyanathan is not only a better librarian than five major library directors, he’s a better copyright lawyer than Jonathan Band or others who believe Google has a good case—since he says “totally hostile,” it must be overwhelming. I’m impressed by the multifaceted genius and authority of Prof. Vaidhyanathan!)

Ø A February 23, 2006 Chronicle of Higher Education piece by Andrea L. Foster notes Google’s new “fact-checking brigade” to cope with “misperceptions” about GBS. One such misperception is Susan Cheever’s Newsday assault on Google. Among other things, Cheever says, “The amount of words that constitute fair use varies according to court case. At present, it is 400 words.” As any librarian should know, that’s nonsense. The U.S. Copyright Office fact sheet does not provide a word limit. Even the conservative guidelines from the office suggest “1,000 words or 10 percent of a work of prose, whichever is less” for republication—and those are guidelines, not legal findings.

Ø Rob Capriccioso wrote “Google’s not-so-simple side” on February 27, 2006 at Inside higher ed (www.insidehighered.com). He reports on a “lively discussion” at the American Enterprise Institute-Brookings Joint Center for Regulatory Studies. One audience member made the claim that “it would be relatively easy…to quickly piece together snippets of…books until entire chapters or texts were available online,” a claim that’s almost certainly nonsense. Edward Timberlake, “who said he works at the U.S. Copyright Office,” made a startling statement about the copies of scans that Google returns to the owning libraries: “He said that the libraries are doing ‘a lot of stuff’ with those electronic versions that authors and publishers don’t believe they have permission to do.” But authors and publishers chose not to include the libraries in their suits against Google, and there is absolutely no indication that any library involved plans to do anything other than use the scans as dark archives. Capriccioso doesn’t cite any example from Timberlake of this “stuff” libraries are doing.

Google Book Search: Longer items

Congressional Research Service

Robin Jeweler of the Congressional Research Service prepared “The Google Book Search Project: Is online indexing a fair use under copyright law?”, issued December 28, 2005 (Order Code RS22356, available at fpc.state.gov/documents/organization/59028.pdf). The six-page report notes the situation and that, “Once again, new technology and traditional principles of copyright law appear to be in conflict.”

Because of the unique facts and issues presented, there is scant legal precedent to legitimize Google’s claim that its project is protected by copyright law’s fair use exception to liability for infringement. Thus, questions presented may be ones of first impression for the courts.

Jeweler concludes that Google’s “opt out” option “contributes to the content holders’ claim that Google is engaged in massive copyright infringement.” Summarizing the positions, Jeweler says plaintiffs consider Google’s project strictly commercial “because it ‘pays’ for the libraries’ collections by delivering digital copies back to them” and because Google will gain advertising revenues. Google “essentially contends that its opt out program negates any infringement liability” and that, in any case, the activity is fair use, citing Kelly v. Arriba Soft.

Jeweler notes that fair use is not strictly a matter of evaluating the four factors encoded in law; “Because fair use is an ‘equitable rule of reason’ to be applied in light of the overall purposes of the Copyright Act, other relevant factors may also be considered.” Without attempting to predict how courts would rule, the CRS report offers some observations on the issues at hand. A few examples:

With respect to the first factor, the purpose and character of use, the searching and indexing goal appears to be a highly transformative use of the copied text. There is little question that indexing basic information about any book alone, absent copying, would not constitute copyright infringement. While displaying “snippets” of text is closer to infringing activity, the prospective display, as described by Google, does not appear to usurp or negate the value of the underlying work.

The second factor is the nature of the copyrighted work. Digitizing the collections of the named libraries will encompass both factual and creative works, the latter being entitled to the highest level of copyright protection. How the court views the third factor—amount of the portion used—will be significant. In order to create its megadatabase, Google will scan the entire copyrighted work, a major consideration weighing against fair use. But it intends to display, i.e., use, at any given time, only brief excerpts of the searchable text. Hence, is the digital reproduction incidental to an otherwise fair use or is it impermissibly infringing?

Finally, what will be the Library Project’s effect on the potential market for or value of the copyrighted works? Here, Google makes a strong argument that its indexing and text searching capability has the potential to greatly enhance the market for sales for books that might otherwise be relegated to obscurity. Its “sampling” of text permits members of the public to determine whether they wish to acquire the book.

Jeweler notes publishers’ claim that copyright owners routinely receive license fees for authorized sampling (but not, as far as I know, for indexing). There’s the speculative claim—publisher could potentially participate in, and derive revenue from, a similar project. And, of course, publishers “expressed concern” that the library copy “may facilitate piracy and/or additional unauthorized uses”—although publishers didn’t sue the libraries.

How about case law? “Google asserts that Kelly v. Arriba Soft Corp. supports its claim of fair use, and in many respects it does.” Google’s snippets represent “far more limited reproduction and display” than Arriba Soft’s thumbnail images of full-sized pictures. A distinction is that the images in question were voluntarily uploaded to the internet.

There’s more. Sony Corporation of America v. Universal City Studios—the Betamax case—held that, in some cases, apparently infringing activity that facilitates an arguably legitimate use is fair use. Other cases have failed to expand that category—but neither have they overruled it. The report concludes:

How the court (or courts) that consider this case define the issues presented will ultimately determine whether the suit against Google sets an important precedent in copyright law. Viewed expansively, the court may find that copying to promote online searching and indexing of literary works is a fair use. To many observers, such a holding could be the jurisprudential equivalent of Sony’s sanctioning of “time shifting.” If the court adopts a more narrow view of fair use that precludes Google’s digitization project, searchable literary databases are likely to evolve in a less comprehensive manner but with the input and control of rights holders who view them as desirable and participate accordingly.

Jonathan Band via ALA OITP and Plagiary

Jonathan Band continues to write some of the most lucid analyses of GLP. The Google Library Project: The copyright debate, issued in January 2006, is available as an OITP Technology Brief from ALA at www.ala.org/ala/washoff/oitp/googlepaprfnl.pdf. A related analysis appears in the new ejournal Plagiary(www.plagiary.org) as “The Google Library Project: Both sides of the story.”

Both sixteen-page publications provide detailed discussion of the issues at play. Unlike far too many commentators, Band is very clear about the limited visibility of copyright works: “This is a critical fact that bears repeating: for books still under copyright, users will be able to see only a few sentences on either side of the search term—what Google calls a ‘snippet’ of text… Indeed, users will never even see a single page of an in-copyright book scanned as part of the Library Project.” Here’s one I hadn’t realized: “Google will not display any snippets for certain reference works, such as dictionaries, where the display of even snippets could harm the market for the work.”

Band finds Kelly v. Arriba Soft applicable, and goes a little further than the CRS report: “[I]t is hard to imagine how the Library Project could actually harm the market for books, given the limited amount of text a user will be able to view… Moreover, the Library Project may actually benefit the market for books…”

Publishers claim Google’s storage of the full text of each book makes it different from Arriba’s storage of compressed low-rez versions of images. Band: “This seems to be a distinction without a difference, because Arriba had to make a high resolution copy before compressing it.” Publishers also attempt to deny the applicability of Kelly because it involved the copying of digital images already on the internet (thus providing an implied license to copy), while Google is digitizing analog works.

Google has three possible responses to this argument. One, the Kelly decision makes no reference to an implied license, nor has any other copyright decision relating to the Internet. Two, this argument suggests that works uploaded onto the Internet are entitled to less protection than analog works. This runs contrary to the entertainment industry’s repeated assertion that copyright law applies to the Internet in precisely the same manner as it applies to the analog environment.

Three, Google can argue that its opt-out feature constitutes a similar form of implied license…

As you’d expect, copyright holders have a third argument against applying Kelly: It was wrongly decided. Plaintiffs would much prefer that UMG Recordings v. MP3.com be used as precedent. But, Band says, Google will contend that MP3.com is easily distinguishable: Google’s use is far more transformative and Google’s use will not harm any likely market for the books. Band says “there is no market for licensing books for inclusion in digital indices of the sort envisioned by Google.”

There’s a lot more here, to be sure. I strongly recommend reading one or both of Band’s pieces. He has something to say about Siva Vaidhyanathan (quoting from the Plagiary article, where there’s a direct endnote to Vaidhyanathan):

While in theory it might be preferable from a societal point of view for the Library Project to be conducted by libraries rather than a private corporation, libraries simply do not have the resources to do so. Thus, as practical matter, only a large search engine such as Google has both the resources and the incentive to perform this activity.

Band concludes “A court correctly applying the fair use doctrine as an equitable rule of reason should permit Google’s Library Project to proceed.”

EContent and Online

Jessica Dye’s “Scanning the stacks” appears in the January/February 2006 EContent; the March/April 2006 Online includes a ten-page cluster of four brief articles on GBS. Both are worth reading. Jessica Dye offers a reasonable quick overview of the situation, perhaps favoring anti-Google voices somewhat.

The cluster in Online is curious. Marydee Ojala begins with a clear commentary on how GBS actually works, at least in its current form—and hopes that searchability improves as it evolves. K. Matthew Dames argues that library organizations should support GBS—but says that “the library community’s only public comments on Google Book Search come from an ALA president who seems more concerned with the possibility that his copyright could be ‘flaunted’ than the possibilities that someone could find, use, or buy his work.” I don’t understand this: Cites & Insights is most certainly part of the library community, as are many blogs and periodicals that have had very public statements in favor of GBS. Or does Dames only consider statements by officers of library organizations? David Dillard, speaking from a reference librarian’s perspective, thinks GBS can be very helpful when looking for books with relatively obscure content, offers some examples, and concludes that “revenue brought in by books should invariably increase as more people learn of books containing answers to their information needs.” As with other librarians (whose opinions I’ve read) who have actually looked at GBS and its potential, Dillard expects it to be a good thing both for book publishing and for libraries.

Then there’s Michael A. Banks and “An author looks at Google Book Search.” It’s the same-old, same-old. The illustrations show entirely books provided through the Google Publisher Project, showing no snippets at all. Banks claims GBS “can actually discourage some users from buying books” because it “displays the very information being sought” in certain kinds of nonfiction books. “Having seen the information, there’s little chance the searcher will buy the books.” That might be true, if snippets were more than a sentence or two and if GBS didn’t suppress snippets in reference works. He speaks of “pillaged” books that are “intellectual property with value, created by people who anticipate being paid for the time, effort, and expense that go into them.” Great, except for the preface: “[M]any, many readers buy reference, tutorial, and how-to books to get at specific information. Now they can go to Google Book Search and get the information for nothing.” Since that’s simply not true, the rest does not follow.

Other Google Cases

While the Google Library Project suits have not yet been heard in court, other cases have been. Perfect 10 won a lawsuit regarding thumbnail images; counsel for plaintiffs in the GLP suits claimed this finding was bad news for Google’s stance on GLP, while Google and EFF didn’t see any precedential similarity.

Blake Field sued Google for caching an article Field had posted on his website; a Nevada district court ruled against Field, saying he had “attempted to manufacture a claim for copyright infringement against Google in hopes of making money from Google’s standard practice.” The court granted summary judgment on four bases: Since Field did not allege that the Googlebot’s initial copy was an infringement, using the cache could not be considered direct infringement; Field didn’t opt out (there was no “no archive” metatag and there was an explicit “allow all” robot.txt header); Google’s cache is fair use; and that cache qualifies as a DMCA “safe harbor.” EFF’s Fred von Lohmann says the decision is “replete with interesting findings that could have important consequences for the search engine industry, the Internet Archive, the Google Library Project lawsuit, RSS republishing, and a host of other online activities.”

Another district court—this one the Eastern District of Pennsylvania—rejected a civil complaint (for copyright infringement and other activities) against Google by Gordon Roy Parker, “an online publisher of sexual seduction guides” who also offers racetrack betting tips. In this case, the complaint (filed by Parker, a former paralegal) was termed “rambling” but the judge was clear that Google’s caching does not constitute infringement.

The saga will continue. OCA’s benefits are clear; the alliance’s choice to avoid copyright issues is cautious but clears the way for more expansive uses of material. GBS is a muddier situation, not aided by Google’s lack of transparency—but there seems little doubt that GBS and the Google Library Project will serve the aims of copyright, at least as stated in the Constitution: “To promote the progress of science and useful arts.” Being able to discover books based on obscure content within those books doesn’t substitute for library catalogs and doesn’t seem to have any chance of substituting for the books themselves—but it can promote progress by making it easier to find work on which to build. How can that be a bad thing?

Cites & Insights: Crawford at Large, Volume 6, Number 6, Whole Issue 76, ISSN 1534-0937, a journal of libraries, policy, technology and media, is written and produced by Walt Crawford, a senior analyst at RLG.

Cites & Insights is sponsored by YBP Library Services, http://www.ybp.com.

Hosting provided by Boise State University Libraries.

Opinions herein may not represent those of RLG, YBP Library Services, or Boise State University Libraries.

Comments should be sent to waltcrawford@gmail.com. Comments specifically intended for publication should go to citesandinsights@gmail.com. Cites & Insights: Crawford at Large is copyright © 2006 by Walt Crawford: Some rights reserved.

All original material in this work is licensed under the Creative Commons Attribution-NonCommercial License. To view a copy of this license, visit http://creativecommons.org/licenses/by-nc/1.0 or send a letter to Creative Commons, 559 Nathan Abbott Way, Stanford, California 94305, USA.

URL: citesandinsights.info/civ6i6.pdf

Cites & Insights: Crawford at Large ISSN 1534-0937 Libraries · Policy · Technology · Media