Cites & Insights: Crawford at Large
ISSN 1534-0937
Libraries · Policy · Technology · Media


Selection from Cites & Insights 5, Number 14: December 2005


Perspective

OCA and GLP 2: Steps on the Digitization Road

Many voices have offered many opinions since the last time I discussed Google Print (and the Google Library Project) at any length (Cites & Insights 5:11, October 2005). Lawsuits have been filed. Scholars, lawyers and pundits have weighed in on the merits of the suits and the nature of fair use. Google ended its scanning moratorium and opened a chunk of Google Print for preliminary use—and a big, new, multipartner complementary project began, the Open Content Alliance (OCA). Finally, Google changed the misleading “Google Print” name to the much better “Google Book Search”—making it clear that the primary aim of the project is to help people locate books of interest, not print them (or, in most cases, read them online).

This essay notes some of the things that have been said since the last roundup, injecting commentary along the way. A separate Perspective, “OCA and GLP 1: Ebooks, Etext, Libraries and the Commons,” summarizes some of my thoughts on the possibilities and issues involved—including definitional issues such as the difference between making the text of a book available online and making the book (or at least its pages) available online.

I’ll start with OCA even though it’s the newer of the two; some commentaries address both OCA and Google Book Search. When commentaries refer to Google Library, I assume they mean the Google [Print] Library Project, GLP; when they refer to Google Print, I assume they mean what’s now Google Book Search, which encompasses both the established publisher-based program and GLP.

Open Content Alliance

Posts and pieces about the Open Content Alliance began around October 2. By the end of October, the new coalition had several major partners, a range of comments, and what appears to be a bright future.

FAQ

The best description I’ve seen of OCA is embedded within the FAQ (www.opencontentalliance.org/faq.html). Here’s quite a bit of it, leaving out most questions, with a couple of comments interjected:

The Open Content Alliance (OCA) represents the collaborative efforts of a group of cultural, technology, nonprofit, and governmental organizations from around the world that will help build a permanent archive of multilingual digitized text and multimedia content. The OCA was conceived by the Internet Archive and Yahoo! in early 2005 as a way to offer broad, public access to a rich panorama of world culture.

The OCA archive will contain globally sourced digital collections, including multimedia content, representing the creative output of humankind.

All content in the OCA archive will be available through the [OCA] website. In addition, Yahoo! will index all content stored by the OCA to make it available to the broadest set of Internet users. Finally, the OCA supports efforts by others to create and offer tools such as finding aids, catalogs, and indexes that will enhance the usability of the materials in the archive.

Worth noting: Yahoo! does not plan to be the sole source for web searching.

Contributors to the OCA include individuals or institutions who donate collections, services, facilities, tools, or funding to the OCA… The OCA will continue to solicit the participation of organizations from around the world.

The OCA will seed the archive with collections from the following organizations: European Archive, Internet Archive, National Archives (UK), O'Reilly Media, Prelinger Archives, University of California, University of Toronto.

An international effort from the start: European Archives, UK’s National Archives and Toronto.

The OCA will encourage the greatest possible degree of access to and reuse of collections in the archive, while respecting the rights of content owners and contributors. Generally, textual material will be free to read, and in most cases, available for saving or printing using formats such as PDF. Contributors to the OCA will determine the appropriate level of access to their content…

“Formats such as PDF” is not the same as “only available in PDF.”

Metadata for all content in the OCA will be freely exposed to the public through formats such as the Open Archives Initiative Protocol for Metadata Harvesting (OAI-PMH) and RSS.

The OCA is committed to respecting the copyrights of content owners…

Will copyrighted content be digitized or placed in the OCA archive without explicit permission from rights-holders?

No…[explained at some length]

The OCA is committed to working with all types of content providers to grow its archive. The OCA has been in discussions with major publishers and the organizations that represent them in order to explore legal, sustainable business models through which more copyrighted content can be made widely available…

There’s the starting point: Something a little like GLP—but a lot different, with a broader range of partners, a commitment to openness (including open access where feasible) and interoperability, a strong archival bent, and—on the downside—no single massive source of funding. As Eli Edwards put it in the first blog post I encountered regarding OCA (at Confessions of a mad librarian, edwards.orcas.net/~misseli/ blog/, October 2, 2005): “It is not as ambitious as the Google Print project, but it has the potential to be a very useful supplement, as well as a way to promote open standards and collaboration.

University of California press release

The California Digital Library issued a press release on October 3, “UC libraries partner with technology companies and non-profits to provide free public access to digitized books.” The release notes that the Internet Archive will do the scanning “using a new technology that scans books at the cost of 10 cents per page”—contrasting that with the costs of scanning archival photographs and documents, which “typically begin at $20.00 per page.”

The release quotes UC Santa Cruz literature professor Richard Terdiman on the virtues of having “public domain literary texts available online”—“This will be a wonderful boon to students and scholars, and a great service to the public.” It also notes two other sponsors: Adobe and Hewlett Packard.

“The UC libraries will contribute books and resources in order to build a collection of out-of-copyright American literature that will include works by many great American authors.” Contribute, in this context, means “lend for non-destructive scanning.”

Ann Wolpert, ARL president, offers the final and longest quote:

“This is an exciting step in the ongoing development of open access solutions for citizens, students, scholars, and researchers worldwide… Working with the OCA, academic and research libraries can provide greater access to an untold wealth of high quality, high value materials, contribute expertise in developing reliable and authoritative collections, and help shape the structure of online services. Libraries, publishers, educational institutions, and others must collaborate around initiatives like the OCA to effectively serve their communities in the 21st century.”

I was surprised to read that scanning archival documents still starts at $20 per page; perhaps I should not have been. Details of how Internet Archive does high-quality non-destructive book scanning and OCR for $0.10 per page are starting to emerge; it’s not and probably can’t be a wholly automated operation.

While neither the FAQ nor this press release clarifies this point, “public domain” does not mean “published before 1923.” A great many items published between 1923 and 1976 are in the public domain, but research is required to identify those items. My understanding is that the Internet Archive plans to carry out that research as needed.

In challenge to Google, Yahoo will scan books

That’s the headline on an October 3 New York Times story by Katie Hafner. The competitive thrust may be necessary to make a newspaper story exciting, but I find it a bit unfortunate. OCA is and should be less and more than a “challenge to Google.” Within the story, that’s clear: books will be “accessible to any search engine, including Google’s.” An odd way to mount a challenge!

The story mentions “hundreds of thousands of books” and “specialized technical papers” as well as “historical works of fiction.” Brewster Kahle snipes at Google, if indirectly: “Other projects talk about snippets. We don’t talk about snippets. We talk about books.” So, to be sure, does the Google Library Project for public domain books, the only kind OCA currently plans to scan. By the end of the article, Kahle’s changed from sniping to recruiting: “The thing I want to have happen out of all this is have Google join in.”

UC’s contribution is cited as “as much as $500,000…in the first year” along with volumes to be scanned. The article estimates Yahoo!’s contribution as between $300,000 and $500,000 for the first year. HP and Adobe are contributing hardware and software.

Other items: OCA and OCA-Google comparisons

Most of these are second-hand, based on Peter Suber’s excerpts in Open access news (www.earlham.edu/~peters/ fos/).

Andrew Orlowski of The Register (October 4) claims the grievance of publishers and authors against Google is that it “has got stuff, if not for free, then at a bargain price” and quotes Seth Finkelstein’s view that Google is trying to supplant publishers as the middleman between authors and readers. “So what at first looks like a copyright issue on closer examination is really a compensation issue. Just as we’ve seen with music. There is no copyright crisis.” He says at some point “would-be gatekeepers such as Google and Yahoo! will do the decent thing and pay for licenses to use the content…” I’m impressed that Orlowski knows better than most everyone else—“there is no copyright crisis”—and that he believes licenses should be required in order to index material.

A Daily Californian article (October 4) provides more detail on UC’s selection: “the university’s prized American literature collection…works written from the 1800s until the 1920s,” all from UC’s libraries (which taken as a whole represent the largest university library collection), all available for free download. If this includes portions of Berkeley’s archival collections, it will be a groundbreaking contribution.

Preston Gralla gets it wrong in an October 5 piece at Networking Pipeline, “Yahoo gets book-scanning right…almost.” His problems with the project?

First is that the material will be made available in Adobe Acrobat format, rather than as text. Acrobat is a notoriously finicky format, and the Acrobat reader has probably crashed more computers than anything this side of Windows. It’s big, it’s ugly, and it’s a resource hog. People should have the option of viewing in plain text. Second is that all the work in the archive, regardless of copyright, will be made fully available as Acrobat files, so it can be easily printed out. This is great for public domain works, but not so great for copyrighted works.

I refer you back to the FAQ. PDF (there is no “Acrobat format”) is one format in which works will be made available—and it’s by far the best established and most robust format in which to display actual book pages. Nothing in the FAQ says that works won’t, at some point, also be offered as plain text. Gralla’s slam at Adobe Reader (which Suber also finds to be an “annoying format”) is over the top, but hey, he’s a commentator. His second point is just wrong: OCA makes it extremely clear that copyright works won’t be posted for full downloading without the express consent of the copyright holders.

Microsoft (via MSN) joined OCA in late October. According to an October 25 press release, MSN Search plans to launch MSN Book Search—and MSN committed to digitizing 150,000 books in 2006 (or Microsoft contributed $5 million for 2006 digitization, roughly the same thing). An October 25 story at Search Engine Watch says Microsoft is making separate deals with libraries and will contribute some scanned material to the OCA database.

An October 26 Reuters story says OCA has added “more than a dozen major libraries in North America, Britain and Europe.” Lulu (called a “publisher of out-of-print books,” but I think of LuLu as a publish-on-demand service) is also working with OCA. Google and OCA are talking and it’s probably only a matter of time before they find common ground.

Just to add a little heat to the light, Tim O’Reilly (whose O’Reilly Media is an early OCA member) grumped on his blog about Microsoft—saying the group was “being hijacked by Microsoft as a way of undermining Google” (according to an October 31 Seattle Post-Intelligencer story). When interviewed, O’Reilly backed down, saying “hijacking” was a little strong and that “it’s good that Microsoft is participating in the group.”

Still, he said he considers it inaccurate to portray Google as the “bad guy” for its initiative and Microsoft as the “good guy” for joining the alliance. In reality, O’Reilly said, the fundamental aims of the alliance and Google aren’t opposed.

I haven’t seen many commentaries (other than those from AAP and other litigants) calling Google a “bad guy” or Microsoft a “good guy” in this case. Rick Prelinger of OCA and the Internet Archive said, “From the beginning, there was a hope that (Microsoft) would join” and said of its $5 million: “That doesn’t seem like undermining to me.”

A November 5 Washington Post story notes a Microsoft deal with the British Library that appears to be OCA-related: Scanning 100,000 books (25 million pages) and making them available on Microsoft’s book search service next year. “Microsoft says that it will seek permission from publishers before scanning any books protected by copyright.”

The November 9 Wall Street Journal has an article about the Internet Archive/OCA scanning project at the University of Toronto. I’ve only seen quoted portions of the article. “In the little more than a year since the group started scanning books, it has digitized just 2,800 books, at a cost of about $108,250.”

RLG, my employer, announced on October 27 that it will be a contributor to and partner with OCA. RLG will supply bibliographic information from the RLG Union Catalog to aid in materials selection and description. The RLG Union Catalog includes records for more than 48 million titles in almost 400 languages. One detail in RLG’s press release: Public domain books in the Open Library (OCA’s current name for its online collection), which “can be downloaded, shared, and printed for free…can also be printed for a nominal fee by a third party, who will bind and mail the book to customers.”

That’s where it stands as of this writing. An ambitious plan with a wide (and growing) range of partners; not as ambitious as the Google Library Project (at least initially), but with a strong open access bent and considerable potential for growth. As with Google Book Search, even if it never achieves everything initially suggested, it should be beneficial. And as with Google Book Search, it seems unlikely to replace print libraries—and isn’t intended to.

Google Book Search and Other Google Stuff

When last we left Google and what was then Google Print (including GLP), publishers had attacked Google and Google temporarily suspended the scanning project. Since then, there’s been two lawsuits, resumption of scanning, the first substantial addition of GLP books to Google Book Search (and the new name)—and many commentaries.

There are far too many commentaries to note individually. Charles W. Bailey, Jr. has a good starter list: “The Google print controversy: A bibliography,” www.escholarlypub.com/digitalkoans/2005/10/25/the-google-print-controversy-a-bibliography/

I’ve tried to point out interesting arguments, odd statements, and some of the people on various sides of various GLP-related issues. I’m not even attempting to provide citations; for those items not in Bailey’s bibliography, web searching should get you there.

The Authors Guild Suit

This suit was filed September 20 as a class action suit with jury trial demanded. The complaint itself runs to 14 double-spaced pages. It claims that Google’s reproduction “has infringed, and continues to infringe, the electronic rights of the copyright holders…” In the next paragraph, the suit makes a questionable factual assertion:

4. Google has announced plans to reproduce the Works for use on its website in order to attract visitors to its web site and generate advertising revenue thereby.

Google has explicitly said that only snippets of in-copyright books, no more than three of them, each containing no more than a paragraph, would be displayed. Calling up to three paragraphs of a book “reproduc[ing] the Work” is outlandish and appears to deny the existence of fair use. The same claim is repeated in the following paragraph.

After a page of claims, the suit identifies three named plaintiffs (Herbert Mitgang, Betty Miles and Daniel Hoffman), each of whom has at least one book with registered copyright held at the University of Michigan (presumably chosen because it’s one of two Google 5 libraries that has agreed to complete digitization). It then describes the Authors Guild and Google and asserts a class definition and allegations. Paragraph 34 is worth quoting in full:

34. Google’s acts have caused, and unless restrained, will continue to cause damages and irreparable injury to the Named Plaintiffs and the Class through:

a. continued copyright infringement of the Works and/or the effectuation of new and further infringements;

b. depreciation in the value and ability to license and sell their Works;

c. lost profits and/or opportunities; and

d. damage to their goodwill and reputation.

I’m no lawyer, but it’s hard to imagine how points b, c, and d could be demonstrated without showing that Google planned to show a lot more than three snippets from a copyright work—or inventing a new “licensing for indexing” revenue stream that authors have never had in the past.

Is GLP Fair Use?

My original non-lawyer’s opinion was that GLP couldn’t pass a fair use test since it involves making and retaining copies of entire copyright works for commercial gain—even though the copies themselves won’t be visible to any Google user. Since then, I’ve concluded that I don’t know what to think…

Even intellectual property lawyers can change sides. William Patry initially called the project “fantastic” but could “see no way for it to be considered fair use… what they have done so far is, in my opinion, already infringing.” He revisits the situation later, analyzing based on market impact, and concludes that GLP is fair use. Jonathan Zittrain (Harvard Law) thinks it’s a tossup (or at least that the outcome of a trial will be a tossup).

Yes—or at least probably

Timothy B. Lee of the Cato Institute says GLP has a strong case based on transformative use and the nearly-certain positive market impact. William Fisher (Harvard Law) and Jessica Litman (Wayne State Law) agree. Julie Hilden says yes based on market share but offers a note that seems to confuse justice and law: “But the point of copyright law isn’t to protect against copying, it’s to protect against harm to the value of intellectual property.” (Actually, according to the Constitution, it’s to promote progress in science and useful arts, but never mind.)

Susan Crawford offers a multipoint discussion and says:

All computers do is copy. Copyright law has this idea of strict liability—no matter what your intent is, if you make a copy without authorization, you’re an infringer. So computers are natural-born automatic infringers. Copyright law and computers are always running into conflict—we really need to rewrite copyright law. But even without rewriting copyright law, what Google plans to do is lawful.

She uses fair use as the basis for that claim. Her first sentence is unfortunate. As anyone who’s ever used a spreadsheet or database, edited a photograph, spell-checked, or used Word stylesheets should know, computers do a whole lot more than copy—but it’s true that most of what they do involves copying. (Sigh. In another later posting, she repeats this claim: “All computers do is make copies.” [Emphasis added])

Lawrence Lessig’s “Google sued” post asserts “Google’s use is fair use” with little argument: “It would be in any case, but the total disaster of a property system that the Copyright Office has produced reinforces the conclusion that Google’s use is fair use.” Much as I admire Lessig, my reaction is “Huh?”

Eric Schmidt (Google’s CEO) claims fair use in a Wall Street Journal piece. I find Google’s full vision improbable and a bit too grandiose—“Imagine sitting at your computer and, in less than a second, searching the full text of every book every written”—but that’s another issue. Schmidt says Google will not place ads on GLP result pages, weakening the “commercial gain” argument. I wonder about his refutation of the notion that “making a full copy of a given work, even just to index it, can never constitute fair use. If this were so, you wouldn’t be able to record a TV show to watch it later or use a search engine that indexes billions of Web pages.” Maybe, but the second part is stronger than the first. I’m impressed that “Google Print will allow [backlist titles] to live forever.” Few corporations predict that they’ll always be around—particularly corporations as young as Google.

Tim Lee has a charming article (well worth reading) at Reason, “What’s so eminent about public domain?” He notes the efforts of copyright extremists to take advantage of the backlash against the Kelo decision (the recent eminent domain case). You get a newly-formed “Property Rights Alliance” talking about “recent Supreme Court decisions gutting physical and intellectual property rights”—but, as Lee says, “there haven’t been any recent Supreme Court decisions ‘gutting’ intellectual property rights.” Quite the opposite, in Grokster, Eldred v Ashcroft and others. Apparently Authors Guild spokespeople are claiming that GLP “seizes private property” and making an analogy with eminent domain. Lee’s note:

Yet in reality, the excerpts of copyrighted books shown by the service would be far too short to be of use to anyone looking for a free copy. And under copyright law, the use of short excerpts has traditionally qualified as fair use. If the Authors’ Guild prevails, it will leave copyright owners with much greater control over how their content is used than they have traditionally enjoyed in the pre-Internet world.

No—or probably not

David Donahue cites the Texaco case (no fair-use right for a private corporation to photocopy entire articles for its research staff) and Williams & Wilkins (fair-use right for a nonprofit library to do similar photocopying) and thinks Google falls in between. Eric Goldman says his heart finds GLP “great and therefore we should interpret copyright law in a way to permit it. Unfortunately, my head says that this is highly suspicious under most readings of copyright law.”

Karen Christensen of Berkshire Publishing Group doesn’t like GLP—and includes an odd attack on Berkshire’s primary customer base:

Librarians, unfortunately, don’t understand the rights of the creators and producers of books. Most librarians do not understand the work and expense, the expertise and talent, involved in creating the publications they buy. And quite a few believe that information should be free…

Pat Schroeder and Bob Barr go beyond saying GLP isn’t fair use. “Not only is Google trying to rewrite copyright law, it is also crushing creativity…. Google’s position essentially amounts to a license to steal…”

Preston Gralla seems consistent in misreading or misunderstanding. A November 3 post at Networking pipeline titled “Google retreats in book scanning project” refers to Google’s “plan to make available for free countless thousands of copyrighted books without the copyright holders’ permissions.” He notes Google is now “not showing the contents of copyrighted books.” But that’s not a retreat; it’s been Google’s consistent plan to show snippets of copyright works unless publishers explicitly agree to allow pages to be displayed. Gralla claims the Authors Guild and AAP suits are “no doubt…why no copyrighted books have been made available today” and expresses Gralla’s clear belief that Google should give up: “Here’s hoping that Google is having second thoughts about the program, and will ultimately back down…”

ALPSP says no

ALPSP issued a formal statement stating its firm belief that “in cases where the works digitised are still in copyright, the law does not permit making a complete digital copy for [Google’s] purposes.” The group opposed Google’s opt-out solution and advises its members “that if they are not sure about the program, they should exclude all their works for the time being.” On the other hand, ALPSP does suggest publishers “protect both in- and out-of-copyright print and electronic works by placing them in the Google Print for Publishers program instead.” One wonders how publishers protect out-of-copyright works; surely public domain means public domain? Peter Suber notes that this and an earlier ALPSP statement assert “an abstract property right without claiming injury.” The second statement also threatens legal action. His note:

If the ALPSP believes that the absence of publisher injury and the possibility of publisher gain needn’t be mentioned because they are irrelevant to its case, then it is mistaken. Apart from their relevance to policy, they will be relevant to any court asked to decide whether the Google copying constitutes fair use under U.S. copyright law.

ALPSP takes the same dogged approach to GLP that it does to open access. Sally Morris (CEO of ALPSP) was quoted as commenting that endorsing GLP is to say “it’s OK to break into my house because you’re going to clean my kitchen,” further noting: “Just because you do something that’s not harmful or (is) beneficial doesn’t make it legal.”

Morris has firm principles. When interviewed by Danny Sullivan for SearchDay, she says Google should, in principle, also “seek opt-in permission before indexing freely available web pages.” That attitude, if made law, could indeed lead to the shutdown of internet search engines.

Jonathan Band’s analysis

Band, who “represents Internet companies and library associations with respect to intellectual property matters in Washington, D.C.,” prepared what may be the most widely-referenced copyright analysis of GLP, “The Google Print Library Project: A copyright analysis.” One version appears in E-Commerce Law & Policy 7:8 (August 2005); that version also appears in ARL Bimonthly Report 242 (October 2005), www.arl.org/ newsltr/242/google.html. A related article with a different title (“The Authors Guild v. the Google Print Library Project”) appears at LLRX.com (www.llrx.cm/ features/googleprint.htm), published October 15. His concise analysis is clearly written and well worth reading in its entirety.

Band notes the need to consider exactly what Google intends to do in each aspect of Google Book Search. As regards AAP’s attack on Google (and the Authors Guild suit), Band asserts that both the full-text copy and the snippets shown in response to queries fall within fair use. Band relies on Arriba Soft as a precedent—a case in which the defendant compiled a database of images from web sites, showing thumbnails in response to queries and linking back to the original website from thumbnails. (One difference: Arriba Soft did not retain the full-size images after preparing thumbnails.) The court found for Arriba Soft, saying its use of a given photographer’s images “was not highly exploitative,” that the thumbnails served an entirely different purpose than the original images (making them transformative), and that the use benefits the public. “Everything the Ninth Circuit stated with respect to Arriba applies with equal force to the Print Library Project.”

Band’s analysis of Arriba Soft and comparison with GLP issues is detailed and fairly convincing. Certainly the market effect seems to favor GLP. Does any rational author or publisher really believe that increased findability will decrease their market? “It is hard to imagine how the Library Project could actually harm the market for certain books, given the limited amount of text a user will be able to view.” Band also concludes that GLP is “similar to the everyday activities of Internet search engines” and explains the fair use analogies. Concluding (the LLRX version):

The Google Print Library Project will make it easier than ever before for users to locate the wealth of information buried in books. By limiting the search results to a few sentences before and after the search term, the program will not diminish demand for books. To the contrary, it will often increase demand for copyrighted works by helping users identify them. Publishers and authors should embrace the Print Library Project rather than reject it.

Fred von Lohmann’s analysis

Here’s how Fred von Lohmann (EFF) sees Google’s case for the four elements of fair use as it applies to the Authors Guild suit:

Nature of the Use: Favors Google. Although Google's use is commercial, it is highly transformative. Google is effectively scanning the books and turning them into the world's most advanced card catalog. That makes Google a whole lot more like Arriba Soft than MP3.com.

Nature of the Works: Favors Neither Side. The books will be a mix of creative and factual, comprised of published works. The works cited in the complaint include "The Fiery Trial: A Life of Lincoln" (largely factual history) and "Just Think" (described elsewhere as: "pictures, poems, words, and sayings for the reader to ponder").

Amount and Substantiality of the Portion Used: Favors Google. Google appears to be copying only as much as necessary (if you are enabling full-text searching, you need the full text), and only tiny snippets are made publicly accessible. Once again, Google looks a lot more like Arriba Soft than MP3.com.

Effect of the Use on the Market: Favors Google. It is easy to see how Google Print can stimulate demand for books that otherwise would lay undiscovered in library stacks. On the other hand, it is hard to imagine how it could hurt the market for the books--getting a couple sentences surrounding a search term is unlikely to serve as a replacement for the book. Copyright owners may argue that they would prefer Google and other search engines pay them for the privilege of creating a search mechanism for their books. In other words, you've hurt my "licensing market" because I could have charged you. Let's hope the court recognizes that for the circular reasoning it is.

I believe von Lohmann’s off base on the second point: biographies and other “factual” works are also protected by copyright unless they’re purely listings of facts. As a library person, I could also do without “world’s most advanced card catalog.” Quite apart from being a bit like the world’s best jet-powered buggy whip (how many card catalogs have you seen lately?), that description asserts that full-text search is inherently more “advanced” than cataloging, an assertion I disagree with. It’s different and complementary.

Siva Vaidhyanathan disagrees with von Lohmann for other reasons, as noted in a September 21 post at Sivacracy.net:

Fred has oversimplified this terribly.

He does not consider the fact that the copying in question is complete and total--100 percent of the work. The authors care about the first complete copy, not how it is later presented in commercial form.

He does not consider that the "nature of the work" is set by the most protected works, not the least. For each suit, there is a particular nature of the work. Novelists and poets are among those suing. That's where the test will be.

Lastly, he mistakenly forgets the most powerful and troublesome word in the fourth factor: "potential." The issue is the effect on the "potential" markets, not the established markets. Because a market exists (and a greater potential market lurks) for licensed digital images of published books, the library project is about that market (see Amazon and Google Print) rather than the market for the physical book.

Vaidhyanathan would rather be on von Lohmann’s side (as he notes), but I question that final paragraph. GLP won’t offer digital images of copyright books. Authors may or may not have anything to gain from “licensed digital images of published books,” depending on their book contracts, and so far there’s really not an established market of any size. In any case, wouldn’t full-text searchability inherently increase the market for digital images, if that’s what people want?

Should Authors and Publishers be Suing Google?

Xeni Jardin says no, in a September 25 article in the Los Angeles Times: “You authors are saps to resist Googling.” Jardin’s another one who calls the outcome of GLP a “digital card catalog” (do lawyers and writers live in a time warp?). She notes the distinction with the war on file sharing: “Google isn’t pirating books. They’re giving away previews.” Internet history has shown that “any product that is more easily found online can be more easily sold.” She notes that the Authors Guild squabbled with Amazon over its “look inside” feature as well. She goes on to suggest that such “paranoid myopia” could lead to a total shutdown of search engines: “What’s the difference, after all, between a copyrighted Web page and a copyrighted book?” (Seth Finkelstein’s answered that one: Web pages are freely available for anybody to read and download; books aren’t.)

Lawrence Solum (University of San Diego Law) also says no and objects to class-action certification: “That class [copyright-holders for books in Michigan’s library] includes many authors who would be injured if the plaintiffs were to prevail—including, for example, me!” Solum (a prolific writer) knows he’ll benefit from wider dissemination of his works. Jack Balkin (Yale Law) feels the same way: “As an author who is always trying to get people interested in my books…I have to agree…the Author’s Guild suit against Google is counterproductive and just plain silly.” Peter Suber also notes that he’s one who falls into the class and doesn’t want to be included—unless Google prefers to fight a class-action suit. Put me in the same category: Michigan owns several of my books, to which I hold copyright, and I believe a successful suit will harm me indirectly if not directly.

While Peter Suber admits to seeing plausible cases that GLP infringes copyright, “I haven’t yet seen a plausible case that the authors or publishers will be injured.” He believes that Authors Guild may be looking for a cut of Google’s ad revenue: “If so, then…we’re watching a shakedown.”

Tim Wu (University of Virginia Law) wrote “Leggo my ego: GooglePrint and the other culture war” at Salon (October 17). I guess “GooglePrint” is Salon’s neologism; Wu doesn’t leave out the space consistently. He thinks sensible authors should favor GLP as part of “the exposure culture,” in which “getting noticed is everything. “The big sin in exposure culture is not copying, but instead, failure to properly attribute authorship.” He makes the point that authors really can’t reconcile a desire for exposure with total authorial control and makes an analogy between indexes and maps. “[B]ooks, as a medium, face competition. If books are too hard to find relative to other media, all authors of books lose out, and authors of searchable media like the Web, win.” Well, maybe…

You could think of OCA as “competing” with GLP, and OCA has deliberately avoided any copyright questions—but Brewster Kahle calls AAP’s suit “counterproductive” and notes that it “could get really messy in a way that will damage progress.”

Nick Taylor (president of the Authors Guild) thinks it’s necessary to sue—because otherwise Google is getting rich at the expense of authors. He goes on, “It’s been tradition in this country to believe in property rights. When did we decide that socialism was the way to run the Internet?” Going from talking about Google having cofounders “ranking among the 20 richest people in the world” to cries of “socialism” for a Google project: Now there’s a creative leap that marks a true Author, as opposed to a hack like me. He brings in “people who cry that information wants to be free” a bit later. Peter Suber comments that Taylor is “as clueless as I feared”—and that’s a fair comment.

You know where Fred von Lohmann stands—but maybe not his analysis of economic harm.

[W]ith the Google Print situation, it’s a completely one-sided debate. Google is right, and the publishers have no argument. What’s their argument that this harms the value of their books? They don’t have one. Google helps you find books, and if you want to read it, you have to buy the book. So how can that hurt them? (From a November 9 Salon article by Farhad Manjoo, which Peter Suber calls “the most detailed and careful article I’ve seen on the controversy over Google Library.”)

Playing devil’s advocate (reluctantly, because I do agree with von Lohmann in this case): To the extent that Google shows library links as well as purchase links for GLP books, it encourages use of libraries—which some publishers could see as harming sales. But boy, is that a stretch…unless they’re planning to attack the First Sale doctrine next.

Should Google Settle?

Some commentators believe Google should settle (by ceasing the copyright portion of GLP or agreeing to some form of license) either because they’re convinced Google’s in the wrong or because they’re afraid courts might make fair use matters even worse. Others believe Google should fight the suits, including some who feel that way even if Google’s likely to lose.

No; it should fight the suits

Timothy Lee (Cato Institute): “Given the tremendous benefit Google Print would bring to library users everywhere, Google should stick to its guns. The rest of us should demand that publishers not stand in the way.” Michael Madison looks to Google as a “public domain proxy” and thinks Google should fight the case—even though “I’m not convinced that Google is in the right.” He makes a good point: If nobody ever litigates fair-use cases, what’s left of fair use?

The group weblog Conglomerate:

Should Google fight the case? Absolutely. From a litigator’s and trial lawyer’s point of view, this is a case worth fighting… It isn’t very often when a fair use argument gets raised by a big-time, well-financed corporate entity.

Wired News shows its professionalism in “Let Google copy!” (September 22) when it calls the Authors Guild of America the “Writers Guild of America” in the lead sentence. Here’s the somewhat utopian stance on the likely outcome: “The courts should take this opportunity to loosen unnecessary restrictions that are limiting innovation with no clear benefit to the public or rights holders.” The final paragraph, on what should happen if courts fail to distinguish whole-work copying for the purposes of index creation as being non-infringing: “If courts refuse to recognize this distinction, Congress should authorize a limited compulsory license to allow unilateral digitization of works for inclusion in a commercial database, provided, of course, that the database doesn’t strip content creators of their ability to profit from their efforts.” But once it’s a license, licensing fees are at issue and fair use is out the window.

Derek Slater says, “When I look at the Google Print case, I say ‘game on’—I see a chance for a legitimate defendant to take a real shot at making some good law. There’s broad and even unexpected support for what Google’s doing.”

Lawrence Lessig hopes Google doesn’t settle:

A rich and rational (and publicly traded) company may be tempted to compromise—to pay for the “right” that it and others should get for free, just to avoid the insane cost of defending that right. Such a company is driven to do what’s best for its shareholders. But if Google gives in, the loss to the Internet will be far more than the amount it will pay publishers. It will be a bad compromise for everyone working to make the Internet more useful—and for everyone who will ultimately use it.

Yes—or at least the suits are dangerous

Siva Vaidhyanathan thinks it’s the wrong fight: “It’s not just Google bettng the company. It’s Google gambling with all of our rights under copyright—both as copyright producers and users.”

Peter Suber notes that the merits of GLP’s case for fair use are important to settle. “But I admit that I’m not very comfortable having any important copyright question settled in today’s legal climate of piracy hysteria and maximalist protection.” He notes that Google’s wealth is a wildcard: It enables Google to defend itself—but it makes Google an extremely attractive target for a class action suit.

The Second Suit

The American Association of Publishers (AAP) announced this suit on October 19. While the suit has five plaintiffs (McGraw-Hill, Pearson Education, Penguin, Simon & Schuster and John Wiley & Sons), it’s “coordinated and funded by AAP.” Pat Schroeder’s take in the press release announcing the suit: “[T]he bottom line is that under its current plan Google is seeking to make millions of dollars by freeloading on the talent and property of authors and publishers.” (In later commentaries, Schroeder has a remarkable incuriosity when it comes to facts. She says, “The creators and owners of these copyrighted works will not be compensated, nor has Google defined what a ‘snippet’ is: a paragraph? A page? A chapter? A whole book?” For anyone willing to take the effort of clicking on “About Google Print” on the Google Print home page, the answer’s clear: Less than a paragraph.)

AAP told Google it should use ISBNs to “identify works under copyright and secure permission from publishers and authors to scan these works.” That does nothing for works published between 1923 and 1966, of course, and the PR explanation glosses over two inconvenient facts: The ISBN links to the publisher at time of publication, which may since have merged, changed names, or folded—and publishers don’t always control copyright.

The suit itself, filed in U.S. District Court for the Southern District of New York, appears deceptively long. It’s really 14 double-spaced pages (plus a cover page), with many more pages listing copyright titles published by the five plaintiffs and known to be held in the University of Michigan Libraries (along with three Google illustrations that undercut some of the claims, since they show the tiny displayed snippets of copyright books).

Publishers bring this action to prevent the continuing, irreparable and imminent harm that Publishers are suffering, will continue to suffer and expect to suffer due to Google’s willful infringement, to further its own commercial purposes, of the exclusive rights of copyright that Publishers enjoy in various books housed in, among others, the collection of the University Library of the University of Michigan in Ann Arbor, Michigan (“Michigan”).

That’s the second paragraph in “Nature of the action.” The fourth paragraph provides AAP’s assertion of Google’s motive: “All of these steps [in GLP] are taken by Google for the purpose of increasing the number of visitors to the google.com website and, in turn, Google’s already substantial advertising revenue.” But Google doesn’t run ads on its home page and says it won’t show ads on GLP pages.

Later, we learn that GLP “completely ignores [publishers’] rights,” which is simply false (else GLP would show pages from all books) and get interesting language on GLP: “When Google makes still other digital copies available to the public for what it touts as research purposes, it does so in order to increase user traffic to its site, which then enables it to increase the price it charges to advertisers.” [Emphasis added] Quite apart from the questionable nature of the last clause, GLP will not make “other digital copies available to the public” (unless AAP seriously claims that the snippets constitute infringement).

There’s a lot of text describing the five publishers and Google, including one paragraph that appears to dismiss fair use and other restrictions on copyright:

It has long been the case that, due to the exclusive rights enjoyed by Publishers under the Copyright Act, both for-profit and non-profit entities provide royalties or other considerations to Publishers in exchange for permission to copy, even in part, Publishers’ copyrighted books.

A bit later, we learn that Google is “one of the world’s largest media companies”—in a context that makes it appear that AAP equates “media company” and “ad delivery mechanism.” That’s odd, given how few books (currently) deliver ads.

As the suit goes into detail about GLP, we are informed once again that each copyright work listed in the exhibits is “at imminent risk of being copied in its entirety and made available for search, retrieval and display, without permission”—never mind that you can’t search a single book by itself or that “display” consists of no more than three paragraphs, each surrounding an occurrence of a word or term. The suit dismisses any analogy with indexing and caching web pages, partly because “books in libraries can be researched in a variety of ways without unauthorized copying. There is, therefore, no ‘need,’ as Google would have it, to scan copyrighted books.” Read that carefully: It appears to say that the existence of online catalogs negates any usefulness of full-text indexing.

Consider the first sentence of paragraph 31:

There is no principled distinction between the Google Print Program for Publishers and the Google Library Program, with respect to the types of works that are copied, the digital technology used to copy and store the books, the amount of a book that is copied by Google and the public accessibility and display of the copied works.

The idea that the “pages around your text” display of a Google Publishers Project text is no different than the “snippets” display of a copyright GLP work is, in my opinion, ludicrous.

That “displaying copies of” claim appears again in paragraph 38. Paragraph 40 repeats the claim that GLP “has greatly and irreparably damaged Publishers…” The “prayer for relief” shows AAP’s attitude regarding fair use: It asks for a permanent injunction to keep Google from “in any manner, reproducing, publicly distributing and/or publicly displaying all or any part of any Publisher’s copyrighted works…”

Once I saw the first claim of irreparable harm, I read the suit carefully for any claim of actual economic harm. The closest I see is paragraph 35:

Google’s continuing and future infringements are likely to usurp Publishers’ present and future business relationships and opportunities for the digital copying, archiving, searching and public display of their works. The Google Library Project, and similar unrestricted and widespread conduct of the sort engaged in by Google, whether by Google or others, will result in a substantially adverse impact on the potential market for Publishers’ books.

That’s it. AAP is claiming that making book text searchable and showing a sentence or two around searched words, together with information about the book so that an interested party can borrow or purchase it, “will result in a substantially adverse impact on the potential market for Publishers’ books.” You have to wonder why AAP members have cooperated in the Google Publishers’ Program if enhanced exposure is such a terrible thing. Pat Schroeder may claim that Google’s opt-out provisions turn copyright law on its head—but this claim turns reality on its head. More exposure yields fewer sales: What a notion!

GLP and Libraries

Richard Leiter gets it right in an October 18 post at The life of books, after clarifying the aims of Google Book Search: “[L]ibrarians need to be prepared for a renaissance; free online services like this will mean better access to libraries and greater demand for books. Not only will libraries’ collections grow, but our numbers of patrons will too.”

Thom Hickey also gets it right in a November 15 post at Outgoing, “Impact of Google print.” “Here’s my prediction: seeing the page images online will result in more requests for the physical object, not less.” He thinks there may also be more use of “equivalent items” (that is, other manifestations of the same work) at other libraries. I’d guess that’s also likely.

I do wonder about one statement in Hickey’s post: “More people will look at a particular page online than will ever look at that physical page in all the copies in all the libraries in the world. That’s clear…” Is it really? If GLP manages to digitize and make available all of the public-domain books in the five participating libraries, that’s roughly 6.5 million books (see the article at the end of this essay). At an average of 250 pages each, that’s 1.6 billion pages. I think it’s hard to make the case that Google Book Search usage will be so high and have searches so varied that any given page will be viewed more via Google Book Search than it has ever been in the sum of all its physical copies in all the libraries in the world. (Take one of the interior pages from, say, a best-selling pre-1924 edition of Alice in Wonderland.) I doubt that we’ll ever know: That’s the kind of argument that’s nearly impossible to settle.

Some observers have looked for mass digitization and online book availability to replace physical libraries, either as an undesirable or desirable aim. That hasn’t changed, and those observers tend to be the ones who miss what GLP is actually doing (and how unlikely it is that many people will read full books on-screen as page images). You’ve probably heard the names before. Realistically, I can see no way that GLP can be used as an argument against continued use of library collections of print books.

Barbara Fister posed the libraries-and-GLP question in a strikingly different way at ACRLog on October 20: “I can’t help wondering—if lending libraries were invented today, would publishers lobby to delete the ‘first sale’ doctrine from copyright law, arguing it enables a harmful form of organized piracy?”

Ben Vershbow may be reacting too soon in his November 3 if:book post, “google print’s not-so-public domain,” where he complains that the initial showing of books doesn’t amount to much other than snippets. “The idea of a text being in the public domain really doesn’t amount to much if you’re only talking about antique manuscripts, and these are the only books that they’ve made fully accessible… This is not an online library. It’s a marketing program. Google Print will undoubtedly have its uses, but we shouldn’t confuse it with a library.” If GLP succeeds, it will be far more than a “marketing program”—but Google itself has been clear that it’s not out to replace libraries, so Vershbow’s right in that last sentence. (Vershbow says Google’s been getting “a lot of bad press for its supposedly cavalier attitude toward copyright”; given the balance of what I’ve seen, Vershbow must read different sources than I do.)

One comment on a Lessig post is unfortunate as an example, but there it is: Dan Jacobsen, a college student, says that GLP’s availability has caused him to order four books from nearby universities that he found on Google Print. “I have never before used the school library for research material, and were it not for Google Print, I would never have found these books.” It’s sad that a college student seems proud of never using the library for research material.

Other Google Matters

Siva Vaidhyanathan would prefer to see libraries themselves carrying out mass scanning projects. As heard in “On the Media,” he says this about “outsourcing” digitization to Google: “Their technology is proprietary. Their algorithms for search are completely secret. We don’t actually know what’s going to generate a certain list of results. They don’t work for us.” Seth Finkelstein quoted this passage in an October 17 Infothought post, adding:

Again—“They don’t work for us.” Whatever their cool geek-dream origin (and I share the fantasy!), Google is now a very large corporation, accountable only [to] the shareholders. It may seem overly critical to emphasize it, but that’s reality.

Here’s a truly strange one, caught by Peter Suber: a press release from the National Consumer League attacking GLP not only for “threats to the principle of copyrights” but also “cultural selectivity, exclusion, and censorship.” Why? Because “any database which represents itself as being a ‘full’ or ‘complete’ record of American culture…must, in fact, be complete”—and Google might be forced to be incomplete. “To the extent that Google finds itself drawing lines for inclusion or exclusion based even indirectly on content…it makes itself a censor of our history and culture.” But when did Google say that GLP would create a “full or complete record of American culture”? And on what basis can incompleteness (as Suber notes, none of the Google 5 libraries has a “complete record of American culture”) be used as the basis for condemning the project? Suber: “The NCL objection not only starts from a false premise, but would abort any project that cannot reach completeness in one step.” I concur with Suber that NCL seems to be saying no literature should be easy to access until all of it is. This condemnation—which went to the Senate and House—also seems wildly out of character with NCL’s history.

Longer Article

Lavoie, Brian, Lynn Sillipigni Connaway and Lorcan Dempsey, “Anatomy of aggregate collections: The example of Google Print for Libraries,” D-Lib Magazine 11:9 (September 2005). www.dlib.org/dlib/september05/lavoie/09lavoie.html

After discussing the (possibly changing) role of books in libraries and the desirability of inter-institutional projects, the article considers the Google 5 collections in terms of coverage, language, copyright, works, and convergence. The paper is interesting not only for its direct answers but also its secondary objective, “to lay some groundwork for a general set of questions that could be used to explore the implications of any mass digitization initiative.”

Some of the findings:

Ø    As of January 2005, WorldCat includes some 32 million records for books among its 55 million records; books thus represent slightly less than 60% of the database.

Ø    The Google 5 have more than 18 million book holdings (in WorldCat) in all. If there was no overlap, that could represent 57% of the print book total. Including overlap, the Google 5 appear to hold 33% of the book titles in WorldCat—10.5 million. Of that 10.5 million total, 61% are reported from only one Google 5 library; 20% show up in two; 10% in three; 6% in four; and 3% (0.4 million) in all five. You can expect that universal digitization of all five libraries would result in about 40% redundancy—redundancy at the edition level, not the works level. (If you’re digitizing books, the edition level is an appropriate measure, in my opinion; if you only care about text, then the works level might be more appropriate.) That 61% figure is fairly startling: At least among these five large institutions, research library collections are far more diverse than you might expect.

Ø    Unsurprisingly, multiple holdings show up more frequently among newer publications. For example, 74% of books published between 1801 and 1825 are uniquely held by one of the five, while only 55% of those published between 1951 and 1975 are unique (the same rate holds for 1976-1985). But then, 55% is still a high level of uniqueness.

Ø    Also unsurprising: English language books don’t make up the majority of titles in the Google 5 collections, but it’s close (49%).

Ø    Many people were surprised by the raw copyright finding: Roughly half of the combined Google 5 collections were published after 1974, thus definitely under copyright unless published by government agencies or otherwise explicitly placed in the public domain. Only about 20% of the collections were published prior to 1923 and can be presumed in the public domain. The cutoff date for clear copyright protection is 1963; the actual percentage of public domain works (omitting government publications) is somewhere between 20% and 37%. Noting a claim elsewhere that only 7% of possible renewals actually took place, the figure might be closer to 36%—that is, 20% plus 93% of 17%.

Ø    For those more interested in unique works than unique books, OCLC’s algorithm for “FRBRizing” records yields 26.1 million works out of the 32 million book titles—and 9.1 million of these (35%) have at least one manifestation in a Google 5 library. (As noted in a footnote, even those who don’t care about typography wouldn’t want a pure “works” focus, since a French translation of Macbeth is considered the same work as the original English version; “expressions” would be a better target. I’d argue that, particularly for out-of-copyright materials where pages can be viewed and PoD could be provided, the title is in fact the best target.)

Ø    Some analysis of “convergence” says you’d need to digitize many library collections to achieve a “complete” digital database: For example, adding five more institutions holding eight million titles would only add 1.8 million new titles to those in Google 5.

Well worth reading, both for the direct study and for its consideration of implications for other multi-institutional projects.

Cites & Insights: Crawford at Large, Volume 5, Number 14, Whole Issue 70, ISSN 1534-0937, a journal of libraries, policy, technology and media, is written and produced by Walt Crawford, a senior analyst at RLG.

Cites & Insights is sponsored by YBP Library Services, http://www.ybp.com.

Hosting provided by Boise State University Libraries.

Opinions herein may not represent those of RLG, YBP Library Services, or Boise State University Libraries.

Comments should be sent to waltcrawford@gmail.com. Comments specifically intended for publication should go to citesandinsights@gmail.com. Cites & Insights: Crawford at Large is copyright © 2005 by Walt Crawford: Some rights reserved.

All original material in this work is licensed under the Creative Commons Attribution-NonCommercial License. To view a copy of this license, visit http://creativecommons.org/licenses/by-nc/1.0 or send a letter to Creative Commons, 559 Nathan Abbott Way, Stanford, California 94305, USA.

URL: citesandinsights.info/civ5i14.pdf