Cites & Insights: Crawford at Large
ISSN 1534-0937
Libraries · Policy · Technology · Media

Selection from Cites & Insights 6, Number 14: December 2006

The Library Stuff

Crook, Edgar, “For the record: Assessing the impact of archiving on the archived,” RLG DigiNews 10:4 (August 15, 2006).

Full disclosure: As before, it’s worth noting that RLG DigiNews isn’t edited by (former) RLG (now OCLC) staff. It continues to be produced by the IRIS Research Department of Cornell University Libraries. Crook is at the National Library of Australia, and this fascinating article recounts experience with PANDORA, Australia’s Web Archive: an NLA project that’s been archiving web-based publications for a decade. PANDORA includes some 12,000 titles, ranging from one document to “a whole government website containing thousands of pages.”

“This study examines publisher behaviour and attitudes in relation to Internet archiving.” “Publisher” means “document producer” in this case, not inherently a commercial publisher. NLA used an online survey and examined archived material—and also compared “knowingly-archived” material in PANDORA with some “unknowingly-archived” material in other archives. (PANDORA explicitly seeks permission before archiving; that’s not true of all internet archives, most specifically not the Internet Archive.)

Internet publications archived in PANDORA may get more respect, since the request to archive involves explicit selection criteria: “One publisher of an online novel has even used [an] excerpted sentence [from the form letter] to make it seem like a positive review.” Almost all the producers thought PANDORA archiving was worthwhile—but only 35% used PANDORA to view any other website, and roughly a third didn’t believe PANDORA would actually provide long-term preservation. Most publishers don’t rely on PANDORA as a backup method.

PANDORA is a “light archive” (materials are openly available). It gets reasonably high usage, more than five million pages in 2004-2005 (the most frequently accessed sites being those no longer available on the ‘live web’). But 92% of publishers thought PANDORA archiving resulted in more hits for their publications.

PANDARA includes some blogs—and they wondered whether inclusion might cause bloggers to self-censor. Apparently not, based on survey results and studying archived and unarchived blogs. As for pure ejournals, mostly from government and academic sites, most publishers believe PANDORA archiving increases citation rates but doesn’t have much effect on submission rates.

A fascinating report, well worth reading.

Dowling, Thomas, “UTF-8 and Latin-1: The leaning tower of Babel,”, posted June 24, 2006.

Dowling offers a pithy explanation of the relationship between Latin-1 (the expanded ASCII character set you most often encounter) and UTF-8, the most common way of transmitting Unicode. ASCII’s a 7-bit code. Latin-1 is the most common way of using the extra 128 slots (or, actually, extra usable 96 slots) available with 8 bits.

Unicode provides for universal character encoding: It allows “several million slots, currently with a little under 100,000 of them assigned to characters.” But you can’t fit several million bit patterns into an 8-bit character, so Unicode requires multibyte character encoding. And, although making every character 24 bits (three bytes) long would support more than 16 million patterns, that won’t work in the real world: It breaks existing ASCII and Latin-1 text and uses three times as much data all those times when what you need is a basic Roman character set. That’s where UTF-8 comes in. UTF-8 is identical to ASCII for slots 0 through 127—but once you reach slot 160 (128 through 159 left blank), the initial byte determines how many bytes constitute a character. “Which means that UTF-8 requires all supporting code and applications to break the ‘one byte=one character’ assumption and UTF-8 is not compatible with Latin-1.

The rest of Dowling’s commentary is why I recommend this item. He deals with real-world problems in an era where most browsers and PC programs do recognize UTF-8: It’s not universal in the computer and communications fields. He’s looking at a journal article about thermal springs in Turkey—and names of the springs and authors are coming out garbled.

The problem, if my diagnosis is anywhere close, is that we have multibyte UTF-8 characters being passed through code that treats them as single byte characters (probably Latin-1), and then being passed through code that converts each of those into multibyte UTF-8 characters. Repeat a few times and the result is gibberish.

I won’t attempt to replicate that gibberish here; it might come out as different gibberish. I’d guess the problem is as Dowling diagnoses it—although it’s still possible that some EBCDIC-ASCII translations are mixed in as well.

Dowling discusses the problem and its ramifications. Support for UTF-8 is still hit or miss, and all it takes is one misstep along the way to destroy the character set integrity. As Dowling concludes (prior to one more cute example), interoperability is still tricky.

Final report of the field test of the Playaway self-contained portable digital audio book player conducted by the Mid-Illinois Talking Book Center, March 31, 2006.

This 18-page report offers detailed analysis of this four-month project, which involved 50 blind and visually impaired volunteers and “several copies of 25 titles” on Playaways. There were 140 circulations during the period; 55 feedback forms were returned.

What’s a Playaway? The title describes it well: A little device looking like a miniature book, containing one digital audiobook, a battery (and space for a spare battery), playback controls, and earbuds. You don’t load it with ebook content: The content and the player are a single purchased unit. That makes it a self-contained circulating library item if you can deal with earbud hygiene, and a particularly interesting possibility where ebooks are needed.

The devices were “fairly rugged,” although the LCD readout on one unit malfunctioned. Because the earbuds didn’t have comfort pads, they were easy to clean but somewhat uncomfortable. Playaways support bookmarks, but it’s apparently easy to erase them and reset the book to its beginning.

There weren’t enough feedback forms to be statistically significant and no such significance is claimed: This was a field test. That said, 50 of 55 responses rated the Playaway experience at least somewhat satisfactory; that’s an excellent result. The devices eliminate some of the overhead of digital audiobooks—no installation, no downloads. The task force concluded, “The overall response to this new type of pre-loaded self-contained digital audio book playback device was very positive.”

Which is not to say all was peaches and cream. The volume button is one way: Volume keeps going up until it hits maximum, then wraps around to minimum. Six responses found that problematic. The buttons (which don’t provide audible feedback) gave some people trouble. Behavior was sometimes inconsistent. Variable speed playback was popular—but it’s only faster or fastest and you have to reset the speed each time you start the unit. Replacing batteries proved difficult for some—and some users weren’t happy with sound quality (but found it much better when they switched to their own headphones).

Most respondents still prefer audiocassettes, but some are ready to shift to the Playaway (some prefer CD audiobooks, mostly for sound quality). As you’d expect based on other audiobook experience, most people prefer unabridged versions.

There aren’t many Playaway titles yet (45 as of early June 2006) and they’re pricey ($35 to $50), but they could still be useful—if the medium survives. (Thanks to the Mid-Illinois Talking Book Center for a clear and comprehensive report.)

Footnote: This commentary was delayed several months. Since then, a number of other libraries have reported informally on experience with Playaway—mostly favorable, a few unfavorable (usually related to battery life). Playaway audiobooks are purchased physical devices that combine content and carrier (like print books): First sale and fair use rights apply. The collection is growing: Playaway’s website shows 165 titles as of early November 2006.

Huwe, Terence K., “From librarian to digital communicator,” Online 30:5 (September/Octo­ber 2006): 21-26.

A fascinating story of a specialized academic library becoming more integral to its parent organization. Huwe runs the library at UC Berkeley’s Institute of Industrial Relations and has used his librarian skills to solve problems for the institute as a whole—everything from building community with email to becoming a paper publisher. Huwe offers some cogent thoughts in answer to the natural response to his article, “But that’s not ‘library’ work.” Part of the long answer is, I think, worth hearing for most any library: “An academic library that sees itself as a passive repository is a library at risk.” Here’s the short answer:

Does a library exist to serve its user community? If so, then any and all work that serves those users—and advances the library’s role—is “library work.” Are we just a tad busy? Yes. Is it worth it? You bet.

Definitely worth reading.

Murray, Peter, “Defining ‘Service Oriented Architecture’ by analogy,” “Services in a Service Oriented Architecture,” “The dis-integration of the ILS into a SOA environment,” Disruptive technology library jester, posted September 18, 19, and 20, 2006.

I’m not going to summarize or comment on this trio of posts (which begin a continuing series on SOA). I’m going to recommend it—and, when you’re at the blog, look at the other posts in the “library SOA” category. Murray helped me understand what service-oriented architecture is all about and what it can or should mean for library systems. I found the second post particularly interesting, as it proceeds from how you could add local holdings directly into, to “is that what the user really wanted?” to possible layers of services. It’s enlightening—and if you’re trying to make the case for disintegrating the ILS, I think Murray’s methodology is more convincing than deriding OPACs or upholding Google as the model of all that’s good in searching. The series is most definitely part of a conversation; be sure to read the comments and linked posts.

Murray, Peter, “Just in time acquisitions versus just in case acquisitions,” Disruptive technology library jester, posted August 2, 2006.

What [i]f a service existed where the patrons selected an item they needed out of our library catalog and that item was delivered to the patron even when the library did not yet own the item? Would that be useful?

That’s the starting point for a thoughtful discussion of possible ways to make libraries more competitive with Amazon, assuming that such competition is a reasonable role. Murray isn’t explicitly advocating for such a system, but he’s trying to expose the factors that would be required to make it work. Briefly, he sees four factors:

Ø    The local catalog would need to display records for items not yet held that could be acquired rapidly.

Ø    You’d need “a highly automated process to get the requested book to the library.”

Ø    Fast copy cataloging and shelf preparation would be essential—although some (or all) of that could take place after the initial circulation (indeed, as the first commenter suggests, the book might be shipped directly from the publisher/distributor to the patron, who would then return it to the library).

Ø    The roles of librarians would change in somewhat disruptive ways.

The summary of how those roles would change is clear and provides a good starting point for discussion of feasibility and desirability. I’m not prepared to come down on either side of whether this is a desirable use of library resources or whether it’s desirable for libraries to compete with booksellers rather than complementing them: I don’t know. I’m a firm believer that some level of just-in-case acquisition is appropriate and perhaps vital for most libraries, but that’s another issue, and I don’t think Murray’s saying otherwise. I’ll close with Murray’s final paragraphs, after summarizing steps in the processing stream:

Can we do this as fast as it would take the patron to get the item directly from the online bookseller? Maybe not—we do have some necessary processing steps that a direct patron purchase does not have. Can we make that delay short enough so that the patron considers it acceptable as compared to the direct price premium of ordering it themselves?

Do we want to?

Sierra, Tito, “Snippets,” The horseless library, posted July 10, 2006.

“Text snippets are different from abstracts and summaries because they are algorithmically extracted from the source text, rather than editorially created to function as a summary or teaser.” Sierra suggest comparing news blurbs at Google News (snippets) with the New York Times online (teasers): Google News uses the first n words of the source article, while the Times blurb is a form of abstract.

Sierra discusses other ways to derive snippets—for example, the snippets within Google results and Google Book Search, and an experiment that prepared blurbs for New York Times pieces by combining the headline and the last paragraph of the article. Sierra’s brief discussion of other possibilities is worth thinking about. Sierra concludes:

Can you think of other methods for generating snippets? Are snippets evil?

I’d say “not inherently, but—as with calculated metadata as compared to cataloging—they’re not as good as summaries.” As for the first, consider this:

What’s a Playaway? The title describes it well: A little digital device looking like a miniature book, containing one digital audiobook, a battery (and space for a spare battery), playback controls, and earbuds. Once a library buys a Playaway, it owns it. Period.

That’s a four-sentence, 43-word summary of my 514-word commentary earlier in this section. It’s a snippet of sorts, prepared by Word’s AutoSummarize function set to 10%. It could be a whole lot worse!

Sullivan, Danny, “Hello natural language search, my old over-hyped search friend,” SearchEngineWatch, October 5, 2006.

Not directly library-related but worth reading if you’re a librarian waiting for natural language search to finally work and make a real difference. Sullivan notes a new (and so far unavailable) search engine, Powerset, that apparently claims to use natural language analysis to give better search results.

Sullivan’s been seeing similar claims for a decade or more. He notes how the hype works (when you have a working engine—Powerset’s still in “stealth mode”) and why it’s unrealistic, particularly given that most people use very short searches or very specific searches that don’t require much analysis. (“Memorial Hospital Modesto,” to take an example that mattered personally to me recently, doesn’t require much fancy analysis to yield a great result.) Sullivan also notes the kind of “conceptual expansion” that Clusty and Ask already do—and, to be sure, that does and RedLightGreen used to do, in a narrower context.

He offers a brief history of “natural language analysis” in search engines over the past decade, based only on the articles he’s written, including Excite, Electric Monk (!), FAST, BrainBoost, MeaningMaster, Stochasto, and Kozoru. All of which have set the world on fire and revolutionized web searching, right?

Sullivan notes that one of Powerset’s people is busily blogging about how Powerset’s going to change users’ “two or three word” habits and how, in five to ten years, we’ll look back at those bad old days when we did keyword searching. “If Powerset’s going to change those habits, good luck.” He’s not enthusiastic about the probability. Interesting, cogent discussion.

Cites & Insights: Crawford at Large, Volume 6, Number 14, Whole Issue 84, ISSN 1534-0937, a journal of libraries, policy, technology and media, is written and produced by Walt Crawford, a senior analyst at OCLC.

Cites & Insights is sponsored by YBP Library Services,

Opinions herein may not represent those of OCLC or YBP Library Services.

Comments should be sent to Comments specifically intended for publication should go to Cites & Insights: Crawford at Large is copyright © 2006 by Walt Crawford: Some rights reserved.

All original material in this work is licensed under the Creative Commons Attribution-NonCommercial License. To view a copy of this license, visit or send a letter to Creative Commons, 559 Nathan Abbott Way, Stanford, California 94305, USA.