Cites & Insights: Crawford at Large
ISSN 1534-0937
Libraries · Policy · Technology · Media

Selection from Cites & Insights 4, Number 6: May 2004

The Library Stuff

Walt Crawford

A few clusters of library stuff this time around plus some miscellany. Assume these are all recommended, sometimes with caveats.

OpenURL

Beckert, Jeroen, and Lyudmila Balakireva, Patrick Hochstenbach, and Herbert Van de Sompel, “Using MPEG-21 DIP and NISO OpenURL for the dynamic dissemination of complex digital objects in the Los Alamos National Library Digital Library,” D-Lib Magazine 10:2 (February 2004). www.dlib.org

This 20-page article could equally well be grouped under “digital repositories” or “digital archives,” or for that matter “open access” or “digital publishing.” It is, to some extent, a “how we do it good” article—but one with broader implications. LANL operates a huge collection of digital resources (five terabytes as of the article), including licensed resources and locally generated material. There’s not one central repository; instead, LANL has “a multitude of autonomous OAI-PMH repositories,” with a central Repository Index. Features of the repository system that appear noteworthy include use of the MPEG-21 Digital Item Declaration Language to represent complex objects and MPEG-21 Digital Item Processing to execute services, but also fairly sophisticated use of NISO Z39.88, OpenURL 1.0, to handle a variety of information and processing requests. OpenURL 1.0 supports a much wider variety of possible uses than the original OpenURL (“OpenURL 0.1”)—one reason the Z39.88 documents are so formidable.

This is one in a series of papers describing LANL’s system and its implications. My guess is that most others will also appear in D-Lib. If you’re interested in complex digital resource collections, they’re worth tracking. Note that Van de Sompel was largely responsible for creating OpenURL, then SFX, when he was at the University of Ghent.

McDonald, John, and Eric F. Van de Velde, “The lure of linking,” Library Journal (April 1, 2004).

This brief article (three single-spaced pages with sidebars on a fourth page) discusses the first production implementation of OpenURL (as a commercially available service) in a U.S. academic library, at the California Institute of Technology (Caltech). Caltech went live in April 2001 and its students took to it rapidly. The article offers lists of currently available OpenURL resolvers and comments from those who are building their own or adding new services. A good quick read that encourages wider use of OpenURL within libraries—and, in the new broader standard, possibly outside as well.

Samuels, Harry E., “OpenURL: A tutorial,” Endeavor Information Systems.

This one-sheet PDF offers a quick introduction to OpenURLs (with the LinkFinderPlus slant you’d expect), using the technique of going through an OpenURL piece by piece. I would take issue with the implications of one comment—that is, that ISSN, volume, issue, date, and starting page is “usually all of the metadata that LinkFinderPlus or any link resolver needs to link to a full-text article.” That’s literally true—if the full-text article is in an aggregation that’s directly addressable by ISSN, volume, issue, and starting page. That’s not always the case and failing to provide the lead author, article title, and journal title handicaps the resolver and the user. It’s a minor objection, since the next example adds article title and journal title (but omits author).

Sutherland, Alison, and Peter Green, “An OpenURL resolver (SFX) in action: The answer to a librarian’s prayer or a burden for technical services?”

This article comes from a presentation at a recent VALA (Victoria Association for Library Automation) conference and should be readily findable. Both authors are at Curtin University of Technology (Australia). They discuss Curtin’s implementation of the SFX OpenURL resolver, OpenURL itself, staffing effects, advantages and disadvantages, and some ongoing issues.

I won’t say the presentation is without issues, as almost any OpenURL presentation is likely to be (fair warning to OSU librarians: Don’t expect perfection from me in late May!). For example, an excellent graphic showing a CSA article search result, the SFX window, and the resulting full-text window, is confusing when compared to the claimed text of the OpenURL and comments on what SFX requires:

Ø The claimed OpenURL text includes only the journal title, date, volume, issue, starting page and (surprisingly), character set—but the SFX window shows the article title as well. While SFX could generate the (also shown) ISSN from the journal title, it can’t possibly generate the article title at that point. I would expect that CSA’s OpenURL also includes “atitle” (article title), and probably also ISSN and author fields.

Ø Two paragraphs down is this sentence: “The minimum requirement by the SFX resolver for generating a link is the presence of an ISSN and Year when a threshold is specified in the KB or only an ISSN when there is no specified threshold.” Without getting into “threshold” (described a little later), this suggests that SFX won’t handle an OpenURL that lacks an ISSN, which is (fortunately) not the case and would invalidate the example OpenURL.

The article also mentions Z39.88-2003 as being released in April 2003, but while the draft NISO Z39.88 (OpenURL 1.0) was indeed released for a trial period at that point, I don’t believe it carried the “2003” suffix and it was most decidedly not a NISO standard at that point. The version of Z39.88 recently balloted, the first version to reach ballot stage (in February 2004), includes small but significant changes from the April 2003 version. (Trust me on this one. I specified RLG’s trial implementation of OpenURL 1.0, the only OpenURL 1.0 source implementation to report completed interoperability testing protocols with resolvers during the several-month trial period. I’m now specifying changes to that implementation to bring it in line with the final balloted standard.)

Those are minor points. What makes this discussion particularly worthwhile is the commentary on real issues in making OpenURL work within an institution. For example, Curtin doesn’t typically subscribe to all of the journals in a collection, so they can’t activate all those full-text targets with one click: They must consider date ranges and availability in each journal. Curtin also does a fair amount of semi-random testing, informing Ex Libris of problems encountered along the way. The article discusses some reasons for “dead links,” OpenURLs that appear to promise full text but don’t yield it.

Do not think that SFX will save you work. The beauty of SFX is that it provides seamless access to information for the client. But SFX is only ever going to be as good as the attention you give it.

All in all, a fascinating look into real-world OpenURL use from the library’s perspective—something that’s hard to get from those of us who write either from an OpenURL vendor’s perspective or, as in my case, from that of an OpenURL source.

Google Matters

These aren’t all library-related items and they’re not all articles. But then, Google searches turn up things you weren’t expecting, for all the usual reasons and some that might not be apparent.

Edward W. Felten talked about “Googlocracy” in a February 3, 2004 posting at Freedom to tinker. He noted the “conventional wisdom” that Google is becoming less useful because people are manipulating its rankings. He thinks the wisdom is wrong. “It ignores the most important fact about how Google works: Google is a voting scheme. Google is not a mysterious Oracle of Truth but a numerical scheme for aggregating the preferences expressed by web authors. It’s a form of democracy—call it Googlocracy.” He goes on about this for a bit, ending: “Like democracy, Googlocracy won’t always get the very best answer. Perfection is far too much to ask. Realistically, all we can hope for is that Googlocracy gets a pretty good answer, almost always. By that standard, it succeeds. Googlocracy is the worst form of page ranking, except for all of the others that have been tried.” I see what Felten is saying—but my grumps that Google isn’t as useful as it used to be are not based on the idea that people are trying to manipulate its rankings. Rather, I believe the sheer effect of weblogs (and blogrolls) and Google’s own changes in its ranking systems have tended to confuse Google results. That’s purely a personal observation and I could be entirely wrong.

Gary Price had a February 17 ResourceShelf comment about a Washington Post article on “life in the age of Google,” an article that quotes him (in “a 45 minute conversation boiled down to a few words”). “I wish journalists would stop making it an either libraries OR Google thing. Having a variety of resources and using the right one at the right time are what matters most.” He notes some of the problems with the article and, as usual, some of the reasons that libraries can’t be replaced by Google—and that Peter Lyman’s quote about Google winning the “war” was, well, simplistic at best.

Beehner, Lionel, “Lies, damned lies, and Google,” Mediabistro, downloaded February 18, 2004.

This brief piece discusses the tendency of (lazy) writers to prove points by using Google results, and the essential fact that such result sizes are inherently meaningless. “What’s a simpler, or faster, way of quantifying a trend than typing a key word or phrase into Google? Type in almost any person, place, or thing, and Google will bounce back to you a neat numerical value that calculates that person, place, or thing’s importance to this world.” He cites a February 2 New Yorker article in which a TV critic demonstrates that the female body is more interesting than the male because “naked men” yields around 600,000 Google results while “naked women” yields more than a million.

Worse examples follow. An LA Times story says that Frank Deford is a “distinguished writer” and uses as evidence: “A Google search of his name produces more than 21,000 hits.” (Hmm. That makes me either six times as famous as Deford or one-quarter as famous, depending on whether the Times person used quotes. Either result is meaningless.)

The point of this piece is that Google result counts are not very accurate ways to gauge the popularity of a person or an idea. I tend to agree with the conclusion: “Plugging Google in a story has become almost a telltale sign of sloppy reporting a hack’s version of a Rolodex.” He also notes that one of two New Yorker writers mentioned appears to be twice as popular as the other—partly because the one shares a name with a prolific porn star.

Seth Finkelstein noted the article and some errors within it. He notes the lack of any indication as to whether searches mentioned were surrounded by quotes. Without them, “many of the number reported are utterly and completely meaningless. They don’t even do the silly measure of the phrase the journalist thinks they measure.” His example: the words ‘hot’ and ‘dog’ keyed as a two-word Google search would yield pages about hot days on which dogs are unhappy, where “hot dog” is at least more likely to yield frankfurter-related stories (or stories about surfing, or…). He verifies that the Mediabistro article gets it wrong in at least one case, when it passes on a report that the phrase “permanent resident cards CA” yields 92,200 sites on the subject:

NO. The phrases return zero or a few hits. The words return that many hits, but having a lot of pages with the four words “permanent” “resident” “cards” “CA” somewhere on them is not “staggering.”

Sigh, Flash—journalists write nonsense. Not news at 11.

Bell, Steven J., “The infodiet: How libraries can offer an appetizing alternative to Google,” Chronicle of Higher Education 50:24 (February 20 , 2004).

This is a good brief commentary on ways that libraries can respond to the “competition” from Google, both by explaining what they have and (in some cases) offering simple search interfaces and offering clear added value. While recommending the piece, I take issue with one of Bell’s definitions: He refers to commercially produced databases as “aggregators.” While there are full-text aggregations, databases are not inherently aggregations just as databases aren’t all full text (and shouldn’t be).

Marylaine Block’s Ex Libris

Block, Marylaine, “How about ignorance management?” Ex Libris 203.

I could do without another claim that Amazon’s finding system to sell certain books is so much better than library catalogs, but the idea of managing internal ignorance is interesting. Her suggestions for libraries include professional training, or at least multiple subscriptions to the professional journals (so staff don’t have to wait six months to see them), mentoring, training during staff meetings, internal weblogs, and cataloging local expertise. “Above all, I think that continuously developing professional competence should be part of the job description.” Well worth thinking about, as usual.

Block, Marylaine, “In need of a better business model,” Ex Libris 207.

Block argues that “the information place” doesn’t work very well as the business model for today’s libraries. Since I’ve been making the same argument in speeches for more than a decade and devoted a couple of pages of Being Analog to refuting that slogan (pp. 75+), it’s hard to disagree. “The problem with the information place model is that most people are convinced they don’t need libraries for information now that they have the internet.” I think it goes further than that. Libraries were never the place that people met their primary information needs—but I’m repeating myself. Go read the book.

Block offers several alternatives: The community place, the self-improvement place, the idea place, a culture place, an education place, a readers’ place, and the kids’ place. All interesting suggestions, although “the” always makes me nervous. Maybe libraries continue to be complex organisms with complex services, which is a tougher story to tell.

Block, Marylaine, “Natural partners,” Ex Libris 208. (marylaine.com/exlibris/xlibnnn. html, where “nnn” is the issue.)

“What do libraries share with museums, historical societies, schools, colleges, orchestras, and arts organizations? Well, yes, they are underfunded public agencies, true enough. But they are also the cultural infrastructure of the community they serve.” Block goes on to suggest various ways that these “logical allies” can work together. Many libraries and museums already cooperate, almost always to mutual benefit. For that matter, the current motto of my employer (RLG) is “Where museums, libraries, and archives intersect,” a reasonable statement given the nature of its membership. Broadening coalitions to other cultural organizations makes good sense; her suggestions offer good starting points.

Other Items

“A dozen primers on standards,” Computers in Libraries 24:2 (February 2004). (www.infotoday.com/cilmag/feb04/primers. shtml)

An interesting assemblage of brief “primers” on items such as ARK (Archival Resource Key), METS (Metadata Encoding & Transmission Standard), OAI-PMH (Open Archives Initiative Protocol for Metadata Harvesting), and Shibboleth (not an acronym). In most cases, the primers are prepared by people involved in development of the “standard” (these aren’t all true standards) or reasonably expert users. The only thing that gives me pause is careful reading of one primer in an area where I have personal expertise, OpenURL. The essay includes one misleading implication (that OpenURL always uses HTTP GET, when in fact HTTP POST is both valid and preferable for secure transmission of long metadata strings) and one somewhat questionable note (the suggestion that the draft OpenURL standard is “in use by many information providers and library software vendors,” when what’s in use is almost always a precursor version). Those are minor issues in a generally good commentary, and I’ll assume that other primers are equally good with minor problems.

Etches-Johnson, Amanda, “Look Mom, I got my name in print! Lessons learned by a publishing neophyte,” Liscareer.com, March 2004.

I have trouble thinking of Etches-Johnson as unpublished, given the quality and extent of her web writing. But she says her first formal publication will be a chapter in a book coming out this fall. Meanwhile, she offers some lessons. Or, as she says, “a few of the things I’m glad I did, but mostly the things I should have done.”

Her six lessons, without her excellent commentary: 1. Talk about yourself. 2. If you want to write, you’d better write. 3. It’s easier to write about something you’re interested in than to pretend to be interested in something you’re writing about. 4. Know thy editors and talk to them. 5. Don’t sweat the grammar. 6. Two heads are better than one and three heads are better than two.

While these are all excellent suggestions (when amplified and refined with her comments, in particular #5), I’ll point particularly to #2 and #3. Maybe you should read #3 and her comments three or four times. I’ve read too many articles that seem to show disinterest on the part of the writers. I’ve probably written one or two, but I surely hope to avoid such exercises in futility in the future. Maybe there are “those gifted writers who can make any topic sound engaging.” Etches-Johnson admits that she’s not one of them. Neither am I, but I sure have read enough who fail to do so. More’s the pity. (In case this commentary is too vague, I highly recommend reading this article.)

Frost, William J., “Do we want or need metasearching?” Library Journal, April 1, 2004.

The item itself is a “Backtalk” commentary—less than 1.5 pages as printed, with a contrarian view of metasearching (or federated search or distributed search). Frost argues that students need to learn to select research tools, and that selecting databases is part of that process. He questions the acceptability of “good enough” results, notes that metasearching is costly, and doubts that it’s as good a use of time and money as adding more content and doing good bibliographic instruction. I wouldn’t have cited it; as a brief opinion piece, it doesn’t offer enough detail or background to deserve additional study.

Then there are the Web4Lib responses—at least 20 of them (to Frost’s posting linking to his commentary and to each other) over five days. I was impressed by the level of the discussion (lots of thought, no flaming) and the expertise of the participants (including Roy Tennant, Thomas Dowling, Karen Coyle and Karen Schneider, Eric Lease Morgan and others).

A quick summary of the stream, with few direct quotes and without contributors’ names (the ideas stand on their own), may suggest things to think further about—both regarding the concept of metasearching and the systems out there. I’m not offering my opinion here, partly because I still don’t know enough, partly because database interfaces are a big part of my day job.

The first post asserted that Frost had “some unfortunate and incorrect assumptions” and confused metasearch possibilities and current implementations. It noted that one-stop searching with intermixed results is not the only approach, that we’re at an early stage of metasearch, that metasearch is not intended to replace direct access, and that we shouldn’t dismiss metasearch possibilities just yet. One interesting commentary was on Frost’s push for bibliographic instruction as a solution: “I wish you great success in this endeavor, but in the end you must surely realize it is futile. At least it is at my university.” This poster noted the need to put “the intelligence of a reference librarian into our systems” and to build user-oriented systems. The post ended by welcoming a “ripping good debate,” and made a good start to such a debate.

The next response, noting that Frost had commended general-purpose databases as serving most undergrad needs, commented that (some) general-purpose databases are themselves effectively metasearch systems, offering the lowest common denominator searchability to their component parts—and, further, that many vendors have offered combined database searching for years.

Another post asserted that library services are no different than other web services and that libraries really only serve expert users. Another participant argued that “We need to accept the fact that most students could care less about retrieving high quality research” and that metasearching tools were improving. Another questioned one poster’s attempt to generalize from Google user experience to research user experience. “Whatever the merits of metasearching, if we implicitly sell it to our users as being Google for Library Databases, they’ll expect it to be just that and will justifiably char-broil us when they find that it isn’t”—which it almost surely won’t be. (Off-list, I responded with my only comment: Effective relevance ranking across disparate databases including large results is pretty nearly impossible, unless every database uses the same relevance ranking methodology—which won’t happen.) Another poster believed it could be “Google for Library Databases” a bit down the road, using the Google engine. (A later response suggested that Googling existing databases was unlikely if only because information aggregators almost certainly won’t allow their content to be spidered.)

The next post discussed techniques that could be used to make search interfaces “smarter,” such as “did you mean” possibilities, suggesting alternative search strategies (which some of us already do), and so on. After that, one librarian discussed their current experience with metasearch, noting the enormous potential to give the target audience simple access to high-quality information but also noting the extent to which it isn’t Google: “We are happy to have displayed responses from at least one data source in 5-30 seconds,” as opposed to Google’s typical 3-4 second response time. That comment drew a sigh: “If only we had Google’s resources!... We have to be realistic about what we can do with our resources. Although Google’s technology may look like magic, it’s based on a huge amount of computing power.” (That’s true: Google has thousands of homebrew PC-class Unix-based computers and software that can take advantage of massive parallelism; Google is also a much bigger business than any library systems vendor.)

A delayed response to “why not Google?” noted that Google runs against full text, where most library databases have only metadata. No one has shown that the same (ranking) techniques will work well for both. And, of course, databases don’t have intersite links. The question, then, is what is it about Google that people like, and how much of it can be done in libraries?

By now (the second day of discussion), the thread had taken on the leapfrog effect that you get with large lists: Post 15 is more likely to respond to Post 11 or 12 than to Post 14, because (even without moderation) there are propagation delays and it takes time to compose a thoughtful response—and “thoughtful” was the order of the day. The next post noted response time as a major design problem, particularly because some usability testing shows that undergrads won’t wait 20 seconds for results—and that, on the database side, metasearching yields a lot more searches against each database. This post suggested OAI harvesting as a solution, building a local combined index—essentially doing a Google and substituting a form of crawling for multi-database searching. A response noted that database owners would argue that this gave away their assets as well as obliterating branding.

Another response agreed that such harvesting would be ideal, but didn’t think it was likely, if only for branding reasons. This respondent noted that their institution has set a four-second timeout for each resource, but caching previous results so temporary slowness doesn’t knock out a resource. (That raises interesting questions about heterogeneity of searches: How often does the same search come from more than one user within two days? Are “common searches” really “common” in university libraries? My own sample log analysis would suggest otherwise, at least for our databases.)

With regard to branding, one library firebrand (sorry, couldn’t resist that) found it ironic that database vendors complained about diminished branding while making it difficult for libraries to put their own brand on those paid-for resources. A partial response raised a library/consortium branding issue for metasearching—and seconded the first person’s complaint about difficulties in cobranding a database, at least as I read it. (Doesn’t most cobranding occur at the library menu level?)

A subthread suggested that library catalogs would work better if they searched the full text of the books. A response questioned that assertion (as I would and have), noting that full text searches don’t work well for abstract topics or topics composed of common words—and that full-text searching in a large collection is problematic. If studies suggesting that people tend to approach a search using fairly broad terms are correct (as seems likely), then problems with full-text searching are even greater. The first person suggested that a search should cast as wide a net as possible, with retrieval mechanisms doing the appropriate ranking—to which a third party said this “emphasize[d] the parts that are hardest in a metasearch environment.” It’s easier to go find lots of stuff than it is to determine which of that stuff is relevant for a search. “How far can we develop our ability to filter, rank, sort and so on in an environment where those capabilities have to be implemented as third-party services bolted on to a bunch of native interfaces?”

That’s as far as I captured the stream, and I think it dwindled after that. If you’re a Web4lib member, you might consider going to the “Metasearching” thread in a month or two and reviewing what was said (the posts begin on April 1, 2004). Do you agree? Do you have more information that would shed more light? Would this discussion improve real-time discussions at appropriate LITA interest groups and in other venues? I found it fascinating and enlightening, and was sorely tempted to participate more than I did. Think of this citation as a recommendation to review the thread as a thoughtful multipart conversation on real-world issues in library metasearch development and implementation.

Cites & Insights: Crawford at Large, Volume 4, Number 6, Whole Issue 49, ISSN 1534-0937, is written and produced at least monthly by Walt Crawford, a senior analyst at RLG. Opinions herein do not reflect those of RLG. Comments should be sent to wcc@notes.rlg.org. Cites & Insights: Crawford at Large is copyright © 2004 by Walt Crawford: Some rights reserved.

All original material in this work is licensed under the Creative Commons Attribution-NonCommercial License. To view a copy of this license, visit http://creativecommons.org/licenses/by-nc/1.0 or send a letter to Creative Commons, 559 Nathan Abbott Way, Stanford, California 94305, USA.

URL: citesandinsights.info/civ4i6.pdf

Cites & Insights: Crawford at Large ISSN 1534-0937 Libraries · Policy · Technology · Media