Tuesday, 5 May 2009

Why card-based records aren't good enough

Card catalogs have a long tradition in librarianship, dating back, I'm told, to the book stock-take in the French revolution. Librarians understand card catalogs in a deep way that comes from generations of librarians having used them as a core professional tool all their professional lives. Librarians understand card catalogs in ways that I, as a computer scientist, never will. I still recall on one of my first visits to a university library, I asked a librarian where I might find books by a particular author, they found the work for me arguably as fast as I can now find works with the new wizzy electronic catalog.

It is natural, when faced with something new, to understand it in terms of what we already know and already understand. Unfortunately, understanding the new by analogy to the old can lead to form of the old being assumed in the new. It was true that when libraries digitized their card catalogs in the 1970s and 1980s, they were more or less exactly digital versions of the card catalog predecessors, because their content was limited to old data from the cards and new data from cataloging processes (which were unchanged from the card catalog era) and because librarians and users had come to equate a library catalog with a card catalog---it was what they expected.

MARC is a perfect example of this kind of thing. As a data format to directly replace a card catalog of printed books, it can hardly be faulted.

Unfortunately, digital metadata has capabilities undreamt of at the time of the French revolution, and card catalogs and MARC do a poor job of handling these capabilities.

A whole range of people have come up with criticisms of MARC that involve materials and methodologies not routinely held in libraries at the time of the French revolution (digital journal subscriptions and music, for example), but I view these as postdating card catalogs and thus the criticism as unfair.

So what was held in libraries in 1789 that MARC struggle with? Here's a list:
  • Systematically linking discussion of particular works with instances of those works
  • Systematically linking discussion of particular instances with those instances ("Was person X the transcriber of manuscript Y?")
  • Handling ambiguity ("This play may have been written by Shakespeare. It might also have been a later forgery by Francis Bacon, Christopher Marlowe or Edward de Vere")

All of these relate to core questions which have been studed in libraries for centuries. They're well understood issues, which changed little in the hundred years until the invention of the computer (which is when all the usually-cited issues with MARC began).

The real question is why we're still expecting an approach that didn't solve the problems two hundred years ago to solve our problems now? Computers are not magic in this area they just seem to be helping us do the wrong things faster, more reliably and for larger collections.

We need a new approach to bibliographic metadata, one which is not ontologically bound to little slips of paper. There are a whole range of different alternatives out there (including a bevy of RDF vocabularies), but I've yet to run into one which both allowed clear representation of existing data (because lets face it, I'm not going to re-enter worldcat, and neither are you, not in our lifetimes) and admitting non-card-based metadata as first class elements.


Friday, 1 May 2009

LoC gets semantic

This morning, the Library of Congress launched http://id.loc.gov/authorities/, their first serious entry into the semantic web.

The site makes the Library of Congress Subject Headings available as defererenable URLs. For example http://id.loc.gov/authorities/sh90005545.