Wednesday, 4 February 2009

NDHA demo and the National Library

This morning I went to the NDHA demonstration where a National Library techie talked us through the NDHA ingest tools. The tools are the most visible piece of the NDHA infrastructure, and are designed to unify the ingest of digital documents, whether they are born-digital documents physically submitted (i.e. producers mail in CDs/DVDs etc); born-digital documents electronically submitted (i.e. producers upload content via wizzy web tools); or digital scans of current holdings produced as part of the on-going digitisation efforts. The tools have a unified system with different workflows for unpublished material (=archive) and published material (=libarary). The unification of library and archival functionality seemed like futile ground for miscommunication.

The infrastructure is (correctly) embedded across the library, and uses all the current tools for collection maintenance, searching and access.

As a whole it looks like the system is going to save time and money for large content producers and capture better metadata for small donors of content, which is great. By moving the capture of metadata closer to the source (while still allowing professional curatorial processes for selection and cataloguing), it looks like more context is going to be captured, which is fabulous.

A couple of things struck me as odd:

  1. The first feedback to the producer/uploader is via a human. Despite having an elaborate validation suite, the user wasn't told immediately "that .doc file looks like a PDF, would you like to try again?" or "Yep, that *.xml file is valid and conforming XML" Time and again studies have shown that immediate feedback to allow people to correct their mistakes immediately is important and effective.
  2. The range of metadata fields available for tagging content was very limited. For example there was no Iwi/Hapu list, no Maori Subject Headings, no Gazetteer of Official Geographic Names. When I asked about these I was told "that's the CMS's role" (=Collection Management Software, i.e. those would be added by professional cataloguers later), but if you're going to move the metadata collection to as close to content generation, it makes sense to at least have the option of proper authority control over it.
Or that's my take, anyway. Maybe I'm missing something.

Monday, 2 February 2009

Report from the NDHA's International Perspectives on Digital Preservation

NOTE: I'm a computer scientist by training and this was largely librarian/archivist gig, so it's entirely possibly I've got the wrong end of the stick on one or more point in the summary below. It's also my own summary, and not the position of my employer, even though I was on work time during the event.

The NDHA is about to announce that the NDHA project has been completed on time and under budget. This is particularly pleasing in light of the poor history of government IT failures over the course of the last 30 years and a tribute to all concerned. Indeed, when I was taking undergraduate courses in software engineering a contemporary national library project was used as a text-book example of how not to run a software development undertaking. It's good to see how far they've come.

The event itself was a one-day event in the national library auditorium, with a handful of overseas speakers. I'm not entirely certain that a handful of foreigners counts as "international," but maybe that's just me being a snob. Certainly there was a fine turn-out of locals, including many from the National Library, the Ministry of Culture and Heritage and from VUW, including a number of students, who couldn't possibly have been there for the free food.

There seemed to be an underlying tension between librarianship and archivistship running through the event. I see this as being a really crazy turfwar, personally, since I see the chances of libraries and archives existing as separate entities and disciplines in fifty years seems pretty slim. The separation between the two, the "uniqueness" of objects in an archive seems to be to be obliterated by the free-duplication of digital objects. I've heard people say that archives also work access controls and embargoes for their depositors, but then so can libraries, particularly those in the military and those working with classified documents.

It seemed to me that the word "reliability" was used in a confusing number different of ways by different people. Without naming the guilty parties:
  1. reliability as the truthfulness of the documents in the library/archive. This is the old problem of ingestors having to determine the absolute veracity of documents
  2. reliability as getting the same metadata every time. This seems odd to me, since systems with audit control give _different_ results every time, because information on the previous accesses is included in the metadata of subsequent accesses
  3. reliability as the degree to which the system conformed to a standard/specification
On reflection this may have been a symptom of the different vocabulary used by librarians and archivists. Whatever the cause, if we're wanting to spend public money, we have to be able to explain to the public what we're doing, and this isn't helping.

The organisers told us the presentations would be up by tonight (the evening of the presentation), but you won't find them on google if you go looking, because they tell google to please f**k off. I guess this is what someone was referring to when they said we had to work to make content accessible to google. The link is and most were up at the time of writing.

I was hugely encouraged by the number of pieces of software that seemed to be being open sourced, as I see this as being a much better economic model than paying vendors for custom software, particularly since it's potentially scalable out from the national and top-tier libraries/archives/museums out to the second and third tier libraries/archives/museums, which by dint of their much larger numbers actually serve the most users and have the most content. It was unfortunate that the national library hasn't looked beyond propriety software for non-specialist software but continues to use AbodePhotoshop / Microsoft Windows, which are available only for limited periods of time on certain platforms (which will inevitably become obsolete), rather than openoffice, GIMP, etc, which are cross platform and licensed under perpetual licences which include the right to port the software from one platform to another. I guessPhotoshop / Windows is what their clients and funders know and use.

With a number of participants I had conversations about preservation. Andrew Wilson in his presentation used the quote:

“traditionally, preserving things meant keeping them unchanged; however our digital environment has fundamentally changed our concept of preservation requirements. If we hold on to digital information without modifications, accessing the information will become increasingly difficult, if not impossible” Su-Sing Chen, “The Paradox of Digital Preservation”, Computer, March 2001, 2-6

If you think about what intellectual objects we have from the Greeks (which is were us Westerners traditionally trace our intellectual history from), the majority fall into two main classes: (a) art works, which have survived primarily through roman copies and (b) texts, which have survived by copying, including a body of mathematics which were kept alive in the Arabic translation during a period when we Westerners were burning the works in Latin and Greek and claiming that the bible was the only book we needed. I'll grant you that a high-quality book will last maybe 500 years in a controlled environment, maybe even 1000, but for real permanence, you just can't get past physical ubiquity. If we have things truly worthy of long-term preservation, we should be striking deals with the Warehouse to get them into every home in the country, and setting them as translation exercises in our language learning courses.

I had some excellent conversations with other participants at the event, including Phillipa Tocker from Museums Aotearoa / Te Tari o Nga Whare Taonga o te Motu who told me about the site they put together for their members.

Looking at the site I'm struck by how similar the search functionality is to I'm not sure whether their relative similarity is a good thing (because it enables non-experts to search the holdings) or a bad thing (because by lowering themselves to the lowest common denominator they've devalued their uniqueness). While I'm certain that these websites have vital roles in the museums and archives community respectively, I can't help but feel that from an end-users perspective have two sites rather than one seems redundant, and the fact that they don't seem to reference/suggest any other information sources doesn't help. I can't imagine a librarian/archivist not being forth-coming with a suggestion of where to look next if they've run out of local relevant content---why should our websites be any different?

I recently changed the NZETC to point to likely-relevant memory institutions when a search returns no results (or when a user pages through to the end of any list of results).

I also talked to some chaps from Te Papa about the metadata they're using to to represent places names (Getty Thesaurus of Geographic Names) and species names (ad-hoc). At the NZETC we have many place names marked up (in NZ, Europe and the Pacific), but are not currently syncing with an external authority. Doing so would hugely enable interoperability. Ideally we'd be using the shiny new New Zealand Gazetteer of Official Geographic Names, but it doesn't yet have enough of the places we need (it basically only covers places mentioned in legislation or treaty settlements). It does have macrons in all the right places though, which is an excellent start. We currently don't mark up species names, but would like to, and again an external authority would be great.

It might have been useful if the day had included an overview of what the NDHA actually was and what had been achieved (maybe I missed this?).

Sunday, 1 February 2009

flickr promoting the commons / creative commons

flickr is promoting photos in what it calls "The Commons", but only to logged in users. Normal users (who can't comment / tag the photos in the commons anyway) don't get an obvious link to them (except via about a billion third party sites such as blogs and google search). The page also shows how the national library's choice not to include text in their logo has come up trumps.

The confusingly similarly named "The Commons" and "Creative Commons" parts of the website apparently don't reference each other. Odd.