Sunday, 8 November 2009

ePubs and quality

You may have heard news about the release of "bookserver" by the good folks at the Internet Archive. This is a DRM-free ePub ecosystem, initially stocked with the prodigious output of Google's book scanning project and the Internet Archive's own book scanning project.

To see how the NZETC stacked up against the much larger (and better funded) collection I picked one of our Maori Language dictionaries. Our Maori and Pacifica dictionaries month-after-month make up the bulk of our top five must used resources, so they're in-demand resources. They're also an appropriate choice because when they were encoded by the NZETC into TEI, the decision was made not to use full dictionary encoding, but a cheaper/easier tradeoff which didn't capture the linguistic semantics of the underlying entries, but treated them as typeset text. I was interested in how well this tradeoff was wearing.

I did my comparison using the new firefox ePub plugin, things will be slightly different if you're reading these ePubs on an iPhone or Kindle.

The ePub I looked at was A Dictionary of the Maori Language by Herbert W. Williams. The NZETC has the 1957 sixth edition. There are two versions of the work on bookserver. A 1852 second edition scanned by Google books (original at the New York Public library) and a 1871 third edition scanned by the Internet Archive in association with Microsoft (original in the University of California library system). All the processing of both works appear to be been done in the U.S. The original print used macrons (NZETC), acutes (Google) and breves (Internet Archive) to mark long vowels. Find them here.

Lets take a look at some entries from each, starting at 'kapukapu':


kapukapu. 1. n. Sole of the foot.

2. Apparently a synonym for kaunoti, the firestick which was kept steady with the foot. Tena ka riro, i runga i nga hanga a Taikomako, i te kapukapu, i te kaunoti (M. 351).

3. v.i. Curl (as a wave). Ka kapukapu mai te ngaru.

4. Gush.

5. Gleam, glisten. Katahi ki te huka o Huiarau, kapukapu ana tera.

Kapua, n. 1. Cloud, bank of clouds. E tutakitaki ana nga kapua o te rangi, kei runga te Mangoroa e kopae pu ana (P.).

2. A flinty stone. = kapuarangi.

3. Polyprion oxygeneios, a fish. = hapuku.

4. An edible species of fungus.

5. Part of the titi pattern of tattooing.

Kapuarangi, n. A variety of matā, or cutting stone, of inferior quality. = kapua, 2.

Kāpuhi, kāpuhipuhi, n. Cluster of branches at the top of a tree.

Kāpui, v.t. 1. Gather up in a bunch. Ka kapuitia nga rau o te kiekie, ka herea.

2. Lace up or draw in the mouth of a bag.

3. Earth up crops, or cover up embers with ashes to keep them alight.

kāpuipui, v.t. Gather up litter, etc.

Kāpuka, n. Griselinia littoralis, a tree. = papauma.

Kapukiore, n. Coprosma australis, a shrub. = kanono.

Kāpuku = kōpuku, n. Gunwale.

Google Books:

Kapukapu, s. Sole of the foot,

Eldpukdpu, v. To curl* as a


Ka kapukapu mai te ngaru; The wave curls over.

Kapunga, v. To take up with both hands held together,

Kapungatia he kai i te omu; Take up food from the oven.

(B. C,

Kapura, s. Fire, -' Tahuna he kapura ; Kindle a fire.

Kapurangi, s. Rubbish; weeds,

Kara, s. An old man,

Tena korua ko kara ? How are you and the old man ?

Kara, s> Basaltic stone.

He kara te kamaka nei; This stone is kara.

Karaha, s. A calabash. ♦Kardhi, *. Glass,

Internet Archive:

kapukapu, n. sole of the foot.

kapukapu, v. i. 1. curl (as a wave). Ka kapukapu mai te ngaru. 2. gush.

kakapii, small basket for cooked food.

Kapua, n. cloud; hank of clouds,

Kapunga, n. palm of the hand.

kapunga, \. t. take up in both hands together.

Kapiira, n. fire.

Kapiiranga, n. handful.

kapuranga, v. t. take up by hand-fuls. Kapurangatia nga otaota na e ia. v. i. dawn. Ka kapuranga te ata.

Kapur&ngi, n. rubbish; uveds.

I. K&r&, n. old man. Tena korua ko kara.

II. K&r&, n. secret plan; conspiracy. Kei te whakatakoto kara mo Te Horo kia patua.

k&k&r&, D. scent; smell.

k&k&r&, a. savoury; odoriferous.

k^ar&, n. a shell-iish.

Unlike the other two, the NZETC version has accents, bold and italics in the right place. It' the only one with a workable and useful table of contents. It is also edition which has been extensively revised and expanded. Google's second edition has many character errors, while the Internet Archive's third edition has many 'á' mis-recognised as '&.' The Google and Internet Achive versions are also available as PDFs, but of course, without fancy tables of contents these PDFs are pretty challenging to navigate and because they're built from page images, they're huge.

It's tempting to say that the NZETC version is better than either of the others, and from a naïve point of it is, but it's more accurate to say that it's different. It's a digitised version of a book revised more than a hundred years after the 1852 second edition scanned by Google books. People who're interested in the history of the language are likely to pick the 1852 edition over the 1957 edition nine times out of ten.

Technical work is currently underway to enable third parties like the Internet Archive's bookserver to more easily redistribute our ePubs. For some semi-arcane reasons it's linked to upcoming new search functionality.

What LibraryThing metadata can the NZETC reasonable stuff inside it's CC'd epubs?

This is the second blog following on from an excellent talk about librarything by LibraryThing's Tim given the VUW in Wellington after his trip to LIANZA.

The NZETC publishes all of it's works as epubs (a file format primarily aimed at mobile devices), which are literally processed crawls of it's website bundled with some metadata. For some of the NZETC works (such as Erewhon and The Life of Captain James Cook), LibraryThing has a lot more metadata than the NZETC, becuase many LibraryThing users have the works and have entered metadata for them. Bundling as much metadata into the epubs makes sense, because these are commonly designed for offline use---call-back hooks are unlikely to be avaliable.

So what kinds of data am I interested in?
1) Traditional bibliographic metadata. Both LT and NZETC have this down really well.
2) Images. LT has many many cover images, NZETC has images of plates from inside many works too.
3) Unique identification (ISBNs, ISSNs, work ids, etc). LT does very well at this, NZETC very poorly
4) Genre and style information. LT has tags to do fancy statistical analysis on, and does. NZETC has full text to do fancy statistical analysis on, but doesn't.
5) Intra-document links. LT has work as the smallest unit. NZETC reproduces original document tables of contents and indexes, cross references and annotations.
6) Inter-document links. LT has none. NZETC captures both 'mentions' and 'cites' relationships between documents.

While most current-generation ebook readers, of course, can do nothing with most of this metadata, but I'm looking forward to the day when we have full-fledged OpenURL resolvers which can do interesting things, primarily picking the best copy (most local / highest quality / most appropiate format / cheapest) of a work to display to a user; and browsing works by genre (LibraryThing does genre very well, via tags).

Thursday, 15 October 2009

Interlinking of collections: the quest continues

After an excellent talk today about LibraryThing by LibraryThing's Tim, I got enthused to see how LibraryThing stacks up against other libraries for having matches in it's authority control system for entities we (the NZETC) care about.
The answer is averagely.
For copies of printed books less than a hundred years old (or reprinted in the last hundred years), and their authors, LibraryThing seems to do every well. These are the books likely to be in active circulation in personal libraries, so it stands to reason that these would be well covered.
I tried half a dozen books from our Nineteenth-Century Novels Collection, and most were missing, Erewhon, of course, was well represented. LibraryThing doesn't have the "Treaty of Waitangi" (a set of manuscripts) but it does have "Facsimiles of the Treaty of Waitangi." It's not clear to me whether these would be merged under their cataloguing rules.
Coverage of non-core bibliographic entities was lacking. Places get a little odd. Sydney is ",%20New%20South%20Wales,%20Australia" but Wellington is "" and Anzac Cove appears to be is missing altogether. This doesn't seem like a sane authority control system for places, as far as I can see. People who are the subjects rather than the authors of books didn't come out so well. I couldn't find Abel Janszoon Tasman, Pōtatau Te Wherowhero or Charles Frederick Goldie, all of which are near and dear to our hearts.

Here is the spreadsheet of how different web-enabled systems map entities we care about.

Correction: It seems that the correct URL for Wellington is,%20New%20Zealand which brings sanity back.

Saturday, 19 September 2009

eBook readers need OpenURL resolvers

Everyone's talking about the next generation of eBook readers having larger reading area, more battery life and more readable screen. I'd give up all of those, however, for an eBook reader that had an internal OpenURL resolver.

OpenURL is the nifty protocol that libraries use to find the closest copy of a electronic resources and direct patrons to copies that the library might have already licensed from commercial parties. It's all about finding the version of a resource that is most accessible to the user, dynamically.

Say I've loaded 500 eBooks into my eBook reader: a couple of encyclopedias and dictionaries; a stack of books I was meant to read in school but only skimmed and have been meaning to get back to; current block-busters; guidebooks to the half-dozen countries I'm planning on visiting over the next couple of years; classics I've always meant to read (Tolstoy, Chaucer, Cervantes, Plato, Descartes, Nietzsche); and local writers (Baxter, Duff, Ihimaera, Hulme, ...). My eBooks by Nietzsche are going to refer to books by Descartes and Plato; my eBooks by Descartes are going to refer to books by Plato; my encyclopaedias are going to refer to pretty much everything; most of the works in translation are going to contain terms which I'm going to need help with (help which theencyclopedias and dictionaries can provide).

Ask yourself, though, whether you'd want to flick between works on the current generation of readers---very painful, since these devices are not designed for efficient navigation between eBooks, but linear reading of them. You can't follow links between them, of course, because on current systems links must point either with the same eBook or out on to the internet---pointing to other eBooks on the same device is verboten. OpenURL can solve this by catching those URLs and making them point to local copies of works (and thus available for free even when the internet is unavailable) where possible while still retaining their

Until eBook readers have a mechanism like this eBooks will be at most a replacement only for paperback novels---not personal libraries.

Tuesday, 15 September 2009

Thoughts on koha

The Koha community is currently undergoing a spasm, with a company apparently forking the code.
As a result a bunch of people are looking at where the community should go from here and how it should be led. In particular the idea of a not-for-profit foundation has been floated and is to be discussed at a meeting early tomorrow morning .
My thoughts on this issue are pretty simple:
  • A not-for-profit is a fabulous idea
  • Reusing one of the existing software not-for-profit (Apache, Software in the Public Interest, etc) introduces a layer of non-library complexity. Libraries are have a long history with consortia, but tend to very much flock together with their own kind, I can see them being leary of a non-library entity.
  • A clear description of a forward-looking plan written in plain language that everyone can understand is vital to communicate the vision of the community, particularly to those currently on the fringes

Tuesday, 1 September 2009

Data and data modelling and underlying assumptions

I feel that there was a huge disconnect between some groups of participants at #opengovt ( in Wellington last weekend. This is my attempt to illuminate the gaps.

The gaps were about data and data modelling and underlying assumptions that the way one person / group / institution viewed a kind of data was the same as the way others viewed it.

This gap is probably most pronounced in geo-location.

There's a whole bunch of very bright people doing wonderful mashups in geo-location using a put-points-on-a-map model. Typically using google maps (or one of a small number of competitors) they give insights into all manner of things by throwing points onto maps, street views, etc, etc. It's a relatively new field and every time I look they seem to have a whizzy new toy. Whizzy thing of the day for me was . Unfortunately the very success of the 'data as points' model encourages the view that location is a lat / long pair and the important metric is the number of significant digits in the lat / long.

In the GLAM (Galleries, Libraries, Archives and Museums) sector, we have a tradition of using thesauri such as the Getty Thesaurus of Geographic Names. Take all look at the entry for The Wellington region:

Yes, if has a lat and a long (with laughable precision), but the lat and long are arguably the least important information on the page. There's a faceted hierarchy, synonyms, linked references and type data. Te Papa have just moved to Getty for place names in their new site ( and frankly, I'm jealous. They paid a few thousand dollars for a licence to thesaurus and it's a joy to use.

The idea of #opengovt is predicated on institutions and individuals speaking the same languages, being able to communicate effectively, and this is clearly a case where we're not. Learning to speak each others languages seems like it's going to be key to this whole venture.

As something of a worked example, here's something that I'm working on at the moment. It's a page from The Manual of the New Zealand Flora by Thomas Frederick Cheeseman, a core text in New Zealand botany, see The text is live on our website, but it's not yet fully marked up. I've chosen it because it illustrates two separate kinds of languages and their disparities.

What are the geographic locations on that page?

* Nelson-Mountains flanking the Clarence Valley
* Marlborough—Kaikoura Mountains
* Canterbury—Kowai River
* Canterbury—Coleridge Pass
* Otago—Mount St. Bathan's

The qualifier "2000–5000 ft" (which I believe is an elevation range at which these flourish) applies across these. Clearly we're going to struggle to represent these with a finite number of lat/long points, no matter how accurate. In all likelihood, I'll not actually mark up these locations, since the because no one's working with complex locations, the cost benifit isn't within sight of being worth it.

Te Papa and the NZETC have a small-scale binomial name exercise underway, and for that I'll be scripting the extraction of the following names from that page:

* Notospartium carmichœliœ (synonym Notospartium carmichaeliae)
* Notospartium torulosum

There were a bunch of folks at the #opengovt barcamp who're involved in the "New Zealand Organisms Register" ( project. As I understand it, they want me to expose the following names from that page:

* Notospartium carmichœliœ, Hook. f.
* Notospartium torulosum, Hook. f.

Of course the name the public want is:

* New Zealand Pink Broom
* ? (Notospartium torulosum appears not to have a common name)

Note that none of these taxonomic names actually appear in full on the page...

Yes is, clearly, an area where the best can be the good and visa versa, but the good needs to at least be aware of the best.

Monday, 27 July 2009

Learning XSLT 2.0 Part 1; Finding Names

We mark up a lot of names, so one of the first things I decided to do was to build an XSLT stylesheet that takes a list of names and tags those names when they occur in a separate XSLT file. To make things easier and clearer, I've ignored little things like namespaces, conformant TEI, etc, etc.

First up, the list of names, these are multi-word names. Notice the simple structure, this could easily be built from a comma seperated list or similar:

<?xml version="1.0" encoding="UTF-8"?>
<name>Papaver argemone</name>
<name>Papaver dubium</name>
<name>Papaver Rhceas</name>
<name>Zanthoxylum novæ-zealandiæ</name>

Next, some sample text:

<?xml version="1.0" encoding="UTF-8"?>
<p> There are several names Papaver argemone in this document Papaver argemone</p>
<p> Some of them are the same as others (Papaver Rhceas Papaver rhceas P. rhceas)</p>
<p> Non ASCII characters shouldn't cause a problem in names like Zanthoxylum novæ-zealandiæ AKA Zanthoxylum novae-zealandiae</p>

Finally the stylesheet. It consists of three parts: the regexp variable that builds a regexp from the names in the file; a default template for everything but text(); and a template for text()s that applies the rexexp.

<?xml version="1.0"?>
<xsl:stylesheet version="2.0"
xmlns:xsl="" >

<!-- build a regexp of the names -->
<xsl:variable name="regexp">
<xsl:value-of select="concat('(',string-join(document('name-list.xml')//name/text(), '|'), ')')"/>

<!-- generic copy-everything-except-texts template -->
<xsl:template match="@*|*|processing-instruction()|comment()">
<xsl:apply-templates select="@*|*|processing-instruction()|comment()|text()"/>

<!-- Look for binomal names in appreviated form where the genus name is in the immediately preceeding head -->
<xsl:template match="text()">
<xsl:analyze-string select="." regex="{$regexp}">
<name type="taxonomic" subtype="matched">
<xsl:value-of select="regex-group(1)"/>
<xsl:value-of select="."/>


The output looks like:

<?xml version="1.0" encoding="UTF-8"?><doc>
<p> There are several names <name type="taxonomic" subtype="matched">Papaver argemone</name> in this document <name type="taxonomic" subtype="matched">Papaver argemone</name></p>
<p> Some of them are the same as others (<name type="taxonomic" subtype="matched">Papaver Rhceas</name> Papaver rhceas P. rhceas)</p>
<p> Non ASCII characters shouldn't cause a problem in names like <name type="taxonomic" subtype="matched">Zanthoxylum novæ-zealandiæ</name> AKA Zanthoxylum novae-zealandiae</p>

As you may notice, I've not yet worked out the best way to handle the 'æ'

Saturday, 6 June 2009

Legal Māori Archive

Now that the
Legal Māori Archive is live, I thought I'd highlight a couple of my favourite texts from the corpus.

The first is a great example of reinforcing cultural confusion.
"The Laws of England, Compiled and translated into the Māori language" by judge Francis Dart Fenton is a bi-lingual compendium of the laws of England, but extraordinarily uses bible quotes as examples.

The second example is actaully a collection of texts, the works of Rev. Henry Hanson Turton, who compiled thousands of pages of land deeds and associated documents into six volumes. I can see these seeing a lot of use by Treaty researchers.

Tuesday, 5 May 2009

Why card-based records aren't good enough

Card catalogs have a long tradition in librarianship, dating back, I'm told, to the book stock-take in the French revolution. Librarians understand card catalogs in a deep way that comes from generations of librarians having used them as a core professional tool all their professional lives. Librarians understand card catalogs in ways that I, as a computer scientist, never will. I still recall on one of my first visits to a university library, I asked a librarian where I might find books by a particular author, they found the work for me arguably as fast as I can now find works with the new wizzy electronic catalog.

It is natural, when faced with something new, to understand it in terms of what we already know and already understand. Unfortunately, understanding the new by analogy to the old can lead to form of the old being assumed in the new. It was true that when libraries digitized their card catalogs in the 1970s and 1980s, they were more or less exactly digital versions of the card catalog predecessors, because their content was limited to old data from the cards and new data from cataloging processes (which were unchanged from the card catalog era) and because librarians and users had come to equate a library catalog with a card catalog---it was what they expected.

MARC is a perfect example of this kind of thing. As a data format to directly replace a card catalog of printed books, it can hardly be faulted.

Unfortunately, digital metadata has capabilities undreamt of at the time of the French revolution, and card catalogs and MARC do a poor job of handling these capabilities.

A whole range of people have come up with criticisms of MARC that involve materials and methodologies not routinely held in libraries at the time of the French revolution (digital journal subscriptions and music, for example), but I view these as postdating card catalogs and thus the criticism as unfair.

So what was held in libraries in 1789 that MARC struggle with? Here's a list:
  • Systematically linking discussion of particular works with instances of those works
  • Systematically linking discussion of particular instances with those instances ("Was person X the transcriber of manuscript Y?")
  • Handling ambiguity ("This play may have been written by Shakespeare. It might also have been a later forgery by Francis Bacon, Christopher Marlowe or Edward de Vere")

All of these relate to core questions which have been studed in libraries for centuries. They're well understood issues, which changed little in the hundred years until the invention of the computer (which is when all the usually-cited issues with MARC began).

The real question is why we're still expecting an approach that didn't solve the problems two hundred years ago to solve our problems now? Computers are not magic in this area they just seem to be helping us do the wrong things faster, more reliably and for larger collections.

We need a new approach to bibliographic metadata, one which is not ontologically bound to little slips of paper. There are a whole range of different alternatives out there (including a bevy of RDF vocabularies), but I've yet to run into one which both allowed clear representation of existing data (because lets face it, I'm not going to re-enter worldcat, and neither are you, not in our lifetimes) and admitting non-card-based metadata as first class elements.


Friday, 1 May 2009

LoC gets semantic

This morning, the Library of Congress launched, their first serious entry into the semantic web.

The site makes the Library of Congress Subject Headings available as defererenable URLs. For example

Wednesday, 4 February 2009

NDHA demo and the National Library

This morning I went to the NDHA demonstration where a National Library techie talked us through the NDHA ingest tools. The tools are the most visible piece of the NDHA infrastructure, and are designed to unify the ingest of digital documents, whether they are born-digital documents physically submitted (i.e. producers mail in CDs/DVDs etc); born-digital documents electronically submitted (i.e. producers upload content via wizzy web tools); or digital scans of current holdings produced as part of the on-going digitisation efforts. The tools have a unified system with different workflows for unpublished material (=archive) and published material (=libarary). The unification of library and archival functionality seemed like futile ground for miscommunication.

The infrastructure is (correctly) embedded across the library, and uses all the current tools for collection maintenance, searching and access.

As a whole it looks like the system is going to save time and money for large content producers and capture better metadata for small donors of content, which is great. By moving the capture of metadata closer to the source (while still allowing professional curatorial processes for selection and cataloguing), it looks like more context is going to be captured, which is fabulous.

A couple of things struck me as odd:

  1. The first feedback to the producer/uploader is via a human. Despite having an elaborate validation suite, the user wasn't told immediately "that .doc file looks like a PDF, would you like to try again?" or "Yep, that *.xml file is valid and conforming XML" Time and again studies have shown that immediate feedback to allow people to correct their mistakes immediately is important and effective.
  2. The range of metadata fields available for tagging content was very limited. For example there was no Iwi/Hapu list, no Maori Subject Headings, no Gazetteer of Official Geographic Names. When I asked about these I was told "that's the CMS's role" (=Collection Management Software, i.e. those would be added by professional cataloguers later), but if you're going to move the metadata collection to as close to content generation, it makes sense to at least have the option of proper authority control over it.
Or that's my take, anyway. Maybe I'm missing something.

Monday, 2 February 2009

Report from the NDHA's International Perspectives on Digital Preservation

NOTE: I'm a computer scientist by training and this was largely librarian/archivist gig, so it's entirely possibly I've got the wrong end of the stick on one or more point in the summary below. It's also my own summary, and not the position of my employer, even though I was on work time during the event.

The NDHA is about to announce that the NDHA project has been completed on time and under budget. This is particularly pleasing in light of the poor history of government IT failures over the course of the last 30 years and a tribute to all concerned. Indeed, when I was taking undergraduate courses in software engineering a contemporary national library project was used as a text-book example of how not to run a software development undertaking. It's good to see how far they've come.

The event itself was a one-day event in the national library auditorium, with a handful of overseas speakers. I'm not entirely certain that a handful of foreigners counts as "international," but maybe that's just me being a snob. Certainly there was a fine turn-out of locals, including many from the National Library, the Ministry of Culture and Heritage and from VUW, including a number of students, who couldn't possibly have been there for the free food.

There seemed to be an underlying tension between librarianship and archivistship running through the event. I see this as being a really crazy turfwar, personally, since I see the chances of libraries and archives existing as separate entities and disciplines in fifty years seems pretty slim. The separation between the two, the "uniqueness" of objects in an archive seems to be to be obliterated by the free-duplication of digital objects. I've heard people say that archives also work access controls and embargoes for their depositors, but then so can libraries, particularly those in the military and those working with classified documents.

It seemed to me that the word "reliability" was used in a confusing number different of ways by different people. Without naming the guilty parties:
  1. reliability as the truthfulness of the documents in the library/archive. This is the old problem of ingestors having to determine the absolute veracity of documents
  2. reliability as getting the same metadata every time. This seems odd to me, since systems with audit control give _different_ results every time, because information on the previous accesses is included in the metadata of subsequent accesses
  3. reliability as the degree to which the system conformed to a standard/specification
On reflection this may have been a symptom of the different vocabulary used by librarians and archivists. Whatever the cause, if we're wanting to spend public money, we have to be able to explain to the public what we're doing, and this isn't helping.

The organisers told us the presentations would be up by tonight (the evening of the presentation), but you won't find them on google if you go looking, because they tell google to please f**k off. I guess this is what someone was referring to when they said we had to work to make content accessible to google. The link is and most were up at the time of writing.

I was hugely encouraged by the number of pieces of software that seemed to be being open sourced, as I see this as being a much better economic model than paying vendors for custom software, particularly since it's potentially scalable out from the national and top-tier libraries/archives/museums out to the second and third tier libraries/archives/museums, which by dint of their much larger numbers actually serve the most users and have the most content. It was unfortunate that the national library hasn't looked beyond propriety software for non-specialist software but continues to use AbodePhotoshop / Microsoft Windows, which are available only for limited periods of time on certain platforms (which will inevitably become obsolete), rather than openoffice, GIMP, etc, which are cross platform and licensed under perpetual licences which include the right to port the software from one platform to another. I guessPhotoshop / Windows is what their clients and funders know and use.

With a number of participants I had conversations about preservation. Andrew Wilson in his presentation used the quote:

“traditionally, preserving things meant keeping them unchanged; however our digital environment has fundamentally changed our concept of preservation requirements. If we hold on to digital information without modifications, accessing the information will become increasingly difficult, if not impossible” Su-Sing Chen, “The Paradox of Digital Preservation”, Computer, March 2001, 2-6

If you think about what intellectual objects we have from the Greeks (which is were us Westerners traditionally trace our intellectual history from), the majority fall into two main classes: (a) art works, which have survived primarily through roman copies and (b) texts, which have survived by copying, including a body of mathematics which were kept alive in the Arabic translation during a period when we Westerners were burning the works in Latin and Greek and claiming that the bible was the only book we needed. I'll grant you that a high-quality book will last maybe 500 years in a controlled environment, maybe even 1000, but for real permanence, you just can't get past physical ubiquity. If we have things truly worthy of long-term preservation, we should be striking deals with the Warehouse to get them into every home in the country, and setting them as translation exercises in our language learning courses.

I had some excellent conversations with other participants at the event, including Phillipa Tocker from Museums Aotearoa / Te Tari o Nga Whare Taonga o te Motu who told me about the site they put together for their members.

Looking at the site I'm struck by how similar the search functionality is to I'm not sure whether their relative similarity is a good thing (because it enables non-experts to search the holdings) or a bad thing (because by lowering themselves to the lowest common denominator they've devalued their uniqueness). While I'm certain that these websites have vital roles in the museums and archives community respectively, I can't help but feel that from an end-users perspective have two sites rather than one seems redundant, and the fact that they don't seem to reference/suggest any other information sources doesn't help. I can't imagine a librarian/archivist not being forth-coming with a suggestion of where to look next if they've run out of local relevant content---why should our websites be any different?

I recently changed the NZETC to point to likely-relevant memory institutions when a search returns no results (or when a user pages through to the end of any list of results).

I also talked to some chaps from Te Papa about the metadata they're using to to represent places names (Getty Thesaurus of Geographic Names) and species names (ad-hoc). At the NZETC we have many place names marked up (in NZ, Europe and the Pacific), but are not currently syncing with an external authority. Doing so would hugely enable interoperability. Ideally we'd be using the shiny new New Zealand Gazetteer of Official Geographic Names, but it doesn't yet have enough of the places we need (it basically only covers places mentioned in legislation or treaty settlements). It does have macrons in all the right places though, which is an excellent start. We currently don't mark up species names, but would like to, and again an external authority would be great.

It might have been useful if the day had included an overview of what the NDHA actually was and what had been achieved (maybe I missed this?).

Sunday, 1 February 2009

flickr promoting the commons / creative commons

flickr is promoting photos in what it calls "The Commons", but only to logged in users. Normal users (who can't comment / tag the photos in the commons anyway) don't get an obvious link to them (except via about a billion third party sites such as blogs and google search). The page also shows how the national library's choice not to include text in their logo has come up trumps.

The confusingly similarly named "The Commons" and "Creative Commons" parts of the website apparently don't reference each other. Odd.

Friday, 9 January 2009

Excellent stuff from New Zealand Geographic Board Ngā Pou Taunaha o Aotearoa

A while ago, motivated by the need for an authoritative list of New Zealand place names for our with at the NZETC, I criticised the NZGB fairly roundly.
While they haven't produced what I/we want/need, in the last couple of months they've made huge progress in an unambiguously right direction.
Their primary work is the New Zealand Gazetteer of Official Geographic Names, a list of all official place names in New Zealand. It uses have a peculiar definition of "official" (= mentioned in legislation or a Treaty of Waitangi settlement), they have very few names of inhabited places (and no linking with the much larger ones maintained by official bodies such as the police and fire service), They have no elevation data for mountains and pass (which are defined by their height) and they define some things as points when they appear to be areas (such as Arthur Pass National Park), but it's much better than the New Zealand Place Names Database since:
  1. It has a statutory reference for every place, given the source of the officialness of the name
  2. It fully support Macrons
  3. It has a machine readable-list of DoC administered lands --- I can imagine this being used for all sorts of interesting things, getting people out in other scenic and marine reserves.
NZGB sent around an email in which they explicitly addressed some of the points I'd earlier raised (I'm sure I wasn't the only one):
It should be noted that some of the naming practices of the past will have to be lived with, despite inconsistencies. Moving forward, the rules of nomenclature followed by the NZGB are designed to promote standardisation, consistency, and non-ambiguity. The modern format for dual names is '<Maori name> / <non-Maori name', which the NZGB has applied for the past 10 years, though Treaty settlement dual names sometimes deviate from this convention, because the decision is ultimately made by the Minister for Treaty of Waitangi Negotiations. Older forms of dual names, with brackets, will remain depicted as such until changed through the statutory processes of the NZGB Act 2008. These are not generally regarded as alternative names.
Macrons in Maori names have posed problems for electronic databases. Nevertheless they are part of the orthography, recommended by the Maori Language Commission, and the Board endorses their use. The Gazetteer will include macrons where they are formalised as part of the official name. When Section 32 of the new Act comes into force, official documents will be required to show official names, and these will need to include macrons where they have been included as part of the official name (unless the proviso is used). A list of those official names which have macrons is at . LINZ's Customer Services has some solutions for showing macrons in LINZ's own databases and on published maps and charts, and is currently investigating how bulk data extracts might include information about macrons, for the customer's benefit.
Despite the name, it isn't clear in my mind exactly what's official and what isn't. Is the content of the "coordinates" column official? For railway lines this is a reference to the description, which in the cases of railways is usually of the form "From X to Y", where X and Y are place names, frequently place names that aren't on the list, so are thus presumably not official. Unless I'm going blind there is also no indication of accuracy on the physical measurements.