Monday, 3 October 2016

How would we know when it was time to move from TEI/XML to TEI/JSON?

This post inspired by TEI Next by Hugh Cayless.

How would we know when it was time to move from TEI/XML to TEI/JSON?

If we stand back and think about what it is we (the TEI community) need from the format :
  1. A common format for storing and communicating Texts and augmentations of Texts (Transcriptions, Manuscript Description, Critical Apparatus, Authority Control, etc, etc.).
  2. A body of documentation for shared use and understanding of that format.
  3. A method of validating Texts in the format as being in the format.
  4. A method of transforming Texts in the format for computation, display or migration.
  5. The ability to reuse the work of other communities so we don't have to build everything for ourselves (Unicode, IETF language tags, URIs, parsers, validators, outsourcing providers who are tooled up to at least have a conversation about what we're trying to do, etc)
[Everyone will have their slightly different priorities for a list like this, but I'm sure we can agree that a list of important functionality could be drawn up and expanded to requirements list at a sufficiently granular level so we can assess different potential technologies against those items. ] 

If we really want to ponder whether TEI/JSON is the next step after TEI/XML we need to compare the two approaches against such as list of requirements. Personally I'm confident that TEI/XML will come out in front right now. Whether javascript has potential to replace XSLT as the preferred method for really exciting interfaces to TEI/XML docs is a much more open question, in my mind.  

That's not to say that the criticisms of XML aren't true (they are) or valid (they are) or worth repeating (they are), but perfection is commonly the enemy of progress.

Sunday, 2 October 2016

Whither TEI? The Next Thirty Years



This post is a direct response to some of the organisational issues raised in https://scalablereading.northwestern.edu/?p=477
I completely agree that we need to significantly broaden the base of the TEI. A 200 x 500 campaign is a great idea, but better is a 2,000 x 250 goal, or a 20,000 x 250 goal. If we can reduce the cost to the normal range of a hardback text, most libraries will have delegated signing authority to individuals in acquisitions and only one person will need to be convinced, rather than a chain of people.
But how could we scale 20,000 institutions? To scale like that, we to think (a) in terms of scale and (b) in terms of how to make it easy for members to be a part of us.

Scale (1)

A recent excellent innovation in the the TEI community has been the appointment of a social media coordinator. This is a great thing and I’ve certainly learnt about happenings I would not have otherwise been exposed to. But by nature the concept of ‘a social media coordinator’ can’t scale (one person in one time zone with one set of priorities...). If we look at what mature large-scale open projects do for social media (debian, wikimedia, etc), planets are almost always part of the solution. A planet for TEI might include (in no particular):
  1. 20x blog feeds from TEI-specific projects
  2. 20x blog feeds from TEI-using projects (limited to those posts tagged TEI)
  3. 1x RSS feed for changes to the TEI wiki (limited to one / day each)
  4. 1x RSS feed for jenkins server (limited to successful build only; limited to one / day each; tweaked to include full context and links)
  5. 20x RSS feeds for github repositories not covered by jenkins server (limited to one / day each)
  6. 10x RSS feeds for other sundry repositories (limited to one / day each)
  7. 50x blog feeds from TEI-people (limited to those posts tagged TEI)
  8. 15x RSS feeds from TEI-people’s zotero bibliographic databases (limited to those bibs tagged TEI; limited to one / day each)
  9. 1x RSS feed for official TEI news
  10. 7x RSS feed of edits for the TEI article on each language wikipedia (limited to one / day each)
  11. 1x RSS feed of announcements from the JTEI
  12. 1x RSS feed of new papers in the JTEI
The diversity of the planet would be incredible compared to current views of the TEI community and it’s all generated as a byproduct of what people are already doing. There might be some pressure to improve commit messages in some repos, but that might not be all bad.
Of course the whole planet is available as an RSS feed and there are RSS-to-facebook (and twitter, yammer, etc) converters if you wish to do TEI in your favourite social media. If the need for a curated facebook feed remains, there is now a diverse constant feed of items to select within.
This is a social media approach at scale.

Scale (2)

There is an annual international conference which is great to attend. There is a perception that engagement in the TEI community requires attendance at the said conference. It’s a huge barrier to entry to small projects, particularly those in far-away places (think global south / developing world / etc). The TEI community should seriously consider a policy for decision making that explicitly removes assumptions about attendances. Something as simple as requiring draft papers intended for submission and agendas to be published and 30 days in advance of meetings and a notice to be posted to TEI-L. That would allow for thoughtful global input, scaling community from those who can attend an annual international conference to a wider group of people who care about the TEI and have time to contribute.

Make it easy (1)

Libraries (at least the library I work in and libraries I talk to) buy resources based on suggestions and lobbying by faculty but renew resources based largely on usage. If we want 20,000 libraries to have TEI on automatic renewal we need usage statistics. The players in the field are SUSHI and COUNTER (SUSHI is a harvesting system for COUNTER).
Maybe the TEI offers members stats at 10 diverse TEI-using sites. It’s not clear to me without deep investigation whether the TEI could offer these stats to members at very little on-going cost to us, but it would be a member benefit that all acquisitions librarians, their supervisors and their auditors could understand and use to evaluate their TEI membership subscription. I believe that that comparison would be favourable.
Of course, the TEI-using sites generating the traffic are going to want at least some cut of the subs, even if it’s just a discount against their own membership (thus driving the number of participating sites up and the perceived member benefits up) and free support for the stats-generating infrastructure.
For the sake of clarity: I’m not suggesting charging for access to content, I’m suggesting charging institutions for access to statistics related to access to the content by their users.

Make it easy (2)

Academics using computers for research, whether or not they think or call the field digital humanities face a relatively large number of policies and rules imposed by their institutions, funders and governments. The TEI community can / should be selling itself as he approach to meet these.
  1. Copyright issues? Have some corpora that are available under a CC license.
  2. Need to prove academic outputs are archivable? Here’s the PRONOM entry (Note: I’m currently working on this)
  3. Management doesn’t think the department as the depth of TEI experience to enroll PhDs in TEI-centric work? Here’s a map of global TEI people to help you find local backups in case staff move on.
  4. Looking for a TEI consultant? A different facet of the same map gives you what you need.
  5. You’re a random academic who knows nothing about the TEI but assigned a TEI-centric paper as part of a national research assessment exercise? Here’s an outline of TEI’s academic credentials.
  6. ....

Make it easy (3)

Librarians love quality MARC / MARCXML records. Many of us have quality MARC / MARCXML records for our TEI-based web content. Might this be offered as a member benefit?

Make it easy (4)

As far as I can tell the TEI community makes very little attempt to reach out to academic communities other than ‘literature departments and cognate humanities disciplines’ attracting a more diverse range of skills and academics will increase our community in depth and breadth. Outreach could be:
  1. Something like CSS Zen Garden http://www.csszengarden.com/ only backed by TEI rather than HTML
  2. A list of ‘hard problems’ that we face that various divergent disciplines might want to set as second or third year projects. Each problem would have a brief description of the problem, pointers to Things like:
    1. Transformation for display for documents have five foot levels of footnotes, multiple obscure scripts, non-Unicode characters, and so forth.
    2. Schema / ODD auto-generation from a corpus of documents
    3. ...
  3. Engaging with a group like http://software-carpentry.org/ to ubiquify TEI training
  4. ..

End Note

I'm not advocating that any particular approach is the cure-all for everything that might be ailing the TEI community, but the current status-quo is increasingly seeming like benign neglect. We need to change the way we think about TEI as a community.


Tuesday, 20 October 2015

Thoughts on the NDFNZ wikipedia panel






Last week I was on an NDFNZ wikipedia panel with Courtney Johnston, Sara Barham and Mike Dickison. Having reflected a little and watched the youtube at https://www.youtube.com/watch?v=3b8X2SQO1UA I've got some comments to make (or to repeat, as the case may be).

Many people, including apparently including Courtney, seemed to get the most enjoyment out of writing the ‘body text’ of articles. This is fine, because the body text (the core textual content of the article) is the core of what the encyclopaedia is about. If you can’t be bothered with wikiprojects, categories, infoboxes, common names and wikidata, you’re not alone and there’s no reason you need to delve into them to any extent. If you start an article with body text and references that’s fine; other people will to a greater or less extent do that work for you over time. If you’re starting a non-trivial number of similar articles, get yourself a prototype which does most of the stuff for you (I still use https://en.wikipedia.org/wiki/User:Stuartyeates/sandbox/academicbio which I wrote for doing New Zealand women academics). If you need a prototype like this, feel free to ask me.

If you have a list of things (people, public art works, exhibitions) in some machine readable format (Excel, CSV, etc) it’s pretty straightforward to turn them into a table like https://en.wikipedia.org/wiki/Wikipedia:WikiProject_New_Zealand/Requested_articles/Craft#Proposed_artists or https://en.wikipedia.org/wiki/Enjoy_Public_Art_Gallery Send me your data and what kind of direction you want to take it.

If you have a random thing that you think needs a Wikipedia article, add to https://en.wikipedia.org/wiki/Wikipedia:WikiProject_New_Zealand/Requested_articles  if you have a hundred things that you think need articles, start a subpage, a la https://en.wikipedia.org/wiki/Wikipedia:WikiProject_New_Zealand/Requested_articles/Craft and https://en.wikipedia.org/wiki/Wikipedia:WikiProject_New_Zealand/Requested_articles/New_Zealand_academic_biographies both completed projects of mine.

Sara mentioned that they were thinking of getting subject matter experts to contribute to relevant wikipedia articles. In theory this is a great idea and some famous subject matter experts contributed to Britannica, so this is well-established ground. However, there have been some recent wikipedia failures particularly in the sciences. People used to ground-breaking writing may have difficulty switching to a genre where no original ideas are permitted and everything needs to be balanced and referenced.

Preparing for the event, I created a list of things the awesome Dowse team could do as follow-ups to they craft artists work, but we never got to that in the session, so I've listed them here:
  1. [[List of public art in Lower Hutt]] Since public art is out of copyright, someone could spend a couple of weeks taking photos of all the public art and creating a table with clickable thumbnail, name, artist, date, notes and GPS coordinates. Could probably steal some logic from somewhere to make the table convertible to a set of points inside a GPS for a tour.
  2. Publish from their archives a complete list of every exhibition ever held at the Dowse since founding. Each exhibition is a shout-out to the artists involved and the list can be used to check for potentially missing wikipedia articles.
  3. Digitise and release photos taken at exhibition openings, capturing the people, fashion and feeling of those era. The hard part of this, of course, is labelling the people.
  4. Reach out to their broader community to use the Dowse blog to publish community-written obituaries and similar content (i.e. encourage the generation of quality secondary sources).
  5. Engage with your local artists and politicians by taking pictures at Dowse events, uploading them to commons and adding them to the subjects’ wikipedia articles—have attending a Dowse exhibition opening being the easiest way for locals to get a new wikipedia image.
I've not listed the 'digitise the collections' option, since at the end of the day, the value of this (to wikipedia) declines over time (because there are more and more alternative sources) and the price of putting them online declines. I'd much rather people tried new innovative things when they had the agility and leadership that lets them do it, because that's how the community as a whole moves forward.

Thursday, 15 October 2015

Feedback on NLNZ ‘DigitalNZ Concepts API‘



This blog post is feedback on a recent blog post ‘Introducing the DigitalNZ Concepts API’ http://digitalnz.org/blog/posts/introducing-the-digitalnz-concepts-api by the National Library of New Zealand’s DigitalNZ team. Some of the feedback also rests on conversations I've had with various NLNZ staffers and other interested parties and a great stack of my own prejudices. I've not actually generated an API key and run the thing, since I'm currently on parental leave.
  1. Parts of the Concepts API look very much like authority control, but authority control is not mentioned in the blog post or the docs that I can find. It may be that there are good reasons for this (such as parallel comms in the pipeline for the authority control community) but there are also potentially very worrying reasons. Clarity is needed here when the system goes live.
  2. All the URLs in examples are HTTP, but the ALA’s Freedom to Read Statement requires all practical measures be taken to ensure the confidentiality of the reader’s searching and reading. Thus, if the API is to be used for real-time searching, HTTPS URLs must be an option. 
  3. There is insufficient detail of of the identifiers in use. If I'm building a system to interoperate with the Concepts API, which identifiers should I be keeping at my end to identify things that the DigitalNZ end? The clearer this definition is, the more robust this interoperability is likely to be, there’s a very good reason for the highly structured formats of identifiers such as ISNI and ISBN. If nothing else a regexp would be very useful. Personally I’d recommend browsing around http://id.loc.gov/ a little and rethinking the URL structure too.
  4. There needs to be an insanely clear statement on the exact relationship between DigitalNZ Concepts and those authority control systems mapped into VIAF. Both DigitalNZ Concepts and VIAF are semi-automated authority matching systems and if we’re not carefully they’ll end up polluting each other (as for example, DNB already has with gender data). 
  5. Deep interoperability is going to require large-scale matching of DigitalNZ Concepts with things in a wide variety of GLAM collections and incorporating identifiers into those collections’ metadata. That doesn't appear possible with the current licensing arrangements. Maybe a flat-file dump (csv or json) of all the Concepts under a CC0 license? URLs to rights-obsessed partners could be excluded.
  6. If non-techies are to understand Concepts, http://api.digitalnz.org/concepts/448 is going to have to provide human-comprehensible content without an API key (I’m guessing that this is going to happen when it comes out of beta?)
  7. Mistakes happen (see https://en.wikipedia.org/wiki/Wikipedia:VIAF/errors for recently found errors in VIAF, for example). There needs to be a clear contact point and likely timescale for getting errors fixed. 
Having said all that, it looks great!

Monday, 14 July 2014

BIBFRAME

Adrian Pohl ‏wrote some excellent thoughts about the current state of BIBFRAME at http://www.uebertext.org/2014/07/name-authority-files-linked-data.html The following started as a direct response but, after limiting myself to where I felt I knew what I was talking about and felt I was being constructive, turned out to be much much narrower in scope.

My primary concern in relation to BIBFRAME is interlinking and in particular authority control. My concern is that a number of the players (BIBFRAME, ISNI, GND, ORCID, Wikipedia, etc) define key concepts differently and that without careful consideration and planning we will end up muddying our data with bad mappings. The key concepts in question are those for persons, names, identities, sex and gender (there may be others that I’m not aware of).

Let me give you an example.

In the 19th Century there was a mass creation of male pseudonyms to allow women to publish novels. A very few of these rose to such prominence that the authors outed themselves as women (think Currer Bell), but the overwhelming majority didn’t. In the late 20th and early 21st Centuries, entries for the books published were created in computerised catalogue systems and some entries found their way into the GND. My understanding is that the GND assigned gender to entries based entirely on the name of the pseudonym (I’ll admit I don’t have a good source for that statement, it may be largely parable). When a new public-edited encyclopedia based on reliable sources called Wikipedia arose, the GND was very successfully cross-linked with Wikipedia, with hundreds of thousands of articles were linked to the catalogues of their works. Information that was in the GND was sucked into a portion of Wikipedia called Wikidata. A problem now arose: there were no reliable sources for the sex information in GND that had been sucked Wikidata by GND, the main part of Wikipedia (which requires strict sources) blocked itself from showing Wikidata sex information. A secondary problem was that the GND sex data was in ISO 5218 format (male/female/unknown/not applicable) whereas Wikipedia talks not about sex but gender and is more than happy for that to include fa'afafine and similar concepts. Fortunately, Wikidata keeps track of where assertions come from, so the sex info can, in theory, be removed; but while people in Wikipedia care passionately about this, no one on the Wikidata side of the fence seems to understand what the problem is. Stalemate.

There were two separate issues here: a mismatch between the Person in Wikipedia and the Pseudonym (I think) in GND; and a mismatch between a cataloguer-assigned ISO 5218 value and a free-form self-identified value. 

The deeper the interactions between our respective authority control systems become, the more these issues are going to come up, but we need them to come up at the planning and strategy stages of our work, rather than halfway through (or worse, once we think we’ve finished).

My proposed solution to this is examples: pick a small number of ‘hard cases’ and map them between as many pairs of these systems as possible.

The hard cases should include at least: Charlotte Brontë (or similar); a contemporary author who has transitioned between genders and published broadly similar work under both identities; a contemporary author who publishes in different genre using different identities; ...

The cases should be accompanied by instructions for dealing with existing mistakes found (and errors will be found, see https://en.wikipedia.org/wiki/Wikipedia:VIAF/errors for some of the errors recently found during he Wikipedia/VIAF matching).

If such an effort gets off the ground, I'll put my hand up to do the Wikipedia component (as distinct from the Wikidata component).


Wednesday, 19 June 2013

A wikipedia strategy for the Royal Society of New Zealand

Over the last 48 hours I’ve had a very unsatisfactory conversation with the individual(s) behind the @royalsocietynz twitter account regarding wikipedia. Rather than talk about what went wrong, I’d like to suggest a simple strategy that builds the Society’s causes in the long term.
First up, our resources: we have three wikipedia pages strongly related the Society, Royal Society of New Zealand, Rutherford Medal (Royal Society of New Zealand) and Hector Memorial Medal; we have a twitter account that appears to be widely followed; we have some employee of RSNZ with no apparent wikipedia skills wanting to use wikipedia to advance the public-facing causes of the Society, which are:
“to foster in the New Zealand community a culture that supports science, technology, and the humanities, including (without limitation)—the promotion of public awareness, knowledge, and understanding of science, technology, and the humanities; and the advancement of science and technology education: to encourage, promote, and recognise excellence in science, technology, and the humanities”
The first thing to notice is that promoting the Society is not a cause of the Society, so no effort should be expending polishing the Royal Society of New Zealand article (which would also breach wikipedia’s conflict of interest guidelines). The second thing to notice is that the two medal pages contain long lists of recipients, people whose contributions to science and the humanities in New Zealand are widely recognised by the Society itself.
This, to me, suggests a strategy: leverage @royalsocietynz’s followers to improve the coverage of New Zealand science and humanities on wikipedia:
  1. Once a week for a month or two, @royalsocietynz tweets about a medal recipient with a link to their wikipedia biography. In the initial phase recipients are picked with reasonably comprehensive wikipedia pages (possibly taking steps to improve the gender and racial demographic of those covered to meet inclusion targets). By the end of this part followers of @royalsocietynz have been exposed to wikipedia biographies of New Zealand people.
  2. In the second part, @royalsocietynz still tweets links to the wikipedia pages of recipients, but picks ‘stubs’ (wikipedia pages with little or almost no actual content). Tweets could look like ‘Hector Medal recipient XXX’s biography is looking bare. Anyone have secondary sources on them?’ In this part followers of @royalsocietynz are exposed to wikipedia biographies and the fact that secondary sources are needed to improve them. Hopefully a proportion of @royalsocietynz’s followers have access to the secondary sources and enough crowdsourcing / generic computer confidence to jump in and improve the article.
  3. In the third part, @royalsocietynz picks recipients who don’t yet have a wikipedia biography at all. Rather than linking to wikipedia, @royalsocietynz links to an obituary or other biography (ideally two or three) to get us started.
  4. In the fourth part @royalsocietynz finds other New Zealand related lists and get the by-now highly trained editors to work through them in the same fashion.
This strategy has a number of pitfalls for the unwary, including:
  • Wikipedia biographies of living people (BLPs) are strictly policed (primarily due to libel laws); the solution is to try new and experimental things out on the biographies of people who are safely dead.
  • Copyright laws prevent cut and pasting content into wikipedia; the solution is to encourage people to rewrite material from a source into an encyclopedic style instead.
  • Recentism is a serious flaw in wikipedia (if the Society is 150 years old, each of those decades should be approximately equally represented; coverage of recent political machinations or triumphs should not outweigh entire decades); the solution is to identify sources for pre-digital events and promote their use.
  • Systematic bias is an on-going problem in wikipedia, just as it is elsewhere; a solution in this case might be to set goals for coverage of women, Māori and/or non-science academics; another solution might be for the Society to trawl it's records and archives lists of  minorities to publish digitally.

Conflict of interest statement: I’m a high-active editor on wikipedia and am a significant contributor to all many of the wikipedia articles linked to from this post.

Friday, 2 December 2011

Prep notes for NDF2011 demonstration

I didn't really have a presentation for my demonstration at the NDF, but the event team have asked for presentations, so here are the notes for my practice demonstration that I did within the library. The notes served as an advert to attract punters to the demo; as a conversation starter in the actual demo and as a set of bookmarks of the URLs I wanted to open.




Depending on what people are interested in, I'll be doing three things

*) Demonstrating basic editing, perhaps by creating a page from the requested articles at http://en.wikipedia.org/wiki/Wikipedia:WikiProject_New_Zealand/Requested_articles

*) Discussing some of the quality control processes I've been involved with (http://en.wikipedia.org/wiki/Wikipedia:Articles_for_deletion and http://en.wikipedia.org/wiki/New_pages_patrol)

*) Discussing how wikipedia handles authority control issues using redirects (https://secure.wikimedia.org/wikipedia/en/wiki/Wikipedia:Redirect ) and disambiguation (https://secure.wikimedia.org/wikipedia/en/wiki/Wikipedia:Disambiguation )

I'm also open to suggestions of other things to talk about.

Thursday, 1 December 2011

Metadata vocabularies LODLAM NZ cares about

At today's LODLAM NZ, in Wellington, I co-hosted a vocabulary schema / interoperability session. I kicked off the session with a list of the metadata schema we care about and counts of how many people in the room cared about it. Here are the results:

8 Library of Congress / NACO Name Authority List
7 Māori Subject Headings
6 Library of Congress Subject Headings
5 SONZ
5 Linnean
4 Getty Thesauri
3 Marsden Research Subject Codes / ANZRSC Codes
3 SCOT
3 Iwi Hapū List
2 Australian Pictorial Thesaurus
1 Powerhouse Object Names Thesaurus
0 MESH

This straw poll naturally only reflects on the participants who attended this particular session and counting was somewhat haphazard (people were still coming into the room), but is gives a sample of the scope.

I don't recall whether the heading was "Metadata we care about" or "Vocabularies we care about," but it was something very close to that.

Wednesday, 30 November 2011

Unexpected advice

During the NDF2011 today I was in "Digital initiatives in Māori communities" put on the the talented Honiana Love and Claire Hall from the Te Reo o Taranaki Charitable Trust about their work on He Kete Kōrero. At the end I asked a question "Most of us [the audience] are in institutions with te Reo Māori holdings or cultural objects of some description. What small thing can we do to help enable our collections for the iwi and hapū source communities? Use Māori Subject Headings? The Iwi / Hapū list? Geotagging? ..." Quick-as-a-blink the response was "Geotagging." If I understood the answer (given mainly by Honiana) correctly, the point was that geotagging is much more useful because it's much more likely to be done right in contexts like this. Presumably because geotagging lends itself to checking, validation and visualisations that make errors easy to spot in ways that these other metadata forms don't; it's better understood by those processing the documents and processing the data.

I think it's fabulous that we're getting feedback from indigenous groups using information systems in indigenous contexts, particularly feedback about previous attempts to cater to their needs. If this is the experience of other indigenous groups, it's really important.

Saturday, 26 November 2011

Goodbye 'social-media' world

You may or may not have noticed, but recently a number of 'social media' services have begun looking and working very similarly. Facebook is the poster-child, followed by google+ and twitter. Their modus operandi is to entice you to interact with family-members, friends and acquaintances and then leverage your interactions to both sell your attention advertisers and entice other members of you social circle to join the service.

There are, naturally, a number of shiny baubles you get for participating it the sale of your eyeballs to the highest bidder, but recently I have come to the conclusion that my eyeballs (and those of my friends, loved ones and colleagues) are worth more.

I'll be signing off google plus, twitter and facebook shortly. I my return for particular events, particularly those with a critical mass the size of Jupiter, but I shall not be using them regularly. I remain serenely confident that all babies born in my extended circle are cute, I do not need to see their pictures.

I will continue using other social media as before (email, wikipedia, irc, skype, etc) as usual. My deepest apologies to those who joined at least party on my account.

Sunday, 6 November 2011

Recreational authority control

Over the last week or two I've been having a bit of a play with Ngā Ūpoko Tukutuku / The Māori Subject Headings (for the uninitiated, think of the widely used Library of Congress Subject Headings, done Post-Colonial and bi-lingually but in the same technology) the main thing I've been doing is trying to munge the MSH into Wikipedia (Wikipedia being my addiction du jour).

My thinking has been to increase the use of MSH by taking it, as it were, to where the people are. I've been working with the English language Wikipedia, since the Māori language Wikipedia has fewer pages and sees much less use.

My first step was to download the MSH in MARC XML format (available from the website) and use XSL to transform it into a wikipedia table (warning: large page). When looking at that table, each row is a subject heading, with the first column being the the te reo Māori term, the second being permutations of the related terms and the third being the scope notes. I started a discussion about my thoughts (warning: large page) and got a clear green light to create redirects (or 'related terms' in librarian speak) for MSH terms which are culturally-specific to Māori culture.

I'm about 50% of the way through the 1300 terms of the MSH and have 115 redirects in the newly created Category:Redirects from Māori language terms. That may sound pretty average, until you remember that institutions are increasingly rolling out tools such as Summon, which use wikipedia redirects for auto-completion, taking these mappings to the heart of most Māori speakers in higher and further education.

I don't have a time-frame for the redirects to appear, but they haven't appeared in Otago's Summon, whereas redirects I created ~ two years ago have; type 'jack yeates' and pause to see it at work.

Tuesday, 16 August 2011

Thoughts on "Letter about the TEI" from Martin Mueller

Thoughts on "Letter about the TEI" from Martin Mueller

Note: I am a member of the TEI council, but this message is should be read as personal position at the time of writing, not a council position, nor the position of my employer.

Reading Martin's missive was painful. I should have responded earlier, I think perhaps I was hoping someone else could say what I wanted to say and I could just say "me too." They haven't so I've become the someone else.

I don't think that Martin's "fairly radical model" is nearly radical enough. I'd like to propose a significantly more radical model as strawman:


1) The TEI shall maintain a document called the 'The TEI Principals.' The purpose of The TEI is to advance The TEI Principals.

2) Institutional membership of The TEI is open to groups which publish, collect and/or curate documents in formats released by The TEI. Institutional membership requires members acknowledge The TEI Principals and permits the members to be listed at http://www.tei-c.org/Activities/Projects/ and use The TEI logos and branding.

3) Individual membership of The TEI is open to individuals; individual membership requires members acknowledge The TEI Principals and subscribe to The TEI mailing list at http://listserv.brown.edu/?A0=TEI-L.

4) All business of The TEI is conducted in public. Business which needs be conducted in private (for example employment matters, contract negotiation, etc) shall be considered out of scope for The TEI.

5) Changes to the structure of The TEI will be discussed on the TEI mailing list and put to a democratic vote with a voting period of at least one month, a two-thirds majority of votes cast is required to pass a motion, which shall be in English.

6) Groups of members may form for activities from time-to-time, such as members meetings, summer schools, promotions of The TEI or collective digitisation efforts, but these groups are not The TEI, even if the word 'TEI' appears as part of their name.




I'll admit that there are a couple of issues not covered here (such as who holds the IPR), but it's only a straw man for discussion. Feel free to fire it as necessary.



Thursday, 23 June 2011

unit testing framework for XSL transformations?

I'm part of the TEI community, which maintains an XML standard which is commonly transformed to HTML for presentation (more rarely PDF). The TEI standard is relatively large but relatively well documented, the transformation to HTML has thus far been largely piecemeal (from a software engineering point of view) and not error free.

Recently we've come under pressure to introduce significantly more complexity into transformations, both to produce ePub (which is wrapped HTML bundled with media and metadata files) and HTML5 (which can represent more of the formal semantics in TEI). The software engineer in me sees unit testing the a way to reduce our errors while opening development up to a larger more diverse group of people with a larger more diverse set of features they want to see implemented.

The problem is, that I can't seem to find a decent unit testing framework for XSLT. Does anyone know of one?

Our requirements are: XSLT 2.0; free to use; runnable on our ubuntu build server; testing the transformation with multiple arguments; etc;

We're already using: XSD, RNG, DTD and schematron schemas, epubcheck, xmllint, standard HTML validators, etc. Having the framework drive these too would be useful.

The kinds of things we want to test include:
  1. Footnotes appear once and only once
  2. Footnotes are referenced in the text and there's a back link from the footnote to the appropriate point in the text
  3. Internal references (tables of contents, indexes, etc) point somewhere
  4. Language encoding used xml:lang survives from the TEI to the HTML
  5. That all the paragraphs in the TEI appear at least once in the HTML
  6. That local links work
  7. Sanity check tables
  8. Internal links within parallel texts
  9. ....
Any of many languages could be used to represent these tests, but ideally it should have a DOM library and be able to run that library across entire directories of files. Most of our community speak XML fluently, so leveraging that would be good.

Wednesday, 23 March 2011

Is there a place for readers' collectives in the bright new world of eBooks?

The transition costs of migrating from the world of books-as-physical-artefacts-of-pulped-tree to the world of books-as-bitstreams are going to be non-trivial.

Current attempts to drive the change (and by implication apportion those costs to other parties) have largely been driven by publishers, distributors and resellers of physical books in combination with the e-commerce and electronics industries which make and market the physical eBook readers on which eBooks are largely read. The e-commerce and electronics industries appear to see traditional publishing as an industry full of lumbering giants unable to compete with the rapid pace of change in the electronics industry and the associated turbulence in business models, and have moved to poach market-share. By-and-large they've been very successful. Amazon and Apple have shipped millions of devices billed as 'eBook readers' and pretty much all best-selling books are available on one platform or another.

This top tier, however, is the easy stuff. It's not surprising that money can be made from the latest bodice-ripping page-turner, but most of the interesting reading and the majority of the units sold are outside the best-seller list, on the so-called 'long tail.'

There's a whole range of books that I'm interested in that don't appear to be on the business plan of any of the current eBook publishers, and I'll miss them if they're not converted:

  1. The back catalogue of local poetry. Almost nothing ever gets reprinted, even if the original has a tiny print run and the author goes on to have a wonderfully successful career. Some gets anthologised and a few authors are big enough to have a posthumous collected works, when their work is no longer cutting edge.
  2. Some fabulous theses. I'm thinking of things like: http://ir.canterbury.ac.nz/handle/10092/1978, http://victoria.lconz.ac.nz/vwebv/holdingsInfo?bibId=69659 and http://otago.lconz.ac.nz/vwebv/holdingsInfo?bibId=241527
  3. Lots of te reo Māori material (pick your local indigenous language if you're reading this outside New Zealand)
  4. Local writing by local authors.

Note that all of these are local content---no foreign mega-corporation is going to regard this as their home-turf. Getting these documents from the old world to the new is going to require a local program run by (read funded by) locals.

Would you pay for these things? I would, if it gave me what I wanted.


What is it that readers want?

We're all readers, of one kind or another, and we all want a different range of things, but I believe that what readers want / expect out of the digital transition is:

  1. To genuinely own books. Not to own them until they drop their eReader in the bath and lose everything. Not to own them until a company they've never heard of goes bust and turns off a DRM server they've never heard of. Not to own them until technology moves on and some new format is in use. To own them in a manner which enables them to use them for at least their entire lifetime. To own them in a manner that poses at least a question for their heirs.
  2. A choice of quality books. Quality in the broadest sense of the word. Choice in the broadest sense of the word. Universality is a pipe-dream, of course, but with releasing good books faster than I can read them.
  3. A quality recommendation service. We all have trusted sources of information about books: friends, acquaintances, librarians or reviewers that history have suggested have similar ideas as us about what a good read is.
  4. To get some credit for already having bought the book in pulp-of-murdered-tree work. Lots of us have collections of wood-pulp and like to maintain the illusion that in some way that makes us well read.
  5. Books bought to their attention based on whether they're worth reading, rather than what publishers have excess stock of. Since the concept of 'stock' largely vanishes with the transition from print to digital this shouldn't be too much of a problem.
  6. Confidentially for their reading habits. If you've never come across it, go and read the ALA's The Freedom to Read Statement

A not-for-profit readers' collective

It seems to me that the way to manage the transition from the old world to the new is as a not-for-profit readers' collective. By that I mean a subscription-funded system in which readers sign up for a range of works every year. The works are digitised by the collective (the expensive step, paid for up-front), distributed to the subscribers in open file formats such as ePub (very cheap via the internet) and kept in escrow for them (a tiny but perpetual cost, more on this later).

Authors, of course, need to pay their mortgage, and part of the digitisation would be obtaining the rights to the work. Authors of new work would be paid a 'reasonable' sum, based on their statue as authors (I have no idea what the current remuneration of authors is like, so I won't be specific). The collective would acquire (non-exclusive) the rights to digitise the work if not born digital, to edit it, distribute it to collective members and to sell it to non-members internationally (i.e. distribute it through 'conventional' digital book channels). In the case of sale to non-members through conventional digital book channels the author would get a cut. Sane and mutually beneficial deals could be worked out with libraries of various sizes.

Generally speaking, I'd anticipate the rights to digitise and distribute in-copyright but out-of-print poetry would would be fairly cheap; the rights to fabulous old university theses cheaper; and rights to out-of-copyright materials are, of course, free. The cost of rights to new novels / poetry would hugely depend on statue of the author and the quality of the work, which is where the collective would need to either employ a professional editor to make these calls or vote based on sample chapters / poems or some combination of the two. Costs of quality digitisation is non-trivial, but costs are much lower in bulk and dropping all the time. Depending on the platform in use, members of the collective might be recruited as proof-readers for OCR errors.

That leaves the question of how to fund the the escrow. The escrow system stores copies of all the books the collective has digitised for the future use of the collectives' members and is required to give efficacy to the promise that readers really own the books. By being held in escrow, the copies survive the collective going bankrupt, being wound up, or evolving into something completely different, but requires funding. The simplest method of obtaining funding would be to align the collective with another established consumer of local literature and have them underwrite the escrow, a university, major library, or similar.

The difference between a not-for-profit readers' collective and an academic press?

Of hundreds of years, major universities have had academic presses which publish quality content under the universities' auspices. The key difference between the not-for-profit readers' collective I am proposing and an academic press is that the collective would attempt to publish the unpublished and out-of-print books that the members wanted rather than aiming to meet some quality criterion. I acknowledge a popularist bias here, but it's the members who are paying the subscriptions.

Which links in the book chain do we want to cut out?

There are some links in the current book production chain which we need to keep, there are others wouldn't have a serious future in a not-for-profit. Certainly there is a role for judgement in which works to purchase with the collective's money. There is a role for editing, both large-scale and copy-editing. There is a role for illustrating works, be it cover images or icons. I don't believe there is a future for roles directly relating to the production, distribution, accounting for, sale, warehousing or pulping of physical books. There may be a role for the marketing books, depending on the business model (I'd like to think that most of the current marketing expense can be replaced by combination of author-driven promotion and word-of-month promotion, but I've been known to dream). Clearly there is an evolving techie role too.

The role not mentioned above that I'd must like to see cut, of course, is that of the multinational corporation as gatekeeper, holding all the copyrights and clipping tickets (and wings).

Saturday, 20 November 2010

HOWTO: Deep linking into the NZETC site

As the heaving mass of activity that is the mixandmash competition heats up, I have come to realise that I should have better documented a feature of the NZETC site, the ability to extract the TEI xml annotated with the IDs for deep linking.

Our content's archival form is TEI xml, which we massage for various output formats. There is a link from the top level of every document to the TEI for the document, which people are welcome to use in their mashups and remixes. Unfortunately, between that TEI and our HTML output is a deep magic that involves moving footnotes, moving page breaks, breaking pages into nicely browsable chunks, floating marginal notes, etc., and this makes it hard to deep link back to the website from anything derived from that TEI.

There is another form of the TEI available which is annotated with whether or not each structural element maps 1:1 to an HTML: nzetc:has-text and what the ID of that page is: nzetc:id This annotated XML is found by replacing the 'tei-source' in the URL with 'etexts'

Thus for The Laws of England, Compiled and translated into the Māori language at http://www.nzetc.org/tm/scholarly/tei-GorLaws.html there is the raw TEI at http://www.nzetc.org/tei-source/GorLaws.xml and the annotated TEI at http://www.nzetc.org/etexts/GorLaws.xml

Looking in the annotated TEI at http://www.nzetc.org/etexts/GorLaws.xml we see for example:

<div xml:id="t1-g1-t1-front1-tp1" xml:lang="en" rend="center" type="titlePage" nzetc:id="tei-GorLaws-t1-g1-t1-front1-tp1" nzetc:depth="5" nzetc:string-length="200" nzetc:has-text="true">


This means that this div has it's own page (because it has nzetc:has-text="true" and that the ID of that page is tei-GorLaws-t1-g1-t1-front1-tp1 (because of the nzetc:id="tei-GorLaws-t1-g1-t1-front1-tp1"). The ID can be plugged into: http://www.nzetc.org/tm/scholarly/<ID>.html to get a URL for the HTML. Thus the URL for this div is http://www.nzetc.org/tm/scholarly/tei-GorLaws-t1-g1-t1-front1-tp1.html This process should work for both text and figures.

Happy remixing everyone!

Sunday, 8 November 2009

ePubs and quality

You may have heard news about the release of "bookserver" by the good folks at the Internet Archive. This is a DRM-free ePub ecosystem, initially stocked with the prodigious output of Google's book scanning project and the Internet Archive's own book scanning project.

To see how the NZETC stacked up against the much larger (and better funded) collection I picked one of our Maori Language dictionaries. Our Maori and Pacifica dictionaries month-after-month make up the bulk of our top five must used resources, so they're in-demand resources. They're also an appropriate choice because when they were encoded by the NZETC into TEI, the decision was made not to use full dictionary encoding, but a cheaper/easier tradeoff which didn't capture the linguistic semantics of the underlying entries, but treated them as typeset text. I was interested in how well this tradeoff was wearing.

I did my comparison using the new firefox ePub plugin, things will be slightly different if you're reading these ePubs on an iPhone or Kindle.

The ePub I looked at was A Dictionary of the Maori Language by Herbert W. Williams. The NZETC has the 1957 sixth edition. There are two versions of the work on bookserver. A 1852 second edition scanned by Google books (original at the New York Public library) and a 1871 third edition scanned by the Internet Archive in association with Microsoft (original in the University of California library system). All the processing of both works appear to be been done in the U.S. The original print used macrons (NZETC), acutes (Google) and breves (Internet Archive) to mark long vowels. Find them here.


Lets take a look at some entries from each, starting at 'kapukapu':


NZETC:

kapukapu. 1. n. Sole of the foot.

2. Apparently a synonym for kaunoti, the firestick which was kept steady with the foot. Tena ka riro, i runga i nga hanga a Taikomako, i te kapukapu, i te kaunoti (M. 351).

3. v.i. Curl (as a wave). Ka kapukapu mai te ngaru.

4. Gush.

5. Gleam, glisten. Katahi ki te huka o Huiarau, kapukapu ana tera.

Kapua, n. 1. Cloud, bank of clouds. E tutakitaki ana nga kapua o te rangi, kei runga te Mangoroa e kopae pu ana (P.).

2. A flinty stone. = kapuarangi.

3. Polyprion oxygeneios, a fish. = hapuku.

4. An edible species of fungus.

5. Part of the titi pattern of tattooing.

Kapuarangi, n. A variety of matā, or cutting stone, of inferior quality. = kapua, 2.

Kāpuhi, kāpuhipuhi, n. Cluster of branches at the top of a tree.

Kāpui, v.t. 1. Gather up in a bunch. Ka kapuitia nga rau o te kiekie, ka herea.

2. Lace up or draw in the mouth of a bag.

3. Earth up crops, or cover up embers with ashes to keep them alight.

kāpuipui, v.t. Gather up litter, etc.

Kāpuka, n. Griselinia littoralis, a tree. = papauma.

Kapukiore, n. Coprosma australis, a shrub. = kanono.

Kāpuku = kōpuku, n. Gunwale.



Google Books:

Kapukapu, s. Sole of the foot,

Eldpukdpu, v. To curl* as a

wave.

Ka kapukapu mai te ngaru; The wave curls over.

Kapunga, v. To take up with both hands held together,

Kapungatia he kai i te omu; Take up food from the oven.

(B. C,

Kapura, s. Fire, -' Tahuna he kapura ; Kindle a fire.

Kapurangi, s. Rubbish; weeds,

Kara, s. An old man,

Tena korua ko kara ? How are you and the old man ?

Kara, s> Basaltic stone.

He kara te kamaka nei; This stone is kara.

Karaha, s. A calabash. ♦Kardhi, *. Glass,



Internet Archive:

kapukapu, n. sole of the foot.

kapukapu, v. i. 1. curl (as a wave). Ka kapukapu mai te ngaru. 2. gush.

kakapii, small basket for cooked food.

Kapua, n. cloud; hank of clouds,

Kapunga, n. palm of the hand.

kapunga, \. t. take up in both hands together.

Kapiira, n. fire.

Kapiiranga, n. handful.

kapuranga, v. t. take up by hand-fuls. Kapurangatia nga otaota na e ia. v. i. dawn. Ka kapuranga te ata.

Kapur&ngi, n. rubbish; uveds.

I. K&r&, n. old man. Tena korua ko kara.

II. K&r&, n. secret plan; conspiracy. Kei te whakatakoto kara mo Te Horo kia patua.

k&k&r&, D. scent; smell.

k&k&r&, a. savoury; odoriferous.

k^ar&, n. a shell-iish.


Unlike the other two, the NZETC version has accents, bold and italics in the right place. It' the only one with a workable and useful table of contents. It is also edition which has been extensively revised and expanded. Google's second edition has many character errors, while the Internet Archive's third edition has many 'á' mis-recognised as '&.' The Google and Internet Achive versions are also available as PDFs, but of course, without fancy tables of contents these PDFs are pretty challenging to navigate and because they're built from page images, they're huge.

It's tempting to say that the NZETC version is better than either of the others, and from a naïve point of it is, but it's more accurate to say that it's different. It's a digitised version of a book revised more than a hundred years after the 1852 second edition scanned by Google books. People who're interested in the history of the language are likely to pick the 1852 edition over the 1957 edition nine times out of ten.

Technical work is currently underway to enable third parties like the Internet Archive's bookserver to more easily redistribute our ePubs. For some semi-arcane reasons it's linked to upcoming new search functionality.

What LibraryThing metadata can the NZETC reasonable stuff inside it's CC'd epubs?

This is the second blog following on from an excellent talk about librarything by LibraryThing's Tim given the VUW in Wellington after his trip to LIANZA.

The NZETC publishes all of it's works as epubs (a file format primarily aimed at mobile devices), which are literally processed crawls of it's website bundled with some metadata. For some of the NZETC works (such as Erewhon and The Life of Captain James Cook), LibraryThing has a lot more metadata than the NZETC, becuase many LibraryThing users have the works and have entered metadata for them. Bundling as much metadata into the epubs makes sense, because these are commonly designed for offline use---call-back hooks are unlikely to be avaliable.

So what kinds of data am I interested in?
1) Traditional bibliographic metadata. Both LT and NZETC have this down really well.
2) Images. LT has many many cover images, NZETC has images of plates from inside many works too.
3) Unique identification (ISBNs, ISSNs, work ids, etc). LT does very well at this, NZETC very poorly
4) Genre and style information. LT has tags to do fancy statistical analysis on, and does. NZETC has full text to do fancy statistical analysis on, but doesn't.
5) Intra-document links. LT has work as the smallest unit. NZETC reproduces original document tables of contents and indexes, cross references and annotations.
6) Inter-document links. LT has none. NZETC captures both 'mentions' and 'cites' relationships between documents.

While most current-generation ebook readers, of course, can do nothing with most of this metadata, but I'm looking forward to the day when we have full-fledged OpenURL resolvers which can do interesting things, primarily picking the best copy (most local / highest quality / most appropiate format / cheapest) of a work to display to a user; and browsing works by genre (LibraryThing does genre very well, via tags).

Thursday, 15 October 2009

Interlinking of collections: the quest continues

After an excellent talk today about LibraryThing by LibraryThing's Tim, I got enthused to see how LibraryThing stacks up against other libraries for having matches in it's authority control system for entities we (the NZETC) care about.
The answer is averagely.
For copies of printed books less than a hundred years old (or reprinted in the last hundred years), and their authors, LibraryThing seems to do every well. These are the books likely to be in active circulation in personal libraries, so it stands to reason that these would be well covered.
I tried half a dozen books from our Nineteenth-Century Novels Collection, and most were missing, Erewhon, of course, was well represented. LibraryThing doesn't have the "Treaty of Waitangi" (a set of manuscripts) but it does have "Facsimiles of the Treaty of Waitangi." It's not clear to me whether these would be merged under their cataloguing rules.
Coverage of non-core bibliographic entities was lacking. Places get a little odd. Sydney is "http://www.librarything.com/place/Sydney,%20New%20South%20Wales,%20Australia" but Wellington is "http://www.librarything.com/place/Wellington" and Anzac Cove appears to be is missing altogether. This doesn't seem like a sane authority control system for places, as far as I can see. People who are the subjects rather than the authors of books didn't come out so well. I couldn't find Abel Janszoon Tasman, Pōtatau Te Wherowhero or Charles Frederick Goldie, all of which are near and dear to our hearts.

Here is the spreadsheet of how different web-enabled systems map entities we care about.

Correction: It seems that the correct URL for Wellington is http://www.librarything.com/place/Wellington,%20New%20Zealand which brings sanity back.

Saturday, 19 September 2009

eBook readers need OpenURL resolvers

Everyone's talking about the next generation of eBook readers having larger reading area, more battery life and more readable screen. I'd give up all of those, however, for an eBook reader that had an internal OpenURL resolver.

OpenURL is the nifty protocol that libraries use to find the closest copy of a electronic resources and direct patrons to copies that the library might have already licensed from commercial parties. It's all about finding the version of a resource that is most accessible to the user, dynamically.

Say I've loaded 500 eBooks into my eBook reader: a couple of encyclopedias and dictionaries; a stack of books I was meant to read in school but only skimmed and have been meaning to get back to; current block-busters; guidebooks to the half-dozen countries I'm planning on visiting over the next couple of years; classics I've always meant to read (Tolstoy, Chaucer, Cervantes, Plato, Descartes, Nietzsche); and local writers (Baxter, Duff, Ihimaera, Hulme, ...). My eBooks by Nietzsche are going to refer to books by Descartes and Plato; my eBooks by Descartes are going to refer to books by Plato; my encyclopaedias are going to refer to pretty much everything; most of the works in translation are going to contain terms which I'm going to need help with (help which theencyclopedias and dictionaries can provide).

Ask yourself, though, whether you'd want to flick between works on the current generation of readers---very painful, since these devices are not designed for efficient navigation between eBooks, but linear reading of them. You can't follow links between them, of course, because on current systems links must point either with the same eBook or out on to the internet---pointing to other eBooks on the same device is verboten. OpenURL can solve this by catching those URLs and making them point to local copies of works (and thus available for free even when the internet is unavailable) where possible while still retaining their

Until eBook readers have a mechanism like this eBooks will be at most a replacement only for paperback novels---not personal libraries.

Tuesday, 15 September 2009

Thoughts on koha


The Koha community is currently undergoing a spasm, with a company apparently forking the code.
As a result a bunch of people are looking at where the community should go from here and how it should be led. In particular the idea of a not-for-profit foundation has been floated and is to be discussed at a meeting early tomorrow morning .
My thoughts on this issue are pretty simple:
  • A not-for-profit is a fabulous idea
  • Reusing one of the existing software not-for-profit (Apache, Software in the Public Interest, etc) introduces a layer of non-library complexity. Libraries are have a long history with consortia, but tend to very much flock together with their own kind, I can see them being leary of a non-library entity.
  • A clear description of a forward-looking plan written in plain language that everyone can understand is vital to communicate the vision of the community, particularly to those currently on the fringes