Monday, 14 July 2014

BIBFRAME

Adrian Pohl ‏wrote some excellent thoughts about the current state of BIBFRAME at http://www.uebertext.org/2014/07/name-authority-files-linked-data.html The following started as a direct response but, after limiting myself to where I felt I knew what I was talking about and felt I was being constructive, turned out to be much much narrower in scope.

My primary concern in relation to BIBFRAME is interlinking and in particular authority control. My concern is that a number of the players (BIBFRAME, ISNI, GND, ORCID, Wikipedia, etc) define key concepts differently and that without careful consideration and planning we will end up muddying our data with bad mappings. The key concepts in question are those for persons, names, identities, sex and gender (there may be others that I’m not aware of).

Let me give you an example.

In the 19th Century there was a mass creation of male pseudonyms to allow women to publish novels. A very few of these rose to such prominence that the authors outed themselves as women (think Currer Bell), but the overwhelming majority didn’t. In the late 20th and early 21st Centuries, entries for the books published were created in computerised catalogue systems and some entries found their way into the GND. My understanding is that the GND assigned gender to entries based entirely on the name of the pseudonym (I’ll admit I don’t have a good source for that statement, it may be largely parable). When a new public-edited encyclopedia based on reliable sources called Wikipedia arose, the GND was very successfully cross-linked with Wikipedia, with hundreds of thousands of articles were linked to the catalogues of their works. Information that was in the GND was sucked into a portion of Wikipedia called Wikidata. A problem now arose: there were no reliable sources for the sex information in GND that had been sucked Wikidata by GND, the main part of Wikipedia (which requires strict sources) blocked itself from showing Wikidata sex information. A secondary problem was that the GND sex data was in ISO 5218 format (male/female/unknown/not applicable) whereas Wikipedia talks not about sex but gender and is more than happy for that to include fa'afafine and similar concepts. Fortunately, Wikidata keeps track of where assertions come from, so the sex info can, in theory, be removed; but while people in Wikipedia care passionately about this, no one on the Wikidata side of the fence seems to understand what the problem is. Stalemate.

There were two separate issues here: a mismatch between the Person in Wikipedia and the Pseudonym (I think) in GND; and a mismatch between a cataloguer-assigned ISO 5218 value and a free-form self-identified value. 

The deeper the interactions between our respective authority control systems become, the more these issues are going to come up, but we need them to come up at the planning and strategy stages of our work, rather than halfway through (or worse, once we think we’ve finished).

My proposed solution to this is examples: pick a small number of ‘hard cases’ and map them between as many pairs of these systems as possible.

The hard cases should include at least: Charlotte Brontë (or similar); a contemporary author who has transitioned between genders and published broadly similar work under both identities; a contemporary author who publishes in different genre using different identities; ...

The cases should be accompanied by instructions for dealing with existing mistakes found (and errors will be found, see https://en.wikipedia.org/wiki/Wikipedia:VIAF/errors for some of the errors recently found during he Wikipedia/VIAF matching).

If such an effort gets off the ground, I'll put my hand up to do the Wikipedia component (as distinct from the Wikidata component).