Monday, 27 July 2009

Learning XSLT 2.0 Part 1; Finding Names

We mark up a lot of names, so one of the first things I decided to do was to build an XSLT stylesheet that takes a list of names and tags those names when they occur in a separate XSLT file. To make things easier and clearer, I've ignored little things like namespaces, conformant TEI, etc, etc.

First up, the list of names, these are multi-word names. Notice the simple structure, this could easily be built from a comma seperated list or similar:

<?xml version="1.0" encoding="UTF-8"?>
<name>Papaver argemone</name>
<name>Papaver dubium</name>
<name>Papaver Rhceas</name>
<name>Zanthoxylum novæ-zealandiæ</name>

Next, some sample text:

<?xml version="1.0" encoding="UTF-8"?>
<p> There are several names Papaver argemone in this document Papaver argemone</p>
<p> Some of them are the same as others (Papaver Rhceas Papaver rhceas P. rhceas)</p>
<p> Non ASCII characters shouldn't cause a problem in names like Zanthoxylum novæ-zealandiæ AKA Zanthoxylum novae-zealandiae</p>

Finally the stylesheet. It consists of three parts: the regexp variable that builds a regexp from the names in the file; a default template for everything but text(); and a template for text()s that applies the rexexp.

<?xml version="1.0"?>
<xsl:stylesheet version="2.0"
xmlns:xsl="" >

<!-- build a regexp of the names -->
<xsl:variable name="regexp">
<xsl:value-of select="concat('(',string-join(document('name-list.xml')//name/text(), '|'), ')')"/>

<!-- generic copy-everything-except-texts template -->
<xsl:template match="@*|*|processing-instruction()|comment()">
<xsl:apply-templates select="@*|*|processing-instruction()|comment()|text()"/>

<!-- Look for binomal names in appreviated form where the genus name is in the immediately preceeding head -->
<xsl:template match="text()">
<xsl:analyze-string select="." regex="{$regexp}">
<name type="taxonomic" subtype="matched">
<xsl:value-of select="regex-group(1)"/>
<xsl:value-of select="."/>


The output looks like:

<?xml version="1.0" encoding="UTF-8"?><doc>
<p> There are several names <name type="taxonomic" subtype="matched">Papaver argemone</name> in this document <name type="taxonomic" subtype="matched">Papaver argemone</name></p>
<p> Some of them are the same as others (<name type="taxonomic" subtype="matched">Papaver Rhceas</name> Papaver rhceas P. rhceas)</p>
<p> Non ASCII characters shouldn't cause a problem in names like <name type="taxonomic" subtype="matched">Zanthoxylum novæ-zealandiæ</name> AKA Zanthoxylum novae-zealandiae</p>

As you may notice, I've not yet worked out the best way to handle the 'æ'