Tuesday, 2 September 2008

Does anyone publish the Dataset of New Zealand Geographic Place Names already in XML form?

I've been playing with the Dataset of New Zealand Geographic Place Names which is a set of CSV files published by Toitū te whenua / Land Information New Zealand (LINZ). The data takes quite a bit of massaging, and I was wondering whether anyone else had already done the work of making acceptable XML out of the data rather than doing all the work myself.

I've attached the script I have so far, but it's not perfect. In particular:

  1. It doesn't include place names with Macrons
  2. It makes lots of ASCII-type assumptions
  3. Many of the element names are poorly named and map non-obviously to fields in the CSV files.
  4. The script isn't very generic and does little or no checking

Anyway, here's he script, hopefully it's successfully escaped. The basics are that it creates an sqlite database and streams the CSV files into it direct from the zip (which it expects to have been downloaded into the current directory). It then streams each point out using awk to transform it to XML.

# script to import data from
# http://www.linz.govt.nz/placenames/search/place-names-dataset-download/index.aspx
# into an XML file.
# this script licensed under the GPL/BSD/Apache 2 licences

echo \(re\)creating the database, expect DROP errors the first time you run this
sqlite nzgeonames.db << EOF
CREATE TABLE name (id, name, east, north, pdescription, district, sheet, lat, long);

DROP TABLE district;
CREATE TABLE district (district, description);

DROP TABLE pdescription;
CREATE TABLE pdescription (pdescription, short, description);

CREATE TABLE sheet (edition, map, sheet);


echo importing the names
unzip -p nznames_6Aug08.zip namedata.txt | sed 's/\r//' | sed 's/`/","/g' | awk -F^ '{print "INSERT INTO name VALUES (\"" $0 "\");"}' | sqlite nzgeonames.db

echo importing the districts
unzip -p nznames_6Aug08.zip landdist.txt | sed 's/\r//' | sed 's/`/","/g' | awk -F^ '{print "INSERT INTO district VALUES (\"" $0 "\");"}' | sqlite nzgeonames.db

echo importing the point descriptions \(expect two lines of errors\)
unzip -p nznames_6Aug08.zip pointdes.txt | sed 's/\r//' | sed 's/`/","/g' | awk -F^ '{print "INSERT INTO pdescription VALUES (\"" $0 "\");"}' | sed 's/:/","/' | sqlite nzgeonames.db

echo importing the sheet names
unzip -p nznames_6Aug08.zip sheetnam.txt | sed 's/\r//' | sed 's/`/","/g' | awk -F^ '{print "INSERT INTO sheet VALUES (\"" $0 "\");"}' | sqlite nzgeonames.db

# pick up the ugly duckling
sqlite nzgeonames << EOF
INSERT INTO pdescription VALUES ("MRFM","MARINE ROCK FORMATION","Marine Rock Formation");

echo exporting points as xml
echo "<document source=\"Sourced from Land Information New Zealand, [date]. Crown copyright reserved.\">" > nzgeonames.xml
sqlite nzgeonames.db "SELECT name.id, name.name, name.east, name.north, name.pdescription, name.district, name.sheet, name.lat, name.long, district.description, pdescription.short, pdescription.description AS descriptionA, sheet.edition, sheet.map FROM name, district, pdescription, sheet WHERE name.district = district.district AND name.pdescription = pdescription.pdescription AND name.sheet = sheet.sheet;" | awk -F\| '{print "<point><id>" $1 "</id><name>" $2 "</name><east>" $3 "</east><north>" $4 "</north><pdescription>" $5 "</pdescription><district>" $6 "</district><sheet>" $7 "</sheet><lat>" $8 "</lat><long>" $9 "</long><description>" $10 "</description><short>" $11 "</short><descriptionA>" $12 "</descriptionA> <edition>" $13 "</edition> <map>" $14 "</map> </point>"}' | sed 's/&/&amp;/' >> nzgeonames.xml
echo "</document>" >> nzgeonames.xml

echo formatting the points nicely
xmllint --format nzgeonames.xml > nzgeonames-formatted.xml

No comments: