Open Source Exile: September 2008

I've been playing with the Dataset of New Zealand Geographic Place Names which is a set of CSV files published by Toitū te whenua / Land Information New Zealand (LINZ). The data takes quite a bit of massaging, and I was wondering whether anyone else had already done the work of making acceptable XML out of the data rather than doing all the work myself.

I've attached the script I have so far, but it's not perfect. In particular:

It doesn't include place names with Macrons
It makes lots of ASCII-type assumptions
Many of the element names are poorly named and map non-obviously to fields in the CSV files.
The script isn't very generic and does little or no checking

Anyway, here's he script, hopefully it's successfully escaped. The basics are that it creates an sqlite database and streams the CSV files into it direct from the zip (which it expects to have been downloaded into the current directory). It then streams each point out using awk to transform it to XML.


#!/bin/bash
# script to import data from
# http://www.linz.govt.nz/placenames/search/place-names-dataset-download/index.aspx
# into an XML file.
# this script licensed under the GPL/BSD/Apache 2 licences

echo \(re\)creating the database, expect DROP errors the first time you run this 
sqlite nzgeonames.db << EOF
DROP TABLE name;
CREATE TABLE  name (id, name, east, north, pdescription, district, sheet, lat, long);

DROP TABLE district;
CREATE TABLE  district (district, description);


DROP TABLE pdescription;
CREATE TABLE  pdescription (pdescription, short, description);


DROP TABLE sheet;
CREATE TABLE  sheet (edition, map, sheet);

VACUUM;
EOF

echo importing the names
unzip -p nznames_6Aug08.zip namedata.txt  | sed 's/\r//'  |  sed 's/`/","/g' | awk -F^ '{print "INSERT INTO name VALUES (\"" $0 "\");"}' | sqlite nzgeonames.db 

echo importing the districts
unzip -p nznames_6Aug08.zip landdist.txt  | sed 's/\r//'  |  sed 's/`/","/g' | awk -F^ '{print "INSERT INTO district VALUES (\"" $0 "\");"}' | sqlite nzgeonames.db

echo importing the point descriptions \(expect two lines of errors\) 
unzip -p nznames_6Aug08.zip pointdes.txt  | sed 's/\r//'  |  sed 's/`/","/g' | awk -F^ '{print "INSERT INTO pdescription VALUES (\"" $0 "\");"}' | sed 's/:/","/' | sqlite nzgeonames.db

echo importing the sheet names
unzip -p nznames_6Aug08.zip sheetnam.txt  | sed 's/\r//'  |  sed 's/`/","/g' | awk -F^ '{print "INSERT INTO sheet VALUES (\"" $0 "\");"}' | sqlite nzgeonames.db 


# pick up the ugly duckling
sqlite nzgeonames << EOF
 INSERT INTO pdescription VALUES ("MRFM","MARINE ROCK FORMATION","Marine Rock Formation");
EOF

echo exporting points as xml
echo "<document source=\"Sourced from Land Information New Zealand, [date]. Crown copyright reserved.\">" > nzgeonames.xml
sqlite nzgeonames.db "SELECT name.id, name.name, name.east, name.north, name.pdescription, name.district, name.sheet, name.lat, name.long, district.description, pdescription.short, pdescription.description AS descriptionA, sheet.edition, sheet.map FROM name, district, pdescription, sheet WHERE name.district = district.district AND name.pdescription = pdescription.pdescription AND name.sheet = sheet.sheet;" | awk -F\| '{print "<point><id>" $1 "</id><name>" $2 "</name><east>" $3 "</east><north>" $4 "</north><pdescription>" $5 "</pdescription><district>" $6 "</district><sheet>" $7 "</sheet><lat>" $8 "</lat><long>" $9 "</long><description>" $10 "</description><short>" $11 "</short><descriptionA>" $12 "</descriptionA> <edition>" $13 "</edition> <map>" $14 "</map>  </point>"}' | sed 's/&/&amp;/' >> nzgeonames.xml 
echo "</document>" >> nzgeonames.xml

echo formatting the points nicely
xmllint --format nzgeonames.xml  > nzgeonames-formatted.xml

Open Source Exile

Tuesday, 2 September 2008

Does anyone publish the Dataset of New Zealand Geographic Place Names already in XML form?

Library of Congress flickr experiment

About Me

Blog Archive