Archive for August, 2010

Refactoring an Apertium dictionary

August 21st, 2010

One of the great things about the Apertium machine translation project is that Fran Tyers and others connected with it have assembled sizeable collections of free (GPL) lexical data. So that was the first place to look when I wanted a Spanish dictionary to use with the Bangor Autoglosser. However, the dictionaries are in XML format, which is notoriously slow for this sort of task (in Apertium, the dictionaries are compiled before use), and clumsy to process in PHP. I therefore ended up refactoring the dictionary into a csv file (downloadable here), which I think is a more useable option for our autoglossing needs (it can be read in a spreadsheet or imported into a database).

To do this, we need to generate a text file containing the contents of the Apertium dictionary. For Ubuntu, the easiest way to go is to install apertium and apertium-en-es. We can test it by opening a terminal and typing:
echo "dog" | apertium en-es

echo "perro" | apertium es-en

We get “Perro” and “Dog” back, respectively (the capitalisation is due to Apertium’s somewhat problematical algorithm for this). To extract the dictionaries, we need to download the raw files for the en-es package, untar them, and then use an Apertium utility, lt-expand:

lt-expand > apertium_es.txt
lt-expand apertium-en-es.en.dix > apertium_en.txt

for the monolingual dictionaries, and:
lt-expand> apertium_enes.txt

for the bilingual one. The Spanish dictionary (which is a file of around 300Mb) is our main focus, and for our purposes we want to remove lines containing :>: or :<:, which will be duplicates, and those where the entries contain spaces (eg a fin de que). We then tag the lines to show the relevant field boundaries, and import them into a database. Once all the dictionaries are safely tucked up there, we can use SQL queries to insert the English lexemes (lemmas) into the Spanish entries.

The result is a table with around 690,000 entries. Around 95% of these are verbforms, and about 87% of those are verbforms with enclitic pronouns (eg háblenosles). Although the execution speed for database lookups gained by rationalising these a bit is probably negligible, decreasing the size of the file makes it easier to distribute.

The first thing I did was to convert the Apertium tags to make them slightly more mnemonic, and segment the categories into their own fields – there are nearly 1900 different tags in the original file, many of them with only a few entries. The number of determiners especially seemed excessive, and for adjusting these I used a very useful tool – SQL Workbench/J, which is the only GUI tool I’ve come across so far that lets you edit the resultsets of PostgreSQL queries. The refactored dictionary has 173 separate combinations of POS tags.

The second thing was to segment the roughly 560,000 clitic verbforms, leaving only around 15,000 base verbforms. This is on the understanding that we can deal with the unsegmented forms via dynamic analysis and tagging – the download of the refactored dictionary contains a file with sample PHP functions that will do this. These standalone verbforms then have to be added back to the dictionary, because they usually entail orthographical variations in terms of accents. For example, the imperative 3 singular of decir is diga when it is standalone, but díga when a clitic pronoun is attached, as in dígame.

The last thing was to remove all the names, because the autoglosser will assume that something is a name of some sort if it starts with a capital.

The end result is a dictionary file with around 130,000 entries. This is probably not perfect (eg the clitic functions will segment háblenosles above as imperative 3 singular + 1 plural + 3 singular, and not admit the alternative of imperative 3 plural + 2 plural + 3 singular), but the file is a lot more manageable now.