Refactoring an Apertium dictionary

August 21st, 2010 by donnek No comments »

One of the great things about the Apertium machine translation project is that Fran Tyers and others connected with it have assembled sizeable collections of free (GPL) lexical data. So that was the first place to look when I wanted a Spanish dictionary to use with the Bangor Autoglosser. However, the dictionaries are in XML format, which is notoriously slow for this sort of task (in Apertium, the dictionaries are compiled before use), and clumsy to process in PHP. I therefore ended up refactoring the dictionary into a csv file (downloadable here), which I think is a more useable option for our autoglossing needs (it can be read in a spreadsheet or imported into a database).

To do this, we need to generate a text file containing the contents of the Apertium dictionary. For Ubuntu, the easiest way to go is to install apertium and apertium-en-es. We can test it by opening a terminal and typing:
echo "dog" | apertium en-es

or:
echo "perro" | apertium es-en

We get “Perro” and “Dog” back, respectively (the capitalisation is due to Apertium’s somewhat problematical algorithm for this). To extract the dictionaries, we need to download the raw files for the en-es package, untar them, and then use an Apertium utility, lt-expand:

lt-expand apertium-en-es.es.dix > apertium_es.txt
lt-expand apertium-en-es.en.dix > apertium_en.txt

for the monolingual dictionaries, and:
lt-expand apertium-en-es.es-en.dix> apertium_enes.txt

for the bilingual one. The Spanish dictionary (which is a file of around 300Mb) is our main focus, and for our purposes we want to remove lines containing :>: or :<:, which will be duplicates, and those where the entries contain spaces (eg a fin de que). We then tag the lines to show the relevant field boundaries, and import them into a database. Once all the dictionaries are safely tucked up there, we can use SQL queries to insert the English lexemes (lemmas) into the Spanish entries.

The result is a table with around 690,000 entries. Around 95% of these are verbforms, and about 87% of those are verbforms with enclitic pronouns (eg háblenosles). Although the execution speed for database lookups gained by rationalising these a bit is probably negligible, decreasing the size of the file makes it easier to distribute.

The first thing I did was to convert the Apertium tags to make them slightly more mnemonic, and segment the categories into their own fields – there are nearly 1900 different tags in the original file, many of them with only a few entries. The number of determiners especially seemed excessive, and for adjusting these I used a very useful tool – SQL Workbench/J, which is the only GUI tool I’ve come across so far that lets you edit the resultsets of PostgreSQL queries. The refactored dictionary has 173 separate combinations of POS tags.

The second thing was to segment the roughly 560,000 clitic verbforms, leaving only around 15,000 base verbforms. This is on the understanding that we can deal with the unsegmented forms via dynamic analysis and tagging – the download of the refactored dictionary contains a file with sample PHP functions that will do this. These standalone verbforms then have to be added back to the dictionary, because they usually entail orthographical variations in terms of accents. For example, the imperative 3 singular of decir is diga when it is standalone, but díga when a clitic pronoun is attached, as in dígame.

The last thing was to remove all the names, because the autoglosser will assume that something is a name of some sort if it starts with a capital.

The end result is a dictionary file with around 130,000 entries. This is probably not perfect (eg the clitic functions will segment háblenosles above as imperative 3 singular + 1 plural + 3 singular, and not admit the alternative of imperative 3 plural + 2 plural + 3 singular), but the file is a lot more manageable now.

Tweaking Ubuntu 10.04

July 21st, 2010 by donnek No comments »

When I came back from Gregynog, the motherboard on my main PC decided to bite the dust (and to be fair, there was quite a lot of it in the case). So, cue one week of building, installing, transferring data, and (not least) tweaking things so that they are the way they should be. I got a bundle from Overclockers, choosing an Intel processor for the first time in a decade. Not sure if that was a good idea, since it doesn’t seem to play well with the nVidia proprietary drivers. So it may be a while before I can get my wobbly windows back ….

I usually install the stock Ubuntu, and then upgrade to Kubuntu via Synaptic. GNOME is getting better, though it seems to have gone on a bit of a Mac OS X trip lately. KDE 4.4 is just about useable now – it’s got about 90% of what KDE 3 had. There are some weird things about it, though. For instance, in the 4.4.5 version of KMail, it seems you have to put together your own recipe for quoting emails you’re replying to. Maybe it was always like that, and the people at openSUSE put in a sensible default so I never noticed it before, but it’s a strange thing to have to start experimenting with the format of your email replies. Then of course I need to install all the stuff I like, including TexLive and R, which are both big-budget downloads. Unfortunately, the Kile install pulls in over 250Mb of documentation, including the docs for LilyPond! I just can’t get to like the new Amarok, with that big slab in the middle – it just seems to use space badly. (Indeed, that’s something I think applies to all of KDE 4 – everything seems to be surrounded by fat widgets, instead of being a bit less in-your-face.) So I’m pleased to say that there is a fork of the old Amarok – Clementine – which works really well. There are a couple of things that aren’t in it yet (like the popup when you mouse over the tray icon, telling you which track is playing), but it’s light and neat – really much better than the current Amarok..

Talking of sound, PulseAudio may be a great idea, but I have never got any sound on my cards until I rip it out as follows:
sudo apt-get purge pulseaudio gstreamer0.10-pulseaudio
sudo apt-get autoremove
sudo apt-get install alsa-base alsa-tools alsa-tools-gui alsa-utils   alsa-oss linux-sound-base
gstreamer-properties # to set the default to ALSA

With ALSA, everything works fine. Unfortunately, GNOME apps seem to just do their own thing once PulseAudio disappears, and things like Zim give weird pops and drumrolls when you click on things. To get rid of that:
nano ~/.gtkrc-2.0-kde4

and append the following lines to the end:
gtk-enable-event-sounds=0
gtk-enable-input-feedback-sounds=0
gtk-error-bell=0

You need to log out and back in again for it to take effect.

Synaptic looks quite yucky, since it uses the GUI for the root user. To fix that, install qtcurve, and then run:
sudo cp .gtkrc-2.0-kde4 /root/.gtkrc-2.0

I took advantage of having to transfer my data to rationalise it a bit. I usually have multiple copies of things now, ever since the Great Hard-Disk Crash of 2005 laid waste nearly 2 years of records, and I’m trying to put all the web-apps I’ve done into some sort of order. Some will just go in a museum somewhere online (eg Kartouche, Kyfieithu), but others will be updated – Eurfa is going to get about 30,000 extra words, an embedded Konjugator, and citations, and Klebran needs to be finished properly (the old site just disappeared one day last year – my ISP was brazen enough to say that I must have deleted it!).

Talking of which, I’ve just discovered a backup app that appears on first acquaintance to be excellent – SpiderOak. This is cross-platform, allows you to select directories and files to upload to a web repository, does incremental uploads of changed files only (à la rsync), offers syncing and sharing, and (best of all) does on-the-fly encryption of what you backup – the lack of this is a big hole in the Ubuntu One service. You get 2 Gb of storage for free, and you can buy more. Well worth looking at – it’s best to have as many backup locations as you can. Mind you, I now have two 1Tb drives in this machine, so that gives plenty of room for manoeuvre.

I’m now keeping my fingers crossed that this mobo doesn’t decide to go down the Swannee …

The autoglosser takes a bow …

July 16th, 2010 by donnek No comments »

At the very interesting Welsh Syntax seminar in Gregynog, there were a couple of presentations from the ESRC Centre, which should shortly be up on the seminar’s webpage. I spoke to a few slides on the first, summarising why the Bangor autoglosser had been developed, and what it does, and we also publicised the BangorTalk test site. Quite a lot of material has now been posted there, mainly to check how well the importer/autoglosser works, and also to experiment with presentation and layout. At the minute, the pages showing the text are pretty heavy to process, and also slow to show the gloss popups on older browsers (because the webpage is pretty big), so the next step there is to find a way (preferably using AJAX) to page through the text in chunks of about 50 utterances without interfering with the audio playback. Over the next few weeks I’ll be revising the Welsh and Spanish dictionaries, and doing a first version of a Spanish constraint grammar.

Costing the segmenter

June 8th, 2010 by donnek No comments »

Just for interest, I registered the Swahili verb segmenter at Ohloh. This is quite a clever setup, because they analyse the various bits of code in the repo and come up with a nice set of tables on the analysis page. The number of code lines comes out at around 2,400 (not counting comment lines or blank lines, which I like to strew liberally around my stuff, so that I have at least a chance of remembering what it’s supposed to do!), and on the little widget I’ve added to the site, it concludes that the segmenter would cost around $27,000 (£19,000) to write from scratch.

However, this is of course a little optimistic. Firstly, the Basic COCOMO model they use tends to overestimate the value of small projects. Secondly, about 38% of the code consists of the very nice CSS stuff from Blueprint – not really “mine”. Lastly, the average salary is assumed to be $55,000 per year, which is probably unlikely around here – I think between $35,000 (£24,000) and $45,000 (£31,000) would be more realistic. So if we put the lower figure in, and assume only 62% of the code is my own, and take a bit off for being a small project, that comes to around £9,000 ($13,000), which is probably a closer reflection of the monetary value.

It’s still quite a significant amount – the equivalent, assuming the lower salary figure, of around 4.5 months of work. Since the actual coding time was probably only about half that, it suggests that around 50% of “standard” costs go on things like overheads, meetings, etc. Perhaps that’s another argument for the free software development model – more emphasis on the code than on the organisational framework for it.

Swahili segmenter now online

May 27th, 2010 by donnek No comments »

At the weekend I finally managed to get the segmenter tidy enough to release. The web version is here, and the code is available for download from a Git repository. This also includes a pretty detailed manual on how to get it working from scratch on an Ubuntu machine.

Beata has found a couple of minor problems so far, and I’ll fix these when I get some other stuff finished. More testing and comments would be very welcome.

Constraint Grammar tutorial

May 14th, 2010 by donnek No comments »

I’ve been doing some work over the past few weeks with the ESRC Centre for Research on Bilingualism at Bangor University, focussing on autoglossing their Welsh and Spanish conversation transcripts. As part of that, I’ve been using Constraint Grammar again as a possible approach to disambiguating words in the text.

Fran Tyers introduced me to CG, which is licensed under the GPL, when we were working on the Apertium Welsh translator 18 months ago. The Welsh grammar we ended up with, containing about 130 rules, was quite small by CG standards (the Portuguese CG grammar has around 9,000 rules), but was pretty effective.

In the course of revising and expanding that grammar, I thought it would consolidate my own learning to write a short tutorial, which might be useful to others as a gentler introduction to this very elegant and versatile system than the manual and howto. The result is a short note on Getting started with Constraint Grammar, using a Welsh sentence as the example text. The TeX source file is here, in case anyone wants to improve on it or extend it.

A Swahili verb analyser

April 10th, 2010 by donnek No comments »

Many years ago I studied Bantu languages, and I’ve recently returned again to perhaps the best-known of them, Swahili. On the FreeDict list, Piotr Bański and Jimmy O’Regan had noted the absence of a free (GPL) morphological analyser for Swahili – some have been written, but they are not available under a free license. I have now completed a free (GPL) analyser for Swahili verbs (analysing the nouns is relatively easy), and my hope is that it might not be all that difficult to port to other Bantu languages like Shona or Zulu.

My first idea was to write a generating conjugator like the one I did for Welsh, where I would set up rules and forms in a database, and then scripts would stick all these together. This would have been much easier to do for a Bantu language than for an inflected language like Welsh. Even though many of the forms would have been semantically dubious, even if morphologically possible, that would not have mattered, since they could just sit quietly in the database, offending no-one. The main drawback would be that as new verbs were added to the dictionary, the relevant forms would have to be generated, but that could be done by a script.

So I took the first steps of generating forms for the current (-na-) tense, adding subject and object pronouns for all the classes. Hmm. A first run-through yields about 400 forms for -ambia (say,tell). And it turns out that of these 400 “possible” forms, only 15 occur in Kevin Scannell’s 5m-word Crubadán Swahili corpus. Worrying – let’s (roughly) do the maths: 20 subject prefixes x 20 object prefixes x 25 tenses x 20 relatives x (say) 2,000 verbs (initially) = 400m entries! Not really practical, I think …

So instead I’ve written something which segments and tags the verbal form given to it. Type in anawaambia (he/she is telling you/them), for instance, and you get:
a[sp1-3s]+na[curr]+wa[op2-2p,op2-3p]+ambia (tell)

or for alivyompiga (how he hit him):
a[sp1-3s]+li[past]+vyo[rel8]+m[op1-3s]+piga (hit)

where rel8 is “relative particle of class 8″, sp1-3s is “subject pronoun of class 1, third person singular”, curr is “current present tense”, and so on. This has the big benefit that you don’t need to generate the forms beforehand and add them to the dictionary. Of course, you won’t get a verb lemma until you put an entry for it into the dictionary!

At the moment, the system is working in a console, which is useful for debugging, but I’ll add a web interface for it. The aspects I’m focussing on now are a disambiguator and a rudimentary grammar checker.

The disambiguator is a bit like constraint grammar, but working on morpheme tags inside the word instead of across word boundaries, and of course it’s using PHP regexes working on a database entry instead of C++! This is helping to tighten up the analysis – in fact, as I was parsing an example to put in here, I realised several rules could be conflated by changing just a couple of things, so I’ll use a simpler example: with singejua (I would not know), the entry for the negative imperative marker is removed to leave the (correct) negative first person singular subject pronoun, so:
si[neg+sp1-1s,neg-imp]+nge[cond]+jua (know)

becomes:
si[neg+sp1-1s]+nge[cond]+jua (know)

The checker is due to the fact that the analyser trusts you to put in correct forms – it will analyse incorrect forms as best it can. It would be difficult, on this implementation model, to completely rule out all incorrect forms (although the original generator model would have done this), but it is feasible to flag the most obvious, and this is what I will be doing. For instance, if you enter *hawasingejua (*they would not know), you get the following:
Incorrect: There are two negatives here. Either remove the initial
negative ‘ha-’ (corrected below), or use ‘nge’ instead of ’singe’.
wa[sp2-3p]+singe[neg-cond]+jua (know)

with a correct form (wasingejua) offered instead. I’m not sure yet how best to cater for this in command-line mode (in cases where the analyser would be being used without supervision to tag text in bulk).

As regards verbal extensions (where -pika (cook) can produce -pikwa, -pikia, -pikiwa, -pikisha, -pikika, and so on), these are not handled directly in the analyser. The main reasons for this are (a) they are less productive (of the 8 or so main extensions, many verbs may have only a few in common use) and (b) the morphology is more variable (often depending on the source of the verb), which makes analysis more complicated. So my current plans are to handle them in a revised version of Beata Wójtowicz’ FreeDict Swahili dictionary, with the extensions being marked in the verb entry (eg -pikisha, v, caus) along with the root of the verb (so that other extensions of the same root will come up in any search). This means that nilipikiwa (I had something cooked for me) might show up in the analyser as something like:
ni[sp1-1s]+li[past]+PIK+prep+pass (have something cooked for one)

meaning that the prepositional and passive extensions are marked in the analysis.

There are undoubtedly many shortcomings in this first version of the analyser, but at least the code (such as it is) will be out there for people to comment on and amend. It may be that it can be rewritten in a compiled language to make it faster, or that the existing constraint grammar engine can be included to make the disambiguation more flexible. Since it’s less than 500 lines of PHP, it should be easy to get to grips with.

Font-face …?

April 2nd, 2010 by donnek No comments »
Testing … Linux works fine, but Mac won’t play ball.
Perhaps there’s something it doesn’t like about the Scheherezade font file …

أَلِپٗپٖنْدَ مَنَانِ · كَمُؤٗنَ مُعَيَنِ
كُنَ كِسِمَ مْوِٹُنِ · أَكٖنْدَ كُچَنْڠَلِيَ

alipopenda manani ✽ kamuona mu’ayani
kuna kisima mwiţuni ✽ akenda kuchangaliya

Swahili layout now part of xkb

April 2nd, 2010 by donnek No comments »

Thanks to fast work by Sergey Udaltsov, the proposed keyboard layout for Swahili in Arabic script is now in the xkeyboard git tree. This means that at some point in the near future you will not have to edit your xkb files manually, as in the howto. All you will need to do is choose the new Tanzania or Kenya layouts on your keyboard switcher, and you will get the Arabic script layout automatically. Having this available as a default in all GNU/Linux distros should make it a lot esier to get started.

Swahili in Arabic script: a howto

March 27th, 2010 by donnek No comments »

I’ve written up what I did to set up my machine to write Swahili in Arabic script, and the result is contained in this howto document. The keyboard layout file listed in Annex 1 can be downloaded here. Any corrections or additions are welcome.

I give another example there of a transcription from the Jaafari manuscript:

What has actually surprised me is that it is almost as fast to type Swahili in Arabic script as it is in Roman, and it takes much less time than it would to add all the diacritics to make a proper transliteration of the manuscript.

I’m not sure whether this has already been done for other African languages that have used Arabic script in the past (eg Hausa), but it seems a useful way of using modern technology to help safeguard an important area of cultural heritage.