Autoglossing Gàidhlig

August 28th, 2013

Over the last month I’ve been wondering about how easy it would be to port the Autoglosser to some completely new language material. This would give me the opportunity to look at things like importing normal text instead of CHAT files, dealing with multiwords (collocations where the meaning of the whole is different from the combined meaning of the parts), better handling of capitalised items, etc. Eventually I decided to take the plunge with Gàidhlig (Scottish Gaelic), which I learnt 30 years ago in Sabhal Mòr Ostaig when it was still a collection of farm buildings ….

Surprisingly, once I had assembled a dictionary, the actual port took only a couple of days, and gives pretty good results, as you can see from the output at the website. There’s obviously a lot to be done yet – in particular, developing a stemmer to simplify the dictionary. Talking of which, I also put together a little TeX script which typesets the dictionary in the form I’ve always wanted: all words listed in alphabetical order, but with the lemma specified where they are derived forms, and also each derived word listed as part of the generating word/lemma’s own entry. Still needs a bit of work (it should be in two columns, for instance), but it shows that the dictionary layout is robust enough to give quite sophisticated output.

This port opens the way for more work on streamlining text consumption by the Autoglosser – at present, punctuation is not handled as well as I’d like. The multiword work is also a first step in allowing the handling of languages with disjunctive writing systems (eg Sotho).