Archive for May, 2010

Swahili segmenter now online

May 27th, 2010

At the weekend I finally managed to get the segmenter tidy enough to release. The web version is here, and the code is available for download from a Git repository. This also includes a pretty detailed manual on how to get it working from scratch on an Ubuntu machine.

Beata has found a couple of minor problems so far, and I’ll fix these when I get some other stuff finished. More testing and comments would be very welcome.

Constraint Grammar tutorial

May 14th, 2010

I’ve been doing some work over the past few weeks with the ESRC Centre for Research on Bilingualism at Bangor University, focussing on autoglossing their Welsh and Spanish conversation transcripts. As part of that, I’ve been using Constraint Grammar again as a possible approach to disambiguating words in the text.

Fran Tyers introduced me to CG, which is licensed under the GPL, when we were working on the Apertium Welsh translator 18 months ago. The Welsh grammar we ended up with, containing about 130 rules, was quite small by CG standards (the Portuguese CG grammar has around 9,000 rules), but was pretty effective.

In the course of revising and expanding that grammar, I thought it would consolidate my own learning to write a short tutorial, which might be useful to others as a gentler introduction to this very elegant and versatile system than the manual and howto. The result is a short note on Getting started with Constraint Grammar, using a Welsh sentence as the example text. The TeX source file is here, in case anyone wants to improve on it or extend it.