A Swahili verb analyser

April 10th, 2010 by donnek Leave a reply »

Many years ago I studied Bantu languages, and I’ve recently returned again to perhaps the best-known of them, Swahili. On the FreeDict list, Piotr Bański and Jimmy O’Regan had noted the absence of a free (GPL) morphological analyser for Swahili – some have been written, but they are not available under a free license. I have now completed a free (GPL) analyser for Swahili verbs (analysing the nouns is relatively easy), and my hope is that it might not be all that difficult to port to other Bantu languages like Shona or Zulu.

My first idea was to write a generating conjugator like the one I did for Welsh, where I would set up rules and forms in a database, and then scripts would stick all these together. This would have been much easier to do for a Bantu language than for an inflected language like Welsh. Even though many of the forms would have been semantically dubious, even if morphologically possible, that would not have mattered, since they could just sit quietly in the database, offending no-one. The main drawback would be that as new verbs were added to the dictionary, the relevant forms would have to be generated, but that could be done by a script.

So I took the first steps of generating forms for the current (-na-) tense, adding subject and object pronouns for all the classes. Hmm. A first run-through yields about 400 forms for -ambia (say,tell). And it turns out that of these 400 “possible” forms, only 15 occur in Kevin Scannell’s 5m-word Crubadán Swahili corpus. Worrying – let’s (roughly) do the maths: 20 subject prefixes x 20 object prefixes x 25 tenses x 20 relatives x (say) 2,000 verbs (initially) = 400m entries! Not really practical, I think …

So instead I’ve written something which segments and tags the verbal form given to it. Type in anawaambia (he/she is telling you/them), for instance, and you get:
a[sp1-3s]+na[curr]+wa[op2-2p,op2-3p]+ambia (tell)

or for alivyompiga (how he hit him):
a[sp1-3s]+li[past]+vyo[rel8]+m[op1-3s]+piga (hit)

where rel8 is “relative particle of class 8”, sp1-3s is “subject pronoun of class 1, third person singular”, curr is “current present tense”, and so on. This has the big benefit that you don’t need to generate the forms beforehand and add them to the dictionary. Of course, you won’t get a verb lemma until you put an entry for it into the dictionary!

At the moment, the system is working in a console, which is useful for debugging, but I’ll add a web interface for it. The aspects I’m focussing on now are a disambiguator and a rudimentary grammar checker.

The disambiguator is a bit like constraint grammar, but working on morpheme tags inside the word instead of across word boundaries, and of course it’s using PHP regexes working on a database entry instead of C++! This is helping to tighten up the analysis – in fact, as I was parsing an example to put in here, I realised several rules could be conflated by changing just a couple of things, so I’ll use a simpler example: with singejua (I would not know), the entry for the negative imperative marker is removed to leave the (correct) negative first person singular subject pronoun, so:
si[neg+sp1-1s,neg-imp]+nge[cond]+jua (know)

becomes:
si[neg+sp1-1s]+nge[cond]+jua (know)

The checker is due to the fact that the analyser trusts you to put in correct forms – it will analyse incorrect forms as best it can. It would be difficult, on this implementation model, to completely rule out all incorrect forms (although the original generator model would have done this), but it is feasible to flag the most obvious, and this is what I will be doing. For instance, if you enter *hawasingejua (*they would not know), you get the following:
Incorrect: There are two negatives here. Either remove the initial
negative ‘ha-‘ (corrected below), or use ‘nge’ instead of ‘singe’.
wa[sp2-3p]+singe[neg-cond]+jua (know)

with a correct form (wasingejua) offered instead. I’m not sure yet how best to cater for this in command-line mode (in cases where the analyser would be being used without supervision to tag text in bulk).

As regards verbal extensions (where –pika (cook) can produce -pikwa, -pikia, -pikiwa, -pikisha, -pikika, and so on), these are not handled directly in the analyser. The main reasons for this are (a) they are less productive (of the 8 or so main extensions, many verbs may have only a few in common use) and (b) the morphology is more variable (often depending on the source of the verb), which makes analysis more complicated. So my current plans are to handle them in a revised version of Beata Wójtowicz’ FreeDict Swahili dictionary, with the extensions being marked in the verb entry (eg –pikisha, v, caus) along with the root of the verb (so that other extensions of the same root will come up in any search). This means that nilipikiwa (I had something cooked for me) might show up in the analyser as something like:
ni[sp1-1s]+li[past]+PIK+prep+pass (have something cooked for one)

meaning that the prepositional and passive extensions are marked in the analysis.

There are undoubtedly many shortcomings in this first version of the analyser, but at least the code (such as it is) will be out there for people to comment on and amend. It may be that it can be rewritten in a compiled language to make it faster, or that the existing constraint grammar engine can be included to make the disambiguation more flexible. Since it’s less than 500 lines of PHP, it should be easy to get to grips with.

Leave a Reply