Swahili verb segmenter

Background

This webpage allows Swahili verbforms to be segmented for use in parsers or taggers. Although some segmenters already exist, they either handle only a few basic forms, or are not available under an open license. This segmenter handles all the one-word tenses in Swahili, and is licensed under the GPLv3 and the AGPLv3.

The segmenter uses the data from Beata Wójtowicz's FreeDict Swahili dictionary, so it will not specify the verb unless it is one of the 500-odd which that dictionary already contains. As the dictionary is expanded, the segmenter will recognise more verbs.

Two variants of the segmenter exist: this web version, and a version that can be run from the command line against a file listing verbforms to be analysed. It would also be possible to use this latter version as part of an application that would tag connected text.

The segmenter is written in PHP, and uses a PostgreSQL database. Many improvements could probably be made to the code. For instance, it is relatively slow - working through a list of 140,000 possible verbforms drawn from Kevin Scannell's Crubadán Swahili corpus, its average rate was around 87 per minute.

Method

The approach used here takes advantage of the fact that Swahili verbs have specific slots for each type of morphological affix. There is some overlap, particularly where negative markers and the general present or habitual markers are concerned, but in general the sequence is fixed.

The segmenter first removes and stores suffixes such as relative particles and the enclitics -ni, -pi, -je. Prefixes are then split off the verbform slot by slot, and tagged with morphological information. After each such split, the remainder of the verbform is checked against the dictionary to see if it can be interpreted as a verb, and whether or not it seems to have a negative (-i) or subjunctive (-e) verb ending. If so, then no further splitting is done, and the stored suffixes are added back to give a full parsed form.

Once the parse has been completed, it is examined in order to see whether some tags are inconsistent with others and can therefore be struck out. This disambiguation is similar to constraint grammar, but working within the word boundary rather than across it. A related area is error-checking. Initial work has been done on this, with suggestions to the user as to how to correct obvious errors, but this needs to be greatly extended.

Usage

Simply type the word you want to check into the box above and press the button. Suggested test forms are: wanavyotaka, kilichompiga, afanyaye, husimama, sitaki, amekwendapi, utafikaje.

Examples of incorrect forms (to test error correction) are: hawasingejua, asifikaye, atayeona.

Source code

The code and data for this application are available from GitHub. Any suggestions for improvement in either code or data are very welcome - contact me on:

A note on verb roots

Because of the extensive use of affixes in Bantu languages, grouping related forms in a dictionary cuts across the alphabetical listing of such forms - see Problems of citation forms in dictionaries of Bantu languages (John C Kiango), Nordic Journal of African Studies, 14(3), 2005 for an interesting discussion. The system used here is intended to allow related dictionary forms to come up in queries of the expanded version of the FreeDict Swahili dictionary that I am working on. For verbs derived from Arabic (eg -fahamu), the Arabic stem (in this case f-h-m) is used as the root.

A segmenter for Swahili verbs

Background

Method

Usage

Source code

A note on verb roots