This is a very simple segmenter for Quechua (Runasimi), spoken by several million people in Peru, Bolivia, Ecuador and Argentina. This version of the segmenter is a proof-of-concept, and there is a lot still to be done with it.
The major shortcomings are as follows:
- The vocabulary that the segmenter "knows" is only a couple of hundred words at present. However, it would be possible to raise this to around 15,000 relatively easily.
- If you enter an accented character, it will do the segmentation, but will precede the output by pointless complaints about encoding.
- If you enter a word that it does not know, it will mutter darkly about about undefined variables and invalid arguments, instead of just telling you plainly that it doesn't know.
- Entering words containing ejective consonants (eg p', t', ch', k', q') does not work yet - sorry!
- Using punctuation of any kind, or capital letters where they are not part of a placename, may confuse the segmenter's small brain.
- The segmenter only deals with single words at present, but extending it to handle multiple words will be trivial (in theory .... )
- The output merely lists possible stems and affixes - it makes no attempt to choose the most likely ones. The next step here would be to assign some sort of likelihood to particular combinations of affixes (eg to say that kuna is more likely to be the plural affix, rather than a combination of ku (benefactive) and na (verbifier), or that only a subset of affixes can appear after particular affixes (eg rqa, past). Output from a corpus such as that produced by Kevin Scannell's Crubadán would be helpful here.
- Last but not least, my Quechua is limited, so the tags I have assigned to the affixes may cause annoyance or shock to first-language speakers or linguists specialising in Quechua. These tags probably need to be improved, but the current set aims at giving a general indication of the meaning as tersely as possible.
In order to save you having to find Quechua text to put into the segmenter (and to ensure that at least some of the text you put in will actually give you output!), you could try entering words from the following list:
- wasiykuna (my houses)
- rantishan (he is buying)
- Qusquman (to Cuzco)
- qhawani (I see)
- kutirqamun (he returned here)
- ruwashanki (you are doing)
- tukurquniñan (I already finished it)
A substantial number of pages in Quechua are available on Wikipedia, but given the low number of words currently in this version of the segmenter, you may not have much luck with them.