New Swahili materials

January 9th, 2016 by donnek Leave a reply »

For much of 2015, I was working on Swahili. The main effort has been on Andika!, a set of tools to allow handling of Swahili in Arabic script. I spoke on this at a seminar in Hamburg in April 2015 (thanks to Ridder Samsom for the invite!), and since then my top priority has been to use Andika! for the real-world task of providing a digital version of Swahili poetry in Arabic script. So I started working on two manuscripts of the hitherto-unpublished Utenzi wa Jaʿfar (The Ballad of Jaʿfar), adding pieces to Andika! as required, and I finished that last summer. Although there are still a few rough edges, I think it’s worth putting it up on the web.

Since then, I’ve been working on a paper to demonstrate how having a ballad text in a database might help in textually analysing it, and the results so far are very interesting. For instance, half of the stanzas use just 3 rhymes, one in five verbs are speech verbs, the majority of verbs (67%) have no specific time reference, 70% of words in rhyming positions are Arabic-derived, 83% of lines have two syntactic slots, and in the four commonest sequences display marked word-order almost as frequently as the “normal” unmarked word-order. I’m currently looking at repetition and formula use, and there the ability to query the database to bring up similar word-sequences is crucial. Another benefit has been that concordances (based either on word or root) can be created very easily.

Ideally, you could load all your classical Swahili manuscripts into a big database, and this would give you an overview of spelling variations and word usage over time and space, with all this text available to be queried easily and printed in either Arabic and Roman script (or both), with a full textual apparatus if desired.

For conclusions based on a poetic corpus to be valid, they would also have to be compared with a prose corpus, and I’m also offering a small contribution to that, with a new Swahili corpus based on Wikipedia. This contains about 2.8m words in 150,000 sentences, so it’s reasonably comprehensive. In contrast to the work I’ve done on Welsh corpora, I used the NLTK tokenizer to split the sentences, and I’m not entirely sure yet whether I like the results as much as the hand-rolled splitting code I was using earlier. Something to come back to, possibly.

Leave a Reply