Swahili in Arabic script

Typesetting Swahili poetry

Introduction
Input
Output
Analysis

The prime aim of Andika! is to facilitate the production of digital versions of classical Swahili manuscripts, and this section highlights one approach to handling traditional poetry. My main interest is تٖينْزِ (tenzi, ballads), so the tools currently focus on that format, but they could be further developed to meet different requirements. Samples of input and output are shown here, and the process is summarised, but these are not (currently) online tools - to use them you need to download Andika! and install a variety of other software on your own PC.

The first step is to transcribe the manuscript using the Swahili keyboard - see the Input tab. This input is then imported into a database, where various annotations can be added to individual words. A transcription is supplied automatically, and an English translation can be added. The imported text can then be output in a number of formats allowing for both print and online publication - see the Output tab.

A sample layout of part of أُتٖينْزِ وَ ڤِيتَ ڤِكُؤُ (Utenzi wa Vita Vikuu, The Ballad of the Great Battle) is given below. (Click to see a full-sized copy.) The manuscript is in the SOAS collection, and this extract containing stanzas 138-147 is taken letter-for-letter from a copy of the manuscript.

The storage of the poem contents in a database allows a number of extra tools to be used in analysing the poem - see the Analysis tab. These include: various word frequency lists (from which a variety of glossaries can be produced), concordances, n-grams, etc.

Format of the input file

تٖينْزِ (tenzi, ballads) consist of quatrains of 8-syllable lines rhyming aaab, where the b rhyme is Va, and is maintained throughout the ballad. Because the lines are short, they are often written two to a manuscript line, with a space dividing them, so each quatrain takes up two manuscript lines. تٖينْزِ (tenzi) which have been printed in Roman script usually print the four lines one after the other, so that the quatrain format is more obvious.

The input file is a LibreOffice document (odt format) where the text of the ballad is transcribed letter-for-letter in a layout reflecting the manuscript procedure - every two lines of the quatrain are written to one line of the file, but they are separated by an asterisk (AltGr+Shift+8 on the Swahili keyboard) instead of space. Each quatrain is separated by a blank line. The input file for the extract from أُتٖينْزِ وَ ڤِيتَ ڤِكُؤُ (Utenzi wa Vita Vikuu) is shown below, and is available in odt format, and also in pdf format.

This input layout may not suit all requirements - for instance, you may prefer to have each line of the quatrain on its own line, or for the quatrains to be numbered, or for the input file to be in text format. Changes in the input routines to handle these would be easy to implement

Items such as لَ زَ يَ نَ na, ya, za, la are written according to the manuscript rendering. In most instances they are attached to the word, or (in the case of non-connecting letters like زَ) written very close to it. But where there is a larger space in the manuscript between the item and the following word, the item has been written separately from that word in the transcription/transliteration. This leads to some inconsistency in the transcription/transliteration, but seeks to reflect the manuscript more faithfully.

Import and annotation of the input file

Once the manuscript is transcribed into the input file, a PHP script is run to import it into a PostgreSQL database table. During this import, the Arabic text is transliterated into standard Swahili Roman orthography, and also into a close transliteration which more closely reflects the Arabic letters.

Another script is then run to segment each line of the ballad into words, which allows each word to have various annotations added to it in the database. Various annotations could be accommodated as necessary, but the ones currently handled are:

Adjustments to the transliterated lemma - for instance, to reflect its scansion, eg miliḥi (salt) instead of مِلْحِ (milḥi, salt).
Segmentation of the transliteration to reflect standard Swahili word boundaries, eg na ẖubuzi (and bread) instead of نَخُبُوزِ (naẖubuzi, and bread).
A note on the meaning or reference of the lemma.
The root of the word, eg fik for akafika (he arrived), or klm for katakalama (he spoke) - this is useful when constructing frequency lists. Many of the roots could be filled in automatically by looking up the word against a digital Swahili dictionary.
An English translation of the line, anchored to the first word of each line.
Variant readings of the word in different manuscript versions.

The first few words of the extract as they appear in this database format are shown below:

Currently, the database table is edited using an interface like phpPgAdmin or SQLWorkbench/J , but it would also be possible to create a web-based interface for this purpose.

Once the annotations have been added to the database import of the input file, it can be output in a variety of formats by running other PHP scripts - see the Output tab.

Output to pdf

Output to pdf (which would be the most likely format for print publication) uses XeTeX, a development of the typesetting system TeX, to produce an intermediate tex file, which is then used to generate a pdf file. The creation of the tex file is handled by a PHP script, so in practice the only reason for opening it would be to make one-off edits.

Notes to the text are printed at the bottom of each page, with the footnote number marked in red in the text. Interpolated letters or other textual emendations are marked in blue.

It would be relatively simple to adjust the layout and contents of the pdf output to meet other requirements.

Output to html

Output to html (the most likely format for web publication) is shown in this html file. In this version, the transcription colour has been changed to blue to fit in with the site livery, and lemma adjustments are in purple. Notes are displayed in a pop-up when the mouse hovers over the note number (to dismiss the note, click on it and move the mouse away). An example of how variant readings might be handled is given in stanza 142 - hovering over the word with a grey background pops up the variant. (Note that this variant reading is an invention for the purposes of this example.)

A paging mechanism could be added to allow easy navigation through a long text.

Again, it would be relatively simple to adjust the layout and contents of the html output to meet other requirements, and indeed, a variant is shown below.

Output to odt

Output to odt (LibreOffice format) is only indirectly supported - it is handled by generating a standalone html file. In this, notes to the text are printed at the end of the entire text, with the endnote number marked in red in the text. The endnote number in the text is a link to the endnote, and the number beside the endnote itself is a link back to the relevant word in the text. To create an odt document from this, open the file in the default KDE web-browser Rekonq (see below), and copy the text using Ctrl+A and Crtl+C. Then open a new LibreOffice document and use Ctrl+V to paste the text into that. When saving the file in LibreOffice, select odt as the save format.

Note that, of the browsers available on Linux, Rekonq provides the best results. It retains all the formatting (fonts, colours, etc), and converts the endnote links in the resulting odt file so that they can be accessed by pressing Ctrl while clicking. Chromium retains the formatting, but does not convert the links, so that they still point to the original html file, and clicking them opens that file in a browser. Chromium also converts some spaces to non-breaking spaces, which are marked with a grey background - to get rid of this, select Tools → Options → LibreOffice Writer → Formatting Aids, untick Non-breaking spaces, and press OK. Firefox is worst of all, losing all the formatting, and leaving the links unconverted. Note that even with Rekonq and Chromium, though, you will need to select all the endnotes in LibreOffice and press the LTR button in order to align them to the left.

You can also open the html file in LibreOffice itself, but in this case you will not be able to save it as an odt file.

With the contents of the poem in a database table, it becomes possible to analyse the text in more detail. This section gives some examples of what might be done here.

Listing words

The following is an alphabetical list of the words in close transcription in the poem extract, along with how often they occur (because the extract consists of only 10 stanzas out of the 900+ in the ballad, most of the words occur only once). Clicking on the word shows the stanzas it appears in, producing a concordance.