Archive for March, 2010

Swahili in Arabic script: a howto

March 27th, 2010

I’ve written up what I did to set up my machine to write Swahili in Arabic script, and the result is contained in this howto document. The keyboard layout file listed in Annex 1 can be downloaded here. Any corrections or additions are welcome.

I give another example there of a transcription from the Jaafari manuscript:

What has actually surprised me is that it is almost as fast to type Swahili in Arabic script as it is in Roman, and it takes much less time than it would to add all the diacritics to make a proper transliteration of the manuscript.

I’m not sure whether this has already been done for other African languages that have used Arabic script in the past (eg Hausa), but it seems a useful way of using modern technology to help safeguard an important area of cultural heritage.

White hair …

March 26th, 2010

I was reading some Chinese poems today, and came across a couple of lines by 杜甫 (Dù Fǔ, 712-770AD) which sadly are all too applicable to me:

“When I scratch my white hair it gets even thinner,
There is hardly enough left to fix a hairpin.”

Swahili in Arabic script on WordPress

March 26th, 2010

It’s possible to get Swahili in Arabic script to display quite nicely in WordPress on Linux (Kubuntu 9.10) Firefox (3.5.7), after a bit of tweaking. Here is the last stanza of the Utendi wa Ja’afari:

نِمٖپٖنْدَ كُكَرِرِ nimependa kukariri
نَنْيِ سٗمَنِ ضَمِيْرِ nanyi somani ḍamïri
أُتٖنْدِ وَ جَعْفَرِ utendi wa Ja’fari
وَمَوْلاَنَا عَلِيَ wa Maũlãnã ‘Aliya

You need to have the SIL Scheherazade font installed to see it to best advantage. It should look like this:

The default fonts on Linux seem to be missing glyphs for 067E (peh) in serif or non-serif fonts, and the glyph for 06A0 (ain with three dots above) doesn’t show up in a medial form. (That makes typing into the edit box on WordPress a little more difficult than it need be, but it’s not a show-stopper.) Even with Scheherazade installed, Konqueror (4.3.2) produces a bit of a mess unless I set the fallback font in the stylesheet to monospace (see below). Even then, it doesn’t see the glyph for 0656 (subscript alef) which I’m using for the vowel e (in the absence of a vertical equivalent of 0650, kasra) or 0657 (inverted damma) for the vowel o. That means it puts a box there:

An older version of Konqueror (3.5.5) on openSUSE 10.2 puts the boxes inline, and this leads to medials not being joined.

On Apple Mac OSX (10.4.11) with Scheherazade installed, Safari (3.1.1) behaves similarly to the older version of Konqueror on Linux (ie boxes instead of e, and medials unjoined). Firefox (3.0.17) is roughly the same, but it doesn’t even seem to see the Scheherezade font, so it falls back to monospace and gets the alignment of the <span> (see below) wrong:

I have no Microsoft Windows machine to test this on.

The default setup produces Arabic script that is too minuscule for my old eyes (I have the same problem with Chinese!), so I’ve added a couple of CSS stanzas to the theme stylesheet to adjust positioning and size:

.swahili {
font-family: Scheherazade, monospace;
font-size: 220%;
line-height: 130%;
unicode-bidi: embed;
text-align: right;
width: 80%;

.trlit {
font-family: “Liberation Sans”, sans-serif;
font-size: 50%;
margin-left: 50px;
text-align: left;

The first handles the Swahili, and the second handles the transliteration.

Then the actual Swahili in the post looks like this:

I’m sure all of this (especially the CSS) could be done more elegantly, but it’s a start. There are many things I still have to figure out, though. My next post will be on setting up your PC to type Swahili in Arabic script.


March 25th, 2010

My wife needed some pdfs sliced and diced to put on a work-related website, so I ended up learning a bit about the PDF Toolkit.

My HP printer has a rather nice webscan app, which scans over the network directly into a pdf on the computer.  However, the scans average around 1Mb a page, which is far too big.  So we need to squish them down a bit first, using a ghostscript invocation I found here:

gs -sDEVICE=pdfwrite -dCompatibilityLevel=1.4 -dPDFSETTINGS=/screen
-dNOPAUSE -dQUIET -dBATCH -sOutputFile=1a.pdf 1.pdf

This will usually reduce the size to less than 100Kb.

We can combine individual pdfs using pdftk:

pdftk first.pdf second.pdf cat output combined.pdf

and conversely we can split an existing pdf into individual pages by using:

pdftk mybig.pdf burst

You can open a pdf in the GIMP for touching up (eg removing the darkest bits from a bad scan if you don’t have the original to rescan), and then save it out as a Postscript file.  After that, you can convert the ps file to a pdf by running:

ps2pdf mynew.pdf

Theoretically, pdfs should orientate themselves correctly automatically.  If you have some existing scanned pages that are in portrait format when they should be landscape, and you can’t get them to show up in the correct orientation, you can open them in the GIMP, correct the orientation there, save as Postscript, and then use ps2pdf to create the pdf.  Occasionally, however, when you open the reoriented landscape file, you get it appearing in portrait mode, with the right-hand half of the file disappearing off the edge of the sheet.  You can fix this by using another ghostscript invocation directly on the Postscript file:

gs -dBATCH -dNOPAUSE -sOutputFile=mylandscape.pdf -sDEVICE=pdfwrite
  -c “<< /PageSize [792 612]  >> setpagedevice”  -f

Jan Deloof Breton-Dutch Dictionary

March 25th, 2010

When Fran Tyers and I were finishing the first version of the apertium-cy Welsh-English translator in late 2008, he decided to try porting that experience to another Celtic language, Breton.  With Welsh, we had received less than enthusiastic responses to our requests to organisations like the Welsh Language Board for publicly-funded language material to be made available under a license compatible with the GPL, so he expected little when he made a similar request to the Breton equivalent, the Ofis ar Brezhoneg.  We had to wait 18 months for a non-committal sort-of kind-of response from the WLB, but Fran got an email back from Fulup Jakez at OB the same week, saying that they would make their terminology lists available, and Gwenvael Jekel even worked with him for a couple of weeks to get the project off the ground.  The Breton-French translator is on the OB website.

As part of the effort to collect GPLed resources for Breton, I put together the Breizh-Llydaw Sentence Bank, with the kind assistance of Dr Rhisiart Hincks in Aberystwyth.  This is a corpus of about 3,500 paired Breton and Welsh sentences, along with a small dictionary of about 1,200 words.  I had also come across the name Jan Deloof in various web-searches, but knew very little about him or his work.

It turned out that he lived in the Belgian town of Zwevegem, and not only did he have an 800-page Breton-Dutch dictionary, but he was quite happy for it to be licensed under the GPL, and (at the age of 80) was still working hard on his Dutch-Breton dictionary!  What a man!  The dictionary was in Microsoft Word document format, and held on an FTP site which hadn’t come up in any of my searches.  So I offered to port the dictionary onto the Web by converting it to a database format, and provide a web interface for it similar to the one for Eurfa and the Sentence Bank.  It took me longer than expected (nearly 18 months!) to get around to this, though Jan was always gracious when I contacted him to apologise – “I hadn’t really noticed the passage of time – I’m working my way through the letter K”.

However, at long last this wonderful piece of free (GPL) data can get the wider audience it deserves.  After three weeks of work, a database version is now available.  The first step was converting the doc files into text using Writer, and joining the broken lines that result from that.  For this I used the amazing sed, and as I used it more I both marvelled at its capabilities, and wondered why it has taken me so long to discover it.  I was able to use sed to correct the encoding issues that plague Microsoft Word documents, and do an initial tag of the text to prepare it for database import.  In the source files, font attributes alone (eg bold) delineate things like citations, and it was difficult to find a way of segmenting these consistently.  I tried converting to HTML, and then searching and replacing on tags, but that didn’t work well either.  In the end, I just read through the treated file as produced by sed, added some tags manually, and then let good old PHP slice and dice each line into a PostgreSQL table.

I separated the headwords proper from the citations Jan gives (called phrases here), on the grounds that this means that the citations will come up on any searchword that they contain, thus extending their relevance.  Separating out the bulk of multiword entries may also be useful in building machine translation systems.  The only drawback is that where the phrases were specifically chosen to illustrate a pariticular grammatical point, that relevance is slightly weakened now.  The main determiner of whether something was a phrase was basically whether it had gender information attached!

I’ve also added the ability to look up mutated words, as in Eurfa.  For instance, if you enter bro, you get words related to that, but if you enter the soft-mutated form, vro, you will get words containing vro, and also ones containing bro.  I need to expand this to cater for the hard, spirant and mixed mutations, and also make it more consistent (for instance, for bro you get vro in the phrases section, but not in the words section).  The functions and queries used to do this are quite interesting, and I suppose I should package up something about this, because it may be useful, if only as a souce of ideas, to other people working with Celtic languages.

A good few things in the db itself also need to be tidied up, though.  For instance, the tagging wasn’t perfect, especially for the first letters I did (AB).  Where you have something like:

diskibl (m) [-ed, diskiblien]:  leerling; volgeling

the tagging separates the two plurals (diskibled and diskiblien), so that the latter is prefixed to the definition, though separated by two slashes.  So the next step is to run some db queries to shift those items out to where they belong – the plural field.  I also used the plural field to hold the past participle of verbs, so those need to be given a field of their own:

dec'hervel (ww) dec'halvet: bij zich roepen

Another issue related to both those points is that if you enter diskiblien or dec’halvet into the dictionary, you will get no results back, even though there is information aplenty there about the lemmas.  So those inflected items need to be pulled out and reformatted so that they can be used as headwords.  In Eurfa, I put all of these into the one table, but it may be better to put them into separate tables (plurals, participles, etc) – time for some head-scratching.

The Dutch-Breton direction is useable, but not ideal.  The Dutch side contains multiple entries for each Breton headword, so the responses are a bit difficult to filter when reading.  What needs to be done here is to split the Dutch entries at each comma or semicolon, and write them into the table as separate entries.  Again, one-to-one correspondences may be useful in MT work (although then the problem of disambiguation appears).

The only drawback I can see here is that if the Breton headword originally had 4 or 5 Dutch equivalents,  searching the revised table from the Breton side may provide a surfeit of results – 4 or 5 entries for the same Breton word, plus plurals, plus mutated forms, etc.  But I think this can be mediated by the interface if it is designed properly.

One definite benefit is that the table could then be used to check or help build the Dutch-Breton dictionary that Jan is steaming through (he already has half completed!).  A single query could order the table in terms of the Dutch headword, and then a PHP script could amalgamate all the Breton equivalents listed for that Dutch headword into one Breton entry, ready for checking.  Any amendments or additions which that throws up can then be incorporated into the db table.

But in the meantime Jan’s dictionary is useable.  With over 40,000 entries, it’s an impressive achievement, and certainly the largest free (GPL) dictionary available for any Celtic language.

Swahili in Arabic script

March 24th, 2010

Over the past couple of days, in preparation for a little project on Swahili, I’ve been investigating getting Kubuntu to show Swahili in Arabic script (the way in which it was traditionally written on the coast). I seem to have got there, and I’ll do a longer post later on how to configure your system to allow this.

In the meantime, as a taster, here is a verse from an utendi (ballad), typed in from a photocopy of the original manuscript:

It means, roughly: “I have accomplished the telling of these [things], and I will put others into verse, that you may all understand them, I have enjoyed relating [them] to you.”

This is still a work-in-progress, but it seems to be on the right track for now.

Chinese cars

March 2nd, 2010

With the official environment for Welsh being so unremittingly hostile to free (GPL) software, over the last couple of months I’ve taken to learning Chinese as a pick-me-up, mainly via the excellent UWB Life-Long Learning courses run by Manman Jones, and also via the international students at the Business School. Jason and Frances taught me various things about cars last night:

我花了五千磅买的这辆车。(的 is a really interesting syntactic “glue” word)

我的车是雪铁龙。(Citroën phonetically, but it translates to Snow Iron Dragon, which is very poetic!)
我的车是大众。(Volkswagen – because they’re so popular!)
我的车是丰田。(Toyota – very popular in the South)
我的车是奥迪。(Audi phonetically)
我的车是沃尔沃。(Volvo phonetically)
我的车是尼桑。(Nissan phonetically)


I’ve put them up on purpose without pinyin, so that I force myself to remember the characters when I read the sentences again. If you’d like a transliteration, use the free Firefox dictionary plugin at the excellent Popup Chinese site manned by David Lancashire and colleagues – well worth a subscription if you want to access a lot of good modern idiomatic stuff to help you learn Chinese. There’s also a plugin there that uses Nathan Dummitt’s system for colourising Chinese characters according to tone: red (tone 1, high); orange (tone 2, rising); green (tone 3, low-lift); blue (tone 4, falling); black (neutral). This is quite helpful when you’re trying to read what you’ve written out loud. For speed, it’s best to open this post on its own page first, and then call the colouriser.