When Fran Tyers and I were finishing the first version of the apertium-cy Welsh-English translator in late 2008, he decided to try porting that experience to another Celtic language, Breton. With Welsh, we had received less than enthusiastic responses to our requests to organisations like the Welsh Language Board for publicly-funded language material to be made available under a license compatible with the GPL, so he expected little when he made a similar request to the Breton equivalent, the Ofis ar Brezhoneg. We had to wait 18 months for a non-committal sort-of kind-of response from the WLB, but Fran got an email back from Fulup Jakez at OB the same week, saying that they would make their terminology lists available, and Gwenvael Jekel even worked with him for a couple of weeks to get the project off the ground. The Breton-French translator is on the OB website.
As part of the effort to collect GPLed resources for Breton, I put together the Breizh-Llydaw Sentence Bank, with the kind assistance of Dr Rhisiart Hincks in Aberystwyth. This is a corpus of about 3,500 paired Breton and Welsh sentences, along with a small dictionary of about 1,200 words. I had also come across the name Jan Deloof in various web-searches, but knew very little about him or his work.
It turned out that he lived in the Belgian town of Zwevegem, and not only did he have an 800-page Breton-Dutch dictionary, but he was quite happy for it to be licensed under the GPL, and (at the age of 80) was still working hard on his Dutch-Breton dictionary! What a man! The dictionary was in Microsoft Word document format, and held on an FTP site which hadn’t come up in any of my searches. So I offered to port the dictionary onto the Web by converting it to a database format, and provide a web interface for it similar to the one for Eurfa and the Sentence Bank. It took me longer than expected (nearly 18 months!) to get around to this, though Jan was always gracious when I contacted him to apologise – “I hadn’t really noticed the passage of time – I’m working my way through the letter K”.
However, at long last this wonderful piece of free (GPL) data can get the wider audience it deserves. After three weeks of work, a database version is now available. The first step was converting the doc files into text using OpenOffice.org Writer, and joining the broken lines that result from that. For this I used the amazing sed, and as I used it more I both marvelled at its capabilities, and wondered why it has taken me so long to discover it. I was able to use sed to correct the encoding issues that plague Microsoft Word documents, and do an initial tag of the text to prepare it for database import. In the source files, font attributes alone (eg bold) delineate things like citations, and it was difficult to find a way of segmenting these consistently. I tried converting to HTML, and then searching and replacing on tags, but that didn’t work well either. In the end, I just read through the treated file as produced by sed, added some tags manually, and then let good old PHP slice and dice each line into a PostgreSQL table.
I separated the headwords proper from the citations Jan gives (called phrases here), on the grounds that this means that the citations will come up on any searchword that they contain, thus extending their relevance. Separating out the bulk of multiword entries may also be useful in building machine translation systems. The only drawback is that where the phrases were specifically chosen to illustrate a pariticular grammatical point, that relevance is slightly weakened now. The main determiner of whether something was a phrase was basically whether it had gender information attached!
I’ve also added the ability to look up mutated words, as in Eurfa. For instance, if you enter bro, you get words related to that, but if you enter the soft-mutated form, vro, you will get words containing vro, and also ones containing bro. I need to expand this to cater for the hard, spirant and mixed mutations, and also make it more consistent (for instance, for bro you get vro in the phrases section, but not in the words section). The functions and queries used to do this are quite interesting, and I suppose I should package up something about this, because it may be useful, if only as a souce of ideas, to other people working with Celtic languages.
A good few things in the db itself also need to be tidied up, though. For instance, the tagging wasn’t perfect, especially for the first letters I did (AB). Where you have something like:
diskibl (m) [-ed, diskiblien]: leerling; volgeling
the tagging separates the two plurals (diskibled and diskiblien), so that the latter is prefixed to the definition, though separated by two slashes. So the next step is to run some db queries to shift those items out to where they belong – the plural field. I also used the plural field to hold the past participle of verbs, so those need to be given a field of their own:
dec'hervel (ww) dec'halvet: bij zich roepen
Another issue related to both those points is that if you enter diskiblien or dec’halvet into the dictionary, you will get no results back, even though there is information aplenty there about the lemmas. So those inflected items need to be pulled out and reformatted so that they can be used as headwords. In Eurfa, I put all of these into the one table, but it may be better to put them into separate tables (plurals, participles, etc) – time for some head-scratching.
The Dutch-Breton direction is useable, but not ideal. The Dutch side contains multiple entries for each Breton headword, so the responses are a bit difficult to filter when reading. What needs to be done here is to split the Dutch entries at each comma or semicolon, and write them into the table as separate entries. Again, one-to-one correspondences may be useful in MT work (although then the problem of disambiguation appears).
The only drawback I can see here is that if the Breton headword originally had 4 or 5 Dutch equivalents, searching the revised table from the Breton side may provide a surfeit of results – 4 or 5 entries for the same Breton word, plus plurals, plus mutated forms, etc. But I think this can be mediated by the interface if it is designed properly.
One definite benefit is that the table could then be used to check or help build the Dutch-Breton dictionary that Jan is steaming through (he already has half completed!). A single query could order the table in terms of the Dutch headword, and then a PHP script could amalgamate all the Breton equivalents listed for that Dutch headword into one Breton entry, ready for checking. Any amendments or additions which that throws up can then be incorporated into the db table.
But in the meantime Jan’s dictionary is useable. With over 40,000 entries, it’s an impressive achievement, and certainly the largest free (GPL) dictionary available for any Celtic language.