Archive for the ‘Welsh’ category

Kwici – a Welsh Wikipedia corpus

January 14th, 2014

A couple of days ago I visited a resources page belonging to Linas Vepstas, the maintainer of the Link Grammar package of the OpenCog project, and that took me to the WaCky website, with some nice examples of tagged corpora.

That in turn led to Giuseppe Attardi and Antonio Fuschetto’s Wikipedia Extractor, which generates plain text from a Wikipedia dump, discarding markup etc. I’ve tried this sort of thing before, and the results leave a lot to be desired, but this one worked wonderfully.

Download it, make it executable, and point it at the dump: ./ -cb 250K -o output < cywiki-20131230-pages-articles-multistream.xml The -cb switch compresses the output into files of 250K, and the -o switch stores them in the output dir.

Then combine all these into one big file: find output -name '*bz2' -exec bzip2 -d -c {} \; > text.xml (the command on the website is missing the -d switch).

This gave a very tidy file with each document in its own doc tag. About the only stuff it didn't strip was some English in includeonly tags - I'm not sure what these are for, but they may have been updates that hadn't been translated yet. So Wikipedia Extractor did exactly what it said on the tin!

Once the doc tags, the English, and blank lines were removed, the text was relatively easily split into sentences, and imported into PostgreSQL for more tidying. I spent a day on that, but I didn't like the results, so I redid it over another day, and the outcome is Kwici - a 200k-sentence, 3.9m-word corpus of Welsh.

Kwici is now added to Eurfa as a seventh citation source.

More corpora …

December 16th, 2013

Korrect/Kywiro was a collection of English-Welsh software translations that I put together in 2004, when there was a lot of community activity going on. It fell off the web a few years ago, but I’ve now resurrected it. In the new version I’ve chopped up the longer strings into chunks of one or two sentences, and stripped out a lot of the HTML. I also spent some time fixing up character encoding issues, which is an occupation hazard of dealing with older text from various machines, not all of which will have been using UTF-8. Hopefully it should be fairly tidy now, so I’ve added it to the citation sources at Eurfa.

Going even further back, in the late 90s Bob Morris Jones and colleagues put together two corpora dealing with child language acquisition – CIG1 looked at children aged 18-30 months, and CIG2 looked at children aged 3-7 years. I’m not sure how many people know about or use these, but they’re an excellent resource that deserves to be more widely available, so I’ve imported the text into a database to provide a searchable interface to them at Kig. That site also includes the original .cha files, along with information from the original website (which is showing signs of bitrot).

Although other search parameters could be introduced (along the lines of the BangorTalk site), I’ve chosen for the moment just to split the searches between child and adult utterances, since I think that’s what most people would be interested in initially – what sort of output do the children in these corpora produce? (That in turn should probably be segmented by age, but the data to do that is now available in the downloadable versions of the database, and it can be added to the website if there is any demand).

The adult utterances, of course, can also be used as another citation source in Eurfa, so I’ve added them there too.

Two more corpora for Eurfa

November 13th, 2013

I’ve just added citations from the Siarad and Patagonia corpora to Eurfa. Both these corpora are GPL-licensed, and were put together by a team led by Prof. Margaret Deuchar. They contain transcriptions of actual spoken Welsh – and as with all languages, the spoken version can be a rather different beast from the formal written language – and Siarad in particular contains quite a few “codeswitches” (where the speaker uses a word from another language, in this case English, in utterances that contain the main language, in this case Welsh). In the past, people have criticised this sort of thing as a signal of poor Welsh, but in fact the evidence suggests that it is actually a marker of linguistic competence – only speakers who are equally capable in both languages tend to do this, while those who are less capable tend to stick to one language.

The corpora provide important data for looking at this and other bilingual phenomena, and among other things have been used to show that the degree to which a codeswitch becomes a loanword is related to its frequency, that “cognates” (words that are similar in both languages, eg siop/shop, or perswadio/persuade) tend to raise the likelihood of codeswitches being used for other words (especially nouns), and so on.

The Patagonia corpus tends to use codeswitching less than the Siarad one, with the speakers switching into the other language for entire utterances or passages, and of course the main codeswitching language in that corpus is Spanish rather than English.

For use in Eurfa, both corpora had to be stripped down a bit. If you look at the material on Bangortalk, you’ll see that it follows the CLAN transcription format, which includes various markers for pauses, hesitations, non-verbal interactions, and so on. To make things easier to read, it made sense to get rid of most of these, with the exception of the brackets to indicate elision, eg mae’n gets represented as mae (y)n.

Then utterances where the transcription was less than 20 characters were removed – about 45% of the original 78,000-odd utterances in Siarad, and about 53% of the original 37,500-odd in Patagonia. One of the things that is noticeable in these corpora is how short the average spoken utterance is compared to the written language.

The last thing was to remove the utterances that are entirely in a language other than Welsh. The CLAN format was changed a couple of years ago to incorporate what I consider a regression. Previously, a default language would be chosen for the transcription (usually the most frequent language in the conversation), and all words not in that language would be tagged. The new system allows a “precode” marker to be attached to utterances which are entirely in the non-default language, presumably on the grounds that this makes the transcriptions easier to read. However, the problem with this is that you can no longer tell simply from the word what language it is in: an untagged word could be Welsh if the the default language is Welsh, but it could also be English if there is a precode marker.

I chose the “easy” (aka “fast”) way to do this: deleting any items with precodes marking English or Spanish. Unfortunately, that still leaves items that were not marked with precodes because the default language of the transcription was English or Spanish! I looked at addressing this with Alberto Simões’ Lingua::Identify, but, as I expected, the results are variable: the utterances are mostly too short to give a good “fingerprint”, and this is compounded by the brackets, the lack of punctuation, and the codeswitches.

For Siarad, some 20 languages (cs, da, de, en, es, fi, fr, hr, hu, id, it, la, nl, pl, pt, ro, sl, sq, sv, tr) were given for about 31% of the utterances remaining after removing those with less than 20 characters. Most of these were in fact Welsh, and even the ones marked English were about 50% Welsh. I therefore left them all, giving a total of around 42,300 citations.

For Patagonia, the same 20 languages (cs, da, de, en, es, fi, fr, hr, hu, id, it, la, nl, pl, pt, ro, sl, sq, sv, tr) were given for about 32% of the remaining utterances. Again, most of these were in fact Welsh, but the ones marked as Spanish and Portuguese were predominantly Spanish, so I removed them, leaving a total of around 15,500 citations.

The upshot is that if in Eurfa you search for an English word, and then click to get citations from Siarad or Patagonia, there is a small chance you will get a “Welsh” equivalent of the English citation that is actually in English or Spanish!

Citations added to Eurfa

November 12th, 2013

Eurfa will now allow you to get in-context citations for its words from the Kynulliad3 corpus. I’ve limited it to 5 at the minute, but I also want to look at doing random lookups to produce a different 5 each time.

I’m hoping to add some other corpora to Eurfa in the same way over the coming months.

I also took the opportunity of changing the AJAX search-boxes to HTML ones, because the AJAX ones had the annoying “feature” of not doing the search if you pressed Return – instead, they just cleared the searchbox.

Language of first delivery

July 4th, 2013

I’ve added some data to Kynulliad3 (it’s not in the current download, but it will be in the next one) so that the “language of first delivery” can be calculated. In the Assembly, Members can speak in either language, which is placed on the left-hand side of the Record. The translation, into English or Welsh as appropriate, then goes on the right-hand side of the Record. If we separate out those sentences which were first delivered in one language from those which have been translated from the other language, we get the following word totals:
Welsh: 879,964  /  8,865,452 (9.9%)
English: 7,916,148  /  8,775,382 (90.2%)

So about 10% of the word total in the Third Assembly used Welsh as the language of first delivery. This is a bit lower than the proportion of Welsh-speakers in Wales (19% according to the 2011 Census), but I don’t know how it relates to the proportion of Welsh-speakers in the Third Assembly (or at least the proportion who felt confident enough to use Welsh in this setting). It’s a pretty sizeable percentage, though.

Kynulliad3 released

July 3rd, 2013

I’ve just released Kynulliad3 – a corpus of aligned Welsh and English sentences, drawn from the Record of Proceedings of the Third Assembly (2007-2011) of the National Assembly for Wales. It contains over 350,000 sentences and almost 9m words in each language, and is a sort of sequel to Jones and Eisele‘s earlier PNAW corpus.

Hopefully it will mean that all the great work done by the Assembly’s translators can be re-used for language processing research. And it also has practical benefits for language checking – if, for instance, you are translating “of almost” and wonder whether bron (almost) should have the usual soft-mutation after o (of), all you need to do is put “of almost” in the search box, and you have your answer!

Language detection for Welsh

June 6th, 2013

For the last week or so I’ve been putting together a corpus of aligned Welsh and English sentences from the Records of Proceedings of the Third Welsh Assembly (2007-2011). This is intended as one of the citation sources for Eurfa, and will complement Jones and Eisele’s earlier PNAW corpus – Dafydd Jones & Andreas Eisele (2006): “Phrase-based statistical machine translation between English and Welsh.” LREC-2006: Fifth International Conference on Language Resources and Evaluation: 5th SALTMIL Workshop on Minority Languages: “Strategies for developing machine translation for minority languages”, Genoa, Italy.

There are 256 documents with blocks of text in both languages, giving a total of around 129,000 aligned items. After cleaning, these yield about 94,000 useable pairs, which then need to be split to give sentences. They also need to be rearranged in a consistent direction: the Records put the first language of the pair as whatever language was actually used by the speaker, with the translation in the other language following. So a language detection routine is required.

Alberto Manuel Brandão Simões has written Lingua::Identify, a very nice Perl module that handles 33 languages. It turns out that Welsh had been deactivated because of a lack of “clean” Welsh text to prime the analyser. Once I sent him some text that was Welsh-only, Alberto produced (in what must be a record-breaking 20 minutes!) a new version of Lingua::Identify (0.54) that identifies Welsh and English with high precision.

I plugged it into a PostgreSQL function, kindly offered by Filip Rembiałkowski:

create or replace function langof( text ) returns varchar(2)
immutable returns null on null input
language plperlu as $perlcode$
    use Lingua::Identify qw(langof);
    return langof( shift );


select langof('my string in an unknown language');

will then give an identifier for the language of the string.

When I ran this against the first item in the 94,000 pairs, there was a misidentification of the language in only 26 cases (0.03%) – 10 of these should have been English and 16 Welsh. So far I haven’t come across any instances where an English item was identified as Welsh, or vice versa. Lingua::Identify is an amazing tool for saving a LOT of work – hats off to Alberto!

It turns out that Welsh was the original language used in around 12% of the blocks (about 12,000 out of the 94,000 blocks). It should be possible later to get the number of words in each language over all the Third Assembly’s debates.

I’ve now split these blocks into sentence pairs (around 350,000 after cleaning). There are about 3,000 “holes”, where one sentence in the source language has been translated by two in the target language, or vice versa (annoying!), and I’m working on fixing those.

Hopefully the new corpus, Kynulliad3, should be ready in the next couple of weeks, with a website and download.

Eurfa v3.0

May 23rd, 2013

In 2003 I started putting together a Welsh wordlist to help with KDE translation, since we were barred from using output from any of the publicly-funded lexical projects (!). In 2005 I put together a verb conjugator (Konjugator) to generate the inflected forms of around 4,000 verbs, and combined those with the wordlist to produce the first version of Eurfa in 2006, with a second edition following in 2007.

At the time it was published, Eurfa was the first Celtic dictionary to list mutated words and verb inflections (though others have copied that idea since). It is still the largest free (GPL) dictionary in Welsh (over 10,000 lemmas at the minute), and was used for the Apertium Welsh-English gist translator and for tagging 900k words of multilingual spoken conversations (BangorTalk).

The original 2007 website was still up until around 3 weeks ago, when server changes meant it stopped working. So I’ve given it a complete makeover using Joshua Gatcke’s very attractive HTML Kickstart. This included streamlining the contents. The old website had a lot of space devoted to proselytising openness, but that battle is pretty much won (except where Welsh language resources are concerned!) – the new Government Service Design Manual mandates a preference for open-source software, the UK Research Councils now have a policy of open access for the outputs of funded research, the Government has set up an open data website giving access to 9,000 public sector datasets, an open operating system (Android) is whuppin’ the ass of the proprietary operating systems, and so on. So the only extraneous bit of the old site I kept was the poem on Pangur Bán, which I think is as fresh and relevant now as it was when it was written by an Irish monk some 1,100 years ago!

I did take the opportunity of folding the conjugator into the new version of Eurfa – the Konjugator site went down some years ago, and I never bothered setting it up again. The current incarnation is much better, and perhaps I have learnt a little bit in the meantime, because the code for printing out the inflected tenses is only 15% the length of the previous code, but also handles periphrastic tenses (with auxiliary verbs)!

Rhymer is still there, allowing you to get lists of rhyming words – again, another feature that has been copied (but not bettered!) since.

I’ve continued work on Eurfa over the years, though much of that hasn’t made it into the wild. But I hope to add some nice features to the new site over the next 6-8 months.


March 15th, 2007

When I put up the first version of Eurfa, I had the idea of doing a directory of programs and apps that were available in Welsh. I’ve now published the initial version of Meddaliadur, which is a start on this task. It lists a handful of programs, with a short description, a link to the website, license and cost details, information about who did the translation and where it can be got, and (last but not least) a few screenshots.

The idea is to show that there are quite a few pieces of software in Welsh, and the number is growing. This may attract some people to the apps themselves, and it might encourage others to think about making their apps available in Welsh too.

What I’ll be doing is splitting out the various programs that form part of the KDE and GNOME desktops, and then adding any other programs that I know have been translated. So that means, for instance, that there will be a Games section, with separate pages for KSokoban, Kolf, and so on. This will give a much better idea of the range of software (especially free software) already in Welsh.

Software in Welsh on other platforms (eg Microsoft Windows, Apple Mac OSX, Solaris, etc) will also be included, since the aim is to give a reasonable overview of all that is available.

The pages at present are simple HTML, but that would become unmanageable as the number of programs grows, so I need to move it to a database-backed system. I’ll take that opportunity to add things like the ability to leave comments about particular programs, and perhaps a space for beginner’s tutorials or howtos on the programs.

Klebran released at last!

February 16th, 2007

I’ve finally announced the release of Klebran. In October last year I was trying to think of a way of using the Welsh port of Gramadóir in some sort of GUI, because I just find it really difficult to use something at the terminal only. I think it must be because it makes me think about things serially, whereas I prefer to get more of a multi-faceted view of things. Anyway, I couldn’t find any tame C++ programmers, so I began to wonder about doing something in PHP. Hmm, probably won’t work, I thought – but I wrote a page to print output from Gramadóir to the browser, and (to my surprise) within a day I had something working.

That was the easy bit – the last 4 months have been down to improving it, and there are still lots of things that are not done “properly”. I’m sure the code could be improved, and I still think I’m not using AJAX properly! I originally used the –api switch on Gramadóir, but Kevin Scannell pointed out that using the –xml switch would give me a lot more of the info I use to do the neat mouseovers giving you meanings (from Eurfa) and part-of-speech info. The Elixir grammar-checking backend for Sonnet in KDE4 will use the –api output, though.

There are still improvements to be made in the Gramadóir port, particularly as regards disambiguation. For example, if you get a verb-form that is tagged as two separate parts-of-speech (eg gweler, newch), both forms will appear in the Gramadóir output, and the Klebran code will just ignore them as it does its regexing, so you get a space where the word should be!

But at least it’s a start – it’s surprising how many typos Klebran picks up even in text that has been eyeball-checked a couple of times, and I am pretty good at spotting typos. The next step is to combine Klebran with an importer for PO-files, so that I can take completed files and check they are using “standard” words, aren’t producing obvious typos, and so on. I’ll probably use the import part of Kartouche for that, suitably amended.