For the last week or so I’ve been putting together a corpus of aligned Welsh and English sentences from the Records of Proceedings of the Third Welsh Assembly (2007-2011). This is intended as one of the citation sources for Eurfa, and will complement Jones and Eisele’s earlier PNAW corpus – Dafydd Jones & Andreas Eisele (2006): “Phrase-based statistical machine translation between English and Welsh.” LREC-2006: Fifth International Conference on Language Resources and Evaluation: 5th SALTMIL Workshop on Minority Languages: “Strategies for developing machine translation for minority languages”, Genoa, Italy.
There are 256 documents with blocks of text in both languages, giving a total of around 129,000 aligned items. After cleaning, these yield about 94,000 useable pairs, which then need to be split to give sentences. They also need to be rearranged in a consistent direction: the Records put the first language of the pair as whatever language was actually used by the speaker, with the translation in the other language following. So a language detection routine is required.
Alberto Manuel Brandão Simões has written Lingua::Identify, a very nice Perl module that handles 33 languages. It turns out that Welsh had been deactivated because of a lack of “clean” Welsh text to prime the analyser. Once I sent him some text that was Welsh-only, Alberto produced (in what must be a record-breaking 20 minutes!) a new version of Lingua::Identify (0.54) that identifies Welsh and English with high precision.
I plugged it into a PostgreSQL function, kindly offered by Filip Rembiałkowski:
create or replace function langof( text ) returns varchar(2)
immutable returns null on null input
language plperlu as $perlcode$
use Lingua::Identify qw(langof);
return langof( shift );
select langof('my string in an unknown language');
will then give an identifier for the language of the string.
When I ran this against the first item in the 94,000 pairs, there was a misidentification of the language in only 26 cases (0.03%) – 10 of these should have been English and 16 Welsh. So far I haven’t come across any instances where an English item was identified as Welsh, or vice versa. Lingua::Identify is an amazing tool for saving a LOT of work – hats off to Alberto!
It turns out that Welsh was the original language used in around 12% of the blocks (about 12,000 out of the 94,000 blocks). It should be possible later to get the number of words in each language over all the Third Assembly’s debates.
I’ve now split these blocks into sentence pairs (around 350,000 after cleaning). There are about 3,000 “holes”, where one sentence in the source language has been translated by two in the target language, or vice versa (annoying!), and I’m working on fixing those.
Hopefully the new corpus, Kynulliad3, should be ready in the next couple of weeks, with a website and download.