Language of first delivery

July 4th, 2013 by donnek 1 comment »

I’ve added some data to Kynulliad3 (it’s not in the current download, but it will be in the next one) so that the “language of first delivery” can be calculated. In the Assembly, Members can speak in either language, which is placed on the left-hand side of the Record. The translation, into English or Welsh as appropriate, then goes on the right-hand side of the Record. If we separate out those sentences which were first delivered in one language from those which have been translated from the other language, we get the following word totals:
Welsh: 879,964  /  8,865,452 (9.9%)
English: 7,916,148  /  8,775,382 (90.2%)

So about 10% of the word total in the Third Assembly used Welsh as the language of first delivery. This is a bit lower than the proportion of Welsh-speakers in Wales (19% according to the 2011 Census), but I don’t know how it relates to the proportion of Welsh-speakers in the Third Assembly (or at least the proportion who felt confident enough to use Welsh in this setting). It’s a pretty sizeable percentage, though.

Kynulliad3 released

July 3rd, 2013 by donnek No comments »

I’ve just released Kynulliad3 – a corpus of aligned Welsh and English sentences, drawn from the Record of Proceedings of the Third Assembly (2007-2011) of the National Assembly for Wales. It contains over 350,000 sentences and almost 9m words in each language, and is a sort of sequel to Jones and Eisele‘s earlier PNAW corpus.

Hopefully it will mean that all the great work done by the Assembly’s translators can be re-used for language processing research. And it also has practical benefits for language checking – if, for instance, you are translating “of almost” and wonder whether bron (almost) should have the usual soft-mutation after o (of), all you need to do is put “of almost” in the search box, and you have your answer!

Language detection for Welsh

June 6th, 2013 by donnek No comments »

For the last week or so I’ve been putting together a corpus of aligned Welsh and English sentences from the Records of Proceedings of the Third Welsh Assembly (2007-2011). This is intended as one of the citation sources for Eurfa, and will complement Jones and Eisele’s earlier PNAW corpus – Dafydd Jones & Andreas Eisele (2006): “Phrase-based statistical machine translation between English and Welsh.” LREC-2006: Fifth International Conference on Language Resources and Evaluation: 5th SALTMIL Workshop on Minority Languages: “Strategies for developing machine translation for minority languages”, Genoa, Italy.

There are 256 documents with blocks of text in both languages, giving a total of around 129,000 aligned items. After cleaning, these yield about 94,000 useable pairs, which then need to be split to give sentences. They also need to be rearranged in a consistent direction: the Records put the first language of the pair as whatever language was actually used by the speaker, with the translation in the other language following. So a language detection routine is required.

Alberto Manuel Brandão Simões has written Lingua::Identify, a very nice Perl module that handles 33 languages. It turns out that Welsh had been deactivated because of a lack of “clean” Welsh text to prime the analyser. Once I sent him some text that was Welsh-only, Alberto produced (in what must be a record-breaking 20 minutes!) a new version of Lingua::Identify (0.54) that identifies Welsh and English with high precision.

I plugged it into a PostgreSQL function, kindly offered by Filip Rembiałkowski:

create or replace function langof( text ) returns varchar(2)
immutable returns null on null input
language plperlu as $perlcode$
    use Lingua::Identify qw(langof);
    return langof( shift );


select langof('my string in an unknown language');

will then give an identifier for the language of the string.

When I ran this against the first item in the 94,000 pairs, there was a misidentification of the language in only 26 cases (0.03%) – 10 of these should have been English and 16 Welsh. So far I haven’t come across any instances where an English item was identified as Welsh, or vice versa. Lingua::Identify is an amazing tool for saving a LOT of work – hats off to Alberto!

It turns out that Welsh was the original language used in around 12% of the blocks (about 12,000 out of the 94,000 blocks). It should be possible later to get the number of words in each language over all the Third Assembly’s debates.

I’ve now split these blocks into sentence pairs (around 350,000 after cleaning). There are about 3,000 “holes”, where one sentence in the source language has been translated by two in the target language, or vice versa (annoying!), and I’m working on fixing those.

Hopefully the new corpus, Kynulliad3, should be ready in the next couple of weeks, with a website and download.

Eurfa v3.0

May 23rd, 2013 by donnek No comments »

In 2003 I started putting together a Welsh wordlist to help with KDE translation, since we were barred from using output from any of the publicly-funded lexical projects (!). In 2005 I put together a verb conjugator (Konjugator) to generate the inflected forms of around 4,000 verbs, and combined those with the wordlist to produce the first version of Eurfa in 2006, with a second edition following in 2007.

At the time it was published, Eurfa was the first Celtic dictionary to list mutated words and verb inflections (though others have copied that idea since). It is still the largest free (GPL) dictionary in Welsh (over 10,000 lemmas at the minute), and was used for the Apertium Welsh-English gist translator and for tagging 900k words of multilingual spoken conversations (BangorTalk).

The original 2007 website was still up until around 3 weeks ago, when server changes meant it stopped working. So I’ve given it a complete makeover using Joshua Gatcke’s very attractive HTML Kickstart. This included streamlining the contents. The old website had a lot of space devoted to proselytising openness, but that battle is pretty much won (except where Welsh language resources are concerned!) – the new Government Service Design Manual mandates a preference for open-source software, the UK Research Councils now have a policy of open access for the outputs of funded research, the Government has set up an open data website giving access to 9,000 public sector datasets, an open operating system (Android) is whuppin’ the ass of the proprietary operating systems, and so on. So the only extraneous bit of the old site I kept was the poem on Pangur Bán, which I think is as fresh and relevant now as it was when it was written by an Irish monk some 1,100 years ago!

I did take the opportunity of folding the conjugator into the new version of Eurfa – the Konjugator site went down some years ago, and I never bothered setting it up again. The current incarnation is much better, and perhaps I have learnt a little bit in the meantime, because the code for printing out the inflected tenses is only 15% the length of the previous code, but also handles periphrastic tenses (with auxiliary verbs)!

Rhymer is still there, allowing you to get lists of rhyming words – again, another feature that has been copied (but not bettered!) since.

I’ve continued work on Eurfa over the years, though much of that hasn’t made it into the wild. But I hope to add some nice features to the new site over the next 6-8 months.


April 25th, 2013 by donnek No comments »

Till Tantau’s PGF (Portable Graphics Format) is a package that adds graphics capabilities to {La|Xe}TeX, and TikZ (Tikz ist kein Zeichenprogramm – “Tikz is not a drawing program”) is a set of commands on top of PGF that makes it easier to handle. A number of packages that are useful to linguists use TikZ (eg Daniele Pighin’s TikZ-dependency for drawing dependency diagrams, or David Chiang’s tikz-qtree), and I’ve recently been investigating how to use it for representing pitch in tone and intonational languages.

One useful program that makes it easier to experiment with TikZ is Florian Hackenberger’s ktikz. Although this hasn’t been updated since 2010, it works well. You enter the TikZ code in the left-hand pane, and the right-hand pane gives you an immediate representation of what your graphic looks like. Some templates for standard LaTeX documents are available in /usr/share/kde4/apps/ktikz/templates, but it’s a good idea to set up a template in your home dir so that you don’t have to be root to edit it, and point ktikz to it. You can then add additional packages and frequently-used tikzlibrary options to that. Note that new tikzlibrary options will not be accessed until ktikz has been closed and restarted. Alternatively, you can just add tikzlibraries you will not be using frequently above the \begin{tikzpicture}.

Once you have drawn your graphic, you may want to put it on a webpage. One handy way of doing this is Pavel Holoborodko’s QuickLaTeX. The QuickLaTeX server compiles the LaTeX code you enter, and gives you a URL for the resulting output. You can then paste that URL into the webpage where you want the LaTeX output. For a TikZ graphic, put the tikzpicture code in the top box, and the environment:
in the Choose Options box. Then click Render to get the compiled code and its URL.

An alternative approach for a WordPress blog is to use Pavel’s plugin, which compiles the code and sends your blog the image to be inserted in place of the code. Add the above lines to the preamble box on the Advanced tab of the plugin settings, and then you can, for instance, use the following tikz-qtree code (taken from a TeX StackExchange question) to produce a nicely-formatted syntax tree:

\Tree [ 
  .TP [ 
      .T\1 \node(C){T+verb}; [
          .vP \qroof{`ana}.DP [
             .v\1 \node(B){v+verb};
                 .VP [
                     .V\1 \node(A){V+verb}; \qroof{taalib}.DP 
\draw [semithick,->] (A) to[out=240,in=270] (B.south);
\draw [semithick,->] (B) to[out=240,in=180] (C);

Rendered by

TikZ is a very powerful system for producing almost any kind of printed graphic, and QuickLaTeX allows those graphics to be made easily available on the web.

Aaron Swartz RIP

January 14th, 2013 by donnek No comments »

This is an extremely sad event. I have written the following to the President of MIT, Rafael Reif:

I am a UK citizen, but I am writing to express my disgust and horror over the
way MIT has behaved in the matter of the death of Aaron Schwartz.

While there is some debate over the methods Schwartz used, there can be none
over his motives – they were based on a desire to ensure that knowledge is
available to all, without artificial barriers of price or being part of a
privileged elite.

I would have thought that an institution like MIT would have recognised and,
if not acquiesced in, at least not opposed this concept. Doesn’t the very
word “university” share the same Latin root as “universal”?

JSTOR, to its credit, decided to adopt a low-key approach to this “crime”,
but, to its eternal shame, MIT did not – it is an institution which appears to
have no conception of the meaning of the words “ethical” or “proportional”.

Non-US citizens like myself can only shake our heads in disbelief over the way
the US patent and copyright circus is poisoning the idea of “doing the right

What a sad day for American letters – that a university of MIT’s pedigree
should let its moral compass go so completely adrift.


November 8th, 2012 by donnek No comments »

After 3 months of work, I’m pleased to have completed Andika!, a set of tools to allow Swahili to be written in Arabic script, with converters to turn this into Roman script and vice versa. Not least, it provides a way to digitise Swahili manuscripts so that they can be printed attractively, and so that their contents are available for research using computational linguistic methods. More details are on the website.

Keyboard Layout Editor

August 20th, 2012 by donnek No comments »

I’m revising my Swahili-Arabic keyboard layout, and was looking for something that allows me to see what’s on each key, because it’s difficult trying to remember where everything is when you’re making revisions every 5 minutes.

I found Simos Xenitellis’ Keyboard Layout Editor here, with the code here. To install on Ubuntu 12.04, you need Java installed. The simplest way of getting Oracle (Sun) Java (I find OpenJDK still has a few issues with some applications) is to install Andrei Alin’s PPA:
sudo add-apt-repository ppa:webupd8team/java
sudo apt-get update
sudo apt-get install oracle-java7-installer

Next, install python-lxml and python-antlr.

Then download the KLE tarball from Github, untar it, and move into the directory. Download the ANTLR package:
wget and process the ANTLR grammars:
export CLASSPATH=$CLASSPATH:antlr-3.1.2.jar
java org.antlr.Tool *.g

Then launch KLE:

With the blog post above, the interface is quite simple to get to grips with – you just drag your desired character into the appropriate slot on the key. The “Start Character Map” button does not launch the GNOME character chooser unless you edit line 951 near the end of the file KeyboardLayoutEditor from:
os.system(Common.gucharmapapp) to
os.system(Common.gucharmappath) but the character chooser doesn’t seem to want to change fonts. So instead I opened the KDE kcharselect app manually, and dragged the characters from that.

In general, KLE is a nice app, but perhaps needs a bit of a springclean. Things that struck me were:

  • You can only drag and drop the character glyph – you can’t edit it once in place on the key, you can only remove it. So you can’t, say, right-click and change 062D to 062E if you’ve made a mistake.
  • Likewise, you can’t drag characters between the slots on a key to move them around – you have to remove them, and then drag them back in from scratch.
  • The final file is a little untidy – it doesn’t use tabs to separate the columns of characters, or a blank line between the row-groups (AB to AE) of the keyboard.
  • If you’re using the most likely course of having two modifiers (Shift and AltGr), the written file is missing the line:
    include "level3(ralt_switch)"without which the AltGr options will not work. So you need to add that manually.

But all in all, Simos has written a very handy application that greatly simplifies designing a keyboard layout, so a big thank-you to him for making my work a good bit easier!

Patagonia and Miami corpora complete!

July 27th, 2012 by donnek No comments »

Phew! There was just no time for anything over the last 6 months but the Patagonia and Miami corpora, and they’ve now been completed on schedule and sent off to Talkbank. Kudos to Professor Margaret Deuchar and her team of researchers.

Patagonia (Welsh/Spanish) corpus main statistics:
150k words, 78% Welsh, 17% Spanish, 5% indeterminate

Miami (Spanish/English) corpus main statistics:
167k words, 63% English, 34% Spanish, 3% indeterminate

Apart from producing the glosses via the Autoglosser, I was involved in running a Git repository, translation and editing, silencing audiofiles for privacy purposes, tweaking the files to accommodate last-minute changes to the CLAN format, and doing a website for the corpora themselves – very interesting indeed.

Possibly the best thing about the corpora is that, like the 2009 Siarad corpus, they are under the GPL. In fact, Siarad and Patagonia seem to be the only large-scale collections of Welsh text to use this free license, which ensures that everyone can have unfettered access to high-quality materials produced using public funds.

Talking of which, the website devoted to all three corpora is at BangorTalk. Have a look and listen to real language in action!

Autoglossing historical Welsh

December 11th, 2011 by donnek No comments »

David Willis asked about the feasibility of using the Autoglosser to tag texts in his Historical Corpus of Welsh. It proved easier than expected to do a proof-of-concept: set up the Autoglosser to import from running monolingual text instead of conversational bilingual text, and then let everything else (lookup, constraint grammar and write-out) work as normal.

The section I chose was a 1,200 word piece from 1779 – a translation of the autobiography of James Groniosaw, an African prince who was enslaved. I set up a way of handling the old-style spelling, though that would need some more work, and the results are available here.

On a rough count, only about 3% of the words are actually tagged incorrectly. Another 40% are not sufficiently disambiguated, but that is more a matter of writing constraint grammar rules that will apply to this sort of text (we were in the same position with the conversation transcripts a few months ago). An interesting option might be to set up different rulesets for different periods or types of Welsh text, which you could plug in to the system as appropriate.

It’s gratifying that the Autoglosser can make a pretty good show at tagging Welsh that is over 230 years old, as well as tagging the modern colloquial Welsh it was designed for.