November 12th, 2013 by donnek

Eurfa will now allow you to get in-context citations for its words from the Kynulliad3 corpus. I’ve limited it to 5 at the minute, but I also want to look at doing random lookups to produce a different 5 each time.

I’m hoping to add some other corpora to Eurfa in the same way over the coming months.

I also took the opportunity of changing the AJAX search-boxes to HTML ones, because the AJAX ones had the annoying “feature” of not doing the search if you pressed Return – instead, they just cleared the searchbox.

# Autoglossing Gàidhlig

August 28th, 2013 by donnek

Over the last month I’ve been wondering about how easy it would be to port the Autoglosser to some completely new language material. This would give me the opportunity to look at things like importing normal text instead of CHAT files, dealing with multiwords (collocations where the meaning of the whole is different from the combined meaning of the parts), better handling of capitalised items, etc. Eventually I decided to take the plunge with Gàidhlig (Scottish Gaelic), which I learnt 30 years ago in Sabhal Mòr Ostaig when it was still a collection of farm buildings ….

Surprisingly, once I had assembled a dictionary, the actual port took only a couple of days, and gives pretty good results, as you can see from the output at the website. There’s obviously a lot to be done yet – in particular, developing a stemmer to simplify the dictionary. Talking of which, I also put together a little TeX script which typesets the dictionary in the form I’ve always wanted: all words listed in alphabetical order, but with the lemma specified where they are derived forms, and also each derived word listed as part of the generating word/lemma’s own entry. Still needs a bit of work (it should be in two columns, for instance), but it shows that the dictionary layout is robust enough to give quite sophisticated output.

This port opens the way for more work on streamlining text consumption by the Autoglosser – at present, punctuation is not handled as well as I’d like. The multiword work is also a first step in allowing the handling of languages with disjunctive writing systems (eg Sotho).

# Language of first delivery

July 4th, 2013 by donnek

I’ve added some data to Kynulliad3 (it’s not in the current download, but it will be in the next one) so that the “language of first delivery” can be calculated. In the Assembly, Members can speak in either language, which is placed on the left-hand side of the Record. The translation, into English or Welsh as appropriate, then goes on the right-hand side of the Record. If we separate out those sentences which were first delivered in one language from those which have been translated from the other language, we get the following word totals:
Welsh: 879,964  /  8,865,452 (9.9%)
English: 7,916,148  /  8,775,382 (90.2%)

So about 10% of the word total in the Third Assembly used Welsh as the language of first delivery. This is a bit lower than the proportion of Welsh-speakers in Wales (19% according to the 2011 Census), but I don’t know how it relates to the proportion of Welsh-speakers in the Third Assembly (or at least the proportion who felt confident enough to use Welsh in this setting). It’s a pretty sizeable percentage, though.

July 3rd, 2013 by donnek

I’ve just released Kynulliad3 – a corpus of aligned Welsh and English sentences, drawn from the Record of Proceedings of the Third Assembly (2007-2011) of the National Assembly for Wales. It contains over 350,000 sentences and almost 9m words in each language, and is a sort of sequel to Jones and Eisele‘s earlier PNAW corpus.

Hopefully it will mean that all the great work done by the Assembly’s translators can be re-used for language processing research. And it also has practical benefits for language checking – if, for instance, you are translating “of almost” and wonder whether bron (almost) should have the usual soft-mutation after o (of), all you need to do is put “of almost” in the search box, and you have your answer!

# Language detection for Welsh

June 6th, 2013 by donnek

For the last week or so I’ve been putting together a corpus of aligned Welsh and English sentences from the Records of Proceedings of the Third Welsh Assembly (2007-2011). This is intended as one of the citation sources for Eurfa, and will complement Jones and Eisele’s earlier PNAW corpus – Dafydd Jones & Andreas Eisele (2006): “Phrase-based statistical machine translation between English and Welsh.” LREC-2006: Fifth International Conference on Language Resources and Evaluation: 5th SALTMIL Workshop on Minority Languages: “Strategies for developing machine translation for minority languages”, Genoa, Italy.

There are 256 documents with blocks of text in both languages, giving a total of around 129,000 aligned items. After cleaning, these yield about 94,000 useable pairs, which then need to be split to give sentences. They also need to be rearranged in a consistent direction: the Records put the first language of the pair as whatever language was actually used by the speaker, with the translation in the other language following. So a language detection routine is required.

Alberto Manuel Brandão Simões has written Lingua::Identify, a very nice Perl module that handles 33 languages. It turns out that Welsh had been deactivated because of a lack of “clean” Welsh text to prime the analyser. Once I sent him some text that was Welsh-only, Alberto produced (in what must be a record-breaking 20 minutes!) a new version of Lingua::Identify (0.54) that identifies Welsh and English with high precision.

I plugged it into a PostgreSQL function, kindly offered by Filip Rembiałkowski:

create or replace function langof( text ) returns varchar(2)
immutable returns null on null input
language plperlu as $perlcode$
use Lingua::Identify qw(langof);
return langof( shift );
$perlcode$;

Running

select langof('my string in an unknown language');

will then give an identifier for the language of the string.

When I ran this against the first item in the 94,000 pairs, there was a misidentification of the language in only 26 cases (0.03%) – 10 of these should have been English and 16 Welsh. So far I haven’t come across any instances where an English item was identified as Welsh, or vice versa. Lingua::Identify is an amazing tool for saving a LOT of work – hats off to Alberto!

It turns out that Welsh was the original language used in around 12% of the blocks (about 12,000 out of the 94,000 blocks). It should be possible later to get the number of words in each language over all the Third Assembly’s debates.

I’ve now split these blocks into sentence pairs (around 350,000 after cleaning). There are about 3,000 “holes”, where one sentence in the source language has been translated by two in the target language, or vice versa (annoying!), and I’m working on fixing those.

# Eurfa v3.0

May 23rd, 2013 by donnek

In 2003 I started putting together a Welsh wordlist to help with KDE translation, since we were barred from using output from any of the publicly-funded lexical projects (!). In 2005 I put together a verb conjugator (Konjugator) to generate the inflected forms of around 4,000 verbs, and combined those with the wordlist to produce the first version of Eurfa in 2006, with a second edition following in 2007.

At the time it was published, Eurfa was the first Celtic dictionary to list mutated words and verb inflections (though others have copied that idea since). It is still the largest free (GPL) dictionary in Welsh (over 10,000 lemmas at the minute), and was used for the Apertium Welsh-English gist translator and for tagging 900k words of multilingual spoken conversations (BangorTalk).

The original 2007 website was still up until around 3 weeks ago, when server changes meant it stopped working. So I’ve given it a complete makeover using Joshua Gatcke’s very attractive HTML Kickstart. This included streamlining the contents. The old website had a lot of space devoted to proselytising openness, but that battle is pretty much won (except where Welsh language resources are concerned!) – the new Government Service Design Manual mandates a preference for open-source software, the UK Research Councils now have a policy of open access for the outputs of funded research, the Government has set up an open data website giving access to 9,000 public sector datasets, an open operating system (Android) is whuppin’ the ass of the proprietary operating systems, and so on. So the only extraneous bit of the old site I kept was the poem on Pangur Bán, which I think is as fresh and relevant now as it was when it was written by an Irish monk some 1,100 years ago!

I did take the opportunity of folding the conjugator into the new version of Eurfa – the Konjugator site went down some years ago, and I never bothered setting it up again. The current incarnation is much better, and perhaps I have learnt a little bit in the meantime, because the code for printing out the inflected tenses is only 15% the length of the previous code, but also handles periphrastic tenses (with auxiliary verbs)!

Rhymer is still there, allowing you to get lists of rhyming words – again, another feature that has been copied (but not bettered!) since.

I’ve continued work on Eurfa over the years, though much of that hasn’t made it into the wild. But I hope to add some nice features to the new site over the next 6-8 months.

# TikZ

April 25th, 2013 by donnek

Till Tantau’s PGF (Portable Graphics Format) is a package that adds graphics capabilities to {La|Xe}TeX, and TikZ (Tikz ist kein Zeichenprogramm – “Tikz is not a drawing program”) is a set of commands on top of PGF that makes it easier to handle. A number of packages that are useful to linguists use TikZ (eg Daniele Pighin’s TikZ-dependency for drawing dependency diagrams, or David Chiang’s tikz-qtree), and I’ve recently been investigating how to use it for representing pitch in tone and intonational languages.

One useful program that makes it easier to experiment with TikZ is Florian Hackenberger’s ktikz. Although this hasn’t been updated since 2010, it works well. You enter the TikZ code in the left-hand pane, and the right-hand pane gives you an immediate representation of what your graphic looks like. Some templates for standard LaTeX documents are available in /usr/share/kde4/apps/ktikz/templates, but it’s a good idea to set up a template in your home dir so that you don’t have to be root to edit it, and point ktikz to it. You can then add additional packages and frequently-used tikzlibrary options to that. Note that new tikzlibrary options will not be accessed until ktikz has been closed and restarted. Alternatively, you can just add tikzlibraries you will not be using frequently above the \begin{tikzpicture}.

Once you have drawn your graphic, you may want to put it on a webpage. One handy way of doing this is Pavel Holoborodko’s QuickLaTeX. The QuickLaTeX server compiles the LaTeX code you enter, and gives you a URL for the resulting output. You can then paste that URL into the webpage where you want the LaTeX output. For a TikZ graphic, put the tikzpicture code in the top box, and the environment:
\usepackage{tikz}
\usepackage{tikz-qtree}
\usepackage{tikz-qtree-compat}
in the Choose Options box. Then click Render to get the compiled code and its URL.

An alternative approach for a WordPress blog is to use Pavel’s plugin, which compiles the code and sends your blog the image to be inserted in place of the code. Add the above lines to the preamble box on the Advanced tab of the plugin settings, and then you can, for instance, use the following tikz-qtree code (taken from a TeX StackExchange question) to produce a nicely-formatted syntax tree:

\begin{tikzpicture}
\Tree [
.TP [
.T\1 \node(C){T+verb}; [
.vP \qroof{`ana}.DP [
.v\1 \node(B){v+verb};
.VP [
.V\1 \node(A){V+verb}; \qroof{taalib}.DP
]
]
]
]
]
]
\draw [semithick,->] (A) to[out=240,in=270] (B.south);
\draw [semithick,->] (B) to[out=240,in=180] (C);
\end{tikzpicture}

TikZ is a very powerful system for producing almost any kind of printed graphic, and QuickLaTeX allows those graphics to be made easily available on the web.

# Aaron Swartz RIP

January 14th, 2013 by donnek

This is an extremely sad event. I have written the following to the President of MIT, Rafael Reif:

I am a UK citizen, but I am writing to express my disgust and horror over the
way MIT has behaved in the matter of the death of Aaron Schwartz.

While there is some debate over the methods Schwartz used, there can be none
over his motives – they were based on a desire to ensure that knowledge is
available to all, without artificial barriers of price or being part of a
privileged elite.

I would have thought that an institution like MIT would have recognised and,
if not acquiesced in, at least not opposed this concept. Doesn’t the very
word “university” share the same Latin root as “universal”?

JSTOR, to its credit, decided to adopt a low-key approach to this “crime”,
but, to its eternal shame, MIT did not – it is an institution which appears to
have no conception of the meaning of the words “ethical” or “proportional”.

Non-US citizens like myself can only shake our heads in disbelief over the way
the US patent and copyright circus is poisoning the idea of “doing the right
thing”.

What a sad day for American letters – that a university of MIT’s pedigree
should let its moral compass go so completely adrift.

# Andika!

November 8th, 2012 by donnek

After 3 months of work, I’m pleased to have completed Andika!, a set of tools to allow Swahili to be written in Arabic script, with converters to turn this into Roman script and vice versa. Not least, it provides a way to digitise Swahili manuscripts so that they can be printed attractively, and so that their contents are available for research using computational linguistic methods. More details are on the website.

# Keyboard Layout Editor

August 20th, 2012 by donnek

I’m revising my Swahili-Arabic keyboard layout, and was looking for something that allows me to see what’s on each key, because it’s difficult trying to remember where everything is when you’re making revisions every 5 minutes.

I found Simos Xenitellis’ Keyboard Layout Editor here, with the code here. To install on Ubuntu 12.04, you need Java installed. The simplest way of getting Oracle (Sun) Java (I find OpenJDK still has a few issues with some applications) is to install Andrei Alin’s PPA:
sudo apt-get update
sudo apt-get install oracle-java7-installer

Next, install python-lxml and python-antlr.

export CLASSPATH=\$CLASSPATH:antlr-3.1.2.jar
java org.antlr.Tool *.g

Then launch KLE:
./KeyboardLayoutEditor

With the blog post above, the interface is quite simple to get to grips with – you just drag your desired character into the appropriate slot on the key. The “Start Character Map” button does not launch the GNOME character chooser unless you edit line 951 near the end of the file KeyboardLayoutEditor from:
os.system(Common.gucharmapapp) to
os.system(Common.gucharmappath) but the character chooser doesn’t seem to want to change fonts. So instead I opened the KDE kcharselect app manually, and dragged the characters from that.

In general, KLE is a nice app, but perhaps needs a bit of a springclean. Things that struck me were:

• You can only drag and drop the character glyph – you can’t edit it once in place on the key, you can only remove it. So you can’t, say, right-click and change 062D to 062E if you’ve made a mistake.
• Likewise, you can’t drag characters between the slots on a key to move them around – you have to remove them, and then drag them back in from scratch.
• The final file is a little untidy – it doesn’t use tabs to separate the columns of characters, or a blank line between the row-groups (AB to AE) of the keyboard.
• If you’re using the most likely course of having two modifiers (Shift and AltGr), the written file is missing the line:
include "level3(ralt_switch)"without which the AltGr options will not work. So you need to add that manually.

But all in all, Simos has written a very handy application that greatly simplifies designing a keyboard layout, so a big thank-you to him for making my work a good bit easier!