# Kwici – a Welsh Wikipedia corpus

January 14th, 2014 by donnek

A couple of days ago I visited a resources page belonging to Linas Vepstas, the maintainer of the Link Grammar package of the OpenCog project, and that took me to the WaCky website, with some nice examples of tagged corpora.

That in turn led to Giuseppe Attardi and Antonio Fuschetto’s Wikipedia Extractor, which generates plain text from a Wikipedia dump, discarding markup etc. I’ve tried this sort of thing before, and the results leave a lot to be desired, but this one worked wonderfully.

Download it, make it executable, and point it at the dump: ./WikiExtractor.py -cb 250K -o output < cywiki-20131230-pages-articles-multistream.xml The -cb switch compresses the output into files of 250K, and the -o switch stores them in the output dir.

Then combine all these into one big file: find output -name '*bz2' -exec bzip2 -d -c {} \; > text.xml (the command on the website is missing the -d switch).

This gave a very tidy file with each document in its own doc tag. About the only stuff it didn't strip was some English in includeonly tags - I'm not sure what these are for, but they may have been updates that hadn't been translated yet. So Wikipedia Extractor did exactly what it said on the tin!

Once the doc tags, the English, and blank lines were removed, the text was relatively easily split into sentences, and imported into PostgreSQL for more tidying. I spent a day on that, but I didn't like the results, so I redid it over another day, and the outcome is Kwici - a 200k-sentence, 3.9m-word corpus of Welsh.

Kwici is now added to Eurfa as a seventh citation source.

# More corpora …

December 16th, 2013 by donnek

Korrect/Kywiro was a collection of English-Welsh software translations that I put together in 2004, when there was a lot of community activity going on. It fell off the web a few years ago, but I’ve now resurrected it. In the new version I’ve chopped up the longer strings into chunks of one or two sentences, and stripped out a lot of the HTML. I also spent some time fixing up character encoding issues, which is an occupation hazard of dealing with older text from various machines, not all of which will have been using UTF-8. Hopefully it should be fairly tidy now, so I’ve added it to the citation sources at Eurfa.

Going even further back, in the late 90s Bob Morris Jones and colleagues put together two corpora dealing with child language acquisition – CIG1 looked at children aged 18-30 months, and CIG2 looked at children aged 3-7 years. I’m not sure how many people know about or use these, but they’re an excellent resource that deserves to be more widely available, so I’ve imported the text into a database to provide a searchable interface to them at Kig. That site also includes the original .cha files, along with information from the original website (which is showing signs of bitrot).

Although other search parameters could be introduced (along the lines of the BangorTalk site), I’ve chosen for the moment just to split the searches between child and adult utterances, since I think that’s what most people would be interested in initially – what sort of output do the children in these corpora produce? (That in turn should probably be segmented by age, but the data to do that is now available in the downloadable versions of the database, and it can be added to the website if there is any demand).

The adult utterances, of course, can also be used as another citation source in Eurfa, so I’ve added them there too.

# Two more corpora for Eurfa

November 13th, 2013 by donnek

I’ve just added citations from the Siarad and Patagonia corpora to Eurfa. Both these corpora are GPL-licensed, and were put together by a team led by Prof. Margaret Deuchar. They contain transcriptions of actual spoken Welsh – and as with all languages, the spoken version can be a rather different beast from the formal written language – and Siarad in particular contains quite a few “codeswitches” (where the speaker uses a word from another language, in this case English, in utterances that contain the main language, in this case Welsh). In the past, people have criticised this sort of thing as a signal of poor Welsh, but in fact the evidence suggests that it is actually a marker of linguistic competence – only speakers who are equally capable in both languages tend to do this, while those who are less capable tend to stick to one language.

The corpora provide important data for looking at this and other bilingual phenomena, and among other things have been used to show that the degree to which a codeswitch becomes a loanword is related to its frequency, that “cognates” (words that are similar in both languages, eg siop/shop, or perswadio/persuade) tend to raise the likelihood of codeswitches being used for other words (especially nouns), and so on.

The Patagonia corpus tends to use codeswitching less than the Siarad one, with the speakers switching into the other language for entire utterances or passages, and of course the main codeswitching language in that corpus is Spanish rather than English.

For use in Eurfa, both corpora had to be stripped down a bit. If you look at the material on Bangortalk, you’ll see that it follows the CLAN transcription format, which includes various markers for pauses, hesitations, non-verbal interactions, and so on. To make things easier to read, it made sense to get rid of most of these, with the exception of the brackets to indicate elision, eg mae’n gets represented as mae (y)n.

Then utterances where the transcription was less than 20 characters were removed – about 45% of the original 78,000-odd utterances in Siarad, and about 53% of the original 37,500-odd in Patagonia. One of the things that is noticeable in these corpora is how short the average spoken utterance is compared to the written language.

The last thing was to remove the utterances that are entirely in a language other than Welsh. The CLAN format was changed a couple of years ago to incorporate what I consider a regression. Previously, a default language would be chosen for the transcription (usually the most frequent language in the conversation), and all words not in that language would be tagged. The new system allows a “precode” marker to be attached to utterances which are entirely in the non-default language, presumably on the grounds that this makes the transcriptions easier to read. However, the problem with this is that you can no longer tell simply from the word what language it is in: an untagged word could be Welsh if the the default language is Welsh, but it could also be English if there is a precode marker.

I chose the “easy” (aka “fast”) way to do this: deleting any items with precodes marking English or Spanish. Unfortunately, that still leaves items that were not marked with precodes because the default language of the transcription was English or Spanish! I looked at addressing this with Alberto Simões’ Lingua::Identify, but, as I expected, the results are variable: the utterances are mostly too short to give a good “fingerprint”, and this is compounded by the brackets, the lack of punctuation, and the codeswitches.

For Siarad, some 20 languages (cs, da, de, en, es, fi, fr, hr, hu, id, it, la, nl, pl, pt, ro, sl, sq, sv, tr) were given for about 31% of the utterances remaining after removing those with less than 20 characters. Most of these were in fact Welsh, and even the ones marked English were about 50% Welsh. I therefore left them all, giving a total of around 42,300 citations.

For Patagonia, the same 20 languages (cs, da, de, en, es, fi, fr, hr, hu, id, it, la, nl, pl, pt, ro, sl, sq, sv, tr) were given for about 32% of the remaining utterances. Again, most of these were in fact Welsh, but the ones marked as Spanish and Portuguese were predominantly Spanish, so I removed them, leaving a total of around 15,500 citations.

The upshot is that if in Eurfa you search for an English word, and then click to get citations from Siarad or Patagonia, there is a small chance you will get a “Welsh” equivalent of the English citation that is actually in English or Spanish!

November 12th, 2013 by donnek

Eurfa will now allow you to get in-context citations for its words from the Kynulliad3 corpus. I’ve limited it to 5 at the minute, but I also want to look at doing random lookups to produce a different 5 each time.

I’m hoping to add some other corpora to Eurfa in the same way over the coming months.

I also took the opportunity of changing the AJAX search-boxes to HTML ones, because the AJAX ones had the annoying “feature” of not doing the search if you pressed Return – instead, they just cleared the searchbox.

# Autoglossing Gàidhlig

August 28th, 2013 by donnek

Over the last month I’ve been wondering about how easy it would be to port the Autoglosser to some completely new language material. This would give me the opportunity to look at things like importing normal text instead of CHAT files, dealing with multiwords (collocations where the meaning of the whole is different from the combined meaning of the parts), better handling of capitalised items, etc. Eventually I decided to take the plunge with Gàidhlig (Scottish Gaelic), which I learnt 30 years ago in Sabhal Mòr Ostaig when it was still a collection of farm buildings ….

Surprisingly, once I had assembled a dictionary, the actual port took only a couple of days, and gives pretty good results, as you can see from the output at the website. There’s obviously a lot to be done yet – in particular, developing a stemmer to simplify the dictionary. Talking of which, I also put together a little TeX script which typesets the dictionary in the form I’ve always wanted: all words listed in alphabetical order, but with the lemma specified where they are derived forms, and also each derived word listed as part of the generating word/lemma’s own entry. Still needs a bit of work (it should be in two columns, for instance), but it shows that the dictionary layout is robust enough to give quite sophisticated output.

This port opens the way for more work on streamlining text consumption by the Autoglosser – at present, punctuation is not handled as well as I’d like. The multiword work is also a first step in allowing the handling of languages with disjunctive writing systems (eg Sotho).

# Language of first delivery

July 4th, 2013 by donnek

I’ve added some data to Kynulliad3 (it’s not in the current download, but it will be in the next one) so that the “language of first delivery” can be calculated. In the Assembly, Members can speak in either language, which is placed on the left-hand side of the Record. The translation, into English or Welsh as appropriate, then goes on the right-hand side of the Record. If we separate out those sentences which were first delivered in one language from those which have been translated from the other language, we get the following word totals:
Welsh: 879,964  /  8,865,452 (9.9%)
English: 7,916,148  /  8,775,382 (90.2%)

So about 10% of the word total in the Third Assembly used Welsh as the language of first delivery. This is a bit lower than the proportion of Welsh-speakers in Wales (19% according to the 2011 Census), but I don’t know how it relates to the proportion of Welsh-speakers in the Third Assembly (or at least the proportion who felt confident enough to use Welsh in this setting). It’s a pretty sizeable percentage, though.

July 3rd, 2013 by donnek

I’ve just released Kynulliad3 – a corpus of aligned Welsh and English sentences, drawn from the Record of Proceedings of the Third Assembly (2007-2011) of the National Assembly for Wales. It contains over 350,000 sentences and almost 9m words in each language, and is a sort of sequel to Jones and Eisele‘s earlier PNAW corpus.

Hopefully it will mean that all the great work done by the Assembly’s translators can be re-used for language processing research. And it also has practical benefits for language checking – if, for instance, you are translating “of almost” and wonder whether bron (almost) should have the usual soft-mutation after o (of), all you need to do is put “of almost” in the search box, and you have your answer!

# Language detection for Welsh

June 6th, 2013 by donnek

For the last week or so I’ve been putting together a corpus of aligned Welsh and English sentences from the Records of Proceedings of the Third Welsh Assembly (2007-2011). This is intended as one of the citation sources for Eurfa, and will complement Jones and Eisele’s earlier PNAW corpus – Dafydd Jones & Andreas Eisele (2006): “Phrase-based statistical machine translation between English and Welsh.” LREC-2006: Fifth International Conference on Language Resources and Evaluation: 5th SALTMIL Workshop on Minority Languages: “Strategies for developing machine translation for minority languages”, Genoa, Italy.

There are 256 documents with blocks of text in both languages, giving a total of around 129,000 aligned items. After cleaning, these yield about 94,000 useable pairs, which then need to be split to give sentences. They also need to be rearranged in a consistent direction: the Records put the first language of the pair as whatever language was actually used by the speaker, with the translation in the other language following. So a language detection routine is required.

Alberto Manuel Brandão Simões has written Lingua::Identify, a very nice Perl module that handles 33 languages. It turns out that Welsh had been deactivated because of a lack of “clean” Welsh text to prime the analyser. Once I sent him some text that was Welsh-only, Alberto produced (in what must be a record-breaking 20 minutes!) a new version of Lingua::Identify (0.54) that identifies Welsh and English with high precision.

I plugged it into a PostgreSQL function, kindly offered by Filip Rembiałkowski:

create or replace function langof( text ) returns varchar(2)
immutable returns null on null input
language plperlu as $perlcode$
use Lingua::Identify qw(langof);
return langof( shift );
$perlcode$;


Running

select langof('my string in an unknown language');


will then give an identifier for the language of the string.

When I ran this against the first item in the 94,000 pairs, there was a misidentification of the language in only 26 cases (0.03%) – 10 of these should have been English and 16 Welsh. So far I haven’t come across any instances where an English item was identified as Welsh, or vice versa. Lingua::Identify is an amazing tool for saving a LOT of work – hats off to Alberto!

It turns out that Welsh was the original language used in around 12% of the blocks (about 12,000 out of the 94,000 blocks). It should be possible later to get the number of words in each language over all the Third Assembly’s debates.

I’ve now split these blocks into sentence pairs (around 350,000 after cleaning). There are about 3,000 “holes”, where one sentence in the source language has been translated by two in the target language, or vice versa (annoying!), and I’m working on fixing those.

# Eurfa v3.0

May 23rd, 2013 by donnek

In 2003 I started putting together a Welsh wordlist to help with KDE translation, since we were barred from using output from any of the publicly-funded lexical projects (!). In 2005 I put together a verb conjugator (Konjugator) to generate the inflected forms of around 4,000 verbs, and combined those with the wordlist to produce the first version of Eurfa in 2006, with a second edition following in 2007.

At the time it was published, Eurfa was the first Celtic dictionary to list mutated words and verb inflections (though others have copied that idea since). It is still the largest free (GPL) dictionary in Welsh (over 10,000 lemmas at the minute), and was used for the Apertium Welsh-English gist translator and for tagging 900k words of multilingual spoken conversations (BangorTalk).

The original 2007 website was still up until around 3 weeks ago, when server changes meant it stopped working. So I’ve given it a complete makeover using Joshua Gatcke’s very attractive HTML Kickstart. This included streamlining the contents. The old website had a lot of space devoted to proselytising openness, but that battle is pretty much won (except where Welsh language resources are concerned!) – the new Government Service Design Manual mandates a preference for open-source software, the UK Research Councils now have a policy of open access for the outputs of funded research, the Government has set up an open data website giving access to 9,000 public sector datasets, an open operating system (Android) is whuppin’ the ass of the proprietary operating systems, and so on. So the only extraneous bit of the old site I kept was the poem on Pangur Bán, which I think is as fresh and relevant now as it was when it was written by an Irish monk some 1,100 years ago!

I did take the opportunity of folding the conjugator into the new version of Eurfa – the Konjugator site went down some years ago, and I never bothered setting it up again. The current incarnation is much better, and perhaps I have learnt a little bit in the meantime, because the code for printing out the inflected tenses is only 15% the length of the previous code, but also handles periphrastic tenses (with auxiliary verbs)!

Rhymer is still there, allowing you to get lists of rhyming words – again, another feature that has been copied (but not bettered!) since.

I’ve continued work on Eurfa over the years, though much of that hasn’t made it into the wild. But I hope to add some nice features to the new site over the next 6-8 months.

# TikZ

April 25th, 2013 by donnek

Till Tantau’s PGF (Portable Graphics Format) is a package that adds graphics capabilities to {La|Xe}TeX, and TikZ (Tikz ist kein Zeichenprogramm – “Tikz is not a drawing program”) is a set of commands on top of PGF that makes it easier to handle. A number of packages that are useful to linguists use TikZ (eg Daniele Pighin’s TikZ-dependency for drawing dependency diagrams, or David Chiang’s tikz-qtree), and I’ve recently been investigating how to use it for representing pitch in tone and intonational languages.

One useful program that makes it easier to experiment with TikZ is Florian Hackenberger’s ktikz. Although this hasn’t been updated since 2010, it works well. You enter the TikZ code in the left-hand pane, and the right-hand pane gives you an immediate representation of what your graphic looks like. Some templates for standard LaTeX documents are available in /usr/share/kde4/apps/ktikz/templates, but it’s a good idea to set up a template in your home dir so that you don’t have to be root to edit it, and point ktikz to it. You can then add additional packages and frequently-used tikzlibrary options to that. Note that new tikzlibrary options will not be accessed until ktikz has been closed and restarted. Alternatively, you can just add tikzlibraries you will not be using frequently above the \begin{tikzpicture}.

Once you have drawn your graphic, you may want to put it on a webpage. One handy way of doing this is Pavel Holoborodko’s QuickLaTeX. The QuickLaTeX server compiles the LaTeX code you enter, and gives you a URL for the resulting output. You can then paste that URL into the webpage where you want the LaTeX output. For a TikZ graphic, put the tikzpicture code in the top box, and the environment:
\usepackage{tikz}
\usepackage{tikz-qtree}
\usepackage{tikz-qtree-compat}
in the Choose Options box. Then click Render to get the compiled code and its URL.

An alternative approach for a WordPress blog is to use Pavel’s plugin, which compiles the code and sends your blog the image to be inserted in place of the code. Add the above lines to the preamble box on the Advanced tab of the plugin settings, and then you can, for instance, use the following tikz-qtree code (taken from a TeX StackExchange question) to produce a nicely-formatted syntax tree:

\begin{tikzpicture}
\Tree [
.TP [
.T\1 \node(C){T+verb}; [
.vP \qroof{ana}.DP [
.v\1 \node(B){v+verb};
.VP [
.V\1 \node(A){V+verb}; \qroof{taalib}.DP
]
]
]
]
]
]
\draw [semithick,->] (A) to[out=240,in=270] (B.south);
\draw [semithick,->] (B) to[out=240,in=180] (C);
\end{tikzpicture}
`

TikZ is a very powerful system for producing almost any kind of printed graphic, and QuickLaTeX allows those graphics to be made easily available on the web.