Article on typesetting pitch in LaTeX

November 3rd, 2015 by donnek No comments »

I should have mentioned earlier, but the 2013 article I published in TUGboat is now freely available after the subscriber purdah period. The article shows how most pitch-related markings can be represented in LaTeX.

Note that Mark Wibrow’s excellent tikz-pitch-contour code is now located at GitHub, since Gitorious is no longer extant.

Sharing a Git repo between two computers

October 14th, 2015 by donnek No comments »

I have a desktop PC which I use for big jobs — it has a bigger screen and more oomph. But I also have a laptop which I use for day-to-day work that requires less power, and which is also useable in the living-room. So if I do some coding on the laptop, I’d like to have it shared onto the PC. The best way to do this is via Git, but how do you share the same Git repo between two computers? This post shows how to set up a single Git repo on the PC which can be committed to from either the PC or the laptop.

Install Git on the PC and set it to allow updates directly to a working copy:

git config --local receive.denyCurrentBranch updateInstead

This option is only available on Git > 2.3. To install this:

sudo add-apt-repository ppa:git-core/ppa
sudo apt-get update
sudo apt-get install git

On the PC, open a terminal in the dir to be used as the Git repo (let’s call it gitstuff), and add a file there to specify files that will not be tracked:

nano .gitignore

Add .* (dot-star) to it, and then save it.

Initialise the gitstuff repo on the PC:

git init
git add . # note final dot
git commit -m 'Initial commit'

Note that a push is not necessary, because you are working directly in the git repo.

Clone the new gitstuff repo to the laptop:

git clone ssh://

This should give a new dir on the laptop called gitstuff.

Change into the gitstuff dir:

cd gitstuff

Choose a file there, make some changes to it, and save them. Then commit the changes and push them back to the Git repo on the PC.

git commit -am "laptop1st"
git push

Now check that those changes have been recorded on the PC. Note that a pull is not necessary, because you are working directly in the git repo, but you may have to refresh the page in your editor if it is already open in order to see the changes.

So let’s go the other way. Make changes to a file on the PC and save them, and then commit:

git commit -am "pc1st"

(Remember, you don’t need to push, because you are working directly in the git repo.)

On the laptop, pull the changes you just made on the PC:

git pull

Make more changes to a file on the laptop and save them. Then commit and push:

git commit -am "laptop2nd"
git push

And on you go, making edits on the PC files and committing them, and pulling/editing/committing/pushing files on the laptop. Git records both sets of changes in one repo on the PC.

Some issues you might come across:

(1) If you commit some changes on the PC, then commit some changes on the laptop, and try to push a commit from laptop to PC without doing git pull, it will be rejected (with a message like: Updates were rejected because the remote contains work that you do not have locally). On the laptop, do git pull to get the latest edits from the PC – git will do a merge, opening an editor to allow you to add a merge message (you can just exit it to leave an empty one). Then do git push as normal to commit from the laptop to the PC.

(2) If you made some edits on the PC and did not commit them, and you then commit some changes on the laptop and then push, the push will fail because git will refuse to overwrite the uncommitted changes on the PC (with a message like: Working directory has unstaged changes). On the PC, do git commit -am "message" to commit the changes there, and then on the laptop do git pull to get the new commit from the PC, and then git push to send the failed commit on the laptop to the PC.

(3) If you commit some changes on the PC, but have made some changes on the laptop which have not been committed, doing git pull on the laptop will fail (with a message like: Your local changes would be overwritten). On the laptop, do git commit -am "message" to commit the changes on the laptop. Then do git pull to get the latest material from the PC, and git push to send the failed commit from the laptop to the PC.

Kwici – a Welsh Wikipedia corpus

January 14th, 2014 by donnek No comments »

A couple of days ago I visited a resources page belonging to Linas Vepstas, the maintainer of the Link Grammar package of the OpenCog project, and that took me to the WaCky website, with some nice examples of tagged corpora.

That in turn led to Giuseppe Attardi and Antonio Fuschetto’s Wikipedia Extractor, which generates plain text from a Wikipedia dump, discarding markup etc. I’ve tried this sort of thing before, and the results leave a lot to be desired, but this one worked wonderfully.

Download it, make it executable, and point it at the dump: ./ -cb 250K -o output < cywiki-20131230-pages-articles-multistream.xml The -cb switch compresses the output into files of 250K, and the -o switch stores them in the output dir.

Then combine all these into one big file: find output -name '*bz2' -exec bzip2 -d -c {} \; > text.xml (the command on the website is missing the -d switch).

This gave a very tidy file with each document in its own doc tag. About the only stuff it didn't strip was some English in includeonly tags - I'm not sure what these are for, but they may have been updates that hadn't been translated yet. So Wikipedia Extractor did exactly what it said on the tin!

Once the doc tags, the English, and blank lines were removed, the text was relatively easily split into sentences, and imported into PostgreSQL for more tidying. I spent a day on that, but I didn't like the results, so I redid it over another day, and the outcome is Kwici - a 200k-sentence, 3.9m-word corpus of Welsh.

Kwici is now added to Eurfa as a seventh citation source.

More corpora …

December 16th, 2013 by donnek No comments »

Korrect/Kywiro was a collection of English-Welsh software translations that I put together in 2004, when there was a lot of community activity going on. It fell off the web a few years ago, but I’ve now resurrected it. In the new version I’ve chopped up the longer strings into chunks of one or two sentences, and stripped out a lot of the HTML. I also spent some time fixing up character encoding issues, which is an occupation hazard of dealing with older text from various machines, not all of which will have been using UTF-8. Hopefully it should be fairly tidy now, so I’ve added it to the citation sources at Eurfa.

Going even further back, in the late 90s Bob Morris Jones and colleagues put together two corpora dealing with child language acquisition – CIG1 looked at children aged 18-30 months, and CIG2 looked at children aged 3-7 years. I’m not sure how many people know about or use these, but they’re an excellent resource that deserves to be more widely available, so I’ve imported the text into a database to provide a searchable interface to them at Kig. That site also includes the original .cha files, along with information from the original website (which is showing signs of bitrot).

Although other search parameters could be introduced (along the lines of the BangorTalk site), I’ve chosen for the moment just to split the searches between child and adult utterances, since I think that’s what most people would be interested in initially – what sort of output do the children in these corpora produce? (That in turn should probably be segmented by age, but the data to do that is now available in the downloadable versions of the database, and it can be added to the website if there is any demand).

The adult utterances, of course, can also be used as another citation source in Eurfa, so I’ve added them there too.

Two more corpora for Eurfa

November 13th, 2013 by donnek No comments »

I’ve just added citations from the Siarad and Patagonia corpora to Eurfa. Both these corpora are GPL-licensed, and were put together by a team led by Prof. Margaret Deuchar. They contain transcriptions of actual spoken Welsh – and as with all languages, the spoken version can be a rather different beast from the formal written language – and Siarad in particular contains quite a few “codeswitches” (where the speaker uses a word from another language, in this case English, in utterances that contain the main language, in this case Welsh). In the past, people have criticised this sort of thing as a signal of poor Welsh, but in fact the evidence suggests that it is actually a marker of linguistic competence – only speakers who are equally capable in both languages tend to do this, while those who are less capable tend to stick to one language.

The corpora provide important data for looking at this and other bilingual phenomena, and among other things have been used to show that the degree to which a codeswitch becomes a loanword is related to its frequency, that “cognates” (words that are similar in both languages, eg siop/shop, or perswadio/persuade) tend to raise the likelihood of codeswitches being used for other words (especially nouns), and so on.

The Patagonia corpus tends to use codeswitching less than the Siarad one, with the speakers switching into the other language for entire utterances or passages, and of course the main codeswitching language in that corpus is Spanish rather than English.

For use in Eurfa, both corpora had to be stripped down a bit. If you look at the material on Bangortalk, you’ll see that it follows the CLAN transcription format, which includes various markers for pauses, hesitations, non-verbal interactions, and so on. To make things easier to read, it made sense to get rid of most of these, with the exception of the brackets to indicate elision, eg mae’n gets represented as mae (y)n.

Then utterances where the transcription was less than 20 characters were removed – about 45% of the original 78,000-odd utterances in Siarad, and about 53% of the original 37,500-odd in Patagonia. One of the things that is noticeable in these corpora is how short the average spoken utterance is compared to the written language.

The last thing was to remove the utterances that are entirely in a language other than Welsh. The CLAN format was changed a couple of years ago to incorporate what I consider a regression. Previously, a default language would be chosen for the transcription (usually the most frequent language in the conversation), and all words not in that language would be tagged. The new system allows a “precode” marker to be attached to utterances which are entirely in the non-default language, presumably on the grounds that this makes the transcriptions easier to read. However, the problem with this is that you can no longer tell simply from the word what language it is in: an untagged word could be Welsh if the the default language is Welsh, but it could also be English if there is a precode marker.

I chose the “easy” (aka “fast”) way to do this: deleting any items with precodes marking English or Spanish. Unfortunately, that still leaves items that were not marked with precodes because the default language of the transcription was English or Spanish! I looked at addressing this with Alberto Simões’ Lingua::Identify, but, as I expected, the results are variable: the utterances are mostly too short to give a good “fingerprint”, and this is compounded by the brackets, the lack of punctuation, and the codeswitches.

For Siarad, some 20 languages (cs, da, de, en, es, fi, fr, hr, hu, id, it, la, nl, pl, pt, ro, sl, sq, sv, tr) were given for about 31% of the utterances remaining after removing those with less than 20 characters. Most of these were in fact Welsh, and even the ones marked English were about 50% Welsh. I therefore left them all, giving a total of around 42,300 citations.

For Patagonia, the same 20 languages (cs, da, de, en, es, fi, fr, hr, hu, id, it, la, nl, pl, pt, ro, sl, sq, sv, tr) were given for about 32% of the remaining utterances. Again, most of these were in fact Welsh, but the ones marked as Spanish and Portuguese were predominantly Spanish, so I removed them, leaving a total of around 15,500 citations.

The upshot is that if in Eurfa you search for an English word, and then click to get citations from Siarad or Patagonia, there is a small chance you will get a “Welsh” equivalent of the English citation that is actually in English or Spanish!

Citations added to Eurfa

November 12th, 2013 by donnek No comments »

Eurfa will now allow you to get in-context citations for its words from the Kynulliad3 corpus. I’ve limited it to 5 at the minute, but I also want to look at doing random lookups to produce a different 5 each time.

I’m hoping to add some other corpora to Eurfa in the same way over the coming months.

I also took the opportunity of changing the AJAX search-boxes to HTML ones, because the AJAX ones had the annoying “feature” of not doing the search if you pressed Return – instead, they just cleared the searchbox.

Autoglossing Gàidhlig

August 28th, 2013 by donnek No comments »

Over the last month I’ve been wondering about how easy it would be to port the Autoglosser to some completely new language material. This would give me the opportunity to look at things like importing normal text instead of CHAT files, dealing with multiwords (collocations where the meaning of the whole is different from the combined meaning of the parts), better handling of capitalised items, etc. Eventually I decided to take the plunge with Gàidhlig (Scottish Gaelic), which I learnt 30 years ago in Sabhal Mòr Ostaig when it was still a collection of farm buildings ….

Surprisingly, once I had assembled a dictionary, the actual port took only a couple of days, and gives pretty good results, as you can see from the output at the website. There’s obviously a lot to be done yet – in particular, developing a stemmer to simplify the dictionary. Talking of which, I also put together a little TeX script which typesets the dictionary in the form I’ve always wanted: all words listed in alphabetical order, but with the lemma specified where they are derived forms, and also each derived word listed as part of the generating word/lemma’s own entry. Still needs a bit of work (it should be in two columns, for instance), but it shows that the dictionary layout is robust enough to give quite sophisticated output.

This port opens the way for more work on streamlining text consumption by the Autoglosser – at present, punctuation is not handled as well as I’d like. The multiword work is also a first step in allowing the handling of languages with disjunctive writing systems (eg Sotho).

Language of first delivery

July 4th, 2013 by donnek 1 comment »

I’ve added some data to Kynulliad3 (it’s not in the current download, but it will be in the next one) so that the “language of first delivery” can be calculated. In the Assembly, Members can speak in either language, which is placed on the left-hand side of the Record. The translation, into English or Welsh as appropriate, then goes on the right-hand side of the Record. If we separate out those sentences which were first delivered in one language from those which have been translated from the other language, we get the following word totals:
Welsh: 879,964  /  8,865,452 (9.9%)
English: 7,916,148  /  8,775,382 (90.2%)

So about 10% of the word total in the Third Assembly used Welsh as the language of first delivery. This is a bit lower than the proportion of Welsh-speakers in Wales (19% according to the 2011 Census), but I don’t know how it relates to the proportion of Welsh-speakers in the Third Assembly (or at least the proportion who felt confident enough to use Welsh in this setting). It’s a pretty sizeable percentage, though.

Kynulliad3 released

July 3rd, 2013 by donnek No comments »

I’ve just released Kynulliad3 – a corpus of aligned Welsh and English sentences, drawn from the Record of Proceedings of the Third Assembly (2007-2011) of the National Assembly for Wales. It contains over 350,000 sentences and almost 9m words in each language, and is a sort of sequel to Jones and Eisele‘s earlier PNAW corpus.

Hopefully it will mean that all the great work done by the Assembly’s translators can be re-used for language processing research. And it also has practical benefits for language checking – if, for instance, you are translating “of almost” and wonder whether bron (almost) should have the usual soft-mutation after o (of), all you need to do is put “of almost” in the search box, and you have your answer!

Language detection for Welsh

June 6th, 2013 by donnek No comments »

For the last week or so I’ve been putting together a corpus of aligned Welsh and English sentences from the Records of Proceedings of the Third Welsh Assembly (2007-2011). This is intended as one of the citation sources for Eurfa, and will complement Jones and Eisele’s earlier PNAW corpus – Dafydd Jones & Andreas Eisele (2006): “Phrase-based statistical machine translation between English and Welsh.” LREC-2006: Fifth International Conference on Language Resources and Evaluation: 5th SALTMIL Workshop on Minority Languages: “Strategies for developing machine translation for minority languages”, Genoa, Italy.

There are 256 documents with blocks of text in both languages, giving a total of around 129,000 aligned items. After cleaning, these yield about 94,000 useable pairs, which then need to be split to give sentences. They also need to be rearranged in a consistent direction: the Records put the first language of the pair as whatever language was actually used by the speaker, with the translation in the other language following. So a language detection routine is required.

Alberto Manuel Brandão Simões has written Lingua::Identify, a very nice Perl module that handles 33 languages. It turns out that Welsh had been deactivated because of a lack of “clean” Welsh text to prime the analyser. Once I sent him some text that was Welsh-only, Alberto produced (in what must be a record-breaking 20 minutes!) a new version of Lingua::Identify (0.54) that identifies Welsh and English with high precision.

I plugged it into a PostgreSQL function, kindly offered by Filip Rembiałkowski:

create or replace function langof( text ) returns varchar(2)
immutable returns null on null input
language plperlu as $perlcode$
    use Lingua::Identify qw(langof);
    return langof( shift );


select langof('my string in an unknown language');

will then give an identifier for the language of the string.

When I ran this against the first item in the 94,000 pairs, there was a misidentification of the language in only 26 cases (0.03%) – 10 of these should have been English and 16 Welsh. So far I haven’t come across any instances where an English item was identified as Welsh, or vice versa. Lingua::Identify is an amazing tool for saving a LOT of work – hats off to Alberto!

It turns out that Welsh was the original language used in around 12% of the blocks (about 12,000 out of the 94,000 blocks). It should be possible later to get the number of words in each language over all the Third Assembly’s debates.

I’ve now split these blocks into sentence pairs (around 350,000 after cleaning). There are about 3,000 “holes”, where one sentence in the source language has been translated by two in the target language, or vice versa (annoying!), and I’m working on fixing those.

Hopefully the new corpus, Kynulliad3, should be ready in the next couple of weeks, with a website and download.