Autoglosser2 released

February 2nd, 2018 by donnek No comments »

During 2009-11 I wrote the Bangor Autoglosser to gloss the Bangor ESRC corpora of multilingual (Welsh, Spanish, English) conversational text. I’ve done a new version, Autoglosser2, that focusses on Welsh written text, and outputs CorCenCC tags as well as Bangor-type glosses. Speed has been greatly increased too, from 1,000 to 22,000 glossses/minute. You can test it online, but for detailed work it’s better to download and install locally. There’s also a detailed manual available. Lots of work to do on it still, but it’s pretty robust, and gives reasonably good results.

New output options for Andika!

December 1st, 2017 by donnek No comments »

Gosh – such a long time since I posted! Too much stuff happening. I just completed a major refactoring of the pdf output possibilities for Andika!, adding lots of options that you can use to tweak how the Swahili text gets printed. All the details are in Chapter 8 of the revised manual.

North Wales Tech

February 2nd, 2016 by donnek No comments »

Carwyn Edwards and friends organised a very thought-provoking set of talks around the inaugural lecture at Pontio by Prof Sir John O’Reilly last Thursday (28 January).

The group is North Wales Tech, and the ambitious aim is to build a tech hub in this part of Wales, with makers, tinkerers, prototype-builders, etc all coming together to swap ideas and generally provide encouragement. This is going to be supported by a subscription to the FabLab and coLab in Pontio, which will hire time on a variety of prototyping machines like laser cutters that it would be difficult to justify buying just to make one prototype.

After an intro about local IT opportunities by Sian Williams of Supertemps, the three lightning talks were:

(1) Simulating Emotions in Games, where Chris Headleand showed how he was using 2D spaces mapping 2 emotions to allow game characters to look as though they are experiencing emotions like fear, reluctance, etc.

(2) Tablets on the Beach, where Graham Worley took us though developing a mobile app to allow the Environment Department of the Government of the Cayman Islands to log incidents more easily. It must have been absolutely awful having to work in the Caymans!

(3) DIY Space, where Jo Hinchliffe (concretedog) talked about launching DIY satellites based on the ProtoQube system.

Very thought-provoking indeed, and here’s looking forward to in the next one!

New Swahili materials

January 9th, 2016 by donnek No comments »

For much of 2015, I was working on Swahili. The main effort has been on Andika!, a set of tools to allow handling of Swahili in Arabic script. I spoke on this at a seminar in Hamburg in April 2015 (thanks to Ridder Samsom for the invite!), and since then my top priority has been to use Andika! for the real-world task of providing a digital version of Swahili poetry in Arabic script. So I started working on two manuscripts of the hitherto-unpublished Utenzi wa Jaʿfar (The Ballad of Jaʿfar), adding pieces to Andika! as required, and I finished that last summer. Although there are still a few rough edges, I think it’s worth putting it up on the web.

Since then, I’ve been working on a paper to demonstrate how having a ballad text in a database might help in textually analysing it, and the results so far are very interesting. For instance, half of the stanzas use just 3 rhymes, one in five verbs are speech verbs, the majority of verbs (67%) have no specific time reference, 70% of words in rhyming positions are Arabic-derived, 83% of lines have two syntactic slots, and in the four commonest sequences display marked word-order almost as frequently as the “normal” unmarked word-order. I’m currently looking at repetition and formula use, and there the ability to query the database to bring up similar word-sequences is crucial. Another benefit has been that concordances (based either on word or root) can be created very easily.

Ideally, you could load all your classical Swahili manuscripts into a big database, and this would give you an overview of spelling variations and word usage over time and space, with all this text available to be queried easily and printed in either Arabic and Roman script (or both), with a full textual apparatus if desired.

For conclusions based on a poetic corpus to be valid, they would also have to be compared with a prose corpus, and I’m also offering a small contribution to that, with a new Swahili corpus based on Wikipedia. This contains about 2.8m words in 150,000 sentences, so it’s reasonably comprehensive. In contrast to the work I’ve done on Welsh corpora, I used the NLTK tokenizer to split the sentences, and I’m not entirely sure yet whether I like the results as much as the hand-rolled splitting code I was using earlier. Something to come back to, possibly.

Gàidhlig autoglosser

January 8th, 2016 by donnek No comments »

I just realised that when I moved my code to GitHub I forgot to include the batch for the test application of the autoglosser to Gàidhlig. I’ve now rectified that, and also taken the opportunity to tidy up a few other things.

Article on typesetting pitch in LaTeX

November 3rd, 2015 by donnek No comments »

I should have mentioned earlier, but the 2013 article I published in TUGboat is now freely available after the subscriber purdah period. The article shows how most pitch-related markings can be represented in LaTeX.

Note that Mark Wibrow’s excellent tikz-pitch-contour code is now located at GitHub, since Gitorious is no longer extant.

Sharing a Git repo between two computers

October 14th, 2015 by donnek No comments »

I have a desktop PC which I use for big jobs — it has a bigger screen and more oomph. But I also have a laptop which I use for day-to-day work that requires less power, and which is also useable in the living-room. So if I do some coding on the laptop, I’d like to have it shared onto the PC. The best way to do this is via Git, but how do you share the same Git repo between two computers? This post shows how to set up a single Git repo on the PC which can be committed to from either the PC or the laptop.

The version of Git you install on the PC needs to allow updates directly to a working copy using the receive.denyCurrentBranch option. This is only available on Git > 2.3, so to ensure you install the latest Git version add a new repository first:

sudo add-apt-repository ppa:git-core/ppa
sudo apt-get update
sudo apt-get install git

On the PC, open a terminal in the dir to be used as the Git repo (let’s call it gitstuff), and add a file there to specify files that will not be tracked:

nano .gitignore

Add .* (dot-star) to it, and then save it.

Set the gitstuff repo to allow updates directly to a working copy:

git config --local receive.denyCurrentBranch updateInstead

Initialise the gitstuff repo on the PC:

git init
git add . # note final dot
git commit -m 'Initial commit'

Note that a push is not necessary, because you are working directly in the git repo.

Clone the new gitstuff repo to the laptop:

git clone ssh://

This should give a new dir on the laptop called gitstuff.

Change into the gitstuff dir:

cd gitstuff

Choose a file there, make some changes to it, and save them. Then commit the changes and push them back to the Git repo on the PC.

git commit -am "laptop1st"
git push

Now check that those changes have been recorded on the PC. Note that a pull is not necessary, because you are working directly in the git repo, but you may have to refresh the page in your editor if it is already open in order to see the changes.

So let’s go the other way. Make changes to a file on the PC and save them, and then commit:

git commit -am "pc1st"

(Remember, you don’t need to push, because you are working directly in the git repo.)

On the laptop, pull the changes you just made on the PC:

git pull

Make more changes to a file on the laptop and save them. Then commit and push:

git commit -am "laptop2nd"
git push

And on you go, making edits on the PC files and committing them, and pulling/editing/committing/pushing files on the laptop. Git records both sets of changes in one repo on the PC.

Some issues you might come across:

(1) If you commit some changes on the PC, then commit some changes on the laptop, and try to push a commit from laptop to PC without doing git pull, it will be rejected (with a message like: Updates were rejected because the remote contains work that you do not have locally). On the laptop, do git pull to get the latest edits from the PC – git will do a merge, opening an editor to allow you to add a merge message (you can just exit it to leave an empty one). Then do git push as normal to commit from the laptop to the PC.

(2) If you made some edits on the PC and did not commit them, and you then commit some changes on the laptop and then push, the push will fail because git will refuse to overwrite the uncommitted changes on the PC (with a message like: Working directory has unstaged changes). On the PC, do git commit -am "message" to commit the changes there, and then on the laptop do git pull to get the new commit from the PC, and then git push to send the failed commit on the laptop to the PC.

(3) If you commit some changes on the PC, but have made some changes on the laptop which have not been committed, doing git pull on the laptop will fail (with a message like: Your local changes would be overwritten). On the laptop, do git commit -am "message" to commit the changes on the laptop. Then do git pull to get the latest material from the PC, and git push to send the failed commit from the laptop to the PC.

Kwici – a Welsh Wikipedia corpus

January 14th, 2014 by donnek No comments »

A couple of days ago I visited a resources page belonging to Linas Vepstas, the maintainer of the Link Grammar package of the OpenCog project, and that took me to the WaCky website, with some nice examples of tagged corpora.

That in turn led to Giuseppe Attardi and Antonio Fuschetto’s Wikipedia Extractor, which generates plain text from a Wikipedia dump, discarding markup etc. I’ve tried this sort of thing before, and the results leave a lot to be desired, but this one worked wonderfully.

Download it, make it executable, and point it at the dump: ./ -cb 250K -o output < cywiki-20131230-pages-articles-multistream.xml The -cb switch compresses the output into files of 250K, and the -o switch stores them in the output dir.

Then combine all these into one big file: find output -name '*bz2' -exec bzip2 -d -c {} \; > text.xml (the command on the website is missing the -d switch).

This gave a very tidy file with each document in its own doc tag. About the only stuff it didn't strip was some English in includeonly tags - I'm not sure what these are for, but they may have been updates that hadn't been translated yet. So Wikipedia Extractor did exactly what it said on the tin!

Once the doc tags, the English, and blank lines were removed, the text was relatively easily split into sentences, and imported into PostgreSQL for more tidying. I spent a day on that, but I didn't like the results, so I redid it over another day, and the outcome is Kwici - a 200k-sentence, 3.9m-word corpus of Welsh.

Kwici is now added to Eurfa as a seventh citation source.

More corpora …

December 16th, 2013 by donnek No comments »

Korrect/Kywiro was a collection of English-Welsh software translations that I put together in 2004, when there was a lot of community activity going on. It fell off the web a few years ago, but I’ve now resurrected it. In the new version I’ve chopped up the longer strings into chunks of one or two sentences, and stripped out a lot of the HTML. I also spent some time fixing up character encoding issues, which is an occupation hazard of dealing with older text from various machines, not all of which will have been using UTF-8. Hopefully it should be fairly tidy now, so I’ve added it to the citation sources at Eurfa.

Going even further back, in the late 90s Bob Morris Jones and colleagues put together two corpora dealing with child language acquisition – CIG1 looked at children aged 18-30 months, and CIG2 looked at children aged 3-7 years. I’m not sure how many people know about or use these, but they’re an excellent resource that deserves to be more widely available, so I’ve imported the text into a database to provide a searchable interface to them at Kig. That site also includes the original .cha files, along with information from the original website (which is showing signs of bitrot).

Although other search parameters could be introduced (along the lines of the BangorTalk site), I’ve chosen for the moment just to split the searches between child and adult utterances, since I think that’s what most people would be interested in initially – what sort of output do the children in these corpora produce? (That in turn should probably be segmented by age, but the data to do that is now available in the downloadable versions of the database, and it can be added to the website if there is any demand).

The adult utterances, of course, can also be used as another citation source in Eurfa, so I’ve added them there too.

Two more corpora for Eurfa

November 13th, 2013 by donnek No comments »

I’ve just added citations from the Siarad and Patagonia corpora to Eurfa. Both these corpora are GPL-licensed, and were put together by a team led by Prof. Margaret Deuchar. They contain transcriptions of actual spoken Welsh – and as with all languages, the spoken version can be a rather different beast from the formal written language – and Siarad in particular contains quite a few “codeswitches” (where the speaker uses a word from another language, in this case English, in utterances that contain the main language, in this case Welsh). In the past, people have criticised this sort of thing as a signal of poor Welsh, but in fact the evidence suggests that it is actually a marker of linguistic competence – only speakers who are equally capable in both languages tend to do this, while those who are less capable tend to stick to one language.

The corpora provide important data for looking at this and other bilingual phenomena, and among other things have been used to show that the degree to which a codeswitch becomes a loanword is related to its frequency, that “cognates” (words that are similar in both languages, eg siop/shop, or perswadio/persuade) tend to raise the likelihood of codeswitches being used for other words (especially nouns), and so on.

The Patagonia corpus tends to use codeswitching less than the Siarad one, with the speakers switching into the other language for entire utterances or passages, and of course the main codeswitching language in that corpus is Spanish rather than English.

For use in Eurfa, both corpora had to be stripped down a bit. If you look at the material on Bangortalk, you’ll see that it follows the CLAN transcription format, which includes various markers for pauses, hesitations, non-verbal interactions, and so on. To make things easier to read, it made sense to get rid of most of these, with the exception of the brackets to indicate elision, eg mae’n gets represented as mae (y)n.

Then utterances where the transcription was less than 20 characters were removed – about 45% of the original 78,000-odd utterances in Siarad, and about 53% of the original 37,500-odd in Patagonia. One of the things that is noticeable in these corpora is how short the average spoken utterance is compared to the written language.

The last thing was to remove the utterances that are entirely in a language other than Welsh. The CLAN format was changed a couple of years ago to incorporate what I consider a regression. Previously, a default language would be chosen for the transcription (usually the most frequent language in the conversation), and all words not in that language would be tagged. The new system allows a “precode” marker to be attached to utterances which are entirely in the non-default language, presumably on the grounds that this makes the transcriptions easier to read. However, the problem with this is that you can no longer tell simply from the word what language it is in: an untagged word could be Welsh if the the default language is Welsh, but it could also be English if there is a precode marker.

I chose the “easy” (aka “fast”) way to do this: deleting any items with precodes marking English or Spanish. Unfortunately, that still leaves items that were not marked with precodes because the default language of the transcription was English or Spanish! I looked at addressing this with Alberto Simões’ Lingua::Identify, but, as I expected, the results are variable: the utterances are mostly too short to give a good “fingerprint”, and this is compounded by the brackets, the lack of punctuation, and the codeswitches.

For Siarad, some 20 languages (cs, da, de, en, es, fi, fr, hr, hu, id, it, la, nl, pl, pt, ro, sl, sq, sv, tr) were given for about 31% of the utterances remaining after removing those with less than 20 characters. Most of these were in fact Welsh, and even the ones marked English were about 50% Welsh. I therefore left them all, giving a total of around 42,300 citations.

For Patagonia, the same 20 languages (cs, da, de, en, es, fi, fr, hr, hu, id, it, la, nl, pl, pt, ro, sl, sq, sv, tr) were given for about 32% of the remaining utterances. Again, most of these were in fact Welsh, but the ones marked as Spanish and Portuguese were predominantly Spanish, so I removed them, leaving a total of around 15,500 citations.

The upshot is that if in Eurfa you search for an English word, and then click to get citations from Siarad or Patagonia, there is a small chance you will get a “Welsh” equivalent of the English citation that is actually in English or Spanish!