Archive for the ‘Linguistics’ category

Autoglosser2 released

February 2nd, 2018

During 2009-11 I wrote the Bangor Autoglosser to gloss the Bangor ESRC corpora of multilingual (Welsh, Spanish, English) conversational text. I’ve done a new version, Autoglosser2, that focusses on Welsh written text, and outputs CorCenCC tags as well as Bangor-type glosses. Speed has been greatly increased too, from 1,000 to 22,000 glossses/minute. You can test it online, but for detailed work it’s better to download and install locally. There’s also a detailed manual available. Lots of work to do on it still, but it’s pretty robust, and gives reasonably good results.

Article on typesetting pitch in LaTeX

November 3rd, 2015

I should have mentioned earlier, but the 2013 article I published in TUGboat is now freely available after the subscriber purdah period. The article shows how most pitch-related markings can be represented in LaTeX.

Note that Mark Wibrow’s excellent tikz-pitch-contour code is now located at GitHub, since Gitorious is no longer extant.

Autoglossing Gàidhlig

August 28th, 2013

Over the last month I’ve been wondering about how easy it would be to port the Autoglosser to some completely new language material. This would give me the opportunity to look at things like importing normal text instead of CHAT files, dealing with multiwords (collocations where the meaning of the whole is different from the combined meaning of the parts), better handling of capitalised items, etc. Eventually I decided to take the plunge with Gàidhlig (Scottish Gaelic), which I learnt 30 years ago in Sabhal Mòr Ostaig when it was still a collection of farm buildings ….

Surprisingly, once I had assembled a dictionary, the actual port took only a couple of days, and gives pretty good results, as you can see from the output at the website. There’s obviously a lot to be done yet – in particular, developing a stemmer to simplify the dictionary. Talking of which, I also put together a little TeX script which typesets the dictionary in the form I’ve always wanted: all words listed in alphabetical order, but with the lemma specified where they are derived forms, and also each derived word listed as part of the generating word/lemma’s own entry. Still needs a bit of work (it should be in two columns, for instance), but it shows that the dictionary layout is robust enough to give quite sophisticated output.

This port opens the way for more work on streamlining text consumption by the Autoglosser – at present, punctuation is not handled as well as I’d like. The multiword work is also a first step in allowing the handling of languages with disjunctive writing systems (eg Sotho).

Kynulliad3 released

July 3rd, 2013

I’ve just released Kynulliad3 – a corpus of aligned Welsh and English sentences, drawn from the Record of Proceedings of the Third Assembly (2007-2011) of the National Assembly for Wales. It contains over 350,000 sentences and almost 9m words in each language, and is a sort of sequel to Jones and Eisele‘s earlier PNAW corpus.

Hopefully it will mean that all the great work done by the Assembly’s translators can be re-used for language processing research. And it also has practical benefits for language checking – if, for instance, you are translating “of almost” and wonder whether bron (almost) should have the usual soft-mutation after o (of), all you need to do is put “of almost” in the search box, and you have your answer!

Patagonia and Miami corpora complete!

July 27th, 2012

Phew! There was just no time for anything over the last 6 months but the Patagonia and Miami corpora, and they’ve now been completed on schedule and sent off to Talkbank. Kudos to Professor Margaret Deuchar and her team of researchers.

Patagonia (Welsh/Spanish) corpus main statistics:
150k words, 78% Welsh, 17% Spanish, 5% indeterminate

Miami (Spanish/English) corpus main statistics:
167k words, 63% English, 34% Spanish, 3% indeterminate

Apart from producing the glosses via the Autoglosser, I was involved in running a Git repository, translation and editing, silencing audiofiles for privacy purposes, tweaking the files to accommodate last-minute changes to the CLAN format, and doing a website for the corpora themselves – very interesting indeed.

Possibly the best thing about the corpora is that, like the 2009 Siarad corpus, they are under the GPL. In fact, Siarad and Patagonia seem to be the only large-scale collections of Welsh text to use this free license, which ensures that everyone can have unfettered access to high-quality materials produced using public funds.

Talking of which, the website devoted to all three corpora is at BangorTalk. Have a look and listen to real language in action!

Autoglossing historical Welsh

December 11th, 2011

David Willis asked about the feasibility of using the Autoglosser to tag texts in his Historical Corpus of Welsh. It proved easier than expected to do a proof-of-concept: set up the Autoglosser to import from running monolingual text instead of conversational bilingual text, and then let everything else (lookup, constraint grammar and write-out) work as normal.

The section I chose was a 1,200 word piece from 1779 – a translation of the autobiography of James Groniosaw, an African prince who was enslaved. I set up a way of handling the old-style spelling, though that would need some more work, and the results are available here.

On a rough count, only about 3% of the words are actually tagged incorrectly. Another 40% are not sufficiently disambiguated, but that is more a matter of writing constraint grammar rules that will apply to this sort of text (we were in the same position with the conversation transcripts a few months ago). An interesting option might be to set up different rulesets for different periods or types of Welsh text, which you could plug in to the system as appropriate.

It’s gratifying that the Autoglosser can make a pretty good show at tagging Welsh that is over 230 years old, as well as tagging the modern colloquial Welsh it was designed for.

Mining corpora with the Autoglosser

December 8th, 2011

I did a presentation last Friday at the ESRC Centre, focussing on using the Autoglosser output to pull stuff out of the corpora. It was interesting to note that over the last 7 months we’ve moved quite a distance in this direction, with about 6 different areas where the autoglossing is able to assist linguistic analysis of the texts in the corpora.

Presentation at ITA11

September 9th, 2011

Just back from a day at ITA11 in Wrexham, where I presented a paper giving a broad overview of the Autoglosser (now published in the conference Proceedings). The session was on web content, so this fitted in well, but much of the rest of the conference was more technical. There seems to be a lot going on at Glyndŵr University, which has been a bit under the radar. It was the first time I had been there, and I can recommend the 1887 restaurant!

ISB8 presentation

June 20th, 2011

I presented our paper last week in Oslo. ISB8 was a great experience! The accuracy comparison showed that the Augotglosser was on a par with CLAN’s MOR tagger for Spanish (97.4%), and within 2% of human tagging for Welsh (97.9%). The final presentation is available here.

Paper accepted for ISB8

January 30th, 2011

On Friday we were very pleased to get word that our paper (Glossing CHAT files using the Bangor Autoglosser) has been accepted for presentation at ISB8. We’ll be using texts from the ESRC Centre’s Miami and Siarad corpora to test the coverage and accuracy of the autoglosser, and compare its output to CLAN’s MOR/POST tagger (for Spanish) and manual glossing (for Welsh). The results should be interesting.