Archive for the ‘Linguistics’ category

Article on typesetting pitch in LaTeX

November 3rd, 2015

I should have mentioned earlier, but the 2013 article I published in TUGboat is now freely available after the subscriber purdah period. The article shows how most pitch-related markings can be represented in LaTeX.

Note that Mark Wibrow’s excellent tikz-pitch-contour code is now located at GitHub, since Gitorious is no longer extant.

Autoglossing Gàidhlig

August 28th, 2013

Over the last month I’ve been wondering about how easy it would be to port the Autoglosser to some completely new language material. This would give me the opportunity to look at things like importing normal text instead of CHAT files, dealing with multiwords (collocations where the meaning of the whole is different from the combined meaning of the parts), better handling of capitalised items, etc. Eventually I decided to take the plunge with Gàidhlig (Scottish Gaelic), which I learnt 30 years ago in Sabhal Mòr Ostaig when it was still a collection of farm buildings ….

Surprisingly, once I had assembled a dictionary, the actual port took only a couple of days, and gives pretty good results, as you can see from the output at the website. There’s obviously a lot to be done yet – in particular, developing a stemmer to simplify the dictionary. Talking of which, I also put together a little TeX script which typesets the dictionary in the form I’ve always wanted: all words listed in alphabetical order, but with the lemma specified where they are derived forms, and also each derived word listed as part of the generating word/lemma’s own entry. Still needs a bit of work (it should be in two columns, for instance), but it shows that the dictionary layout is robust enough to give quite sophisticated output.

This port opens the way for more work on streamlining text consumption by the Autoglosser – at present, punctuation is not handled as well as I’d like. The multiword work is also a first step in allowing the handling of languages with disjunctive writing systems (eg Sotho).

Kynulliad3 released

July 3rd, 2013

I’ve just released Kynulliad3 – a corpus of aligned Welsh and English sentences, drawn from the Record of Proceedings of the Third Assembly (2007-2011) of the National Assembly for Wales. It contains over 350,000 sentences and almost 9m words in each language, and is a sort of sequel to Jones and Eisele‘s earlier PNAW corpus.

Hopefully it will mean that all the great work done by the Assembly’s translators can be re-used for language processing research. And it also has practical benefits for language checking – if, for instance, you are translating “of almost” and wonder whether bron (almost) should have the usual soft-mutation after o (of), all you need to do is put “of almost” in the search box, and you have your answer!

Patagonia and Miami corpora complete!

July 27th, 2012

Phew! There was just no time for anything over the last 6 months but the Patagonia and Miami corpora, and they’ve now been completed on schedule and sent off to Talkbank. Kudos to Professor Margaret Deuchar and her team of researchers.

Patagonia (Welsh/Spanish) corpus main statistics:
150k words, 78% Welsh, 17% Spanish, 5% indeterminate

Miami (Spanish/English) corpus main statistics:
167k words, 63% English, 34% Spanish, 3% indeterminate

Apart from producing the glosses via the Autoglosser, I was involved in running a Git repository, translation and editing, silencing audiofiles for privacy purposes, tweaking the files to accommodate last-minute changes to the CLAN format, and doing a website for the corpora themselves – very interesting indeed.

Possibly the best thing about the corpora is that, like the 2009 Siarad corpus, they are under the GPL. In fact, Siarad and Patagonia seem to be the only large-scale collections of Welsh text to use this free license, which ensures that everyone can have unfettered access to high-quality materials produced using public funds.

Talking of which, the website devoted to all three corpora is at BangorTalk. Have a look and listen to real language in action!

Autoglossing historical Welsh

December 11th, 2011

David Willis asked about the feasibility of using the Autoglosser to tag texts in his Historical Corpus of Welsh. It proved easier than expected to do a proof-of-concept: set up the Autoglosser to import from running monolingual text instead of conversational bilingual text, and then let everything else (lookup, constraint grammar and write-out) work as normal.

The section I chose was a 1,200 word piece from 1779 – a translation of the autobiography of James Groniosaw, an African prince who was enslaved. I set up a way of handling the old-style spelling, though that would need some more work, and the results are available here.

On a rough count, only about 3% of the words are actually tagged incorrectly. Another 40% are not sufficiently disambiguated, but that is more a matter of writing constraint grammar rules that will apply to this sort of text (we were in the same position with the conversation transcripts a few months ago). An interesting option might be to set up different rulesets for different periods or types of Welsh text, which you could plug in to the system as appropriate.

It’s gratifying that the Autoglosser can make a pretty good show at tagging Welsh that is over 230 years old, as well as tagging the modern colloquial Welsh it was designed for.

Mining corpora with the Autoglosser

December 8th, 2011

I did a presentation last Friday at the ESRC Centre, focussing on using the Autoglosser output to pull stuff out of the corpora. It was interesting to note that over the last 7 months we’ve moved quite a distance in this direction, with about 6 different areas where the autoglossing is able to assist linguistic analysis of the texts in the corpora.

Presentation at ITA11

September 9th, 2011

Just back from a day at ITA11 in Wrexham, where I presented a paper giving a broad overview of the Autoglosser (now published in the conference Proceedings). The session was on web content, so this fitted in well, but much of the rest of the conference was more technical. There seems to be a lot going on at Glyndŵr University, which has been a bit under the radar. It was the first time I had been there, and I can recommend the 1887 restaurant!

ISB8 presentation

June 20th, 2011

I presented our paper last week in Oslo. ISB8 was a great experience! The accuracy comparison showed that the Augotglosser was on a par with CLAN’s MOR tagger for Spanish (97.4%), and within 2% of human tagging for Welsh (97.9%). The final presentation is available here.

Paper accepted for ISB8

January 30th, 2011

On Friday we were very pleased to get word that our paper (Glossing CHAT files using the Bangor Autoglosser) has been accepted for presentation at ISB8. We’ll be using texts from the ESRC Centre’s Miami and Siarad corpora to test the coverage and accuracy of the autoglosser, and compare its output to CLAN’s MOR/POST tagger (for Spanish) and manual glossing (for Welsh). The results should be interesting.

Conversation profiles

December 16th, 2010

At the transcription workshop last month, Jens Normann Jørgensen showed some graphics which mapped the development of a bilingual conversation over time, and they struck me as a very interesting way of trying to grasp the overall profile of the conversation. This inspired me to see if I could use R to produce something similar for the ESRC corpora.

Normann categorised the utterances in his conversations into 5 groups (Danish utterances with no loan, Danish-based utterances with loans, code-switching utterances and other utterances, Turkish-based utterances with loans, and Turkish utterances with no loan), to give profiles like this:

J. N. Jørgensen profile: number 903

For further details, see pp320ff of:
J. N. Jørgensen (2008): Languaging: Nine years of poly-lingual development of young Turkish-Danish grade school students (2 volumes), Copenhagen Studies in Bilingualism, Volumes 15 and 16, Copenhagen.

The approach I have used is altogether more basic – I calculate the number or percentage of words in each utterance that are tagged in the transcript as belonging to a specific language, and then graph each utterance as a vertical line, with the numbers or percentages stacked.

In the profiles below, blue is Spanish or Welsh, yellow is English, and purple is undetermined (ie the item occurs in a printed dictionary for more than one language). First, the profiles based on numbers (with the scale on the y-axis):
Sastre1 profile: numbers
Stammers4 profile: numbers

These two conversations are very different: in sastre1 (from the Miami corpus), the speakers move into and out of both Spanish and English; in stammers4 (from the Welsh Siarad corpus), the conversation is almost monolingual, except for a few English words, and there is a greater use of indeterminate items.

The profiles based on percentages make the general “texture” of the conversation clearer, but overemphasise indeterminate items, since they give a one-word utterance the same prominence as a multiword utterance:
Sastre1 profile: percentages
Stammers4 profile: percentages