Archive for the ‘Linguistics’ category

Autoglossing historical Welsh

December 11th, 2011

David Willis asked about the feasibility of using the Autoglosser to tag texts in his Historical Corpus of Welsh. It proved easier than expected to do a proof-of-concept: set up the Autoglosser to import from running monolingual text instead of conversational bilingual text, and then let everything else (lookup, constraint grammar and write-out) work as normal.

The section I chose was a 1,200 word piece from 1779 – a translation of the autobiography of James Groniosaw, an African prince who was enslaved. I set up a way of handling the old-style spelling, though that would need some more work, and the results are available here.

On a rough count, only about 3% of the words are actually tagged incorrectly. Another 40% are not sufficiently disambiguated, but that is more a matter of writing constraint grammar rules that will apply to this sort of text (we were in the same position with the conversation transcripts a few months ago). An interesting option might be to set up different rulesets for different periods or types of Welsh text, which you could plug in to the system as appropriate.

It’s gratifying that the Autoglosser can make a pretty good show at tagging Welsh that is over 230 years old, as well as tagging the modern colloquial Welsh it was designed for.

Mining corpora with the Autoglosser

December 8th, 2011

I did a presentation last Friday at the ESRC Centre, focussing on using the Autoglosser output to pull stuff out of the corpora. It was interesting to note that over the last 7 months we’ve moved quite a distance in this direction, with about 6 different areas where the autoglossing is able to assist linguistic analysis of the texts in the corpora.

Presentation at ITA11

September 9th, 2011

Just back from a day at ITA11 in Wrexham, where I presented a paper giving a broad overview of the Autoglosser (now published in the conference Proceedings). The session was on web content, so this fitted in well, but much of the rest of the conference was more technical. There seems to be a lot going on at Glyndŵr University, which has been a bit under the radar. It was the first time I had been there, and I can recommend the 1887 restaurant!

ISB8 presentation

June 20th, 2011

I presented our paper last week in Oslo. ISB8 was a great experience! The accuracy comparison showed that the Augotglosser was on a par with CLAN’s MOR tagger for Spanish (97.4%), and within 2% of human tagging for Welsh (97.9%). The final presentation is available here.

Paper accepted for ISB8

January 30th, 2011

On Friday we were very pleased to get word that our paper (Glossing CHAT files using the Bangor Autoglosser) has been accepted for presentation at ISB8. We’ll be using texts from the ESRC Centre’s Miami and Siarad corpora to test the coverage and accuracy of the autoglosser, and compare its output to CLAN’s MOR/POST tagger (for Spanish) and manual glossing (for Welsh). The results should be interesting.

Conversation profiles

December 16th, 2010

At the transcription workshop last month, Jens Normann Jørgensen showed some graphics which mapped the development of a bilingual conversation over time, and they struck me as a very interesting way of trying to grasp the overall profile of the conversation. This inspired me to see if I could use R to produce something similar for the ESRC corpora.

Normann categorised the utterances in his conversations into 5 groups (Danish utterances with no loan, Danish-based utterances with loans, code-switching utterances and other utterances, Turkish-based utterances with loans, and Turkish utterances with no loan), to give profiles like this:

J. N. Jørgensen profile: number 903

For further details, see pp320ff of:
J. N. Jørgensen (2008): Languaging: Nine years of poly-lingual development of young Turkish-Danish grade school students (2 volumes), Copenhagen Studies in Bilingualism, Volumes 15 and 16, Copenhagen.

The approach I have used is altogether more basic – I calculate the number or percentage of words in each utterance that are tagged in the transcript as belonging to a specific language, and then graph each utterance as a vertical line, with the numbers or percentages stacked.

In the profiles below, blue is Spanish or Welsh, yellow is English, and purple is undetermined (ie the item occurs in a printed dictionary for more than one language). First, the profiles based on numbers (with the scale on the y-axis):
Sastre1 profile: numbers
Stammers4 profile: numbers

These two conversations are very different: in sastre1 (from the Miami corpus), the speakers move into and out of both Spanish and English; in stammers4 (from the Welsh Siarad corpus), the conversation is almost monolingual, except for a few English words, and there is a greater use of indeterminate items.

The profiles based on percentages make the general “texture” of the conversation clearer, but overemphasise indeterminate items, since they give a one-word utterance the same prominence as a multiword utterance:
Sastre1 profile: percentages
Stammers4 profile: percentages

Transcription Workshop

November 23rd, 2010

The Corpus Linguistics group at the ESRC Centre held a transcription workshop on 19-20 November to look at practical issues in transcription. Professor Brian MacWhinney, from Carnegie Mellon University, the originator of the CLAN software and the Talkbank repository, was the guest speaker. There were 11 other presentations from speakers from as far afield as Denmark and Estonia, with plenty of time for discussion. I gave a short presentation on the autoglosser, and preparing it was quite a useful review of how far we’ve come. We can now import chat files in 4 different marking formats, and autogloss the text in 3 languages – Welsh, English, and Spanish. A great deal more remains to be done, particularly in regard to increasing the accuracy above the magic 95% mark, but it looks quite doable.

Refactoring an Apertium dictionary

August 21st, 2010

One of the great things about the Apertium machine translation project is that Fran Tyers and others connected with it have assembled sizeable collections of free (GPL) lexical data. So that was the first place to look when I wanted a Spanish dictionary to use with the Bangor Autoglosser. However, the dictionaries are in XML format, which is notoriously slow for this sort of task (in Apertium, the dictionaries are compiled before use), and clumsy to process in PHP. I therefore ended up refactoring the dictionary into a csv file (downloadable here), which I think is a more useable option for our autoglossing needs (it can be read in a spreadsheet or imported into a database).

To do this, we need to generate a text file containing the contents of the Apertium dictionary. For Ubuntu, the easiest way to go is to install apertium and apertium-en-es. We can test it by opening a terminal and typing:
echo "dog" | apertium en-es

or:
echo "perro" | apertium es-en

We get “Perro” and “Dog” back, respectively (the capitalisation is due to Apertium’s somewhat problematical algorithm for this). To extract the dictionaries, we need to download the raw files for the en-es package, untar them, and then use an Apertium utility, lt-expand:

lt-expand apertium-en-es.es.dix > apertium_es.txt
lt-expand apertium-en-es.en.dix > apertium_en.txt

for the monolingual dictionaries, and:
lt-expand apertium-en-es.es-en.dix> apertium_enes.txt

for the bilingual one. The Spanish dictionary (which is a file of around 300Mb) is our main focus, and for our purposes we want to remove lines containing :>: or :<:, which will be duplicates, and those where the entries contain spaces (eg a fin de que). We then tag the lines to show the relevant field boundaries, and import them into a database. Once all the dictionaries are safely tucked up there, we can use SQL queries to insert the English lexemes (lemmas) into the Spanish entries.

The result is a table with around 690,000 entries. Around 95% of these are verbforms, and about 87% of those are verbforms with enclitic pronouns (eg háblenosles). Although the execution speed for database lookups gained by rationalising these a bit is probably negligible, decreasing the size of the file makes it easier to distribute.

The first thing I did was to convert the Apertium tags to make them slightly more mnemonic, and segment the categories into their own fields – there are nearly 1900 different tags in the original file, many of them with only a few entries. The number of determiners especially seemed excessive, and for adjusting these I used a very useful tool – SQL Workbench/J, which is the only GUI tool I’ve come across so far that lets you edit the resultsets of PostgreSQL queries. The refactored dictionary has 173 separate combinations of POS tags.

The second thing was to segment the roughly 560,000 clitic verbforms, leaving only around 15,000 base verbforms. This is on the understanding that we can deal with the unsegmented forms via dynamic analysis and tagging – the download of the refactored dictionary contains a file with sample PHP functions that will do this. These standalone verbforms then have to be added back to the dictionary, because they usually entail orthographical variations in terms of accents. For example, the imperative 3 singular of decir is diga when it is standalone, but díga when a clitic pronoun is attached, as in dígame.

The last thing was to remove all the names, because the autoglosser will assume that something is a name of some sort if it starts with a capital.

The end result is a dictionary file with around 130,000 entries. This is probably not perfect (eg the clitic functions will segment háblenosles above as imperative 3 singular + 1 plural + 3 singular, and not admit the alternative of imperative 3 plural + 2 plural + 3 singular), but the file is a lot more manageable now.

The autoglosser takes a bow …

July 16th, 2010

At the very interesting Welsh Syntax seminar in Gregynog, there were a couple of presentations from the ESRC Centre, which should shortly be up on the seminar’s webpage. I spoke to a few slides on the first, summarising why the Bangor autoglosser had been developed, and what it does, and we also publicised the BangorTalk test site. Quite a lot of material has now been posted there, mainly to check how well the importer/autoglosser works, and also to experiment with presentation and layout. At the minute, the pages showing the text are pretty heavy to process, and also slow to show the gloss popups on older browsers (because the webpage is pretty big), so the next step there is to find a way (preferably using AJAX) to page through the text in chunks of about 50 utterances without interfering with the audio playback. Over the next few weeks I’ll be revising the Welsh and Spanish dictionaries, and doing a first version of a Spanish constraint grammar.

Constraint Grammar tutorial

May 14th, 2010

I’ve been doing some work over the past few weeks with the ESRC Centre for Research on Bilingualism at Bangor University, focussing on autoglossing their Welsh and Spanish conversation transcripts. As part of that, I’ve been using Constraint Grammar again as a possible approach to disambiguating words in the text.

Fran Tyers introduced me to CG, which is licensed under the GPL, when we were working on the Apertium Welsh translator 18 months ago. The Welsh grammar we ended up with, containing about 130 rules, was quite small by CG standards (the Portuguese CG grammar has around 9,000 rules), but was pretty effective.

In the course of revising and expanding that grammar, I thought it would consolidate my own learning to write a short tutorial, which might be useful to others as a gentler introduction to this very elegant and versatile system than the manual and howto. The result is a short note on Getting started with Constraint Grammar, using a Welsh sentence as the example text. The TeX source file is here, in case anyone wants to improve on it or extend it.