Archive for the ‘Linguistics’ category

Conversation profiles

December 16th, 2010

At the transcription workshop last month, Jens Normann Jørgensen showed some graphics which mapped the development of a bilingual conversation over time, and they struck me as a very interesting way of trying to grasp the overall profile of the conversation. This inspired me to see if I could use R to produce something similar for the ESRC corpora.

Normann categorised the utterances in his conversations into 5 groups (Danish utterances with no loan, Danish-based utterances with loans, code-switching utterances and other utterances, Turkish-based utterances with loans, and Turkish utterances with no loan), to give profiles like this:

J. N. Jørgensen profile: number 903

For further details, see pp320ff of:
J. N. Jørgensen (2008): Languaging: Nine years of poly-lingual development of young Turkish-Danish grade school students (2 volumes), Copenhagen Studies in Bilingualism, Volumes 15 and 16, Copenhagen.

The approach I have used is altogether more basic – I calculate the number or percentage of words in each utterance that are tagged in the transcript as belonging to a specific language, and then graph each utterance as a vertical line, with the numbers or percentages stacked.

In the profiles below, blue is Spanish or Welsh, yellow is English, and purple is undetermined (ie the item occurs in a printed dictionary for more than one language). First, the profiles based on numbers (with the scale on the y-axis):
Sastre1 profile: numbers
Stammers4 profile: numbers

These two conversations are very different: in sastre1 (from the Miami corpus), the speakers move into and out of both Spanish and English; in stammers4 (from the Welsh Siarad corpus), the conversation is almost monolingual, except for a few English words, and there is a greater use of indeterminate items.

The profiles based on percentages make the general “texture” of the conversation clearer, but overemphasise indeterminate items, since they give a one-word utterance the same prominence as a multiword utterance:
Sastre1 profile: percentages
Stammers4 profile: percentages

Transcription Workshop

November 23rd, 2010

The Corpus Linguistics group at the ESRC Centre held a transcription workshop on 19-20 November to look at practical issues in transcription. Professor Brian MacWhinney, from Carnegie Mellon University, the originator of the CLAN software and the Talkbank repository, was the guest speaker. There were 11 other presentations from speakers from as far afield as Denmark and Estonia, with plenty of time for discussion. I gave a short presentation on the autoglosser, and preparing it was quite a useful review of how far we’ve come. We can now import chat files in 4 different marking formats, and autogloss the text in 3 languages – Welsh, English, and Spanish. A great deal more remains to be done, particularly in regard to increasing the accuracy above the magic 95% mark, but it looks quite doable.

Refactoring an Apertium dictionary

August 21st, 2010

One of the great things about the Apertium machine translation project is that Fran Tyers and others connected with it have assembled sizeable collections of free (GPL) lexical data. So that was the first place to look when I wanted a Spanish dictionary to use with the Bangor Autoglosser. However, the dictionaries are in XML format, which is notoriously slow for this sort of task (in Apertium, the dictionaries are compiled before use), and clumsy to process in PHP. I therefore ended up refactoring the dictionary into a csv file (downloadable here), which I think is a more useable option for our autoglossing needs (it can be read in a spreadsheet or imported into a database).

To do this, we need to generate a text file containing the contents of the Apertium dictionary. For Ubuntu, the easiest way to go is to install apertium and apertium-en-es. We can test it by opening a terminal and typing:
echo "dog" | apertium en-es

or:
echo "perro" | apertium es-en

We get “Perro” and “Dog” back, respectively (the capitalisation is due to Apertium’s somewhat problematical algorithm for this). To extract the dictionaries, we need to download the raw files for the en-es package, untar them, and then use an Apertium utility, lt-expand:

lt-expand apertium-en-es.es.dix > apertium_es.txt
lt-expand apertium-en-es.en.dix > apertium_en.txt

for the monolingual dictionaries, and:
lt-expand apertium-en-es.es-en.dix> apertium_enes.txt

for the bilingual one. The Spanish dictionary (which is a file of around 300Mb) is our main focus, and for our purposes we want to remove lines containing :>: or :<:, which will be duplicates, and those where the entries contain spaces (eg a fin de que). We then tag the lines to show the relevant field boundaries, and import them into a database. Once all the dictionaries are safely tucked up there, we can use SQL queries to insert the English lexemes (lemmas) into the Spanish entries.

The result is a table with around 690,000 entries. Around 95% of these are verbforms, and about 87% of those are verbforms with enclitic pronouns (eg háblenosles). Although the execution speed for database lookups gained by rationalising these a bit is probably negligible, decreasing the size of the file makes it easier to distribute.

The first thing I did was to convert the Apertium tags to make them slightly more mnemonic, and segment the categories into their own fields – there are nearly 1900 different tags in the original file, many of them with only a few entries. The number of determiners especially seemed excessive, and for adjusting these I used a very useful tool – SQL Workbench/J, which is the only GUI tool I’ve come across so far that lets you edit the resultsets of PostgreSQL queries. The refactored dictionary has 173 separate combinations of POS tags.

The second thing was to segment the roughly 560,000 clitic verbforms, leaving only around 15,000 base verbforms. This is on the understanding that we can deal with the unsegmented forms via dynamic analysis and tagging – the download of the refactored dictionary contains a file with sample PHP functions that will do this. These standalone verbforms then have to be added back to the dictionary, because they usually entail orthographical variations in terms of accents. For example, the imperative 3 singular of decir is diga when it is standalone, but díga when a clitic pronoun is attached, as in dígame.

The last thing was to remove all the names, because the autoglosser will assume that something is a name of some sort if it starts with a capital.

The end result is a dictionary file with around 130,000 entries. This is probably not perfect (eg the clitic functions will segment háblenosles above as imperative 3 singular + 1 plural + 3 singular, and not admit the alternative of imperative 3 plural + 2 plural + 3 singular), but the file is a lot more manageable now.

The autoglosser takes a bow …

July 16th, 2010

At the very interesting Welsh Syntax seminar in Gregynog, there were a couple of presentations from the ESRC Centre, which should shortly be up on the seminar’s webpage. I spoke to a few slides on the first, summarising why the Bangor autoglosser had been developed, and what it does, and we also publicised the BangorTalk test site. Quite a lot of material has now been posted there, mainly to check how well the importer/autoglosser works, and also to experiment with presentation and layout. At the minute, the pages showing the text are pretty heavy to process, and also slow to show the gloss popups on older browsers (because the webpage is pretty big), so the next step there is to find a way (preferably using AJAX) to page through the text in chunks of about 50 utterances without interfering with the audio playback. Over the next few weeks I’ll be revising the Welsh and Spanish dictionaries, and doing a first version of a Spanish constraint grammar.

Constraint Grammar tutorial

May 14th, 2010

I’ve been doing some work over the past few weeks with the ESRC Centre for Research on Bilingualism at Bangor University, focussing on autoglossing their Welsh and Spanish conversation transcripts. As part of that, I’ve been using Constraint Grammar again as a possible approach to disambiguating words in the text.

Fran Tyers introduced me to CG, which is licensed under the GPL, when we were working on the Apertium Welsh translator 18 months ago. The Welsh grammar we ended up with, containing about 130 rules, was quite small by CG standards (the Portuguese CG grammar has around 9,000 rules), but was pretty effective.

In the course of revising and expanding that grammar, I thought it would consolidate my own learning to write a short tutorial, which might be useful to others as a gentler introduction to this very elegant and versatile system than the manual and howto. The result is a short note on Getting started with Constraint Grammar, using a Welsh sentence as the example text. The TeX source file is here, in case anyone wants to improve on it or extend it.

Quechua segmenter

August 24th, 2007

I’ve been working with Fran Tyers and the Apertium people over the past few months, and one of the issues for any MT system is dealing with the source language text that is fed into it. For interest, I decided to look at how an agglutinative language like Quechua might be dealt with, and the result is a very basic Quechua segmenter – there’s more info on the page. This needs much more work on the code (eg the ability to input connected, punctuated text) and a much bigger dictionary, but it actually works quite well.