Patagonia and Miami corpora complete!

Phew! There was just no time for anything over the last 6 months but the Patagonia and Miami corpora, and they’ve now been completed on schedule and sent off to Talkbank. Kudos to Professor Margaret Deuchar and her team of researchers.

Patagonia (Welsh/Spanish) corpus main statistics:
150k words, 78% Welsh, 17% Spanish, 5% indeterminate

Miami (Spanish/English) corpus main statistics:
167k words, 63% English, 34% Spanish, 3% indeterminate

Apart from producing the glosses via the Autoglosser, I was involved in running a Git repository, translation and editing, silencing audiofiles for privacy purposes, tweaking the files to accommodate last-minute changes to the CLAN format, and doing a website for the corpora themselves – very interesting indeed.

Possibly the best thing about the corpora is that, like the 2009 Siarad corpus, they are under the GPL. In fact, Siarad and Patagonia seem to be the only large-scale collections of Welsh text to use this free license, which ensures that everyone can have unfettered access to high-quality materials produced using public funds.

Talking of which, the website devoted to all three corpora is at BangorTalk. Have a look and listen to real language in action!

Autoglossing historical Welsh

David Willis asked about the feasibility of using the Autoglosser to tag texts in his Historical Corpus of Welsh. It proved easier than expected to do a proof-of-concept: set up the Autoglosser to import from running monolingual text instead of conversational bilingual text, and then let everything else (lookup, constraint grammar and write-out) work as normal.

The section I chose was a 1,200 word piece from 1779 – a translation of the autobiography of James Groniosaw, an African prince who was enslaved. I set up a way of handling the old-style spelling, though that would need some more work, and the results are available here.

On a rough count, only about 3% of the words are actually tagged incorrectly. Another 40% are not sufficiently disambiguated, but that is more a matter of writing constraint grammar rules that will apply to this sort of text (we were in the same position with the conversation transcripts a few months ago). An interesting option might be to set up different rulesets for different periods or types of Welsh text, which you could plug in to the system as appropriate.

It’s gratifying that the Autoglosser can make a pretty good show at tagging Welsh that is over 230 years old, as well as tagging the modern colloquial Welsh it was designed for.

OCR Classics GCE: Roman Britain

If you’re doing the the Roman History from Original Sources (Option 3: Britain in the Roman Empire) (F392) paper in the OCR GCE (AS-level) 2011 Classics syllabus (HO38), you might be interested in these notes, which have been drawn from a variety of sources. The notes will open in OpenOffice or LibreOffice, and you can then edit them as necessary to suit yourself. There is also a package of 14 maps (snaps of hand-drawn sheets, I’m afraid, so the download is 11Mb), which might also be useful.

Mining corpora with the Autoglosser

I did a presentation last Friday at the ESRC Centre, focussing on using the Autoglosser output to pull stuff out of the corpora. It was interesting to note that over the last 7 months we’ve moved quite a distance in this direction, with about 6 different areas where the autoglossing is able to assist linguistic analysis of the texts in the corpora.

LaTeX template for play scripts

I was recently asked by Steffan to put some playtexts into a neater format, and of course I chose LaTeX for the job. It may be that the template I came up with would be of use to others, so I’m posting it here, along with a pdf of the template output.

The template is fairly simple, depending on description lists for the most part, and I’ve commented it so that it should be fairly easy to adjust for what you need. Apologies to Sheridan for mucking about with The Rivals to give some sample text. The output is pretty similar to that of the Methuen playtexts series, so it should be acceptable in most good theatres …

The only issue I haven’t resolved yet is getting rid of the blank page at the very start of the pdf.

Presentation at ITA11

Just back from a day at ITA11 in Wrexham, where I presented a paper giving a broad overview of the Autoglosser (now published in the conference Proceedings). The session was on web content, so this fitted in well, but much of the rest of the conference was more technical. There seems to be a lot going on at Glyndŵr University, which has been a bit under the radar. It was the first time I had been there, and I can recommend the 1887 restaurant!

ISB8 presentation

I presented our paper last week in Oslo. ISB8 was a great experience! The accuracy comparison showed that the Augotglosser was on a par with CLAN’s MOR tagger for Spanish (97.4%), and within 2% of human tagging for Welsh (97.9%). The final presentation is available here.

Paper accepted for ISB8

On Friday we were very pleased to get word that our paper (Glossing CHAT files using the Bangor Autoglosser) has been accepted for presentation at ISB8. We’ll be using texts from the ESRC Centre’s Miami and Siarad corpora to test the coverage and accuracy of the autoglosser, and compare its output to CLAN’s MOR/POST tagger (for Spanish) and manual glossing (for Welsh). The results should be interesting.

Conversation profiles

At the transcription workshop last month, Jens Normann Jørgensen showed some graphics which mapped the development of a bilingual conversation over time, and they struck me as a very interesting way of trying to grasp the overall profile of the conversation. This inspired me to see if I could use R to produce something similar for the ESRC corpora.

Normann categorised the utterances in his conversations into 5 groups (Danish utterances with no loan, Danish-based utterances with loans, code-switching utterances and other utterances, Turkish-based utterances with loans, and Turkish utterances with no loan), to give profiles like this:

J. N. Jørgensen profile: number 903

For further details, see pp320ff of:
J. N. Jørgensen (2008): Languaging: Nine years of poly-lingual development of young Turkish-Danish grade school students (2 volumes), Copenhagen Studies in Bilingualism, Volumes 15 and 16, Copenhagen.

The approach I have used is altogether more basic – I calculate the number or percentage of words in each utterance that are tagged in the transcript as belonging to a specific language, and then graph each utterance as a vertical line, with the numbers or percentages stacked.

In the profiles below, blue is Spanish or Welsh, yellow is English, and purple is undetermined (ie the item occurs in a printed dictionary for more than one language). First, the profiles based on numbers (with the scale on the y-axis):
Sastre1 profile: numbers
Stammers4 profile: numbers

These two conversations are very different: in sastre1 (from the Miami corpus), the speakers move into and out of both Spanish and English; in stammers4 (from the Welsh Siarad corpus), the conversation is almost monolingual, except for a few English words, and there is a greater use of indeterminate items.

The profiles based on percentages make the general “texture” of the conversation clearer, but overemphasise indeterminate items, since they give a one-word utterance the same prominence as a multiword utterance:
Sastre1 profile: percentages
Stammers4 profile: percentages

Transcription Workshop

The Corpus Linguistics group at the ESRC Centre held a transcription workshop on 19-20 November to look at practical issues in transcription. Professor Brian MacWhinney, from Carnegie Mellon University, the originator of the CLAN software and the Talkbank repository, was the guest speaker. There were 11 other presentations from speakers from as far afield as Denmark and Estonia, with plenty of time for discussion. I gave a short presentation on the autoglosser, and preparing it was quite a useful review of how far we’ve come. We can now import chat files in 4 different marking formats, and autogloss the text in 3 languages – Welsh, English, and Spanish. A great deal more remains to be done, particularly in regard to increasing the accuracy above the magic 95% mark, but it looks quite doable.