OCR Classics GCE: Roman Britain

December 9th, 2011 by donnek No comments »

If you’re doing the the Roman History from Original Sources (Option 3: Britain in the Roman Empire) (F392) paper in the OCR GCE (AS-level) 2011 Classics syllabus (HO38), you might be interested in these notes, which have been drawn from a variety of sources. The notes will open in OpenOffice or LibreOffice, and you can then edit them as necessary to suit yourself. There is also a package of 14 maps (snaps of hand-drawn sheets, I’m afraid, so the download is 11Mb), which might also be useful.

Mining corpora with the Autoglosser

December 8th, 2011 by donnek No comments »

I did a presentation last Friday at the ESRC Centre, focussing on using the Autoglosser output to pull stuff out of the corpora. It was interesting to note that over the last 7 months we’ve moved quite a distance in this direction, with about 6 different areas where the autoglossing is able to assist linguistic analysis of the texts in the corpora.

LaTeX template for play scripts

October 12th, 2011 by donnek 1 comment »

I was recently asked by Steffan to put some playtexts into a neater format, and of course I chose LaTeX for the job. It may be that the template I came up with would be of use to others, so I’m posting it here, along with a pdf of the template output.

The template is fairly simple, depending on description lists for the most part, and I’ve commented it so that it should be fairly easy to adjust for what you need. Apologies to Sheridan for mucking about with The Rivals to give some sample text. The output is pretty similar to that of the Methuen playtexts series, so it should be acceptable in most good theatres …

The only issue I haven’t resolved yet is getting rid of the blank page at the very start of the pdf.

Presentation at ITA11

September 9th, 2011 by donnek No comments »

Just back from a day at ITA11 in Wrexham, where I presented a paper giving a broad overview of the Autoglosser (now published in the conference Proceedings). The session was on web content, so this fitted in well, but much of the rest of the conference was more technical. There seems to be a lot going on at Glyndŵr University, which has been a bit under the radar. It was the first time I had been there, and I can recommend the 1887 restaurant!

ISB8 presentation

June 20th, 2011 by donnek No comments »

I presented our paper last week in Oslo. ISB8 was a great experience! The accuracy comparison showed that the Augotglosser was on a par with CLAN’s MOR tagger for Spanish (97.4%), and within 2% of human tagging for Welsh (97.9%). The final presentation is available here.

Paper accepted for ISB8

January 30th, 2011 by donnek No comments »

On Friday we were very pleased to get word that our paper (Glossing CHAT files using the Bangor Autoglosser) has been accepted for presentation at ISB8. We’ll be using texts from the ESRC Centre’s Miami and Siarad corpora to test the coverage and accuracy of the autoglosser, and compare its output to CLAN’s MOR/POST tagger (for Spanish) and manual glossing (for Welsh). The results should be interesting.

Conversation profiles

December 16th, 2010 by donnek No comments »

At the transcription workshop last month, Jens Normann Jørgensen showed some graphics which mapped the development of a bilingual conversation over time, and they struck me as a very interesting way of trying to grasp the overall profile of the conversation. This inspired me to see if I could use R to produce something similar for the ESRC corpora.

Normann categorised the utterances in his conversations into 5 groups (Danish utterances with no loan, Danish-based utterances with loans, code-switching utterances and other utterances, Turkish-based utterances with loans, and Turkish utterances with no loan), to give profiles like this:

J. N. Jørgensen profile: number 903

For further details, see pp320ff of:
J. N. Jørgensen (2008): Languaging: Nine years of poly-lingual development of young Turkish-Danish grade school students (2 volumes), Copenhagen Studies in Bilingualism, Volumes 15 and 16, Copenhagen.

The approach I have used is altogether more basic – I calculate the number or percentage of words in each utterance that are tagged in the transcript as belonging to a specific language, and then graph each utterance as a vertical line, with the numbers or percentages stacked.

In the profiles below, blue is Spanish or Welsh, yellow is English, and purple is undetermined (ie the item occurs in a printed dictionary for more than one language). First, the profiles based on numbers (with the scale on the y-axis):
Sastre1 profile: numbers
Stammers4 profile: numbers

These two conversations are very different: in sastre1 (from the Miami corpus), the speakers move into and out of both Spanish and English; in stammers4 (from the Welsh Siarad corpus), the conversation is almost monolingual, except for a few English words, and there is a greater use of indeterminate items.

The profiles based on percentages make the general “texture” of the conversation clearer, but overemphasise indeterminate items, since they give a one-word utterance the same prominence as a multiword utterance:
Sastre1 profile: percentages
Stammers4 profile: percentages

Transcription Workshop

November 23rd, 2010 by donnek No comments »

The Corpus Linguistics group at the ESRC Centre held a transcription workshop on 19-20 November to look at practical issues in transcription. Professor Brian MacWhinney, from Carnegie Mellon University, the originator of the CLAN software and the Talkbank repository, was the guest speaker. There were 11 other presentations from speakers from as far afield as Denmark and Estonia, with plenty of time for discussion. I gave a short presentation on the autoglosser, and preparing it was quite a useful review of how far we’ve come. We can now import chat files in 4 different marking formats, and autogloss the text in 3 languages – Welsh, English, and Spanish. A great deal more remains to be done, particularly in regard to increasing the accuracy above the magic 95% mark, but it looks quite doable.

Platform and browser

October 21st, 2010 by donnek No comments »

I’ve just finished a project for the Psychology Department at Bangor University, which involved logging various pieces of data on participants as they used the web interface to the survey. One of the most interesting aspects from my point of view was the platform and browser of the participant.

Out of the 834 participants in this sample, 770 (92%) were using Internet Explorer or Firefox, with a 71/29% split between these two. In the “other” category was Chrome (4%), Safari (2%), and Opera (less than 1%).

Browser numbers

Non-Microsoft platforms were noticeable by their paucity – 2% for the Mac, and 1% for Linux. Of the Windows flavours, XP was the most numerous (37%), followed by Vista (33%), and then 7 (26%) – there were even a couple of instances of 2000!

It could be argued that the sample of participants in a consumer survey may be slightly skewed, and not totally random, but these figures are a useful corrective to the figures for “power users” (itself a skewed sample) which we may be more used to seeing reported on IT-oriented sites.

Refactoring an Apertium dictionary

August 21st, 2010 by donnek No comments »

One of the great things about the Apertium machine translation project is that Fran Tyers and others connected with it have assembled sizeable collections of free (GPL) lexical data. So that was the first place to look when I wanted a Spanish dictionary to use with the Bangor Autoglosser. However, the dictionaries are in XML format, which is notoriously slow for this sort of task (in Apertium, the dictionaries are compiled before use), and clumsy to process in PHP. I therefore ended up refactoring the dictionary into a csv file (downloadable here), which I think is a more useable option for our autoglossing needs (it can be read in a spreadsheet or imported into a database).

To do this, we need to generate a text file containing the contents of the Apertium dictionary. For Ubuntu, the easiest way to go is to install apertium and apertium-en-es. We can test it by opening a terminal and typing:
echo "dog" | apertium en-es

echo "perro" | apertium es-en

We get “Perro” and “Dog” back, respectively (the capitalisation is due to Apertium’s somewhat problematical algorithm for this). To extract the dictionaries, we need to download the raw files for the en-es package, untar them, and then use an Apertium utility, lt-expand:

lt-expand apertium-en-es.es.dix > apertium_es.txt
lt-expand apertium-en-es.en.dix > apertium_en.txt

for the monolingual dictionaries, and:
lt-expand apertium-en-es.es-en.dix> apertium_enes.txt

for the bilingual one. The Spanish dictionary (which is a file of around 300Mb) is our main focus, and for our purposes we want to remove lines containing :>: or :<:, which will be duplicates, and those where the entries contain spaces (eg a fin de que). We then tag the lines to show the relevant field boundaries, and import them into a database. Once all the dictionaries are safely tucked up there, we can use SQL queries to insert the English lexemes (lemmas) into the Spanish entries.

The result is a table with around 690,000 entries. Around 95% of these are verbforms, and about 87% of those are verbforms with enclitic pronouns (eg háblenosles). Although the execution speed for database lookups gained by rationalising these a bit is probably negligible, decreasing the size of the file makes it easier to distribute.

The first thing I did was to convert the Apertium tags to make them slightly more mnemonic, and segment the categories into their own fields – there are nearly 1900 different tags in the original file, many of them with only a few entries. The number of determiners especially seemed excessive, and for adjusting these I used a very useful tool – SQL Workbench/J, which is the only GUI tool I’ve come across so far that lets you edit the resultsets of PostgreSQL queries. The refactored dictionary has 173 separate combinations of POS tags.

The second thing was to segment the roughly 560,000 clitic verbforms, leaving only around 15,000 base verbforms. This is on the understanding that we can deal with the unsegmented forms via dynamic analysis and tagging – the download of the refactored dictionary contains a file with sample PHP functions that will do this. These standalone verbforms then have to be added back to the dictionary, because they usually entail orthographical variations in terms of accents. For example, the imperative 3 singular of decir is diga when it is standalone, but díga when a clitic pronoun is attached, as in dígame.

The last thing was to remove all the names, because the autoglosser will assume that something is a name of some sort if it starts with a capital.

The end result is a dictionary file with around 130,000 entries. This is probably not perfect (eg the clitic functions will segment háblenosles above as imperative 3 singular + 1 plural + 3 singular, and not admit the alternative of imperative 3 plural + 2 plural + 3 singular), but the file is a lot more manageable now.