Autoglossing historical Welsh

December 11th, 2011

David Willis asked about the feasibility of using the Autoglosser to tag texts in his Historical Corpus of Welsh. It proved easier than expected to do a proof-of-concept: set up the Autoglosser to import from running monolingual text instead of conversational bilingual text, and then let everything else (lookup, constraint grammar and write-out) work as normal.

The section I chose was a 1,200 word piece from 1779 – a translation of the autobiography of James Groniosaw, an African prince who was enslaved. I set up a way of handling the old-style spelling, though that would need some more work, and the results are available here.

On a rough count, only about 3% of the words are actually tagged incorrectly. Another 40% are not sufficiently disambiguated, but that is more a matter of writing constraint grammar rules that will apply to this sort of text (we were in the same position with the conversation transcripts a few months ago). An interesting option might be to set up different rulesets for different periods or types of Welsh text, which you could plug in to the system as appropriate.

It’s gratifying that the Autoglosser can make a pretty good show at tagging Welsh that is over 230 years old, as well as tagging the modern colloquial Welsh it was designed for.

OCR Classics GCE: Roman Britain

December 9th, 2011

If you’re doing the the Roman History from Original Sources (Option 3: Britain in the Roman Empire) (F392) paper in the OCR GCE (AS-level) 2011 Classics syllabus (HO38), you might be interested in these notes, which have been drawn from a variety of sources. The notes will open in OpenOffice or LibreOffice, and you can then edit them as necessary to suit yourself. There is also a package of 14 maps (snaps of hand-drawn sheets, I’m afraid, so the download is 11Mb), which might also be useful.

Mining corpora with the Autoglosser

December 8th, 2011

I did a presentation last Friday at the ESRC Centre, focussing on using the Autoglosser output to pull stuff out of the corpora. It was interesting to note that over the last 7 months we’ve moved quite a distance in this direction, with about 6 different areas where the autoglossing is able to assist linguistic analysis of the texts in the corpora.