Autoglossing historical Welsh

December 11th, 2011 by donnek Leave a reply »

David Willis asked about the feasibility of using the Autoglosser to tag texts in his Historical Corpus of Welsh. It proved easier than expected to do a proof-of-concept: set up the Autoglosser to import from running monolingual text instead of conversational bilingual text, and then let everything else (lookup, constraint grammar and write-out) work as normal.

The section I chose was a 1,200 word piece from 1779 – a translation of the autobiography of James Groniosaw, an African prince who was enslaved. I set up a way of handling the old-style spelling, though that would need some more work, and the results are available here.

On a rough count, only about 3% of the words are actually tagged incorrectly. Another 40% are not sufficiently disambiguated, but that is more a matter of writing constraint grammar rules that will apply to this sort of text (we were in the same position with the conversation transcripts a few months ago). An interesting option might be to set up different rulesets for different periods or types of Welsh text, which you could plug in to the system as appropriate.

It’s gratifying that the Autoglosser can make a pretty good show at tagging Welsh that is over 230 years old, as well as tagging the modern colloquial Welsh it was designed for.

Leave a Reply