Patagonia and Miami corpora complete!

July 27th, 2012 by donnek Leave a reply »

Phew! There was just no time for anything over the last 6 months but the Patagonia and Miami corpora, and they’ve now been completed on schedule and sent off to Talkbank. Kudos to Professor Margaret Deuchar and her team of researchers.

Patagonia (Welsh/Spanish) corpus main statistics:
150k words, 78% Welsh, 17% Spanish, 5% indeterminate

Miami (Spanish/English) corpus main statistics:
167k words, 63% English, 34% Spanish, 3% indeterminate

Apart from producing the glosses via the Autoglosser, I was involved in running a Git repository, translation and editing, silencing audiofiles for privacy purposes, tweaking the files to accommodate last-minute changes to the CLAN format, and doing a website for the corpora themselves – very interesting indeed.

Possibly the best thing about the corpora is that, like the 2009 Siarad corpus, they are under the GPL. In fact, Siarad and Patagonia seem to be the only large-scale collections of Welsh text to use this free license, which ensures that everyone can have unfettered access to high-quality materials produced using public funds.

Talking of which, the website devoted to all three corpora is at BangorTalk. Have a look and listen to real language in action!

Leave a Reply