More corpora …

December 16th, 2013 by donnek Leave a reply »

Korrect/Kywiro was a collection of English-Welsh software translations that I put together in 2004, when there was a lot of community activity going on. It fell off the web a few years ago, but I’ve now resurrected it. In the new version I’ve chopped up the longer strings into chunks of one or two sentences, and stripped out a lot of the HTML. I also spent some time fixing up character encoding issues, which is an occupation hazard of dealing with older text from various machines, not all of which will have been using UTF-8. Hopefully it should be fairly tidy now, so I’ve added it to the citation sources at Eurfa.

Going even further back, in the late 90s Bob Morris Jones and colleagues put together two corpora dealing with child language acquisition – CIG1 looked at children aged 18-30 months, and CIG2 looked at children aged 3-7 years. I’m not sure how many people know about or use these, but they’re an excellent resource that deserves to be more widely available, so I’ve imported the text into a database to provide a searchable interface to them at Kig. That site also includes the original .cha files, along with information from the original website (which is showing signs of bitrot).

Although other search parameters could be introduced (along the lines of the BangorTalk site), I’ve chosen for the moment just to split the searches between child and adult utterances, since I think that’s what most people would be interested in initially – what sort of output do the children in these corpora produce? (That in turn should probably be segmented by age, but the data to do that is now available in the downloadable versions of the database, and it can be added to the website if there is any demand).

The adult utterances, of course, can also be used as another citation source in Eurfa, so I’ve added them there too.

Leave a Reply