Kwici – a Welsh Wikipedia corpus

January 14th, 2014 by donnek Leave a reply »

A couple of days ago I visited a resources page belonging to Linas Vepstas, the maintainer of the Link Grammar package of the OpenCog project, and that took me to the WaCky website, with some nice examples of tagged corpora.

That in turn led to Giuseppe Attardi and Antonio Fuschetto’s Wikipedia Extractor, which generates plain text from a Wikipedia dump, discarding markup etc. I’ve tried this sort of thing before, and the results leave a lot to be desired, but this one worked wonderfully.

Download it, make it executable, and point it at the dump: ./WikiExtractor.py -cb 250K -o output < cywiki-20131230-pages-articles-multistream.xml The -cb switch compresses the output into files of 250K, and the -o switch stores them in the output dir.

Then combine all these into one big file: find output -name '*bz2' -exec bzip2 -d -c {} \; > text.xml (the command on the website is missing the -d switch).

This gave a very tidy file with each document in its own doc tag. About the only stuff it didn't strip was some English in includeonly tags - I'm not sure what these are for, but they may have been updates that hadn't been translated yet. So Wikipedia Extractor did exactly what it said on the tin!

Once the doc tags, the English, and blank lines were removed, the text was relatively easily split into sentences, and imported into PostgreSQL for more tidying. I spent a day on that, but I didn't like the results, so I redid it over another day, and the outcome is Kwici - a 200k-sentence, 3.9m-word corpus of Welsh.

Kwici is now added to Eurfa as a seventh citation source.

Leave a Reply