Kwici




Kwici is a 4m-word corpus drawn from the Welsh Wikipedia as it was on 30 December 2013. It is licensed under the CC-BY-SA.

The final pages and articles dump for 2013 was downloaded from the Wikimedia dump page. WikiExtractor was then used to extract plain text (discarding markup etc) from the 165Mb dump, resulting in a 33Mb output file. This was tidied by removing remaining XML, blank lines, and any blocks of English text.

The text was then split to give into a total of 360,477 sentences, and these were imported into a PostgreSQL database table. The sentences were pruned by removing all items less than 50 characters long, all items containing numbers only (eg timelines), and all duplicates, to give a final total of 204,789 sentences in the corpus.

When you enter a word in the search box, 20 sentences in the corpus containing that word will be shown.

Download