-->

 

Swwiki

A 2.8m-word Swahili Wikipedia corpus

 


Swwiki

 Introduction

Swwiki is a 2.8m-word corpus drawn from the Swahili Wikipedia as it was on 26 December 2015. When you enter a word in the search box above, up to 20 sentences in the corpus containing that word will be shown.

If using Swwiki in research, the following citation can be used:

Kevin Donnelly (2016). "Swwiki: a 2.8m-word corpus drawn from the Swahili Wikipedia." http://kevindonnelly.org.uk/swahili/swwiki. (BibTeX)

 Creation

The pages and articles dump for 26/12/2015 was downloaded from the Wikimedia dump page. Giuseppe Attardi and Antonio Fuschetto's WikiExtractor tool was then used to extract plain text (discarding most markup etc) from the 165Mb dump, resulting in a 25Mb output file. This was tidied by removing remaining XML, blank lines, and blocks of English text.

The text was then split using the NLTK tokenize package, and these were imported into a PostgreSQL database table. The sentences were then pruned by removing all items less than 50 characters long, all duplicates, and all variant items (e.g. sentences giving the population of a specific place).

 Key statistics

The final corpus contains total of 151,753 sentences containing 2,821,431 words. There are 172,998 tokens and 5,409 types (TTR 32), but a small percentage will be garbage (typos, errant brackets, etc).

The distribution of sentences by their wordcount is shown below:

The commonest words (> 5,000 instances) are as follows:

183276ya13186mwaka7484watu6155wakati
142598na12202nchini7202mji5765hii
108506wa12085au7007kati5624moja
54051kwa11266yake6825pamoja5539Tanzania
49673katika9537kutoka6796lakini5464mara
46691ni9296pia6766jina5411miaka
33648la8737kwenye6593wake5147ambayo
30420za8563vya6510kwanza5105juu
19654kama8271lugha6382hadi5071sehemu
18113cha7803zaidi6164alikuwa5015Katika
13800kuwa7769nchi

 Downloads

The Swwiki corpus, which is licensed under the CC-BY-SA, can be downloaded below in csv format.

Download Swwiki

A frequency list of the words in the Swwiki corpus can be downloaded here in odf format.