Kynulliad3





Kynulliad3 is a substantial corpus of nearly 360,000 aligned sentences in Welsh and English (around 8.8m words in each language) drawn from the Record of Proceedings of the Third Assembly (2007-2011) of the National Assembly for Wales.

The corpus is intended for use in natural language processing research - since it contains only the longer sentences from the plenary sessions of the Proceedings, without attribution or an indication of the context, date, or which language was used by the speaker, it is not a record of the activity of the Assembly.

When you enter a word in the search box, 20 sentences in the corpus containing that word will be shown. Each time you press the Search button, a different set of 20 sentences will be shown.

The data is drawn from the HTML versions of the Proceedings. Markup was discarded to leave blocks of aligned text. These blocks were cleaned to remove low-value items like "I move that". Lingua::Identify was used to swap around the blocks where necessary so that Welsh was always first. The blocks were then split into individual sentences, which were also cleaned to remove duplicates and sentences of less than 20 characters.

The database table contains the following fields:

Kynulliad3 can be downloaded as a PostgreSQL dump.

Download   

You can also download a frequency list of the almost 48,400 words in Kynulliad3.

Download