-->

Gàidhlig

Demo corpus, dictionary and POS-tagger
Corpus search:
Dictionary search:

Fàilte!

 Introduction

This is a small demo site containing resources for Gàidhlig (Scots Gaelic). At present, these include:

  1. A small aligned corpus of 160 sentences, drawn from the website of Sabhal Mòr Ostaig (with some minor typos corrected). To search for sentences containing specific words, use the input boxes above. Another small bilingual corpus was created by using the exercise sentences from Danaidh M'Clelland's concise yet very comprehensive online Gàidhlig course Taic. These items are not (yet) available via the web search, but glossed versions of some 380 short sentences are available as a 68-page pdf.
  2. A dictionary. This includes a basic set of words, and also the vocabulary list from Danaidh M'Clelland's Taic. The dictionary includes around 3,000 words and 1,500 lemmas. To search for specific words, use the input boxes above. The dictionary is also available as a printable 124-page pdf.
  3. Glosses for every sentence in the corpus, giving the lexeme (English lemma and Gàidhlig part-of-speech) for each word. The entire glossed (tagged) corpus is also available as a 61-page pdf.

 

 Get the code and resources

The code for the Gàidhlig autoglosser is available under a GPLv3 license from the GitHub.

 

 Technical information

The glosser takes about 3 minutes to import the 4,100 words of the SMO corpus, tag them with the lemmas and POS-tags in the dictionary, and disambiguate according to a small constraint grammar of 55 rules, so the throughput is around 1,300 words a minute. (It takes just over 2 minutes to import and gloss the 2,800 words of the Taic corpus.)

The technology behind these Gàidhlig materials is based on my previous work for Welsh: the Kynulliad3 corpus, the Eurfa dictionary, and the Bangor Autoglosser. Apart from any intrinsic usefulness they may have, they also demonstrate how lexical resources, once created, can be repurposed relatively simply for use with another language - this is especially important for minority languages, where financial and other resources may be limited.

This site has been tested on standards-compliant browsers (Firefox, Chromium, Opera).

 

 Improvements

These materials have been put together as a proof-of-concept over 8 days (4 of which were spent on compiling the dictionary, and 2 on developing the website), and could be refined further. Some things that spring to mind:

  1. The dictionary could be simplified by having the autoglosser do more stemming during lookup (ie derive singular nouns from plural nouns, derive nominative nouns from genitive nouns, derive non-root verbal forms from the verb root, and so on).
  2. The tags need to be reviewed: for instance, it may be preferable to have two (identical) entries, one for noun and one for present participle, instead of the current verbal noun category.
  3. The printable dictionary needs further work to tweak the layout.
  4. The corpus import needs to handle punctuation better.
  5. The corpus search caters for "whole" words only, so that entering t-saoghal or saoghal will find all instances of saoghal, but will not return instances of saoghail or t-saoghail. This constraint could be relaxed so that all related instances are found.
  6. The finer points of the Gàidhlig case system are not fully catered for as yet.