Don’t forget your READMEs!

March 7th, 2007 by donnek Leave a reply »

Klebran is now listed on the OpenOffice.org site as one of the free grammar checkers. This thread on the relevant discussion list was quite interesting, because it shows that you need to check your README files every so often, and update them in line with the way the project develops. In this case, my failure to update the README led Dewi to an incorrect conclusion. I checked back over my email archive (going back to 2003) to see what actually happened.

A tagged wordlist is a prerequisite for any grammar-checker, and when I began considering a GPL Welsh one in early 2004, I decided to add meanings as well, since there was then no GPL dictionary available for Welsh. A number of lists – Jim Killock’s 2003 aspell list, the Crubadán web-crawler Welsh corpus, the UWB Cronfa Electroneg (which has fallen off the web, but is still available here, although the downloads don’t seem to work), and the UWB myspell list – fed into this dictionary project, in the sense that they provided “control” lists of frequently occurring words. I had to make a start somewhere, so in the event I took a list of the 5,000 most common words in Crubadán that also occurred in the myspell list, and began checking these and adding tags and meanings.

The 1.2 revision of the README that Dewi referred to dates from that point (March 2005). However, almost immediately (April 2005, according to my emails), I decided not to pursue this approach. Instead, once the “5,000 most common words” were done, I started inputting words from multiple everyday sources (books, magazines, already-completed translations) as I met them.

Why did I do this? Reading the emails, one reason for the change of plan seems to have been uncertainty about the myspell license – in the UWB README, the license is specified as GPL, but condition 3 appeared to me to be incompatible with that (I’m not from the FSF, so I can’t be sure, but it does look odd). But another reason was dissatisfaction with the verb-expansion rules giving inflected forms (I won’t go into details here, but I can give chapter and verse if anyone is that interested). I therefore decided to ignore the myspell-generated verb forms, and began a “clean-room” implementation of my own, where the abstraction rules behind the generated forms would be open to scrutiny. That led to Konjugator in July 2005, which was a necessary detour, and by the beginning of 2006, as Kevin Scannell said in his response to Dewi’s post, work on the dictionary was progressing well – the first version of Eurfa was released in April 2006. The dictionary in turn, as he said, is now the basis for Klebran, giving a nice example of how tools like this can build on each other!

It is therefore wholly untrue to say that my dictionary “incorporated” the myspell list, and the README should really have been updated in April or May 2005 to reflect that (revision 1.2 was left untouched for more than 23 months, in contrast to other parts of the repository!). It’s trivial to show this – download lexicon-cy.txt, and try to find a couple of less common non-mutated, non-inflected words that are in the myspell list – you will likely get no hit (I have just done this, for instance, with “perffeithiadwy” and “syndrom”, to take two at random). This is because the myspell list contains far more items than the 13,000 citation forms I have in my dictionary so far.

An additional point, of course, is that the myspell list cannot provide the meanings and POS information in the dictionary, since it does not include them in the first place. So what Kevin Scannell said originally in his first post was perfectly accurate – this work was indeed done “from scratch”, because (sadly) no such material is available in Welsh under a free license, even though it would be a tremendous help to the Welsh language to have it so.

The entries in my dictionary are each tagged, because it’s useful to be able to give some indication of their provenance. The “5,000 most common words” work (tagged as wl1 – working list 1 – in the dictionary) equates to 3,852 entries (the reduction is due to data cleaning), which is a mere 31% of the total citation forms in the current version of Eurfa.

So the moral of the story is: don’t forget your READMEs! If they are intended to give an overview of the project, keep them up-to-date, and commit revisions as you go. I’ve now revised the one for Gramadóir-cy, to ensure there is no future misunderstanding.

Leave a Reply