White hair …

March 26th, 2010 by donnek No comments »

I was reading some Chinese poems today, and came across a couple of lines by 杜甫 (Dù Fǔ, 712-770AD) which sadly are all too applicable to me:
白头骚更短
浑欲不胜簪

(roughly)
“When I scratch my white hair it gets even thinner,
There is hardly enough left to fix a hairpin.”
LOL.

Swahili in Arabic script on WordPress

March 26th, 2010 by donnek No comments »

It’s possible to get Swahili in Arabic script to display quite nicely in WordPress on Linux (Kubuntu 9.10) Firefox (3.5.7), after a bit of tweaking. Here is the last stanza of the Utendi wa Ja’afari:

نِمٖپٖنْدَ كُكَرِرِ nimependa kukariri
نَنْيِ سٗمَنِ ضَمِيْرِ nanyi somani ḍamïri
أُتٖنْدِ وَ جَعْفَرِ utendi wa Ja’fari
وَمَوْلاَنَا عَلِيَ wa Maũlãnã ‘Aliya

You need to have the SIL Scheherazade font installed to see it to best advantage. It should look like this:

The default fonts on Linux seem to be missing glyphs for 067E (peh) in serif or non-serif fonts, and the glyph for 06A0 (ain with three dots above) doesn’t show up in a medial form. (That makes typing into the edit box on WordPress a little more difficult than it need be, but it’s not a show-stopper.) Even with Scheherazade installed, Konqueror (4.3.2) produces a bit of a mess unless I set the fallback font in the stylesheet to monospace (see below). Even then, it doesn’t see the glyph for 0656 (subscript alef) which I’m using for the vowel e (in the absence of a vertical equivalent of 0650, kasra) or 0657 (inverted damma) for the vowel o. That means it puts a box there:

An older version of Konqueror (3.5.5) on openSUSE 10.2 puts the boxes inline, and this leads to medials not being joined.

On Apple Mac OSX (10.4.11) with Scheherazade installed, Safari (3.1.1) behaves similarly to the older version of Konqueror on Linux (ie boxes instead of e, and medials unjoined). Firefox (3.0.17) is roughly the same, but it doesn’t even seem to see the Scheherezade font, so it falls back to monospace and gets the alignment of the <span> (see below) wrong:

I have no Microsoft Windows machine to test this on.

The default setup produces Arabic script that is too minuscule for my old eyes (I have the same problem with Chinese!), so I’ve added a couple of CSS stanzas to the theme stylesheet to adjust positioning and size:

.swahili {
font-family: Scheherazade, monospace;
font-size: 220%;
line-height: 130%;
direction:rtl;
unicode-bidi: embed;
text-align: right;
width: 80%;
}

.trlit {
font-family: “Liberation Sans”, sans-serif;
font-size: 50%;
margin-left: 50px;
text-align: left;
float:left;
}

The first handles the Swahili, and the second handles the transliteration.

Then the actual Swahili in the post looks like this:

I’m sure all of this (especially the CSS) could be done more elegantly, but it’s a start. There are many things I still have to figure out, though. My next post will be on setting up your PC to type Swahili in Arabic script.

PDFing

March 25th, 2010 by donnek No comments »

My wife needed some pdfs sliced and diced to put on a work-related website, so I ended up learning a bit about the PDF Toolkit.

My HP printer has a rather nice webscan app, which scans over the network directly into a pdf on the computer.  However, the scans average around 1Mb a page, which is far too big.  So we need to squish them down a bit first, using a ghostscript invocation I found here:

gs -sDEVICE=pdfwrite -dCompatibilityLevel=1.4 -dPDFSETTINGS=/screen
-dNOPAUSE -dQUIET -dBATCH -sOutputFile=1a.pdf 1.pdf

This will usually reduce the size to less than 100Kb.

We can combine individual pdfs using pdftk:

pdftk first.pdf second.pdf cat output combined.pdf

and conversely we can split an existing pdf into individual pages by using:

pdftk mybig.pdf burst

You can open a pdf in the GIMP for touching up (eg removing the darkest bits from a bad scan if you don't have the original to rescan), and then save it out as a Postscript file.  After that, you can convert the ps file to a pdf by running:

ps2pdf mypostscript.ps mynew.pdf

Theoretically, pdfs should orientate themselves correctly automatically.  If you have some existing scanned pages that are in portrait format when they should be landscape, and you can't get them to show up in the correct orientation, you can open them in the GIMP, correct the orientation there, save as Postscript, and then use ps2pdf to create the pdf.  Occasionally, however, when you open the reoriented landscape file, you get it appearing in portrait mode, with the right-hand half of the file disappearing off the edge of the sheet.  You can fix this by using another ghostscript invocation directly on the Postscript file:

gs -dBATCH -dNOPAUSE -sOutputFile=mylandscape.pdf -sDEVICE=pdfwrite
  -c "<< /PageSize [792 612]  >> setpagedevice"  -f mywonky.ps

Jan Deloof Breton-Dutch Dictionary

March 25th, 2010 by donnek No comments »

When Fran Tyers and I were finishing the first version of the apertium-cy Welsh-English translator in late 2008, he decided to try porting that experience to another Celtic language, Breton.  With Welsh, we had received less than enthusiastic responses to our requests to organisations like the Welsh Language Board for publicly-funded language material to be made available under a license compatible with the GPL, so he expected little when he made a similar request to the Breton equivalent, the Ofis ar Brezhoneg.  We had to wait 18 months for a non-committal sort-of kind-of response from the WLB, but Fran got an email back from Fulup Jakez at OB the same week, saying that they would make their terminology lists available, and Gwenvael Jekel even worked with him for a couple of weeks to get the project off the ground.  The Breton-French translator is on the OB website.

As part of the effort to collect GPLed resources for Breton, I put together the Breizh-Llydaw Sentence Bank, with the kind assistance of Dr Rhisiart Hincks in Aberystwyth.  This is a corpus of about 3,500 paired Breton and Welsh sentences, along with a small dictionary of about 1,200 words.  I had also come across the name Jan Deloof in various web-searches, but knew very little about him or his work.

It turned out that he lived in the Belgian town of Zwevegem, and not only did he have an 800-page Breton-Dutch dictionary, but he was quite happy for it to be licensed under the GPL, and (at the age of 80) was still working hard on his Dutch-Breton dictionary!  What a man!  The dictionary was in Microsoft Word document format, and held on an FTP site which hadn’t come up in any of my searches.  So I offered to port the dictionary onto the Web by converting it to a database format, and provide a web interface for it similar to the one for Eurfa and the Sentence Bank.  It took me longer than expected (nearly 18 months!) to get around to this, though Jan was always gracious when I contacted him to apologise – “I hadn’t really noticed the passage of time – I’m working my way through the letter K”.

However, at long last this wonderful piece of free (GPL) data can get the wider audience it deserves.  After three weeks of work, a database version is now available.  The first step was converting the doc files into text using OpenOffice.org Writer, and joining the broken lines that result from that.  For this I used the amazing sed, and as I used it more I both marvelled at its capabilities, and wondered why it has taken me so long to discover it.  I was able to use sed to correct the encoding issues that plague Microsoft Word documents, and do an initial tag of the text to prepare it for database import.  In the source files, font attributes alone (eg bold) delineate things like citations, and it was difficult to find a way of segmenting these consistently.  I tried converting to HTML, and then searching and replacing on tags, but that didn’t work well either.  In the end, I just read through the treated file as produced by sed, added some tags manually, and then let good old PHP slice and dice each line into a PostgreSQL table.

I separated the headwords proper from the citations Jan gives (called phrases here), on the grounds that this means that the citations will come up on any searchword that they contain, thus extending their relevance.  Separating out the bulk of multiword entries may also be useful in building machine translation systems.  The only drawback is that where the phrases were specifically chosen to illustrate a pariticular grammatical point, that relevance is slightly weakened now.  The main determiner of whether something was a phrase was basically whether it had gender information attached!

I’ve also added the ability to look up mutated words, as in Eurfa.  For instance, if you enter bro, you get words related to that, but if you enter the soft-mutated form, vro, you will get words containing vro, and also ones containing bro.  I need to expand this to cater for the hard, spirant and mixed mutations, and also make it more consistent (for instance, for bro you get vro in the phrases section, but not in the words section).  The functions and queries used to do this are quite interesting, and I suppose I should package up something about this, because it may be useful, if only as a souce of ideas, to other people working with Celtic languages.

A good few things in the db itself also need to be tidied up, though.  For instance, the tagging wasn’t perfect, especially for the first letters I did (AB).  Where you have something like:

diskibl (m) [-ed, diskiblien]:  leerling; volgeling

the tagging separates the two plurals (diskibled and diskiblien), so that the latter is prefixed to the definition, though separated by two slashes.  So the next step is to run some db queries to shift those items out to where they belong – the plural field.  I also used the plural field to hold the past participle of verbs, so those need to be given a field of their own:

dec'hervel (ww) dec'halvet: bij zich roepen

Another issue related to both those points is that if you enter diskiblien or dec’halvet into the dictionary, you will get no results back, even though there is information aplenty there about the lemmas.  So those inflected items need to be pulled out and reformatted so that they can be used as headwords.  In Eurfa, I put all of these into the one table, but it may be better to put them into separate tables (plurals, participles, etc) – time for some head-scratching.

The Dutch-Breton direction is useable, but not ideal.  The Dutch side contains multiple entries for each Breton headword, so the responses are a bit difficult to filter when reading.  What needs to be done here is to split the Dutch entries at each comma or semicolon, and write them into the table as separate entries.  Again, one-to-one correspondences may be useful in MT work (although then the problem of disambiguation appears).

The only drawback I can see here is that if the Breton headword originally had 4 or 5 Dutch equivalents,  searching the revised table from the Breton side may provide a surfeit of results – 4 or 5 entries for the same Breton word, plus plurals, plus mutated forms, etc.  But I think this can be mediated by the interface if it is designed properly.

One definite benefit is that the table could then be used to check or help build the Dutch-Breton dictionary that Jan is steaming through (he already has half completed!).  A single query could order the table in terms of the Dutch headword, and then a PHP script could amalgamate all the Breton equivalents listed for that Dutch headword into one Breton entry, ready for checking.  Any amendments or additions which that throws up can then be incorporated into the db table.

But in the meantime Jan’s dictionary is useable.  With over 40,000 entries, it’s an impressive achievement, and certainly the largest free (GPL) dictionary available for any Celtic language.

Swahili in Arabic script

March 24th, 2010 by donnek No comments »

Over the past couple of days, in preparation for a little project on Swahili, I’ve been investigating getting Kubuntu to show Swahili in Arabic script (the way in which it was traditionally written on the coast). I seem to have got there, and I’ll do a longer post later on how to configure your system to allow this.

In the meantime, as a taster, here is a verse from an utendi (ballad), typed in from a photocopy of the original manuscript:

It means, roughly: “I have accomplished the telling of these [things], and I will put others into verse, that you may all understand them, I have enjoyed relating [them] to you.”

This is still a work-in-progress, but it seems to be on the right track for now.

Chinese cars

March 2nd, 2010 by donnek No comments »

With the official environment for Welsh being so unremittingly hostile to free (GPL) software, over the last couple of months I’ve taken to learning Chinese as a pick-me-up, mainly via the excellent UWB Life-Long Learning courses run by Manman Jones, and also via the international students at the Business School. Jason and Frances taught me various things about cars last night:

你的车是什么颜色?
我的车是红色的。
这辆车是去年买的。
我花了五千磅买的这辆车。(的 is a really interesting syntactic “glue” word)
我开这辆车上班。
我开这辆车去工作。

你的车是什么牌子?
我的车是雪铁龙。(Citroën phonetically, but it translates to Snow Iron Dragon, which is very poetic!)
我的车是大众。(Volkswagen – because they’re so popular!)
我的车是丰田。(Toyota – very popular in the South)
我的车是本田。(Honda)
我的车是奥迪。(Audi phonetically)
我的车是沃尔沃。(Volvo phonetically)
我的车是尼桑。(Nissan phonetically)

我的车坐五个人。
你的车有七个座位。
你的车比我的车大。

I’ve put them up on purpose without pinyin, so that I force myself to remember the characters when I read the sentences again. If you’d like a transliteration, use the free Firefox dictionary plugin at the excellent Popup Chinese site manned by David Lancashire and colleagues – well worth a subscription if you want to access a lot of good modern idiomatic stuff to help you learn Chinese. There’s also a plugin there that uses Nathan Dummitt’s system for colourising Chinese characters according to tone: red (tone 1, high); orange (tone 2, rising); green (tone 3, low-lift); blue (tone 4, falling); black (neutral). This is quite helpful when you’re trying to read what you’ve written out loud. For speed, it’s best to open this post on its own page first, and then call the colouriser.

A fond farewell to openSUSE …

February 20th, 2010 by donnek No comments »

My first experiments with GNU/Linux began in December 1998. I bought a boxed edition of Red Hat, and couldn’t get it to install. I moved on to OpenCaldera, which did install, and stayed with it for 6 months before moving to SuSE 6.4. I’ve used SuSE/openSUSE ever since, and it’s been wonderful. I’ve looked at a few other distros in the meantime, but I’ve never been tempted away from what seemed to me to be the Rolls-Royce of Linux distros. Even the Novell purchase, which some people were concerned about at the time, resulted in new and exciting developments like the Build Service and openSUSE Studio.

But over the last couple of years, openSUSE seems to have lost its way a bit. I don’t blame Novell trying to focus on enterprise use (that’s where its bread-and-butter comes from), and minimise the amount of time devoted to non-enterprise users, but this seems to have led to some odd decisions being made.

I’m still running 10.2 on my main machine. 10.2 is a couple of years old now, and in fact the repos for it have been emptied, so there are no updated packages available any more. So why am I doing this? The main reason is package management. Sometime around 9.2, the package management system began to be overhauled, and the net result was a tremendous loss of functionality. The first few iterations of the new system were slow and buggy, so much so that I started using first apt4rpm, and then the excellent Smart for package management. The 10.2 system was much faster, but had the unfeature that you could not save the packages you downloaded, so if you wanted to install them on several machines on your network you had to download them separately on each one (I know there are ways around this, but Smart was easier :-) ). The new 11.x systems are very fast (using deltas), and you can save the downloaded packages, but 11.2 has the bizarre unfeature that the package management system will not list or download packages in any new repos that you add, until you tell it to (see here, for instance – there is a better post about this from an openSUSE developer, but I can’t find it ATM).

This appears to have been done in order to minimise bug reports and queries from people along the lines of “I installed x from rep y, and now my system is broked”. Fair enough. But when I tried the new 11.2 a couple of months ago, I was pretty surprised when I added a new repo, and couldn’t get a listing of packages that I knew from the webpage listing were there. It took me about 3 hours to figure out what the problem was, and I wasn’t pleased, particularly since there was nothing in the release notes or in the package manager itself to tell you about this – it might have been sensible for whoever signed off on this decision to ensure that the repo manager popped up a message when you added a new repo, saying that it will not be activated unless you confirm that you know what you’re doing.

So for me, this was the last straw – four iterations of the distro and the package manager was still behaving unintuitively. I’d been using 64Studio, and quite liked apt-get, and then that was moved from a Debian base to an Ubuntu one last year. I was also noticing that there’s a lot of Ubuntu about – packages for it seem to be made quite widely, and there is a lot of info about it on the web. I’d tried Ubuntu last year, and was quite impressed, especially when I switched to Kubuntu, and did a complete dist-upgrade without a hitch (though mistrust returned when one morning said Kubuntu just refused to boot, and seemed to have FUBARed itself). It also got brownie points for doing a flawless UNR install on my eeePC, and a similarly flawless one on my R41 laptop, and even going online without a hitch via wired, wireless and mobile dongle. And the clincher was that Linode offers a virtual Ubuntu instance, with some excellent notes on how to configure various bits of software. It made sense to run the same distro on my desktop as I had on the server, not least because you can test things there before they go online.

So I took the plunge and did an Ubuntu install on my second PC. I kept GNOME on for about 3 weeks until I couldn’t stand their file dialogue any more, and upgraded to Kubuntu. KDE 4.3 is actually very nice now, and has most of the old KDE 3.x features back in. The only problem I have noticed is that iBus, the new system for using characters-sets like Chinese, seems to freeze the display after about 36 hours. The workaround is to quit it when it’s not in use.

Setting up Apache, PHP, R, LaTeX, etc, has been very easy. The only sad point there is that the cran2deb repo isn’t really useable on Ubuntu, because Ubuntu decided to break binary compatibility with Debian. But all in all, I’m quite impressed so far – a consumer version of GNU/Linux.

It will be interesting to see if Ubuntu stands the test of time (10 years) like openSUSE, but for me at the minute it’s certainly a better bet.

Shiny pretty thinks

February 19th, 2010 by donnek No comments »

For 6 months in 2007, I tried blogging on Google’s Blogger.  But I didn’t really warm to the layout or the configuration panel.  Now that I’ve moved to the very excellent Linode, I decided to look at Wordpress again.  It’s improved a lot since I last considered it in 2006 :-)   Add in the rather elegant Green Park 2 theme from Artis Cordobo (aka Andreas Jacob), stir, and viola — is it a flower? is it a musical instrument? no, it’s a new lease of life for Me Myself Why? Who can say whether it will last longer this time?  Does the grass ask the sky what time the rain will come, little cricket?

Printer gotcha on openSUSE

September 12th, 2007 by donnek No comments »

I just spent a couple of hours yesterday trying to figure out why the printer on 2 new openSUSE 10.2 installs was only printing in greyscale. This is a very reliable network printer (Xerox DocuPrint C20), which I got for a song about 6 years ago (although the cartridges cost a packet!). I had just run through the standard YaST printer setup, and accepted the defaults.

Now, with a daughter wanting a printout of her homework poster 5 minutes before the bus went, the darned thing was only printing greyscale. But why??? Daughter was packed off with a “holding” greyscale version, and cue some headscratching.

It eventually transpired that the default setup in YaST installed the Gutenprint drivers (which for this printer and some others only seem to print in greyscale), instead of the Foomatic drivers. The odd thing is that the Foomatice ones have “(recommended)” next to them, so why does YaST go and install something other than the ones that are recommended? That may not be a bug per se, but in my view it comes pretty close to it.

The other strange thing is that changing the printer driver in YaST made no difference – it still printed greyscale. A reboot was required to make openSUSE start using the newly-selected driver!!! There is presumably some reason for this, but again, it seems to me that if something had to be restarted YaST should have done the needful by itself. It’s certainly one of the few times I have come across where a Linux box has to be rebooted just in order to have a configuration change take …..

Quechua segmenter

August 24th, 2007 by donnek No comments »

I’ve been working with Fran Tyers and the Apertium people over the past few months, and one of the issues for any MT system is dealing with the source language text that is fed into it. For interest, I decided to look at how an agglutinative language like Quechua might be dealt with, and the result is a very basic Quechua segmenter – there’s more info on the page. This needs much more work on the code (eg the ability to input connected, punctuated text) and a much bigger dictionary, but it actually works quite well.