Archive for the ‘Text software’ category

Autoglosser2 released

February 2nd, 2018

During 2009-11 I wrote the Bangor Autoglosser to gloss the Bangor ESRC corpora of multilingual (Welsh, Spanish, English) conversational text. I’ve done a new version, Autoglosser2, that focusses on Welsh written text, and outputs CorCenCC tags as well as Bangor-type glosses. Speed has been greatly increased too, from 1,000 to 22,000 glossses/minute. You can test it online, but for detailed work it’s better to download and install locally. There’s also a detailed manual available. Lots of work to do on it still, but it’s pretty robust, and gives reasonably good results.

Gàidhlig autoglosser

January 8th, 2016

I just realised that when I moved my code to GitHub I forgot to include the batch for the test application of the autoglosser to Gàidhlig. I’ve now rectified that, and also taken the opportunity to tidy up a few other things.


March 25th, 2010

My wife needed some pdfs sliced and diced to put on a work-related website, so I ended up learning a bit about the PDF Toolkit.

My HP printer has a rather nice webscan app, which scans over the network directly into a pdf on the computer.  However, the scans average around 1Mb a page, which is far too big.  So we need to squish them down a bit first, using a ghostscript invocation I found here:

gs -sDEVICE=pdfwrite -dCompatibilityLevel=1.4 -dPDFSETTINGS=/screen
-dNOPAUSE -dQUIET -dBATCH -sOutputFile=1a.pdf 1.pdf

This will usually reduce the size to less than 100Kb.

We can combine individual pdfs using pdftk:

pdftk first.pdf second.pdf cat output combined.pdf

and conversely we can split an existing pdf into individual pages by using:

pdftk mybig.pdf burst

You can open a pdf in the GIMP for touching up (eg removing the darkest bits from a bad scan if you don’t have the original to rescan), and then save it out as a Postscript file.  After that, you can convert the ps file to a pdf by running:

ps2pdf mynew.pdf

Theoretically, pdfs should orientate themselves correctly automatically.  If you have some existing scanned pages that are in portrait format when they should be landscape, and you can’t get them to show up in the correct orientation, you can open them in the GIMP, correct the orientation there, save as Postscript, and then use ps2pdf to create the pdf.  Occasionally, however, when you open the reoriented landscape file, you get it appearing in portrait mode, with the right-hand half of the file disappearing off the edge of the sheet.  You can fix this by using another ghostscript invocation directly on the Postscript file:

gs -dBATCH -dNOPAUSE -sOutputFile=mylandscape.pdf -sDEVICE=pdfwrite
  -c “<< /PageSize [792 612]  >> setpagedevice”  -f