Home Dictionary Segmenter Review Affixes Menu
Dictionary Segmenter Review Affixes



Korean tools for learners

1. Introduction

I developed KoSeg (Korean segmenter) over the last few years as a tool to help me learn Korean, and although it has many shortcomings, I'm presenting it here in case it might be of some help to others. In brief, KoSeg splits each word in a sentence into a meaning segment and an affix segment. For more information, see below , or go to the Segmenter page itself.

In many cases the output for a word will be equivalent to that from a glosser (or tagger). However, I call KoSeg a segmenter rather than a tagger because in a significant number of cases it performs no disambiguation of the output, and just lists all possible meanings for a word.

KoSaj (Korean sajeon, meaning "dictionary") is the dictionary used to look up the words and affixes. For more information, see below , or go to the Dictionary page itself.

I have created both these resources myself from my own reading and research, but I owe a great debt of gratitude to many authoritative sources for their Korean language knowledge, which I have unashamedly sculpted into something that suited my own purposes. If by doing so I have misrepresented any aspects of the language, I beg indulgence and forgiveness.

Both KoSeg and KoSaj are made available under the GPL3 and AGPL3 - see below for further details.

Note that because both tools are under development, the ouptut you get may differ slightly from that shown in this documentation.

2. Koseg

Before you start...

It is important to emphasise that KoSeg is NOT a translator like Kagi Translate (by far the best Korean-English translation resource), Papago, or Google Translate. It is located in the space between those translation resources and online dictionaries like the excellent Naver offering, in that the main aim is to produce a sense of how the words in a sentence are syntactically related to each other. With a dictionary, you know what the individual words mean (once you remove the affixes attached to them, which is not necessarily easy for a learner), but even then it may not be clear what the overall meaning of the sentence is. With a translator, you know what the overall sentence means, but it may not be clear how the component words contribute to that meaning. I've found KoSeg quite useful in a strongly left-branching language like Korean, where you usually get the topic at the beginning, and a verb at the end, but the relationship of the words in between can be a bit opaque.

The key point here is that KoSeg does not give you a "finished" explanation of the sentence -- rather, it does an automatic dictionary lookup, and gives you the general meaning of the words in the sentence. It is up to you to do a bit more thinking about the meaning those words convey.

Usage

On the Segmenter page, enter a Korean sentence into the text box on the left, You can copy and paste a sentence of your own, or use one of the sample sentences on the right. Alternatively, you can just type directly into the text box. Note that your text will be truncated to 300 characters (hangeul syllable blocks or Roman characters) to prevent server overload, and only common characters that you are likely to meet in ordinary text are permitted. If you have an English translation for your sentence, you can attach it to the Korean sentence using § (on the standard Ubuntu keyboard, this can be typed using AltGr+Shift+W), and that will be printed out as well, at the bottom (see the sample sentences for examples).

Once the text is in the box, press the "Segment it!" button. KoSeg will then produce a printout on the right, with each word in the sentence stacked vertically. Against each word (across the row) there will be a romanisation (see below, then a meaning for the word, and a meaning for the affixes (if any) attached to the word.

The following image shows the printout for the sentence "언제 오셨는지 엄마가 내 등 뒤에 서서 우리 이야기를 듣고 있었어요." (from 할머니 테왁, by 김종배).

Segmented sentence

Meanings

The entry for the meaning consists of the "prime" meaning for each word. The prime is a single meaning that best sums up the constellation of meanings for any given word. This is done purely in order to keep the output manageable, and it should be said that the prime meaning is somewhat subjective. However, clicking on the prime meaning will open a popup showing the full KoSaj dictionary entry for that word, which allows you to view the range of meanings associated with that word. One of those other meanings may be more appropriate for the word's usage in the sentence. If the prime meaning seems not to fit well into the sentence, it's a good idea to click on it to show the popup, since that will offer more options for the meaning. Click the "Close" button at the top right to close the popup.

Clicking on the word "story" against 이야기를, i-ya-gi-reul, for instance, will show the following:

Word pop-up

In some cases, words will have multiple meanings, and in that case the primes are listed in a group. This is most common with shorter words. In the above sentence, 내, nae, and 등, deung, have multiple meanings. Clicking on the group will give all the meanings separately, as with the popup for 내:

Prime group

In this case, the choices for 내 and 등 that make most sense are "I+POSS" (equals "my") and "back", so that 내 등 뒤에 means "behind my back" (lit. at the back of my back). That fits in with a viable translation for the whole sentence: "I'm not sure when she came in, but my mum was standing behind me, listening to our conversation."

Affixes

I use the term "affix" instead of "ending" or "suffix", because although many affixes usually end appear at the end of the word, it is not unusual to have further affixes attached to these normally "word-final" affixes. To take one example, the plural affix -들, -deul very often occurs as the last item in a word, but is frequently followed by other affixes: -들처럼, -deul-ceo-reom, -들하고도, -deul-ha-go-do, -들한테만, -deul-han-te-man, -들만큼이나, -deul-man-keum-i-na, and so on.

The final column on the printout gives the affixes attached to the word. All affixes are marked in glosses with an initial + (plus sign). So far as possible, these are broken down into the main semantic components. For instance, in the example above, the word 오셨는지, o-syeoss-neun-ji, is segmented as the verb 오, o, "come", with attached affixes -셨는지, -syeoss-neun-ji. This affix sequence consists of -시- (-si-, honorific), -었- (-eoss-, past tense), and -는지. (-neun-ji, an interrogative expressing uncertainty or inquiry) The affix sequence is therefore listed as +HON+PAST+QVAGUE (see below for the meaning of abbreviations), so that the first two words (언제 오셨는지) can be translated as "I wonder when she came" or "I'm not sure when she came".

As with the prime meanings, you can click on the affixes to open a popup. If an affix sequence has multiple meanings, they will be listed in a group, and the popup will show each of those meanings. In some cases (as with +LOC in this example), the affix entry in the popup may have additional notes explaining general meaning. I would like to extend these notes in the future, and there may also be a need to review and possibly standardise some of the affix tags.

At present, some 1,300 affix sequences are listed in the dictionary, though many of these are variants (eg, with or without the polite affix -요, -yo, the plural affix -들, -deul, the honorific affix -시, -si, and so on. Nevertheless, the variety and number of these sequences is one of the reasons why Korean makes such demands on the learner.

The Affixes page on this website lists the components of all affixes. Clicking on a component will show all the affix sequences it occurs in, and clicking again will hide them. It is worth saying that not all affix sequence variants have been listed yet in the dictionary, so (for example) a sequence following a vowel-final word may be missing, while its equivalent following a consonant-final word may be listed -- this is a function of whether or not it has been enountered in my reading! As noted above, more work needs to be done on normalising the affixes.

Affix abbreviations

Affixes are divided into two types for glossing. The first type uses general ideas as glosses, and are written in capitals, eg +PAST, +REALISE, +SUGGEST. The second type uses specific English words as glosses, and these are written in lower-case, eg +so, +must, +also. This is an ad hoc and somewhat artificial distinction, and it is possible that if and when the affixes are normalised, this distinction may be removed, affixes re-glossed, etc.

Abbreviations which are not full words are listed here: ADV adverb, AFF affirmative, CONT continuous, DESC descriptive (adjective), FUT future, FUTDEF future definite, HON honorific, HORT hortative, HYP hypothetical, INDIC indicative, INFIN infinitive, INST instrumental, INT interrogative, L L-form (future participle), LOC locative, NEG negative, OBJ object, PL plural, PLUPERF pluperfect, POL polite, POSS possessive (genitive), PROG progressive, PROP propositive, QQUO question quotation (indirect interrogative), QUO quotation (indirect affirmative), QVAGUE vague question, SUBJ subject.

Some affixes are attached only to nouns, and others only to verbs. This means that there is a fourfold distinction in the rules used to select meanings (below for further details).

Partial segmentation

A complete segmentation of a word depends on (a) the base word being listed in the dictionary, and (b) its affixes being listed among the affix sequences in the dictionary. Either of these conditions may not be met, in which case you get a partial segmentation, or none at all.

4. Licensing

Both KoSeg and KoSaj are available under the Free Software Foundation's General Public License version 3 or later, and the Affero General Public License version 3 or later. These licenses mean that (a) you can freely change the code to suit your own purposes, and even redistribute that changed version if you wish (with the proviso that you must also use the GPLv3 for that changed version), and (b) you can rewrite the web interface to suit yourself, but that rewritten version must also be made available under the Affero license. Ideally, you would give me a copy of any changes you make, so that they can be incorporated into this original version, but that is not a necessary condition for editing the code or interface as you see fit.

5. Local installation

If you wish, you can set up KoSeg and KoSaj on your own PC or laptop. Both tools have been developed on Linux (Ubuntu 24.04 LTS), but they should run on legacy operating systems too, though you might need to do some tweaking to fit how those OSs do things You need first to install the Apache2 webserver, the PHP scripting language, and the PostgreSQL database management software. phpPgAdmin is also useful for oversight of the database. The Git repo for both tools is here, so clone that repo. Then load the database (dbs/kosajseg.sql) into PostrgreSQL, and set up a virtual server in Apache2 so that you can access the web interface when you enter "http://kosajseg" in your browser's address bar. If you want to edit the dictionary, I would recommend DBeaver, since it allows multiple edits to be made on one result set. If you run into problems setting this up, contact me - I can give you further details. My ability to help with legacy OSs is very limited though, since I haven't used one since 2005.

6. How KoSeg works

This is technical stuff you probably don't need to read.

7. Shortcomings

I find KoSeg useful for my purposes, but at the time of writing it has a number of undeniable shortcomings.

Compound words. Korean tends towards combined compound words (like German) instead of spaced or hyphened compound words (like English). Examples are 대한여학사협회 dae-han-yeo-hag-sa-hyeob-hoe, Korean Association of University Women, or 대한여학사협회 un-jeon-myeon-heo-jeung, driving licence, 쇼핑지역 syo-ping-ji-yeog, shopping district, or even just 반지값 ban-ji-gabs, the price of a ring. KoSeg currently has no functionality that would allow words to be internally segmented to produce compound meanings, so this is addressed using two options. The first is the "brute-force" one of including such compound words as separate entries in the dictionary. This approach has been used in nearly all instances, but of course it is not ideal, since many possible compounds are not as yet listed in the dictionary. In such cases, users have to try segmenting the word themselves, based on their knowledge of other words. The second option is to add the last word in the compound as an affix, so in the case of 반지값을 banj-ji-gabs-eul, the dictionary has two entries: the word 반지 ban-ji, ring, and the affix -값을 -gabs-eul, +price+OBJ. This has only been used with a few one-syllable final words in a compound, and is again not ideal.

Can't split words, and Korean can sometimes have German-type words.

Hard to deal with names.

Based on analysis of around 70,000 words of formal written and some less formal spoken Korean. These texts are under copyright, so cannot be reproduced here.

Punctuation less common. No hyphens.



©2022-25 Kevin Donnelly