I started building the MLMT (multi-language, multi-translation) dictionary system by expanding upon a simple record-based Chinese to English dictionary I had come up with to help me remember my homework in Chinese language studies in University. I had been attempting to branch out into Japanese by adding a few fields to the project. However upon encountering JMDict, I determind that the format I was using was inadequste. It was a little too simple in that it couldn’t store a more complete picture of any particular word. I then rewrote the database in a format very similar to JMDict. The records were actually interchangeable; a JMdict entry imported and then exported by my program would look exactly the same; you couldn’t tell the difference. Yet for some reason I couldn’t shake the feeling that the JMDict format was moving in the wrong direction. For one, it was annoying to code for. Second, it just didn’t feel like the right way to store the information. There were too many inefficiencies and kludges in how the information was organized. I decided to start fresh with a very simple principle:

95% of what a learner needs in a dictionary is a direct translation.

This makes the target format easy to code. I can worry about specifics later. The second thing I wanted to do was separate the translation data from the dictionary data. In an MLMT project with 50 languages, this implies 2,500 dictionary pairs. But if we re-used source language data, we could cut the project size by at least half. Not to mention, the source language data would tend to be of a higher quality since there would be only one copy of it to edit. This was the initial reasoning behind the Source Langue Catalogue.

How would such a system be organized? It would need to be flexible enough to store any kind of language. Considering English as the standard alphabetic language and Chinese and Japanese as standard pictographic languages, we could imagine the following fields:

  • Language Number: What language is this?
  • Headword field. Contains the main entry, the “canonical dictionary form” for an entry.
  • Alternate Readings Field: Contains alternate forms of the word. Ex. “color” and “colour” would both be added for headword/canonical entry “color”.
  • Pronunciation fields such as IPA and language specific fields.
  • Tags field: language-specific information for this headword.

There are no specific fields for something like radical or stroke number, because they would only be useful in some languages. Instead, this data can be added in the tag field, with custom property tags.

Languages like Chinese or English fit very neatly into a system like this, but Japanese has proven itself difficult to organize because words can alternately appear as a mix of hiragana or kanji, and some kanji have different readings in different situations. The neat way around this is to add type fields to headword and alternate reading fields. In fact, these fields can be merged (with headword being, say, 10 + type). This would limit us to ten different types of words, but what language has ten or more different writing systems? In any case we could say 100 + type. So if hiragana was type 1, a hiragana headword would be type 101.

Now we have a system which is functionally similar to JMDict except that it has no ent_seq or other meta fields and does not use restriction tags like stagk. In practice I find the use of these tags an outgrowth of the policy to list kanji as headwords, to “toss in” reading elements without any sort of order, and to apply part of speech to translation data and not headword data. The end cause is copious amounts of extraneous data being given to the user even on mature search engines like Tangorin (ex. searching for “hito” gives you secondary meanings for “ichi” because hito and ichi share a secondary kanji). Our solution is to break up entries by part of speech and/or sense attached to headword. This will naturally split out entries which would normally take an exclusion tag like stagr or stagk. A great example is “pounded fish cake” being in the same entry as “ticket stub” simply because (again) the kanji can be confused. This kind of mess would never occurr with kana headwords and (simply) having separate entries for each sense/part-of-speech.

This looks like it would fold up well into a single table, with standard comma and semicolon delimiters. On the other hand I have been meaning to try a fully normalized system. In any case it does not look overly complex, and seems like a good base for cataloging basic word data in various languages.

By Serena

One thought on “The Source Language Catalogue”

Leave a Reply

Your email address will not be published. Required fields are marked *