Now that we have understood the need for a source language catalogue — the need to separate the dictionary headword language from the translation language inside the database — let us return to the discussion of how the headword and translation information should be stored and how it should be presented to editors.

A standard first approach might be to implement a normalized SQL database and a UI which attempts to shield the user from having to know about things like enumerated keywords and the database schema.

One idea I have been tossing around for this is to just present the user with XML and not worry about a schema at all.

  • XML can be edited by humans, and therefore we would not need a complex UI for editing the dictionary.
  • Although editors would be required to understand the XML document type, we could slowly code a more user-friendly editor around an initial simple textarea and submit button.
  • We would not need complex schema code to handle language specific data in an editor. I.E. lots of little buttons (or empty textinputs) to insert things like tags onto various fields.
  • When changes to the schema are necessary, only a programmatic change to display code would be needed, and not to the code for display or editing.
  • Can use a fancy schmantzy XMLDB like Sedna.

The most interesting pro is that it would be more efficient/easier to code for, esp. in that things such as tags could be displayed attached anywhere and without the need for a well-defined document structure. In SQL this would be done by defining a “tags table” with something like ID, entry_no, tag_name, and tag_value (for tags like ichi1 or ichi2). These could be read and attached to a particular entry without requiring the program to understand what those tags mean. For example if tags such as  strokes:15  or  pos:n  were read, these could be simply displayed to the user directly (or with a quick parse to expand pos to Part of Speech, for example). The program would not need to understand that a  strokes  tag was only applicable to a Kanji entry; the editors would simply only include a strokes:# tag onto Kanji entries.

However, named properties are not an incentive to use SQL. The exact same thing is possible in XML with named tags like <strokes>10</strokes>. The real benefit between showing the user XML seems to be the three tons of code for schema and editing that doesn’t need to be written to prevent the user from ruining the document structure (We just assume he wont, and that he doesn’t mind getting his hands dirty with a little XML). Yet, if anything we do in XML could be done with a normalized SQL database, what really is the reason to use XML? We can also represent hierarchial data using a flat record structure too, so long as the order of elements is used to store information. JMDictDB uses the order of tags to store information even within an XML database, but if you’re going to do that then why use XML at all?

Some of the model/view categories we came up with are:

  1. Full-blown editing UI: The user does not need training to edit dictionary entries.
    • Pro: Easiest for users.
    • Con: Schema is defined by code; a nightmare of code maintenance.
    • Database: Fully normalized SQL.
  2. A series of textareas named “r_ele, pos, gloss, etc.”
    • Pro: easy to code for, things like order are represented by lines in a textarea.
    • Pro: Tree-like schema can be represented with a flat record structure (therefore) no difficult code required to “build” the HTML editing form.
    • Con: Discordant representation requires the use of exclusionary tags like stagr and stagk.
    • Database: Each textarea could be a record with line delimiters, or could be fully normalized.
  3. One big textarea: SQL
    • Each line would contain a type specifier with a colon and then data i.e.
      • r_ele: shuko
      • gloss1: lunch
      • gloss2: eclipse
      • Pro: Easiest to store and present to users
      • Pro/Con: Have to write our own processing library from scratch (but) this can be seen as an advantage re: speed and size.
      • Database: these could be placed in separate tables (for elements and glosses) or in a single table with the tag and tag value in columns. Both have advantages and disadvantages for searching the database.
  4. One big textarea: XML
    • Essentially the same as #3 but has a more mainstream structure (maybe even easier for editors to learn) and will have a larger code and data size.
    • Database: The entire record is stored in one line of a table. Or in a SQL database.
    • Con: Makes it a little more difficult to search unless a SQL database is used.

Before we make a choice we might as well ask the question. Do we really want completely untrained people editing the database? The answer is of course no. Do we want “casual editors”?  Probably not because casual editors would have no enduring incentive to provide high quality input. Just for the fact that it’s a computer system and that users would be responsible for high-quality, canonical, language-specific data, you might expect that someone in charge of the data (the product we sell) must have some amount of training. That we would be able to establish a minimum level of quality. You see this is in similar products like JMDict where users are advised (trained) to supplant adj tags with more specific tags like adj-i and adj-na, and how/when to use exclusionary tags like stagr and stagk. Therefore if users must be aware of such information it does not seem like too much of a requirement to ask them to read and present the information in a defined structure — like an XML DTD — or any other kind of structured text.

So what’s easier for joe random the editor?

Example #1:

Example #2:

Above are two examples of language information using different forms of structured text. All the same information is presented in both examples.

If XML were to be used we would require a well-ordered document. But XML is thought of as more difficult to maintain precisely because of it’s verbosely well-ordered nature. It may be programatically easier to simply assume an order based on the order of tags as in Example #2 (i.e. if subsequent tag elements are encountered after a reading element, they should be attached to the reading element).

A conclusion can be stated as follows. Programatically interpreting a loosely-defined user-created structure is better than hardcoding the structure of the schema in code. Let the display code look for named properties; let them not be column names in SQL tables or fields of required presence in an XML document. From the editor’s standpoing, whitespace can be used to help visualize the hierarchy of data without requiring it to be explicitly pointed out in code.

The decision now appears whether or not to use an XML database to store the underlying information (ala ex. 1) or a shorter kind of representation in a SQL database (ala ex. 2). The first step in reaching a final decision will be to take a more detailed look at the state of XML databases like Sedna and see if they have evolved to the point where it is worthwhile to use one of them or to just stuff the data into tables.

By Serena

One thought on “Editing SLC in XML”

Leave a Reply

Your email address will not be published. Required fields are marked *