So as it turns out, there’s an interesting natural language problem in Kongzi beyond the multiple choice problem discussed years ago. Let me recap quickly so you know what I mean. Assume a random multiple choice quiz generator which selects random answers (from among known and unknown cards, also from similar parts of speech). I had noticed that certain types of words were too easy to dismiss; for example, students would quickly learn to recognize sentance patterns which required a noun and then disqualify simple past tense verbs (ex. opened). I discovered students were able to perform this kind of pattern and grammar recognition even when they did not know the correct word, nor the meaning of the word they were disqualifying. To get around this I began manually tagging dictionary entries with part of speech in order to allow the multiple choice generator to choose at least one answer which the students could not so easily dismiss. Yet this brought to the fore another problem that had lain dormant until this time; truly correct yet incorrect answers. Ex. “I went to the store and bought a ____.” a) pencil, b) watermelon, c) banana, d) fish. No amount of clever question writing will make this easy. Once we are forced to write “I went to the fruit store and bought a long, yellow …” one answer becomes obvious, the idea of “related adjectives”. One could in theory use related adjectives to help choose questions. True. But this is a tricky thing to get correct and there is a lot of data and there is no guarantee that it will solve the problem. One can substitute almost any kind of dessert for “I went to the cake shop to buy a delicious, sweet _____.”
So, a related problem is in automated part of speech tagging. There is almost no way to distinguish if the word “house” is a noun or a verb.
House the home (n) — House the poor (v).
There is however a “best shot” approach known as the Brill tagger, and a more detailed one known as the Brown corpus tagger.
There may be others but I wanted to comment on this since it was now a focus of Kongzi Online. The issue is the command “add all words as cards”. In this, we cannot seem to accurately tag parts of speech.
My initial plan was to assign each dictionary word a “part of speech priority” and then have it somehow automate, but this is a fool’s errand.
The way I am thinking now is have the program auto-analyze for part of speech and then ask a human to confirm it. This only needs to be done once — and seeing as how mulching is an admin command anyways I think that will be the best thing. Once we have accurate parts of speech for a story, whole new worlds open up — automatic sentance pattern frequency lists, better multiple choice (and cloze) testing, and so forth. A decent associated adjectives list becomes only a few lines of code away. Now how to store it. It is possible to have this data attached to the story directly, and simply not shown when printed on the screen. This is probably the best way to have it done, possibly attached via a slash like this/pos or brackets of some kind such as this(pos).
Then we will present the user(administrator) with a series of sentances and their parts of speech (below them) and the ability to enter a new part of speech, or perhaps click on the part of speech (or the word) to change it. Maybe in a modal. Then, submit brings us to the next sentence. Since the data is tagged onto each word, it can be done incrementally or bit by bit, since any word without a pos would indicate that sentance had not been analyzed.
We could even simply tag each word with a number to confirm it belongs to a certain dictionary id.
Shall we then merely create an alternate txt where every word possible is replaced by a number? This is probably along the lines we’re looking for. Perhaps then every word in the alternate text which is replaced by a number indicates a word that can be added, possibly, and the others will be relatively ignored. As far as ease of manual checking goes this may be it — “click to show dictionary definition in a side window…” then “choose the correct form of the word from the dictionary (or none)…. click “save”…. this could be done on a word by word, sentance by sentance basis rather easily but the UI for it will be a bit complex.
Priority is probably somewhere below multiple choice questions and SRS flashcards.
We’ll keep SRS for the flashcards and F-Score for the multiple choice. Since it’s more like a test anyways.