Sometimes you just get awestruck by things you discover that you didn’t know you were looking for.

On a lark I looked for a POS tagging library and I discovered spaCy. Yes, it’s in Python, but beggars cannot be choosers.

spaCy looks absolutely amazing and in preliminary tests it appears to be a perfect tokenizer during the story save process. Unbelievably it seems perfect (other than being written in Python) and the POS tagging is so good that I believe the marketing when it claims spaCy is nearly state-of-the-art. It picks out the proper POS for verb or noun for words like ‘lead’ based on context. I could never write something like that without pulling significant, life-shifting time away from my other projects. Unfortunately we, as humans, have the same 24 hours a day — but they are only 24 hours. We must make choices. We must make sacrifices. But, spaCy exists, so I can resume writing kongzi.

Now, should I include a record of the tagging process, such as an intermediary .a file to a .c file, such that if there is an error I can manually alter the file? That would be interesting, but the people working on spaCy already seem to have done such an amazing job already — no, really — that not only would I strain to improve it, I would again be diverting my time far away from working on the site itself and it’s the content.

Here’s an example from “The Animals and the Plague” (one of Aesop’s fables):

Note that the word, lemma and pos/tag data form, essentially, a “sense” — a nearly-if-not-always unique key into the dictionary.

That is utterly amazing and exactly what I need.

It’s 3am now and I can hardly sleep after this discovery!

I will need to take a break from all this, mostly, for the next few days due to work. But this discovery has changed everything. What a development!

By Serena

Leave a Reply

Your email address will not be published. Required fields are marked *