# Most Readability Scores are Inaccurate So as it turns out, most readability scores are inaccurate. This comes as no surprise because simply by joining some sentences with commas or semicolons (versus using periods) will inflate the readability score. I’ve found it trivial to push or pull a given document by up to two years just by replacing some commas with periods (and vice versa) to change how the document is written. Notably this doesn’t affect Dale-Chall or Coleman-Liau indexes, but those two have other problems which make them unusable, such as giving scores which are too high for simple documents.

It strikes me as odd that formulas which count the number of words per sentance, the number of letters or syllables per word or per sentance, are considered accurate in determining the particular grade level of a certain text. How can this be when some words are just not as important or useful as others at a given level?

Combining my idea of targeted frequency analysis, with the general idea behind the readability formulas in existance, we may be able to produce an accurate readability score given;

p1. the natural order hypothesis

p2. the input hypothesis / n+1 theory

p3. therefore students increase their vocabulary in predictable ways as they grow older and progress through the education system.

I’m currently working on the idea of word frequency. So lets assume for a moment that there is no correlation between the words a student knows and the frequency of that word. This is really a situation where we have some frequency f, which is not consistent with n (which is the word’s order in the natural order hypothesis). In this case, there is no real value in using word frequency and we should simply look for documents where “the user knows most of the words, if not all” — let’s create the idea of a document which contains n+p, where n are the words a user knows and p are the words the user does not know. However even in such a case if we wish to introduce n+1, it would make sense to have this pn where n = 1 where the frequency of pn is as low (frequent) as possible, so long as we know nothing else about the text or the target of the frequency analysis. Then again, any +1 is assumed to be valid because of it’s n component, which is assumed to be very large — it seems to make sense that a +1 would be connected in some sense to the level of the text in general, such that it is rare that a +1 will present itself with a severely out-of-order word frequency. In the absence of a targeted frequency list then, the general frequency list would always be assumed to provide the shortest path to language competency; it would at the very least open the greatest number of next-possible +1 jumps.

### Making it Work

RS1; An article has a word-set, which is the list of unique words. If all of these words are known to the user, it is an “n” article; and within n it is considered that the highest grade level or difficulty level of that text would be more difficult for the student to read. Now, all remaining articles will be sorted by how many new words they contain. There will be +1 stories, +2, and so forth; also we may consider the length of the story. If there is a new word every 250 to 300 words, this is probably a decent level — one new word per page of text. In any case, for short story reading, or article reading, articles with 1 to 5 new words per article are going to be targets for n+1 presentation. So, for the n we have the reading analysis scores, but for the +1 (or +p) we do not have reading analysis scores. This leads us to RS2.

RS2; We apply the frequency of the p words (the +1 or extra words) as a method for choosing them over other words. We then see no reason not to use the frequency of the words in n to score the n texts for n presentation. Thus, the RS is some variable related to the frequency of the words in the text. An average would imply we are taking the standard deviation to be relevant for deciding +1; I think instead it would make sense to use word frequency to determine the +1, since essentially that is what we have done anyways; since as time goes by presenting n with low frequency words would be presenting the “most recent n+1” as a form of review.

Therefore the highest frequency words in a document should be used to determine it’s difficulty; as the lower frequency words will be assumed to be known by the user. This leads to a secondary conclusion that when a user knows a sufficient number of words at a frequency that he should be introduced to words below that frequency before he is introduced to words above that frequency. perhaps here an average or statistical deviation method should be used;

### To introduce words

Introduce new words that the user does not know in frequency order. If there is a targeted frequency list; such as “Harry Potter” because the student must learn those words in order to enable a book report on “Harry Potter”, use that; in all other cases the general list is assumed to be most useful.

Once a word is “in” the set of words the user is studying, some SRS algorithm will be assumed to “work” for the user. Frequency analysis is only useful to determine +1 in the situation where there is an option in discovered n+1 propositions.

### Finding N vs N+1

Since the user already knows all the words in an N story, it is possible that the average frequency will determine it’s level. However, whether this should be done on an average of all words, unique words, the most difficult words, etc. is unknown. If we assume the average of the most difficult 20% of the words in the document, we might get an idea of the “difficulty impact” of that document. Perhaps as few as 10 words could be used to count this number; perhaps 5. Perhaps it should be based on document length; one word per 100 headwords? per 200? So some experimentation must be done to determine what average or what composition of words would be used to analyze a document.

Also, is it really more difficult to understand a document with more copies of a low frequency word? Yes and no; it should be taken into account but perhaps it is not that important. This would be covered by an average of, say, the most difficult 20% of words…

A document will also need to be scored by it’s most difficult word alone; assuming this will represent a +1 at some point, that’s the number we will need to find when we are looking for a word near a certain frequency.

This implies we will need at least one story for every word in the target language!

RS3; for every ~200 words in the text, drop the least frequent word. The remaining highest frequency word is an ‘n’ for n+1 where you want a +1 every ~200 words. We could carry this idea further, especially for longer stories, but for longer stories the concept has less meaning. The concept being that you want to avoid documents with outlier words that are of very low frequency. Thus two or three numbers should be recorded, perhaps; the n (least frequent word), the n for n+1 (second least frequent word) and so on; perhaps with a fourth or fifth being an over-200-word-estimate. This would allow us to find stories both within the user’s n, at n+1 or n+2 exactly (etc) or n+p where p is over x words. These could then be used as “difficulty ratings” at a certain n level, such as star ratings used in some ESL magazines. Articles with a greater number of higher frequency words at a certain band (a certain n) would be considered more difficult than easier articles.

What is preferred? According to Krashen, the more N+1 available the better. N and N+1.

Present how much N vs. N+1? Number of articles is meaningless, only amount of reading. So it must be done by number of words. Look for data on 200 vs 250 vs 300 words, or what percentage is a decent amount. It is probably rare, at least for shorter articles or at lower levels that you will find an article  with only one new word that is over x in length (one new word in 500 may be considered rare, or at least easier than one new word in 200). Maybe there is a method for this?

Top frequency divided by word count?

Lets take all articles which are n+1. Longer ones will be preferred, and/or articles with greater frequency of the +1 word. Thus within n+1 articles, preferred articles will be longer articles and also articles where the +1 is a lower frequency. The longer the article then, the more n practice (review) a user will get.

This then would be represented by a fractional new word count over x number of words, such as 0.4 per 200 for one over 500. This, then, appears to be a reasonable metric for determining n+1; the fractional +1 per x score among n+1 with the highest frequency scores. Prefer 70% frequency and 30% occurrence? Again, more experimentation would need to be done to see how this works in practice.

In practice; “These three articles all have approximately one new word per 200” (because the previous articles you studied eliminated the ones with one new per 1000). For any given frequency then we will need at least one n+1 story. We may also perhaps determine if there are too many stories which fit given n’s by analyzing the story’s value as if it was an n story; too many stories at the same n are not so helpful. But would likely be so rare as to make checking meaningless. Looking at averages, or number of stories by approximate grade level (or band) would be enough to determine if more or less were needed at a particular level.

### Ignoring Grammar

This method ignores the progression of grammar, ex. counting how many sentence patterns someone knows and slowly introducing new sentence patterns and grammatical constructs. No known readability metric covers precisely this as it would require identifying sentance patterns in a text which is a non-trivial endeavour. First one would have to identify sentence parts and then classify patterns by the string of parts or equivalent strings of parts.

This may not be as difficult as it may first seem. Initially, many words could be classified with 100% accuracy; “she” is always a pronoun, “walk” will always be a verb or a noun; we can dertermine that it is always a noun if it appears after an article, and always a verb if it appears after a subject and is properly conjugated.

In such a sense, the auto-recognition of a sentence pattern could be done in stages, first by identifying obvious words and then by working out the less obvious ones. And that’s it — once a pattern has been broken down into parts, it can be hashed into a list of parts; SP frequency can then be used in conjuction to word frequency to determine N+1; the user should be offered both kinds of learning, but in what mix or order is not immediately clear. Perhaps grammatical complexity should be introduced before word frequency; all of this would need to be studied and overlaid onto grade levels or analyzed to see how they co-relate.

Perhaps it could be pegged that certain SP’s should be introduced at certain word frequency levels (and vice versa) — that would be very convenient indeed!