LXX Vocabulary Coverage
The Blog of Nathan D. Smith
James Tauber is blogging daily until SBL, and several of his posts have piqued my interest, so expect to see a few derivative posts here.
The first post covers vocabulary coverage statistics for the SBLGNT. The concept can take a moment to wrap your mind around: given a count of vocabulary learned (vertical axis), assuming you want to be able to read a certain percentage of words in a verse (horizontal axis), in what percentage of verses will you be successful (intersection of the two).
My usual instinct when reading posts about New Testament Greek is to try the same thing with the Septuagint. Here is the data for the LXXM using the methodology outlined in James' post:
ANY 50.00% 75.00% 90.00% 95.00% 100.00% ------------------------------------------------------------------ 100 99.78% 88.63% 27.16% 1.99% 0.74% 0.62% 200 99.80% 94.19% 51.25% 8.65% 2.58% 1.56% 500 99.84% 98.38% 78.01% 33.00% 13.95% 8.30% 1000 99.89% 99.35% 89.86% 58.46% 34.27% 23.08% 2000 99.92% 99.61% 95.93% 79.25% 59.45% 46.20% 5000 99.99% 99.87% 98.67% 93.72% 85.12% 77.44% 10000 100.00% 99.99% 99.78% 98.31% 95.33% 92.15% ALL 100.00% 100.00% 100.00% 100.00% 100.00% 100.00%
(In order to obtain the necessary input data, I had to restructure the lxxmorph-unicode dataset - after proofing I'd like to release the new format soon.)
Say you had learned 500 words, and only wanted to look up about one word per verse (90%), you would be successful in 13.95% of verses. Another way of looking at it: if you wanted to know 75% of words in 90% of verses, how big would your vocabulary need to be? About 1000 words.
I have been convinced by smart and experienced educators that vocabulary mastery really is the key to mastery of reading Greek. Just imagine the frustration of having to look up words that often even after learning so many. Wait, you probably don't have to imagine it - we've all been there! Vocab is king.
The LXX is a much bigger corpus than the New Testament (and maybe has more lexical diversity - perhaps the subject of a forthcoming post). By way of comparison with the above, a vocab of 500 targeting 90% coverage would be successful in 36.57% of verses.
I wonder if maybe the number of proper nouns in the LXX may significantly skew these numbers. Proper nouns are not vocabulary words per se - the knowledge and memory of them works differently than vocab words. So what if I remove them from consideration (in this case just filtering out words which start with a capital letter from the input file). This decreased the word count from 623,685 to 589,731. Here is the updated coverage:
ANY 50.00% 75.00% 90.00% 95.00% 100.00% ------------------------------------------------------------------ 100 99.91% 91.57% 40.48% 6.54% 3.36% 3.02% 200 99.92% 95.92% 63.30% 18.17% 7.77% 5.79% 500 99.97% 99.21% 85.79% 47.92% 26.03% 18.13% 1000 99.99% 99.84% 94.92% 72.53% 49.98% 38.05% 2000 99.99% 99.97% 98.84% 89.13% 74.26% 63.61% 5000 100.00% 100.00% 99.92% 98.40% 93.92% 89.70% 10000 100.00% 100.00% 100.00% 100.00% 100.00% 100.00% ALL 100.00% 100.00% 100.00% 100.00% 100.00% 100.00%
That change upped the 500/90% result to 26.03% from 13.95%. Still pretty daunting, but less intimidating when you think of it that way.