Fun with LXXM-Corpus
The Blog of Nathan D. Smith
Once I have a text available for natural language processing, there are a few basic tasks I like to perform to kick the tires. First, I like to run the collocations method of NLTK, which gives common word pairs from the text. For the LXXM, here are the results:
- ἐν τῇ
- ἐν τῷ
- ὁ θεὸς
- τῆς γῆς
- καὶ εἶπεν
- λέγει κύριος
- ἀνὰ μέσον
- τὴν γῆν
- τοῦ θεοῦ
- ὁ θεός
- τάδε λέγει
- πρός με
- πάντα τὰ
- ὁ βασιλεὺς
- οὐ μὴ
- οὐκ ἔστιν
- τῇ ἡμέρᾳ
- οἱ υἱοὶ
- τῷ κυρίῳ
- τοῦ βασιλέως
If you disregard the stop words, you can get a decent idea of the fundamental thematic content of the text.
Now for the silliness, using the n-gran random text generator:
ἐν ἀρχῇ ὁδοῦ πόλεως ἐπ' ὀνόμασιν φυλῶν τοῦ Ισραηλ παρώξυναν οὐκ ἐμνήσθησαν διαθήκης ἀδελφῶν καὶ ἐξαποστελῶ πῦρ ἐπὶ Μωαβ ἐν τῷ ἐξαγαγεῖν σε τὸν ἱματισμόν