Prep the SBLGNT for use as an NLTK corpus
The Blog of Nathan D. Smith

The SBLGNT is available as a plain-text download, which is my personal favorite format for text processing. I have been wanting to put the SBLGNT into a Natural Language Toolkit corpus for ease in text processing for quite some time, and decided to get around to it yesterday.

First of all, the plain text of the SBLGNT has a few undesirable features for this task. First, each verse is prefixed with the verse number and the tab character, which is great for many applications but not for corpus linguistics. Second, the text contains Windows-style linebreaks and other extraneous whitespace. Third, the text contains text-critical signs.

So I wrote a script to download the plaintext archive, extract the text, and normalize it for use in NLTK.

Fir download and extract or checkout the repo. To install requirements:

$ pip install -r requirements.txt

Next, run the script:

$ python

Now you have a collection of text files, one for each book of the New Testament, in a directory called "out". You can know use these with NLTK. For example:

>>> import nltk
>>> sblgnt = nltk.corpus.PlaintextCorpusReader('out','.*',encoding='utf-8')
>>> sblgnt_text = nltk.text.Text([w.encode('utf-8') for w in sblgnt.words()])

You end up with sblgnt as an NLTK corpus object and sblgnttext as an NLTK text object. You can refer to the NLTK documentation for the various uses of these. Please take note of the encodings. If you don't pay attention, you'll get lots of encoding errors when working with a unicode text and NLTK.

One thing you can do is run the collocations method on sblgnttext:

>>> sblgnt_text.collocations()
Building collocations list
τοῦ θεοῦ; ἐν τῷ; ἀλλ ’; ἐν τῇ; ὁ Ἰησοῦς; δι ’; ἐπ ’; ὁ θεὸς; μετ ’; εἰς τὴν; ἀπ ’; τῆς γῆς; λέγω ὑμῖν; Ἰησοῦ Χριστοῦ; ἐκ τοῦ; τῷ θεῷ; τοῦ κυρίου; κατ ’; εἰς τὸ; οὐκ ἔστιν

I'll have to look into tweaking the NLTK tokenizer, because, as you can see, it is treating elisions as tokens, which may or may not be grammatically correct (I'll have to think about that and ask around). Another cool trick, the generate method:

>>> sblgnt_text.generate(50)
Building ngram index...

ΠΡΟΣ ΚΟΡΙΝΘΙΟΥΣ Α Παῦλος ἀπόστολος Χριστοῦ Ἰησοῦ καὶ τοῖς βουνοῖς · Καλύψατε ἡμᾶς · πολλοὶ ἐλεύσονται ἐπὶ τῷ λόγῳ διὰ τῆς στενῆς θύρας , ὅτι τὸ μωρὸν τοῦ θεοῦ . Καὶ ἐγένετο ἐν τῷ βυθῷ πεποίηκα · ὁδοιπορίαις πολλάκις , ἐν κόποις , ἐλπίδα δὲ ἔχοντες αὐξανομένης τῆς πίστεως ,

So that's that. At some point I'll attempt to make a tagged text based on the MorphGNT (which is being re-based off SBLGNT).

Date: 2013-02-24 15:26