A categorized, tagged Greek New Testament corpus
The Blog of Nathan D. Smith

I have published a categorized, tagged Greek New Testament useful for natural language processing. I am calling it sblgnt-corpus. The text comes from the SBGNT and the morphological tags come from the MorphGNT project.

The text is broken up with one book per file. Each file has one or more categories (e.g. gospel and pauline). In the files there is one sentence (not verse) per line. Sentences are demarcated by punctuation . ; and ·. This makes it easy to tokenize sentences by splitting on newlines. Each word is accompanied by the morphological tag in the word/tag format (NLTK will automatically split word and tag on the slash). The part of speech tag is separated from the parsing information with a hyphen, which enables the use of the simplify tags function in NLTK.

Here is an example:

εὐθυμεῖ/V-3PAIS τις/RI-NSM ;/;
ψαλλέτω/V-3PADS ./.

Here follows an example of how to load this corpus into NLTK:

from nltk.corpus.reader import CategorizedTaggedCorpusReader

def simplify_tag(tag):
    try:
        if '-' in tag:
            tag = tag.split('-')[0]
        return tag
    except:
        return tag

sblgnt = CategorizedTaggedCorpusReader('sblgnt-corpus/', 
    '\d{2}-.*', encoding=u'utf8',
    tag_mapping_function=simplify_tag, 
    cat_file='cats.txt')

Now through the sblgnt object you have access to:

sblgnt.tagged_words()
sblgnt.tagged_words(simplify_tags=True)
sblgnt.tagged_sents()
sblgnt.words(categories='gospel')

That should be enough to kickstart the exploration of the Greek New Testament with natural language processing.

Date: 2013-03-13

Validate