A categorized, tagged Septuagint corpus
The Blog of Nathan D. Smith
Last year I created a version of the SBLGNT for use as categorized, tagged, corpus for natural language processing. Now I have done the same with a Septuagint text. I am calling it LXXMorph-Corpus. The source for text and tags is my unicode conversion of the CATSS LXXMorph text. There is at least one category for each file.
The text is arranged with one book per file. Certain books in the source LXXMorph text are split where there is significant textual divergence (manuscript B and A, or the Old Greek and Theodotion). Each file has one or more categories (e.g. pentateuch and writings).
Since there is no punctuation in the source text, the files are laid out with one verse per line. A better arrangement from an NLP perspective would be one line per sentence (thereby preserving the semantic structure). Maybe someday we'll have a freely-licensed LXX text which will include sentence breaks.
Each word is accompanied by the morphological tag in the word/tag format (NLTK will automatically split word and tag on the slash). The part of speech tag is separated from the parsing information with a hyphen, which enables the use of the simplify tags function in NLTK.
Here follows an example of how to load this corpus into NLTK:
from nltk.corpus.reader import CategorizedTaggedCorpusReader def simplify_tag(tag): try: if '-' in tag: tag = tag.split('-')[0] return tag except: return tag lxx = CategorizedTaggedCorpusReader('lxxmorph-corpus/', '\d{2}\..*', encoding=u'utf8', tag_mapping_function=simplify_tag, cat_file='cats.txt')
Now through the lxx object you have access to:
- tagged words:
lxx.tagged_wordns()
- simplified tags:
lxx.tagged_words(simplify_tags=True)
- tagged sentences:
lxx.tagged_sents()
- textual categories:
lxx.words(categories='former-prophets')
This is a derivative work of the original CATSS LXXMorph text, and so your use of it is subject to the terms of that license. See the README file for more details.