Better tokenization of the SBLGNT
The Blog of Nathan D. Smith
In my previous post on this topic I mentioned that the default NLTK tokenizer was erroneously treating elisions as separate tokens. They should be grouped with the word to which they are attached in my opinion. I decided today to look into this and fix the problem.
The SBLGNT uses unicode character 0x2019 ("right single quotation
mark") for elisions. The default tokenizer for the NLTK
PlaintextCorpus is apparently the wordpunct\tokenize function. This
uses the following regular expression for matching tokens:
\w+|[^\w\s]+
That essentially means: match any sequence of alphanumeric characters (\w+), or (|) any sequence comprised of neither alphanumeric characters nor whitespace ([\^\w\s]+) - e.g. punctuation. The problem is that in Python's implementation of unicode, 0x2019 is not considered an alphanumeric character, so it is getting tokenized on its own by the latter expression meant to catch punctuation.
So I crafted a new regular expression to alter this behavior:
\w+\u2019?|[^\w\s\u2019]+
So now for each sequence of alphanumeric characters, there can optionally be a 0x2019 at the end to catch elisions (I also explicitly exclude 0x2012 from the latter expression, though I am not entirely sure this is necessary). So now to actually use this:
tokens = nltk.tokenize.regexp.regexp_tokenize(text, u'\w+\u2019?|[^\w\s\u2019]+')
Using the custom regexptokenize function we can tokenize a text using any old regular expression our heart desires. I put a full example of this in the same repo with the name load-sblgnt.py. It should be run after the sblgnt-nltk.py script has run to download and prep the data. The load script provides an example workflow for getting an NLTK text object and then running collocations() and generate() as an example. Enjoy!