Better tokenization of the SBLGNT
The Blog of Nathan D. Smith

In my previous post on this topic I mentioned that the default NLTK tokenizer was erroneously treating elisions as separate tokens. They should be grouped with the word to which they are attached in my opinion. I decided today to look into this and fix the problem.

The SBLGNT uses unicode character 0x2019 ("right single quotation mark") for elisions. The default tokenizer for the NLTK PlaintextCorpus is apparently the wordpunct\tokenize function. This uses the following regular expression for matching tokens: \w+|[^\w\s]+

That essentially means: match any sequence of alphanumeric characters (\w+), or (|) any sequence comprised of neither alphanumeric characters nor whitespace ([\^\w\s]+) - e.g. punctuation. The problem is that in Python's implementation of unicode, 0x2019 is not considered an alphanumeric character, so it is getting tokenized on its own by the latter expression meant to catch punctuation.

So I crafted a new regular expression to alter this behavior: \w+\u2019?|[^\w\s\u2019]+

So now for each sequence of alphanumeric characters, there can optionally be a 0x2019 at the end to catch elisions (I also explicitly exclude 0x2012 from the latter expression, though I am not entirely sure this is necessary). So now to actually use this:

tokens = nltk.tokenize.regexp.regexp_tokenize(text, u'\w+\u2019?|[^\w\s\u2019]+')

Using the custom regexptokenize function we can tokenize a text using any old regular expression our heart desires. I put a full example of this in the same repo with the name It should be run after the script has run to download and prep the data. The load script provides an example workflow for getting an NLTK text object and then running collocations() and generate() as an example. Enjoy!

Date: 2013-03-02 15:41