How to tweak the NLTK sentence tokenizer

You need to supply a list of abbreviations to the tokenizer, like so:

from nltk.tokenize.punkt import PunktSentenceTokenizer, PunktParameters
punkt_param = PunktParameters()
punkt_param.abbrev_types = set(['dr', 'vs', 'mr', 'mrs', 'prof', 'inc'])
sentence_splitter = PunktSentenceTokenizer(punkt_param)
text = "is THAT what you mean, Mrs. Hussey?"
sentences = sentence_splitter.tokenize(text)

sentences is now:

['is THAT what you mean, Mrs. Hussey?']

Update: This does not work if the last word of the sentence has an apostrophe or a quotation mark attached to it (like Hussey?’). So a quick-and-dirty way around this is to put spaces in front of apostrophes and quotes that follow sentence-end symbols (.!?):

text = text.replace('?"', '? "').replace('!"', '! "').replace('."', '. "')

More Related Contents:

Leave a Comment Cancel reply