How to tweak the NLTK sentence tokenizer

You need to supply a list of abbreviations to the tokenizer, like so: from nltk.tokenize.punkt import PunktSentenceTokenizer, PunktParameters punkt_param = PunktParameters() punkt_param.abbrev_types = set([‘dr’, ‘vs’, ‘mr’, ‘mrs’, ‘prof’, ‘inc’]) sentence_splitter = PunktSentenceTokenizer(punkt_param) text = “is THAT what you mean, Mrs. Hussey?” sentences = sentence_splitter.tokenize(text) sentences is now: [‘is THAT what you mean, Mrs. Hussey?’] Update: … Read more

NLTK download SSL: Certificate verify failed

TLDR: Here is a better solution: https://github.com/gunthercox/ChatterBot/issues/930#issuecomment-322111087 Note that when you run nltk.download(), a window will pop up and let you select which packages to download (Download is not automatically started right away). To complement the accepted answer, the following is a complete list of directories that will be searched on Mac (not limited to … Read more

How to get most informative features for scikit-learn classifier for different class?

In the case of binary classification, it seems like the coefficient array has been flatten. Let’s try to relabel our data with only two labels: import codecs, re, time from itertools import chain import numpy as np from sklearn.feature_extraction.text import CountVectorizer from sklearn.naive_bayes import MultinomialNB trainfile=”train.txt” # Vectorizing data. train = [] word_vectorizer = CountVectorizer(analyzer=”word”) … Read more

Getting 405 error while trying to download nltk data

This is caused by a down-age of Github raw file link. Meanwhile a stop-gap solution would be to manually download the file: PATH_TO_NLTK_DATA=/home/username/nltk_data/ wget https://github.com/nltk/nltk_data/archive/gh-pages.zip unzip gh-pages.zip mv nltk_data-gh-pages/ $PATH_TO_NLTK_DATA We’re working on finding an alternative to the data and model downloading. Meanwhile, @everyone please help to check that your script(s) and make sure that … Read more

Fast n-gram calculation

Since you didn’t indicate whether you want word or character-level n-grams, I’m just going to assume the former, without loss of generality. I also assume you start with a list of tokens, represented by strings. What you can easily do is write n-gram extraction yourself. def ngrams(tokens, MIN_N, MAX_N): n_tokens = len(tokens) for i in … Read more

Implementing Bag-of-Words Naive-Bayes classifier in NLTK

scikit-learn has an implementation of multinomial naive Bayes, which is the right variant of naive Bayes in this situation. A support vector machine (SVM) would probably work better, though. As Ken pointed out in the comments, NLTK has a nice wrapper for scikit-learn classifiers. Modified from the docs, here’s a somewhat complicated one that does … Read more

Topic distribution: How do we see which document belong to which topic after doing LDA in python

Using the probabilities of the topics, you can try to set some threshold and use it as a clustering baseline, but i am sure there are better ways to do clustering than this ‘hacky’ method. from gensim import corpora, models, similarities from itertools import chain “”” DEMO “”” documents = [“Human machine interface for lab … Read more