Scikit learn – fit_transform on the test set

You are not supposed to do fit_transform on your test data, but only transform. Otherwise, you will get different vectorization than the one used during training. For the memory issue, I recommend TfIdfVectorizer, which has numerous options of reducing the dimensionality (by removing rare unigrams etc.). UPDATE If the only problem is fitting test data, … Read more

Mixing categorial and continuous data in Naive Bayes classifier using scikit-learn

You have at least two options: Transform all your data into a categorical representation by computing percentiles for each continuous variables and then binning the continuous variables using the percentiles as bin boundaries. For instance for the height of a person create the following bins: “very small”, “small”, “regular”, “big”, “very big” ensuring that each … Read more

Using my own corpus instead of movie_reviews corpus for Classification in NLTK

If you have you data in exactly the same structure as the movie_review corpus in NLTK, there are two ways to “hack” your way through: 1. Put your corpus directory into where you save the nltk.data First check where is your nltk.data saved: >>> import nltk >>> nltk.data.find(‘corpora/movie_reviews’) FileSystemPathPointer(u’/home/alvas/nltk_data/corpora/movie_reviews’) Then move your directory to where … Read more

Cost function in logistic regression gives NaN as a result

There are two possible reasons why this may be happening to you. The data is not normalized This is because when you apply the sigmoid / logit function to your hypothesis, the output probabilities are almost all approximately 0s or all 1s and with your cost function, log(1 – 1) or log(0) will produce -Inf. … Read more