classification - w3toppers.com

Recognise an arbitrary date string [closed]

Use JChronic You may want to use DateParser2 from edu.mit.broad.genome.utils package.

scikit-learn .predict() default threshold

The threshold can be set using clf.predict_proba() for example: from sklearn.tree import DecisionTreeClassifier clf = DecisionTreeClassifier(random_state = 2) clf.fit(X_train,y_train) # y_pred = clf.predict(X_test) # default threshold is 0.5 y_pred = (clf.predict_proba(X_test)[:,1] >= 0.3).astype(bool) # set threshold as 0.3

Error in Confusion Matrix : the data and reference factors must have the same number of levels

Scikit learn – fit_transform on the test set

You are not supposed to do fit_transform on your test data, but only transform. Otherwise, you will get different vectorization than the one used during training. For the memory issue, I recommend TfIdfVectorizer, which has numerous options of reducing the dimensionality (by removing rare unigrams etc.). UPDATE If the only problem is fitting test data, … Read more

Save Naive Bayes Trained Classifier in NLTK

To save: import pickle f = open(‘my_classifier.pickle’, ‘wb’) pickle.dump(classifier, f) f.close() To load later: import pickle f = open(‘my_classifier.pickle’, ‘rb’) classifier = pickle.load(f) f.close()

Mixing categorial and continuous data in Naive Bayes classifier using scikit-learn

You have at least two options: Transform all your data into a categorical representation by computing percentiles for each continuous variables and then binning the continuous variables using the percentiles as bin boundaries. For instance for the height of a person create the following bins: “very small”, “small”, “regular”, “big”, “very big” ensuring that each … Read more

Using my own corpus instead of movie_reviews corpus for Classification in NLTK

If you have you data in exactly the same structure as the movie_review corpus in NLTK, there are two ways to “hack” your way through: 1. Put your corpus directory into where you save the nltk.data First check where is your nltk.data saved: >>> import nltk >>> nltk.data.find(‘corpora/movie_reviews’) FileSystemPathPointer(u’/home/alvas/nltk_data/corpora/movie_reviews’) Then move your directory to where … Read more

10 fold cross-validation in one-against-all SVM (using LibSVM)

Mainly there are two reasons we do cross-validation: as a testing method which gives us a nearly unbiased estimate of the generalization power of our model (by avoiding overfitting) as a way of model selection (eg: find the best C and gamma parameters over the training data, see this post for an example) For the … Read more

Understanding concept of Gaussian Mixture Models

I think it would help if you first look at what a GMM model represents. I’ll be using functions from the Statistics Toolbox, but you should be able to do the same using VLFeat. Let’s start with the case of a mixture of two 1-dimensional normal distributions. Each Gaussian is represented by a pair of … Read more

Cost function in logistic regression gives NaN as a result

There are two possible reasons why this may be happening to you. The data is not normalized This is because when you apply the sigmoid / logit function to your hypothesis, the output probabilities are almost all approximately 0s or all 1s and with your cost function, log(1 – 1) or log(0) will produce -Inf. … Read more