Scikit learn – fit_transform on the test set

You are not supposed to do fit_transform on your test data, but only transform. Otherwise, you will get different vectorization than the one used during training. For the memory issue, I recommend TfIdfVectorizer, which has numerous options of reducing the dimensionality (by removing rare unigrams etc.). UPDATE If the only problem is fitting test data, … Read more

How are feature_importances in RandomForestClassifier determined?

There are indeed several ways to get feature “importances”. As often, there is no strict consensus about what this word means. In scikit-learn, we implement the importance as described in [1] (often cited, but unfortunately rarely read…). It is sometimes called “gini importance” or “mean decrease impurity” and is defined as the total decrease in … Read more

Why is Random Forest with a single tree much better than a Decision Tree classifier?

The random forest estimators with one estimator isn’t just a decision tree? Well, this is a good question, and the answer turns out to be no; the Random Forest algorithm is more than a simple bag of individually-grown decision trees. Apart from the randomness induced from ensembling many trees, the Random Forest (RF) algorithm also … Read more

Predict classes or class probabilities?

In principle & in theory, hard & soft classification (i.e. returning classes & probabilities respectively) are different approaches, each one with its own merits & downsides. Consider for example the following, from the paper Hard or Soft Classification? Large-margin Unified Machines: Margin-based classifiers have been popular in both machine learning and statistics for classification problems. … Read more

How to extract the decision rules from scikit-learn decision-tree?

I believe that this answer is more correct than the other answers here: from sklearn.tree import _tree def tree_to_code(tree, feature_names): tree_ = tree.tree_ feature_name = [ feature_names[i] if i != _tree.TREE_UNDEFINED else “undefined!” for i in tree_.feature ] print “def tree({}):”.format(“, “.join(feature_names)) def recurse(node, depth): indent = ” ” * depth if tree_.feature[node] != _tree.TREE_UNDEFINED: … Read more