random-forest
Scikit learn – fit_transform on the test set
You are not supposed to do fit_transform on your test data, but only transform. Otherwise, you will get different vectorization than the one used during training. For the memory issue, I recommend TfIdfVectorizer, which has numerous options of reducing the dimensionality (by removing rare unigrams etc.). UPDATE If the only problem is fitting test data, … Read more
Numpy Array Get row index searching by a row
Why not simply do something like this? >>> a array([[ 0., 5., 2.], [ 0., 0., 3.], [ 0., 0., 0.]]) >>> b array([ 0., 0., 3.]) >>> a==b array([[ True, False, False], [ True, True, True], [ True, True, False]], dtype=bool) >>> np.all(a==b,axis=1) array([False, True, False], dtype=bool) >>> np.where(np.all(a==b,axis=1)) (array([1]),)
Got continuous is not supported error in RandomForestRegressor
It’s because accuracy_score is for classification tasks only. For regression you should use something different, for example: clf.score(X_test, y_test) Where X_test is samples, y_test is corresponding ground truth values. It will compute predictions inside.
How are feature_importances in RandomForestClassifier determined?
There are indeed several ways to get feature “importances”. As often, there is no strict consensus about what this word means. In scikit-learn, we implement the importance as described in [1] (often cited, but unfortunately rarely read…). It is sometimes called “gini importance” or “mean decrease impurity” and is defined as the total decrease in … Read more
Why is Random Forest with a single tree much better than a Decision Tree classifier?
The random forest estimators with one estimator isn’t just a decision tree? Well, this is a good question, and the answer turns out to be no; the Random Forest algorithm is more than a simple bag of individually-grown decision trees. Apart from the randomness induced from ensembling many trees, the Random Forest (RF) algorithm also … Read more
Predict classes or class probabilities?
In principle & in theory, hard & soft classification (i.e. returning classes & probabilities respectively) are different approaches, each one with its own merits & downsides. Consider for example the following, from the paper Hard or Soft Classification? Large-margin Unified Machines: Margin-based classifiers have been popular in both machine learning and statistics for classification problems. … Read more
How to extract the decision rules from scikit-learn decision-tree?
I believe that this answer is more correct than the other answers here: from sklearn.tree import _tree def tree_to_code(tree, feature_names): tree_ = tree.tree_ feature_name = [ feature_names[i] if i != _tree.TREE_UNDEFINED else “undefined!” for i in tree_.feature ] print “def tree({}):”.format(“, “.join(feature_names)) def recurse(node, depth): indent = ” ” * depth if tree_.feature[node] != _tree.TREE_UNDEFINED: … Read more