Accuracy Score ValueError: Can’t Handle mix of binary and continuous target

Despite the plethora of wrong answers here that attempt to circumvent the error by numerically manipulating the predictions, the root cause of your error is a theoretical and not computational issue: you are trying to use a classification metric (accuracy) in a regression (i.e. numeric prediction) model (LinearRegression), which is meaningless. Just like the majority … Read more

Save classifier to disk in scikit-learn

Classifiers are just objects that can be pickled and dumped like any other. To continue your example: import cPickle # save the classifier with open(‘my_dumped_classifier.pkl’, ‘wb’) as fid: cPickle.dump(gnb, fid) # load it again with open(‘my_dumped_classifier.pkl’, ‘rb’) as fid: gnb_loaded = cPickle.load(fid) Edit: if you are using a sklearn Pipeline in which you have custom … Read more

scikit-learn & statsmodels – which R-squared is correct?

Arguably, the real challenge in such cases is to be sure that you compare apples to apples. And in your case, it seems that you don’t. Our best friend is always the relevant documentation, combined with simple experiments. So… Although scikit-learn’s LinearRegression() (i.e. your 1st R-squared) is fitted by default with fit_intercept=True (docs), this is … Read more

sklearn error ValueError: Input contains NaN, infinity or a value too large for dtype(‘float64’)

This might happen inside scikit, and it depends on what you’re doing. I recommend reading the documentation for the functions you’re using. You might be using one which depends e.g. on your matrix being positive definite and not fulfilling that criteria. EDIT: How could I miss that: np.isnan(mat.any()) #and gets False np.isfinite(mat.all()) #and gets True … Read more

How to one-hot-encode from a pandas column containing a list?

We can also use sklearn.preprocessing.MultiLabelBinarizer: Often we want to use sparse DataFrame for the real world data in order to save a lot of RAM. Sparse solution (for Pandas v0.25.0+) from sklearn.preprocessing import MultiLabelBinarizer mlb = MultiLabelBinarizer(sparse_output=True) df = df.join( pd.DataFrame.sparse.from_spmatrix( mlb.fit_transform(df.pop(‘Col3’)), index=df.index, columns=mlb.classes_)) result: In [38]: df Out[38]: Col1 Col2 Apple Banana Grape Orange … Read more

How to extract the decision rules from scikit-learn decision-tree?

I believe that this answer is more correct than the other answers here: from sklearn.tree import _tree def tree_to_code(tree, feature_names): tree_ = tree.tree_ feature_name = [ feature_names[i] if i != _tree.TREE_UNDEFINED else “undefined!” for i in tree_.feature ] print “def tree({}):”.format(“, “.join(feature_names)) def recurse(node, depth): indent = ” ” * depth if tree_.feature[node] != _tree.TREE_UNDEFINED: … Read more

Label encoding across multiple columns in scikit-learn

You can easily do this though, df.apply(LabelEncoder().fit_transform) EDIT2: In scikit-learn 0.20, the recommended way is OneHotEncoder().fit_transform(df) as the OneHotEncoder now supports string input. Applying OneHotEncoder only to certain columns is possible with the ColumnTransformer. EDIT: Since this original answer is over a year ago, and generated many upvotes (including a bounty), I should probably extend … Read more

How to split data into 3 sets (train, validation and test)?

Numpy solution. We will shuffle the whole dataset first (df.sample(frac=1, random_state=42)) and then split our data set into the following parts: 60% – train set, 20% – validation set, 20% – test set In [305]: train, validate, test = \ np.split(df.sample(frac=1, random_state=42), [int(.6*len(df)), int(.8*len(df))]) In [306]: train Out[306]: A B C D E 0 0.046919 … Read more