scikit-learn - w3toppers.com

where to put freeze_support() in a Python script?

On Windows all of your multiprocessing-using code must be guarded by if __name__ == “__main__”: So to be safe, I would put all of your the code currently at the top-level of your script in a main() function, and then just do this at the top-level: if __name__ == “__main__”: main() See the “Safe importing … Read more

Accuracy Score ValueError: Can’t Handle mix of binary and continuous target

Despite the plethora of wrong answers here that attempt to circumvent the error by numerically manipulating the predictions, the root cause of your error is a theoretical and not computational issue: you are trying to use a classification metric (accuracy) in a regression (i.e. numeric prediction) model (LinearRegression), which is meaningless. Just like the majority … Read more

Save classifier to disk in scikit-learn

Classifiers are just objects that can be pickled and dumped like any other. To continue your example: import cPickle # save the classifier with open(‘my_dumped_classifier.pkl’, ‘wb’) as fid: cPickle.dump(gnb, fid) # load it again with open(‘my_dumped_classifier.pkl’, ‘rb’) as fid: gnb_loaded = cPickle.load(fid) Edit: if you are using a sklearn Pipeline in which you have custom … Read more

scikit-learn & statsmodels – which R-squared is correct?

Arguably, the real challenge in such cases is to be sure that you compare apples to apples. And in your case, it seems that you don’t. Our best friend is always the relevant documentation, combined with simple experiments. So… Although scikit-learn’s LinearRegression() (i.e. your 1st R-squared) is fitted by default with fit_intercept=True (docs), this is … Read more

ValueError: numpy.ndarray size changed, may indicate binary incompatibility. Expected 88 from C header, got 80 from PyObject

I’m in Python 3.8.5. It sounds too simple to be real, but I had this same issue and all I did was reinstall numpy. Gone. pip install –upgrade numpy or pip uninstall numpy pip install numpy

sklearn error ValueError: Input contains NaN, infinity or a value too large for dtype(‘float64’)

This might happen inside scikit, and it depends on what you’re doing. I recommend reading the documentation for the functions you’re using. You might be using one which depends e.g. on your matrix being positive definite and not fulfilling that criteria. EDIT: How could I miss that: np.isnan(mat.any()) #and gets False np.isfinite(mat.all()) #and gets True … Read more

How to one-hot-encode from a pandas column containing a list?

We can also use sklearn.preprocessing.MultiLabelBinarizer: Often we want to use sparse DataFrame for the real world data in order to save a lot of RAM. Sparse solution (for Pandas v0.25.0+) from sklearn.preprocessing import MultiLabelBinarizer mlb = MultiLabelBinarizer(sparse_output=True) df = df.join( pd.DataFrame.sparse.from_spmatrix( mlb.fit_transform(df.pop(‘Col3’)), index=df.index, columns=mlb.classes_)) result: In [38]: df Out[38]: Col1 Col2 Apple Banana Grape Orange … Read more

How to extract the decision rules from scikit-learn decision-tree?

I believe that this answer is more correct than the other answers here: from sklearn.tree import _tree def tree_to_code(tree, feature_names): tree_ = tree.tree_ feature_name = [ feature_names[i] if i != _tree.TREE_UNDEFINED else “undefined!” for i in tree_.feature ] print “def tree({}):”.format(“, “.join(feature_names)) def recurse(node, depth): indent = ” ” * depth if tree_.feature[node] != _tree.TREE_UNDEFINED: … Read more

Label encoding across multiple columns in scikit-learn

You can easily do this though, df.apply(LabelEncoder().fit_transform) EDIT2: In scikit-learn 0.20, the recommended way is OneHotEncoder().fit_transform(df) as the OneHotEncoder now supports string input. Applying OneHotEncoder only to certain columns is possible with the ColumnTransformer. EDIT: Since this original answer is over a year ago, and generated many upvotes (including a bounty), I should probably extend … Read more

How to split data into 3 sets (train, validation and test)?

Numpy solution. We will shuffle the whole dataset first (df.sample(frac=1, random_state=42)) and then split our data set into the following parts: 60% – train set, 20% – validation set, 20% – test set In [305]: train, validate, test = \ np.split(df.sample(frac=1, random_state=42), [int(.6*len(df)), int(.8*len(df))]) In [306]: train Out[306]: A B C D E 0 0.046919 … Read more