Passing categorical data to Sklearn Decision Tree

(This is just a reformat of my comment above from 2016…it still holds true.)

The accepted answer for this question is misleading.

As it stands, sklearn decision trees do not handle categorical data – see issue #5442.

The recommended approach of using Label Encoding converts to integers which the DecisionTreeClassifier() will treat as numeric. If your categorical data is not ordinal, this is not good – you’ll end up with splits that do not make sense.

Using a OneHotEncoder is the only current valid way, allowing arbitrary splits not dependent on the label ordering, but is computationally expensive.

More Related Contents:

How to extract the decision rules from scikit-learn decision-tree?
Why is Random Forest with a single tree much better than a Decision Tree classifier?
confused about random_state in decision tree of scikit learn
sklearn error ValueError: Input contains NaN, infinity or a value too large for dtype(‘float64’)
ValueError: numpy.ndarray size changed, may indicate binary incompatibility. Expected 88 from C header, got 80 from PyObject
Save classifier to disk in scikit-learn
Accuracy Score ValueError: Can’t Handle mix of binary and continuous target
where to put freeze_support() in a Python script?
How to normalize a NumPy array to a unit vector?
How to do n-D distance and nearest neighbor calculations on numpy arrays
Is there a library function for Root mean square error (RMSE) in python?
How to get precision, recall and f-measure from confusion matrix in Python [duplicate]
How to convert a Scikit-learn dataset to a Pandas dataset
Custom transformer for sklearn Pipeline that alters both X and y
Scikit Learn SVC decision_function and predict
RuntimeWarning: numpy.dtype size changed, may indicate binary incompatibility
Save MinMaxScaler model in sklearn
sklearn plot confusion matrix with labels
confusion matrix error “Classification metrics can’t handle a mix of multilabel-indicator and multiclass targets”
Using explicit (predefined) validation set for grid search with sklearn
LabelEncoder: TypeError: ‘>’ not supported between instances of ‘float’ and ‘str’
What are the pros and cons between get_dummies (Pandas) and OneHotEncoder (Scikit-learn)?
Run an OLS regression with Pandas Data Frame
How to upgrade scikit-learn package in anaconda
sklearn pipeline – how to apply different transformations on different columns
How to compute precision, recall, accuracy and f1-score for the multiclass case with scikit learn?
What are the different use cases of joblib versus pickle?
Using Smote with Gridsearchcv in Scikit-learn
Does GridSearchCV perform cross-validation?
Python import error: cannot import name ‘six’ from ‘sklearn.externals’

More Related Contents:

Leave a Comment Cancel reply