Passing categorical data to Sklearn Decision Tree

(This is just a reformat of my comment above from 2016…it still holds true.)

The accepted answer for this question is misleading.

As it stands, sklearn decision trees do not handle categorical data – see issue #5442.

The recommended approach of using Label Encoding converts to integers which the DecisionTreeClassifier() will treat as numeric. If your categorical data is not ordinal, this is not good – you’ll end up with splits that do not make sense.

Using a OneHotEncoder is the only current valid way, allowing arbitrary splits not dependent on the label ordering, but is computationally expensive.

Leave a Comment