(This is just a reformat of my comment above from 2016…it still holds true.)
The accepted answer for this question is misleading.
As it stands, sklearn decision trees do not handle categorical data – see issue #5442.
The recommended approach of using Label Encoding converts to integers which the DecisionTreeClassifier()
will treat as numeric. If your categorical data is not ordinal, this is not good – you’ll end up with splits that do not make sense.
Using a OneHotEncoder
is the only current valid way, allowing arbitrary splits not dependent on the label ordering, but is computationally expensive.