Label encoding across multiple columns in scikit-learn

You can easily do this though,

df.apply(LabelEncoder().fit_transform)

EDIT2:

In scikit-learn 0.20, the recommended way is

OneHotEncoder().fit_transform(df)

as the OneHotEncoder now supports string input.
Applying OneHotEncoder only to certain columns is possible with the ColumnTransformer.

EDIT:

Since this original answer is over a year ago, and generated many upvotes (including a bounty), I should probably extend this further.

For inverse_transform and transform, you have to do a little bit of hack.

from collections import defaultdict
d = defaultdict(LabelEncoder)

With this, you now retain all columns LabelEncoder as dictionary.

# Encoding the variable
fit = df.apply(lambda x: d[x.name].fit_transform(x))

# Inverse the encoded
fit.apply(lambda x: d[x.name].inverse_transform(x))

# Using the dictionary to label future data
df.apply(lambda x: d[x.name].transform(x))

MOAR EDIT:

Using Neuraxle’s FlattenForEach step, it’s possible to do this as well to use the same LabelEncoder on all the flattened data at once:

FlattenForEach(LabelEncoder(), then_unflatten=True).fit_transform(df)

For using separate LabelEncoders depending for your columns of data, or if only some of your columns of data needs to be label-encoded and not others, then using a ColumnTransformer is a solution that allows for more control on your column selection and your LabelEncoder instances.

Leave a Comment