How to one-hot-encode from a pandas column containing a list?

We can also use sklearn.preprocessing.MultiLabelBinarizer:

Often we want to use sparse DataFrame for the real world data in order to save a lot of RAM.

Sparse solution (for Pandas v0.25.0+)

from sklearn.preprocessing import MultiLabelBinarizer

mlb = MultiLabelBinarizer(sparse_output=True)

df = df.join(
            pd.DataFrame.sparse.from_spmatrix(
                mlb.fit_transform(df.pop('Col3')),
                index=df.index,
                columns=mlb.classes_))

result:

In [38]: df
Out[38]:
  Col1  Col2  Apple  Banana  Grape  Orange
0    C  33.0      1       1      0       1
1    A   2.5      1       0      1       0
2    B  42.0      0       1      0       0

In [39]: df.dtypes
Out[39]:
Col1                object
Col2               float64
Apple     Sparse[int32, 0]
Banana    Sparse[int32, 0]
Grape     Sparse[int32, 0]
Orange    Sparse[int32, 0]
dtype: object

In [40]: df.memory_usage()
Out[40]:
Index     128
Col1       24
Col2       24
Apple      16    #  <--- NOTE!
Banana     16    #  <--- NOTE!
Grape       8    #  <--- NOTE!
Orange      8    #  <--- NOTE!
dtype: int64

Dense solution

mlb = MultiLabelBinarizer()
df = df.join(pd.DataFrame(mlb.fit_transform(df.pop('Col3')),
                          columns=mlb.classes_,
                          index=df.index))

Result:

In [77]: df
Out[77]:
  Col1  Col2  Apple  Banana  Grape  Orange
0    C  33.0      1       1      0       1
1    A   2.5      1       0      1       0
2    B  42.0      0       1      0       0

Leave a Comment