Reconstruct a categorical variable from dummies in pandas

It’s been a few years, so this may well not have been in the pandas toolkit back when this question was originally asked, but this approach seems a little easier to me. idxmax will return the index corresponding to the largest element (i.e. the one with a 1). We do axis=1 because we want the column name where the 1 occurs.

EDIT: I didn’t bother making it categorical instead of just a string, but you can do that the same way as @Jeff did by wrapping it with pd.Categorical (and pd.Series, if desired).

In [1]: import pandas as pd

In [2]: s = pd.Series(['a', 'b', 'a', 'c'])

In [3]: s
Out[3]: 
0    a
1    b
2    a
3    c
dtype: object

In [4]: dummies = pd.get_dummies(s)

In [5]: dummies
Out[5]: 
   a  b  c
0  1  0  0
1  0  1  0
2  1  0  0
3  0  0  1

In [6]: s2 = dummies.idxmax(axis=1)

In [7]: s2
Out[7]: 
0    a
1    b
2    a
3    c
dtype: object

In [8]: (s2 == s).all()
Out[8]: True

EDIT in response to @piRSquared’s comment:
This solution does indeed assume there’s one 1 per row. I think this is usually the format one has. pd.get_dummies can return rows that are all 0 if you have drop_first=True or if there are NaN values and dummy_na=False (default) (any cases I’m missing?). A row of all zeros will be treated as if it was an instance of the variable named in the first column (e.g. a in the example above).

If drop_first=True, you have no way to know from the dummies dataframe alone what the name of the “first” variable was, so that operation isn’t invertible unless you keep extra information around; I’d recommend leaving drop_first=False (default).

Since dummy_na=False is the default, this could certainly cause problems. Please set dummy_na=True when you call pd.get_dummies if you want to use this solution to invert the “dummification” and your data contains any NaNs. Setting dummy_na=True will always add a “nan” column, even if that column is all 0s, so you probably don’t want to set this unless you actually have NaNs. A nice approach might be to set dummies = pd.get_dummies(series, dummy_na=series.isnull().any()). What’s also nice is that idxmax solution will correctly regenerate your NaNs (not just a string that says “nan”).

It’s also worth mentioning that setting drop_first=True and dummy_na=False means that NaNs become indistinguishable from an instance of the first variable, so this should be strongly discouraged if your dataset may contain any NaN values.

More Related Contents:

Leave a Comment Cancel reply