What are the pros and cons between get_dummies (Pandas) and OneHotEncoder (Scikit-learn)?

For machine learning, you almost definitely want to use sklearn.OneHotEncoder. For other tasks like simple analyses, you might be able to use pd.get_dummies, which is a bit more convenient. Note that sklearn.OneHotEncoder has been updated in the latest version so that it does accept strings for categorical variables, as well as integers. The crux of … Read more

Dummy variables when not all categories are present

TL;DR: pd.get_dummies(cat.astype(pd.CategoricalDtype(categories=categories))) Older pandas: pd.get_dummies(cat.astype(‘category’, categories=categories)) is there a way to pass to get_dummies (or an equivalent function) the names of the categories, so that, for the categories that don’t appear in a given dataframe, it’d just create a column of 0s? Yes, there is! Pandas has a special type of Series just for categorical … Read more

Dummify character column and find unique values [duplicate]

I’d use splitstackshape and mtabulate from qdapTools packages to get this as a one liner, i.e. library(splitstackshape) library(qdapTools) mtabulate(as.data.frame(t(cSplit(test, ‘col’, sep = ‘;’, ‘wide’)))) # a cc ff rr e #V1 1 1 1 1 0 #V2 1 1 0 1 1 It can also be full splitstackshape as @A5C1D2H2I1M1N2O1R2T1 mentions in comments, cSplit_e(test, “col”, … Read more