dummy-variable
What are the pros and cons between get_dummies (Pandas) and OneHotEncoder (Scikit-learn)?
For machine learning, you almost definitely want to use sklearn.OneHotEncoder. For other tasks like simple analyses, you might be able to use pd.get_dummies, which is a bit more convenient. Note that sklearn.OneHotEncoder has been updated in the latest version so that it does accept strings for categorical variables, as well as integers. The crux of … Read more
Converting pandas column of comma-separated strings into dummy variables
Use str.get_dummies df[‘col’].str.get_dummies(sep=’,’) a b c d 0 1 0 0 0 1 1 1 1 0 2 1 1 0 1 3 0 0 0 1 4 0 0 1 1 Edit: Updating the answer to address some questions. Qn 1: Why is it that the series method get_dummies does not accept the argument … Read more
Creating dummy variables in R data.table
This seems to do what you’re looking for: inds <- unique(test$index) test[, (inds) := lapply(inds, function(x) index == x)] which gives index var1 a b c d e f g h i j 1: a 0.25331851 TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE 2: b -0.02854676 FALSE TRUE FALSE FALSE FALSE FALSE … Read more
Keep same dummy variable in training and testing data
You can also just get the missing columns and add them to the test dataset: # Get missing columns in the training test missing_cols = set( train.columns ) – set( test.columns ) # Add a missing column in test set with default value equal to 0 for c in missing_cols: test[c] = 0 # Ensure … Read more
Dummy variables when not all categories are present
TL;DR: pd.get_dummies(cat.astype(pd.CategoricalDtype(categories=categories))) Older pandas: pd.get_dummies(cat.astype(‘category’, categories=categories)) is there a way to pass to get_dummies (or an equivalent function) the names of the categories, so that, for the categories that don’t appear in a given dataframe, it’d just create a column of 0s? Yes, there is! Pandas has a special type of Series just for categorical … Read more
Pandas: Get Dummies
You can try : df = pd.get_dummies(df, columns=[‘type’])
Dummify character column and find unique values [duplicate]
I’d use splitstackshape and mtabulate from qdapTools packages to get this as a one liner, i.e. library(splitstackshape) library(qdapTools) mtabulate(as.data.frame(t(cSplit(test, ‘col’, sep = ‘;’, ‘wide’)))) # a cc ff rr e #V1 1 1 1 1 0 #V2 1 1 0 1 1 It can also be full splitstackshape as @A5C1D2H2I1M1N2O1R2T1 mentions in comments, cSplit_e(test, “col”, … Read more
How to force R to use a specified factor level as reference in a regression?
See the relevel() function. Here is an example: set.seed(123) x <- rnorm(100) DF <- data.frame(x = x, y = 4 + (1.5*x) + rnorm(100, sd = 2), b = gl(5, 20)) head(DF) str(DF) m1 <- lm(y ~ x + b, data = DF) summary(m1) Now alter the factor b in DF by use of the … Read more