Linear regression analysis with string/categorical features (variables)?

Yes, you will have to convert everything to numbers. That requires thinking about what these attributes represent. Usually there are three possibilities: One-Hot encoding for categorical data Arbitrary numbers for ordinal data Use something like group means for categorical data (e. g. mean prices for city districts). You have to be carefull to not infuse … Read more

Should Feature Selection be done before Train-Test Split or after?

It is not actually difficult to demonstrate why using the whole dataset (i.e. before splitting to train/test) for selecting features can lead you astray. Here is one such demonstration using random dummy data with Python and scikit-learn: import numpy as np from sklearn.feature_selection import SelectKBest from sklearn.model_selection import train_test_split from sklearn.linear_model import LogisticRegression from sklearn.metrics … Read more

How are feature_importances in RandomForestClassifier determined?

There are indeed several ways to get feature “importances”. As often, there is no strict consensus about what this word means. In scikit-learn, we implement the importance as described in [1] (often cited, but unfortunately rarely read…). It is sometimes called “gini importance” or “mean decrease impurity” and is defined as the total decrease in … Read more