How to one hot encode variant length features?

You can use MultiLabelBinarizer present in scikit which is specifically used for doing this.

Code for your example:

features = [
            ['f1', 'f2', 'f3'],
            ['f2', 'f4', 'f5', 'f6'],
            ['f1', 'f2']
           ]
from sklearn.preprocessing import MultiLabelBinarizer
mlb = MultiLabelBinarizer()
new_features = mlb.fit_transform(features)

Output:

array([[1, 1, 1, 0, 0, 0],
       [0, 1, 0, 1, 1, 1],
       [1, 1, 0, 0, 0, 0]])

This can also be used in a pipeline, along with other feature_selection utilities.

More Related Contents:

How to one-hot-encode from a pandas column containing a list?
ValueError: numpy.ndarray size changed, may indicate binary incompatibility. Expected 88 from C header, got 80 from PyObject
Custom transformer for sklearn Pipeline that alters both X and y
How to use sklearn fit_transform with pandas and return dataframe instead of numpy array?
Stratified Sampling in Pandas
Binning a column with Python Pandas
Best way to join / merge by range in pandas
How to take column-slices of dataframe in pandas
Creating a Pandas DataFrame from a Numpy array: How do I specify the index column and column headers?
float64 with pandas to_csv
Pandas finding local max and min
Recursive definitions in Pandas
How to do n-D distance and nearest neighbor calculations on numpy arrays
extracting days from a numpy.timedelta64 value
Vectorizing Haversine distance calculation in Python
Set values on the diagonal of pandas.DataFrame
What techniques can be used to measure performance of pandas/numpy solutions
Python Numpy TypeError: ufunc ‘isfinite’ not supported for the input types
Scikit Learn SVC decision_function and predict
RuntimeWarning: numpy.dtype size changed, may indicate binary incompatibility
Why is np.where faster than pd.apply
How to determine whether a column/variable is numeric or not in Pandas/NumPy?
LabelEncoder: TypeError: ‘>’ not supported between instances of ‘float’ and ‘str’
pandas.read_csv from string or package data
What are the pros and cons between get_dummies (Pandas) and OneHotEncoder (Scikit-learn)?
Run an OLS regression with Pandas Data Frame
Add numpy array as column to Pandas data frame
Load CSV to Pandas MultiIndex DataFrame
Python pandas – new column’s value if the item is in the list
Why is NaN considered as a float?

More Related Contents:

Leave a Comment Cancel reply