use Featureunion in scikit-learn to combine two pandas columns for tfidf

FeatureUnion was not meant to be used that way. It instead takes two feature extractors / vectorizers and applies them to the input. It does not take data in the constructor the way it is shown.

CountVectorizer is expecting a sequence of strings. The easiest way to provide it with that is to concatenate the strings together. That would pass both the text in both columns to the same CountVectorizer.

combined_2 = df['Subject'] + ' '  + df['body_text']

An alternative method would be to run CountVectorizer and optionally TfidfTransformer individually on each column, and then stack the results.

import scipy.sparse as sp

subject_vectorizer = CountVectorizer(...)
subject_vectors = subject_vectorizer.fit_transform(df['Subject'])

body_vectorizer = CountVectorizer(...)
body_vectors = body_vectorizer.fit_transform(df['body_text'])

combined_2 = sp.hstack([subject_vectors, body_vectors], format="csr")

A third option is to implement your own transformer that would extract a dataframe column.

class DataFrameColumnExtracter(TransformerMixin):

    def __init__(self, column):
        self.column = column

    def fit(self, X, y=None):
        return self

    def transform(self, X, y=None):
        return X[self.column]

In that case you could use FeatureUnion on two pipelines, each containing your custom transformer, then CountVectorizer.

subj_pipe = make_pipeline(
       DataFrameColumnExtracter('Subject'), 
       CountVectorizer()
)

body_pipe = make_pipeline(
       DataFrameColumnExtracter('body_text'), 
       CountVectorizer()
)

feature_union = make_union(subj_pipe, body_pipe)

This feature union of pipelines will take the dataframe and each pipeline will process its column. It will produce the concatenation of term count matrices from the two columns given.

 sparse_matrix_of_counts = feature_union.fit_transform(df)

This feature union can also be added as the first step in a larger pipeline.

Leave a Comment