FeatureUnion
was not meant to be used that way. It instead takes two feature extractors / vectorizers and applies them to the input. It does not take data in the constructor the way it is shown.
CountVectorizer
is expecting a sequence of strings. The easiest way to provide it with that is to concatenate the strings together. That would pass both the text in both columns to the same CountVectorizer
.
combined_2 = df['Subject'] + ' ' + df['body_text']
An alternative method would be to run CountVectorizer
and optionally TfidfTransformer
individually on each column, and then stack the results.
import scipy.sparse as sp
subject_vectorizer = CountVectorizer(...)
subject_vectors = subject_vectorizer.fit_transform(df['Subject'])
body_vectorizer = CountVectorizer(...)
body_vectors = body_vectorizer.fit_transform(df['body_text'])
combined_2 = sp.hstack([subject_vectors, body_vectors], format="csr")
A third option is to implement your own transformer that would extract a dataframe column.
class DataFrameColumnExtracter(TransformerMixin):
def __init__(self, column):
self.column = column
def fit(self, X, y=None):
return self
def transform(self, X, y=None):
return X[self.column]
In that case you could use FeatureUnion
on two pipelines, each containing your custom transformer, then CountVectorizer
.
subj_pipe = make_pipeline(
DataFrameColumnExtracter('Subject'),
CountVectorizer()
)
body_pipe = make_pipeline(
DataFrameColumnExtracter('body_text'),
CountVectorizer()
)
feature_union = make_union(subj_pipe, body_pipe)
This feature union of pipelines will take the dataframe and each pipeline will process its column. It will produce the concatenation of term count matrices from the two columns given.
sparse_matrix_of_counts = feature_union.fit_transform(df)
This feature union can also be added as the first step in a larger pipeline.