使用此作为垃圾邮件分类的模型时,我想添加主题和正文的附加功能.
我在熊猫数据框中拥有所有功能.例如,主题是df ['Subject'],正文是df ['body_text'],垃圾邮件/火腿标签是df ['ham/spam']
我收到以下错误:TypeError:'FeatureUnion'对象不可迭代
如何通过管道功能运行df ['Subject']和df ['body_text']作为功能?
from sklearn.pipeline import FeatureUnion features = df[['Subject', 'body_text']].values combined_2 = FeatureUnion(list(features)) pipeline = Pipeline([ ('count_vectorizer', CountVectorizer(ngram_range=(1, 2))), ('tfidf_transformer', TfidfTransformer()), ('classifier', MultinomialNB())]) pipeline.fit(combined_2, df['ham/spam']) k_fold = KFold(n=len(df), n_folds=6) scores = [] confusion = numpy.array([[0, 0], [0, 0]]) for train_indices, test_indices in k_fold: train_text = combined_2.iloc[train_indices] train_y = df.iloc[test_indices]['ham/spam'].values test_text = combined_2.iloc[test_indices] test_y = df.iloc[test_indices]['ham/spam'].values pipeline.fit(train_text, train_y) predictions = pipeline.predict(test_text) prediction_prob = pipeline.predict_proba(test_text) confusion += confusion_matrix(test_y, predictions) score = f1_score(test_y, predictions, pos_label='spam') scores.append(score)
David Maust.. 25
FeatureUnion
并不意味着以这种方式使用.它取而代之的是两个特征提取器/矢量化器并将它们应用于输入.它不会像构造函数那样显示构造函数中的数据.
CountVectorizer
期待一系列字符串.提供它的最简单方法是将字符串连接在一起.这会将两列中的文本都传递给它们CountVectorizer
.
combined_2 = df['Subject'] + ' ' + df['body_text']
另一种方法是在每列上运行CountVectorizer
并可选地TfidfTransformer
单独运行,然后堆叠结果.
import scipy.sparse as sp subject_vectorizer = CountVectorizer(...) subject_vectors = subject_vectorizer.fit_transform(df['Subject']) body_vectorizer = CountVectorizer(...) body_vectors = body_vectorizer.fit_transform(df['Subject']) combined_2 = sp.hstack([subject_vectors, body_vectors], format='csr')
第三种选择是实现自己的变换器,它将提取数据帧列.
class DataFrameColumnExtracter(TransformerMixin): def __init__(self, column): self.column = column def fit(self, X, y=None): return self def transform(self, X, y=None): return X[self.column]
在这种情况下,您可以FeatureUnion
在两个管道上使用,每个管道都包含您的自定义变换器CountVectorizer
.
subj_pipe = make_pipeline( DataFrameColumnExtracter('Subject'), CountVectorizer() ) body_pipe = make_pipeline( DataFrameColumnExtracter('body_text'), CountVectorizer() ) feature_union = make_union(subj_pipe, body_pipe)
管道的这个特征联合将采用数据帧,每个管道将处理其列.它将从给定的两列产生术语计数矩阵的串联.
sparse_matrix_of_counts = feature_union.fit_transform(df)
此功能联合也可以作为更大管道中的第一步添加.
FeatureUnion
并不意味着以这种方式使用.它取而代之的是两个特征提取器/矢量化器并将它们应用于输入.它不会像构造函数那样显示构造函数中的数据.
CountVectorizer
期待一系列字符串.提供它的最简单方法是将字符串连接在一起.这会将两列中的文本都传递给它们CountVectorizer
.
combined_2 = df['Subject'] + ' ' + df['body_text']
另一种方法是在每列上运行CountVectorizer
并可选地TfidfTransformer
单独运行,然后堆叠结果.
import scipy.sparse as sp subject_vectorizer = CountVectorizer(...) subject_vectors = subject_vectorizer.fit_transform(df['Subject']) body_vectorizer = CountVectorizer(...) body_vectors = body_vectorizer.fit_transform(df['Subject']) combined_2 = sp.hstack([subject_vectors, body_vectors], format='csr')
第三种选择是实现自己的变换器,它将提取数据帧列.
class DataFrameColumnExtracter(TransformerMixin): def __init__(self, column): self.column = column def fit(self, X, y=None): return self def transform(self, X, y=None): return X[self.column]
在这种情况下,您可以FeatureUnion
在两个管道上使用,每个管道都包含您的自定义变换器CountVectorizer
.
subj_pipe = make_pipeline( DataFrameColumnExtracter('Subject'), CountVectorizer() ) body_pipe = make_pipeline( DataFrameColumnExtracter('body_text'), CountVectorizer() ) feature_union = make_union(subj_pipe, body_pipe)
管道的这个特征联合将采用数据帧,每个管道将处理其列.它将从给定的两列产生术语计数矩阵的串联.
sparse_matrix_of_counts = feature_union.fit_transform(df)
此功能联合也可以作为更大管道中的第一步添加.