我有一个14列的DataFrame。我正在使用自定义转换器
我自定义的ColumnSelector转换器是:
class ColumnSelector(BaseEstimator, TransformerMixin):
def __init__(self, columns):
self.columns = columns
def fit(self, X, y=None):
return self
def transform(self, X):
assert isinstance(X, pd.DataFrame)
try:
return X[self.columns]
except KeyError:
cols_error = list(set(self.columns) - set(X.columns))
raise KeyError("The DataFrame does not include the columns: %s" % cols_error)
后跟自定义TypeSelector:
class TypeSelector(BaseEstimator, TransformerMixin):
def __init__(self, dtype):
self.dtype = dtype
def fit(self, X, y=None):
return self
def transform(self, X):
assert isinstance(X, pd.DataFrame)
return X.select_dtypes(include=[self.dtype])
我从中选择所需列的原始DataFrame是 df_with_types,具有981行。我希望提取的列以及相应的数据类型在下面列出;
meeting_subject_stem_sentence:“对象”, priority_label_stem_sentence:'对象', 参加者:“类别”, day_of_week:“类别”, Meeting_time_mins:“ int64”
然后我按照以下方式构建管道
preprocess_pipeline = make_pipeline(
ColumnSelector(columns=['meeting_subject_stem_sentence',
'attendees', 'day_of_week', 'meeting_time_mins', 'priority_label_stem_sentence']),
FeatureUnion(transformer_list=[
("integer_features", make_pipeline(
TypeSelector('int64'),
StandardScaler()
)),
("categorical_features", make_pipeline(
TypeSelector("category"),
OneHotEnc()
)),
("text_features", make_pipeline(
TypeSelector("object"),
TfidfVectorizer(stop_words=stopWords)
))
])
)
当我使数据流水线适合时抛出的错误是:
preprocess_pipeline.fit_transform(df_with_types)
ValueError: blocks[0,:] has incompatible row dimensions. Got blocks[0,2].shape[0] == 2, expected 981.
由于TFIDF矢量化器,我有这种预感。仅在没有FeatureUnion ...的TFIDF矢量化器上拟合东西
the_pipe = Pipeline([('col_sel', ColumnSelector(columns=['meeting_subject_stem_sentence',
'attendees', 'day_of_week', 'meeting_time_mins', 'priority_label_stem_sentence'])),
('type_selector', TypeSelector('object')), ('tfidf', TfidfVectorizer())])
当我适合the_pipe时:
a = the_pipe.fit_transform(df_with_types)
这给了我2 * 2的矩阵,而不是981。
(0, 0) 1.0
(1, 1) 1.0
使用named_steps调用功能名称属性,我得到
the_pipe.named_steps['tfidf'].get_feature_names()
[u'meeting_subject_stem_sentence', u'priority_label_stem_sentence']
似乎只适合列名,而不遍历文档。如何在上述管道中实现这一目标。另外,如果我想在ColumnSelector和TypeSelector之后对每个要素应用成对的距离/相似度函数作为流水线的一部分,我该怎么做。
一个例子是...
preprocess_pipeline = make_pipeline(
ColumnSelector(columns=['meeting_subject_stem_sentence',
'attendees', 'day_of_week', 'meeting_time_mins', 'priority_label_stem_sentence']),
FeatureUnion(transformer_list=[
("integer_features", make_pipeline(
TypeSelector('int64'),
StandardScaler(),
'Pairwise manhattan distance between each element of the integer feature'
)),
("categorical_features", make_pipeline(
TypeSelector("category"),
OneHotEnc(),
'Pairwise dice coefficient here'
)),
("text_features", make_pipeline(
TypeSelector("object"),
TfidfVectorizer(stop_words=stopWords),
'Pairwise cosine similarity here'
))
])
)
请帮助。作为一个初学者,我一直在head头,这无济于事。我经历了zac_stewart's blog和许多其他类似的文章,但是似乎没有人谈论如何将TFIDF与TypeSelector或ColumnSelector一起使用。 非常感谢您提供的所有帮助。希望我清楚地提出问题。
编辑1:
如果我使用TextSelector转换器,如下所示...
class TextSelector(BaseEstimator, TransformerMixin):
""" Transformer that selects text column from DataFrame by key."""
def __init__(self, key):
self.key = key
def fit(self, X, y=None):
'''Create X attribute to be transformed'''
return self
def transform(self, X, y=None):
'''the key passed here indicates column name'''
return X[self.key]
text_processing_pipe_line_1 =管道([[''selector',TextSelector(key ='meeting_subject')), ('text_1',TfidfVectorizer(stop_words = stopWords))])
t = text_processing_pipe_line_1.fit_transform(df_with_types)
(0, 656) 0.378616399898
(0, 75) 0.378616399898
(0, 117) 0.519159384271
(0, 545) 0.512337545421
(0, 223) 0.425773433566
(1, 154) 0.5
(1, 137) 0.5
(1, 23) 0.5
(1, 355) 0.5
(2, 656) 0.497937369182
这有效并且正在文档中进行迭代,因此,如果我可以让TypeSelector返回一个序列,那对吗?再次感谢您的帮助。
答案 0 :(得分:1)
问题1
您有2个包含文本的列:
分别对它们分别应用TfidfVectorizer
,然后应用FeatureUnion
或将字符串串联到1列中,并将此串联视为一个文档。
我想这是您问题的根源,因为TfidfVectorizer.fit()
输入了raw_documents
,并且它必须是可迭代的,产生str。在您的情况下,这是一个可迭代的对象,它会产生另一个可迭代对象(容纳2个字符串-每个文本列一个)。
阅读the official docs了解更多信息。
问题2
您不能将成对相似性/距离用作管道的一部分,因为它不是转换器。变压器相互独立地变换每个样本,而成对度量同时需要2个样本。但是,您可以在通过fit_transform
metrics.pairwise.pairwise_distances
流水线之后简单地对其进行计算。