我使用此函数以1,100,000个样本对文本进行tf-idf计算:
# Calculating Tf_idf using PipeLine
transformer = FeatureUnion([
('Source1_tfidf',
Pipeline([('extract_field',
FunctionTransformer(lambda x: x['Text1'],
validate=False)),
('tfidf',
TfidfVectorizer())])),
('Source2_tfidf',
Pipeline([('extract_field',
FunctionTransformer(lambda x: x['Text2'],
validate=False)),
('tfidf',
TfidfVectorizer())]))])
transformer.fit(Fulldf31)
#now our vocabulatry has merged
Source1_vocab = transformer.transformer_list[0][1].steps[1] [1].get_feature_names()
Source2_vocab = transformer.transformer_list[1][1].steps[1][1].get_feature_names()
vocab = Source1_vocab + Source2_vocab
#vocab
tfidf_vectorizer_vectors31=transformer.transform(Fulldf31)
火车机之后,我在100000文本上计算了tf-idf,然后预测我收到此错误:
ValueError: X has a different shape than during fitting.
答案 0 :(得分:0)
与其装配两个TfidfVectorizer,然后尝试将它们组合,不如将它们的文本数据逐行连接,然后将它们传递给单个TfidfVectorizer。
import numpy as np
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import FunctionTransformer
fruit = ['apple', 'banana', 'pear', 'kiwi']
vegetables = ['tomatoes', 'peppers', 'broccoli', 'carrots']
df = pd.DataFrame(
{'Fruit': fruit, 'Vegetables': vegetables, 'Integers': np.arange(1, 5)})
# Select text data and join them along each row
def prepare_text_data(data):
text_cols = [col for col in data.columns if (df[col].dtype == 'object')]
text_data = data[text_cols].apply(lambda x: ' '.join(x), axis=1)
return text_data
pipeline = Pipeline([
('text_selector', FunctionTransformer(prepare_text_data,
validate=False)),
('vectorizer', TfidfVectorizer())])
pipeline = pipeline.fit(df)
tfidf = pipeline.transform(df)
# Check the vocabulary to verify it contains all tokens from df
pipeline['vectorizer'].vocabulary_
Out[39]:
{'apple': 0,
'tomatoes': 7,
'banana': 1,
'peppers': 6,
'pear': 5,
'broccoli': 2,
'kiwi': 4,
'carrots': 3}
# Here is the resulting Tfidf matrix with 4 rows and 8 columns corresponding to
# the number of rows in the df and the number of tokens in the Tfidf vocabulary
tfidf.A
Out[40]:
array([[0.70710678, 0. , 0. , 0. , 0. ,
0. , 0. , 0.70710678],
[0. , 0.70710678, 0. , 0. , 0. ,
0. , 0.70710678, 0. ],
[0. , 0. , 0.70710678, 0. , 0. ,
0.70710678, 0. , 0. ],
[0. , 0. , 0. , 0.70710678, 0.70710678,
0. , 0. , 0. ]])