Question

我使用此函数以1,100,000个样本对文本进行tf-idf计算：

  # Calculating Tf_idf using PipeLine
   transformer = FeatureUnion([
            ('Source1_tfidf', 
              Pipeline([('extract_field',
                          FunctionTransformer(lambda x: x['Text1'], 
                                              validate=False)),
                        ('tfidf', 
                          TfidfVectorizer())])),
            ('Source2_tfidf', 
              Pipeline([('extract_field', 
                          FunctionTransformer(lambda x: x['Text2'], 
                                              validate=False)),
                        ('tfidf', 
                          TfidfVectorizer())]))]) 

   transformer.fit(Fulldf31)

   #now our vocabulatry has merged
    Source1_vocab = transformer.transformer_list[0][1].steps[1] [1].get_feature_names() 
   Source2_vocab = transformer.transformer_list[1][1].steps[1][1].get_feature_names()
   vocab = Source1_vocab + Source2_vocab
  #vocab

   tfidf_vectorizer_vectors31=transformer.transform(Fulldf31)

火车机之后，我在100000文本上计算了tf-idf，然后预测我收到此错误：

  ValueError: X has a different shape than during fitting.

Answer 1

与其装配两个TfidfVectorizer，然后尝试将它们组合，不如将它们的文本数据逐行连接，然后将它们传递给单个TfidfVectorizer。

import numpy as np
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import FunctionTransformer

fruit = ['apple', 'banana', 'pear', 'kiwi']
vegetables = ['tomatoes', 'peppers', 'broccoli', 'carrots']

df = pd.DataFrame(
        {'Fruit': fruit, 'Vegetables': vegetables, 'Integers': np.arange(1, 5)})

# Select text data and join them along each row
def prepare_text_data(data):
    text_cols = [col for col in data.columns if (df[col].dtype == 'object')]
    text_data = data[text_cols].apply(lambda x: ' '.join(x), axis=1)
    return text_data

pipeline = Pipeline([
                     ('text_selector', FunctionTransformer(prepare_text_data,
                                                           validate=False)),
                     ('vectorizer', TfidfVectorizer())])

pipeline = pipeline.fit(df)
tfidf = pipeline.transform(df)

# Check the vocabulary to verify it contains all tokens from df
pipeline['vectorizer'].vocabulary_
Out[39]: 
{'apple': 0,
 'tomatoes': 7,
 'banana': 1,
 'peppers': 6,
 'pear': 5,
 'broccoli': 2,
 'kiwi': 4,
 'carrots': 3}

# Here is the resulting Tfidf matrix with 4 rows and 8 columns corresponding to 
# the number of rows in the df and the number of tokens in the Tfidf vocabulary
tfidf.A
Out[40]: 
array([[0.70710678, 0.        , 0.        , 0.        , 0.        ,
        0.        , 0.        , 0.70710678],
       [0.        , 0.70710678, 0.        , 0.        , 0.        ,
        0.        , 0.70710678, 0.        ],
       [0.        , 0.        , 0.70710678, 0.        , 0.        ,
        0.70710678, 0.        , 0.        ],
       [0.        , 0.        , 0.        , 0.70710678, 0.70710678,
        0.        , 0.        , 0.        ]])

训练模型后预测值的问题

1 个答案: