我训练了一个逻辑回归模型,用于对文本数据进行多分类。我想从模型中生成样本预测,但出现此错误
ValueError: X has 30 features per sample; expecting 100000
这是向量化文本数据的代码
tfidf_pipeline = Pipeline([
('tfidf' ,TfidfVectorizer(max_features=50000, ngram_range=(1, 3), stop_words = 'english', strip_accents= 'ascii',))])
preprocessor_pipeline = ColumnTransformer(
transformers=[
('short_description', tfidf_pipeline,'short_description'),
('details', tfidf_pipeline,'details'),
])
这是我尝试运行的代码,但出现上述错误
d = {'short_description' : ['[mitigated] [ubl5] ssd slam station not working'],
'details' : ['ssd slam station not working, unable to take slam from the station.']}
df_test = pd.DataFrame(data=d)
X = df_test[['short_description', 'details']]
X_prep = preprocessor_pipeline.fit_transform(X)
y_p = lr.predict(X_prep)
答案 0 :(得分:2)
训练和测试步骤的 preprocessor_pipeline
必须相同。
这是一个最小的可重现示例:
import numpy as np
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
raw_input = [
"first sentence looks like this",
"second sentence looks like that",
"it's going to demonstrate something",
]
vectorizer = TfidfVectorizer(stop_words="english", strip_accents="ascii")
X = vectorizer.fit_transform(raw_input)
y = np.array([0, 0, 1])
clf = LogisticRegression()
clf.fit(X, y)
d = {
"short_description": ["[mitigated] [ubl5] ssd slam station not working"],
"details": ["ssd slam station not working, unable to take slam from the station."],
}
df_test = pd.DataFrame(data=d)
X_test = vectorizer.fit_transform(df_test)
print(clf.predict(X_test))
结果:
Traceback (most recent call last):
File "vectorizer_test.py", line 27, in <module>
print(clf.predict(X_test))
File "/home/hayesall/miniconda3/envs/stackoverflow/lib/python3.7/site-packages/sklearn/linear_model/_base.py", line 309, in predict
scores = self.decision_function(X)
File "/home/hayesall/miniconda3/envs/stackoverflow/lib/python3.7/site-packages/sklearn/linear_model/_base.py", line 289, in decision_function
% (X.shape[1], n_features))
ValueError: X has 2 features per sample; expecting 6
它需要 transform
而不是 fit_transform
:
X_test = vectorizer.transform(df_test)
print(clf.predict(X_test))
# [0 0]