用经过训练的模型进行预测

时间:2020-03-23 15:12:56

标签: python machine-learning scikit-learn regression logistic-regression

我使用Logistic回归创建模型,后来使用joblib保存了模型。后来我尝试在test.csv中加载该模型并预测标签。每当我尝试此操作时,我都会收到一条错误消息,提示您“ X每个示例具有1433445个功能;预期为3797015” 这是我的初始代码:-

import numpy as np 
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.linear_model import LogisticRegression


#reading data 
train=pd.read_csv('train_yesindia.csv')
test=pd.read_csv('test_yesindia.csv')

train=train.iloc[:,1:]
test=test.iloc[:,1:]

test.info()
train.info()

test['label']='t'

test=test.fillna(' ')
train=train.fillna(' ')
test['total']=test['title']+' '+test['author']+test['text']
train['total']=train['title']+' '+train['author']+train['text']


transformer = TfidfTransformer(smooth_idf=False)
count_vectorizer = CountVectorizer(ngram_range=(1, 2))
counts = count_vectorizer.fit_transform(train['total'].values)
tfidf = transformer.fit_transform(counts)


targets = train['label'].values
test_counts = count_vectorizer.transform(test['total'].values)
test_tfidf = transformer.fit_transform(test_counts)

#split in samples
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(tfidf, targets, random_state=0)



logreg = LogisticRegression(C=1e5)
logreg.fit(X_train, y_train)
print('Accuracy of Lasso classifier on training set: {:.2f}'
     .format(logreg.score(X_train, y_train)))
print('Accuracy of Lasso classifier on test set: {:.2f}'
     .format(logreg.score(X_test, y_test)))


targets = train['label'].values
logreg = LogisticRegression()
logreg.fit(counts, targets)

example_counts = count_vectorizer.transform(test['total'].values)
predictions = logreg.predict(example_counts)
pred=pd.DataFrame(predictions,columns=['label'])
pred['id']=test['id']
pred.groupby('label').count()

#dumping models
from joblib import dump, load
dump(logreg,'mypredmodel1.joblib')

稍后,我以另一种代码加载模型:-

import numpy as np 
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.linear_model import LogisticRegression
from joblib import dump, load

test=pd.read_csv('test_yesindia.csv')
test=test.iloc[:,1:]
test['label']='t'
test=test.fillna(' ')
test['total']=test['title']+' '+test['author']+test['text']

#check
transformer = TfidfTransformer(smooth_idf=False)
count_vectorizer = CountVectorizer(ngram_range=(1, 2))


test_counts = count_vectorizer.fit_transform(test['total'].values)
test_tfidf = transformer.fit_transform(test_counts)
#check

#load_model

logreg = load('mypredmodel1.joblib')


example_counts = count_vectorizer.fit_transform(test['total'].values)
predictions = logreg.predict(example_counts)

当我运行它时,出现错误:

predictions = logreg.predict(example_counts)
Traceback (most recent call last):

  File "<ipython-input-58-f28afd294d38>", line 1, in <module>
    predictions = logreg.predict(example_counts)

  File "C:\Users\adars\Anaconda3\lib\site-packages\sklearn\linear_model\base.py", line 289, in predict
    scores = self.decision_function(X)

  File "C:\Users\adars\Anaconda3\lib\site-packages\sklearn\linear_model\base.py", line 270, in decision_function
    % (X.shape[1], n_features))

ValueError: X has 1433445 features per sample; expecting 3797015

1 个答案:

答案 0 :(得分:1)

最可能的原因是,您要在测试集中重新安装变压器。绝对不能这样做-您还应该将它们保存在您的训练集中,并仅将测试(或任何其他将来使用)的集合用于转换数据。

使用管道更容易做到这一点。

因此,请删除以下第一行代码:

transformer = TfidfTransformer(smooth_idf=False)
count_vectorizer = CountVectorizer(ngram_range=(1, 2))
counts = count_vectorizer.fit_transform(train['total'].values)
tfidf = transformer.fit_transform(counts)


targets = train['label'].values
test_counts = count_vectorizer.transform(test['total'].values)
test_tfidf = transformer.fit_transform(test_counts)

并将其替换为:

from sklearn.pipeline import Pipeline

pipeline = Pipeline([
                ('counts', CountVectorizer(ngram_range=(1, 2)),
                ('tf-idf', TfidfTransformer(smooth_idf=False))
            ])

pipeline.fit(train['total'].values)

tfidf = pipeline.transform(train['total'].values)
targets = train['label'].values

test_tfidf = pipeline.transform(test['total'].values)

dump(pipeline, 'transform_predict.joblib')

现在,在第二个代码块中,删除此部分:

#check
transformer = TfidfTransformer(smooth_idf=False)
count_vectorizer = CountVectorizer(ngram_range=(1, 2))

test_counts = count_vectorizer.fit_transform(test['total'].values)
test_tfidf = transformer.fit_transform(test_counts)
#check

替换

pipeline = load('transform_predict.joblib')
test_tfidf = pipeline.transform(test['total'].values)

如果您是predict变量test_tfidf,而不是TF-IDF变位的example_counts,您应该没事: / p>

predictions = logreg.predict(test_tfidf)