我正在使用Pandas构建机器学习模型,但是很难将我的模型应用于测试用户输入的数据。我的数据基本上是一个具有两列的数据框:文本和情感。我希望能够预测用户输入的情绪。这是我的工作:
# reading dataset
df = pd.read_csv('dataset/dataset.tsv', sep='\t')
# splitting training/test set
test_size = 0.1
train_x, valid_x, train_y, valid_y = model_selection.train_test_split(df['text'], df['sentiment'], test_size=test_size)
# label encode the target variable (i.e. negative = 0, positive = 1)
encoder = preprocessing.LabelEncoder()
train_y = encoder.fit_transform(train_y)
valid_y = encoder.fit_transform(valid_y)
# create a count vectorizer object
count_vect = CountVectorizer(analyzer='word', token_pattern=r'\w{1,}')
count_vect.fit(df['text'])
# transform the training and validation data using count vectorizer object
xtrain_count = count_vect.transform(train_x)
xvalid_count = count_vect.transform(valid_x)
# function to train the model
def train_model(classifier, feature_vector_train, label, feature_vector_valid, name):
# fit the training dataset on the classifier
classifier.fit(feature_vector_train, label)
# save the trained model in the "models" folder
joblib.dump(classifier, 'models/' + name + '.pkl')
# predict the labels on validation dataset
predictions = classifier.predict(feature_vector_valid)
return metrics.accuracy_score(predictions, valid_y)
# Naive Bayes on Count Vectors
accuracy = train_model(naive_bayes.MultinomialNB(), xtrain_count, train_y, xvalid_count, 'NB-COUNT')
print("NB, Count Vectors: ", accuracy)
一切正常,准确率约80%
然后我再次读取保存的模型,获取用户输入并尝试做出预测(用户输入现在已在input_text
中进行了硬编码):
clf = joblib.load('models/NB-COUNT.pkl')
dataset_df = pd.read_csv('dataset/dataset.tsv', sep='\t')
input_text = 'stackoverflow is the best' # the sentence I want to predict the sentiment for
test_df = pd.Series(data=input_text)
count_vect = CountVectorizer(analyzer='word', token_pattern=r'\w{1,}')
count_vect.fit(dataset_df['text']) # fit the count vectorizer again so we can extract features from test_df
features = count_vect.transform(test_df)
result = clf.predict(features)[0]
print(result)
但是我得到的错误是“尺寸不匹配”:
Traceback (most recent call last):
File "C:\Users\vdvax\iCloudDrive\Freelance\09. Arabic Sentiment Analysis\test.py", line 20, in <module>
result = clf.predict(features)[0]
File "C:\Python36\lib\site-packages\sklearn\naive_bayes.py", line 66, in predict
jll = self._joint_log_likelihood(X)
File "C:\Python36\lib\site-packages\sklearn\naive_bayes.py", line 725, in _joint_log_likelihood
return (safe_sparse_dot(X, self.feature_log_prob_.T) +
File "C:\Python36\lib\site-packages\sklearn\utils\extmath.py", line 135, in safe_sparse_dot
ret = a * b
File "C:\Python36\lib\site-packages\scipy\sparse\base.py", line 515, in __mul__
raise ValueError('dimension mismatch')
ValueError: dimension mismatch
答案 0 :(得分:1)
由于CountVectorizer
变换的输出在尺寸上与拟合估计器中的预期形状不匹配,因此出现尺寸不匹配错误。这是由于您要在测试数据上单独填充CountVectorizer
。
Scikit-learn提供了一个方便的接口,称为Pipeline
,该接口可让您将预处理器和估计器一起堆叠在一个估计器类中。您应该将所有变压器放在估算器之前的Pipeline
中,然后您的测试数据将通过预适配变压器类进行转换。这是您适合估算器的流水线版本的方法:
from sklearn.pipeline import Pipeline
# takes a list of tuples where the first arg is the step name,
# and the second is the estimator itself.
pipe = Pipeline([
('cvec', CountVectorizer(analyzer='word', token_pattern=r'\w{1,}')),
('clf', naive_bayes.MultinomialNB())
])
# you can fit a pipeline in the same way you would any other estimator,
# and it will go sequentially through every stage
pipe.fit(train_x, train_y)
# you can produce predictions by feeding your test data into the pipe
pipe.predict(test_x)
请注意,您也不必以这种方式在预处理的各个阶段中创建大量数据副本,因为一个阶段的输出将直接馈送到下一个阶段。
现在,对于您的持久性问题。可以采用与其他模型相同的方式来保留管道:
joblib.dump(pipe, 'models/NB-COUNT.pkl')
loaded_model = joblib.load('models/NB-COUNT.pkl')
loaded_model.predict(test_df)