1。培训/测试模型

Question

我正在使用Pandas构建机器学习模型，但是很难将我的模型应用于测试用户输入的数据。我的数据基本上是一个具有两列的数据框：文本和情感。我希望能够预测用户输入的情绪。这是我的工作：

1。培训/测试模型

# reading dataset
df = pd.read_csv('dataset/dataset.tsv', sep='\t')
# splitting training/test set
test_size = 0.1
train_x, valid_x, train_y, valid_y = model_selection.train_test_split(df['text'], df['sentiment'], test_size=test_size)

# label encode the target variable (i.e. negative = 0, positive = 1)
encoder = preprocessing.LabelEncoder()
train_y = encoder.fit_transform(train_y)
valid_y = encoder.fit_transform(valid_y)

# create a count vectorizer object 
count_vect = CountVectorizer(analyzer='word', token_pattern=r'\w{1,}')
count_vect.fit(df['text'])

# transform the training and validation data using count vectorizer object
xtrain_count = count_vect.transform(train_x)
xvalid_count = count_vect.transform(valid_x)

# function to train the model
def train_model(classifier, feature_vector_train, label, feature_vector_valid, name):
    # fit the training dataset on the classifier
    classifier.fit(feature_vector_train, label)
    # save the trained model in the "models" folder
    joblib.dump(classifier, 'models/' + name + '.pkl') 

    # predict the labels on validation dataset
    predictions = classifier.predict(feature_vector_valid)

    return metrics.accuracy_score(predictions, valid_y)

# Naive Bayes on Count Vectors
accuracy = train_model(naive_bayes.MultinomialNB(), xtrain_count, train_y, xvalid_count, 'NB-COUNT')
print("NB, Count Vectors: ", accuracy)

一切正常，准确率约80％

2。根据用户输入测试模型

然后我再次读取保存的模型，获取用户输入并尝试做出预测（用户输入现在已在input_text中进行了硬编码）：

clf = joblib.load('models/NB-COUNT.pkl')
dataset_df = pd.read_csv('dataset/dataset.tsv', sep='\t')
input_text = 'stackoverflow is the best'  # the sentence I want to predict the sentiment for
test_df = pd.Series(data=input_text)

count_vect = CountVectorizer(analyzer='word', token_pattern=r'\w{1,}')
count_vect.fit(dataset_df['text'])  # fit the count vectorizer again so we can extract features from test_df
features = count_vect.transform(test_df)
result = clf.predict(features)[0]
print(result)

但是我得到的错误是“尺寸不匹配”：

Traceback (most recent call last):
File "C:\Users\vdvax\iCloudDrive\Freelance\09. Arabic Sentiment Analysis\test.py", line 20, in <module>
result = clf.predict(features)[0]
File "C:\Python36\lib\site-packages\sklearn\naive_bayes.py", line 66, in predict
jll = self._joint_log_likelihood(X)
File "C:\Python36\lib\site-packages\sklearn\naive_bayes.py", line 725, in _joint_log_likelihood
return (safe_sparse_dot(X, self.feature_log_prob_.T) +
File "C:\Python36\lib\site-packages\sklearn\utils\extmath.py", line 135, in safe_sparse_dot
ret = a * b
File "C:\Python36\lib\site-packages\scipy\sparse\base.py", line 515, in __mul__
raise ValueError('dimension mismatch')
ValueError: dimension mismatch

Answer 1

由于CountVectorizer变换的输出在尺寸上与拟合估计器中的预期形状不匹配，因此出现尺寸不匹配错误。这是由于您要在测试数据上单独填充CountVectorizer。

Scikit-learn提供了一个方便的接口，称为Pipeline，该接口可让您将预处理器和估计器一起堆叠在一个估计器类中。您应该将所有变压器放在估算器之前的Pipeline中，然后您的测试数据将通过预适配变压器类进行转换。这是您适合估算器的流水线版本的方法：

from sklearn.pipeline import Pipeline

# takes a list of tuples where the first arg is the step name,
# and the second is the estimator itself.
pipe = Pipeline([
    ('cvec', CountVectorizer(analyzer='word', token_pattern=r'\w{1,}')),
    ('clf', naive_bayes.MultinomialNB())
])

# you can fit a pipeline in the same way you would any other estimator,
# and it will go sequentially through every stage
pipe.fit(train_x, train_y)

# you can produce predictions by feeding your test data into the pipe
pipe.predict(test_x)

请注意，您也不必以这种方式在预处理的各个阶段中创建大量数据副本，因为一个阶段的输出将直接馈送到下一个阶段。

现在，对于您的持久性问题。可以采用与其他模型相同的方式来保留管道：

joblib.dump(pipe, 'models/NB-COUNT.pkl')
loaded_model = joblib.load('models/NB-COUNT.pkl')
loaded_model.predict(test_df)

Scikit-Learn / Pandas：根据用户输入使用保存的模型进行预测

1。培训/测试模型

2。根据用户输入测试模型

1 个答案: