Question

因此，基本上，我有一个 350 文本文件（ 350行）的测试语料库，并且我建立了一个ml模型来根据SMS预测作者的性别在每个文本文件中。

完成预处理后，这些是我最后的代码行：

（已加入是数据框df中的预处理列）

from sklearn.model_selection import train_test_split
from sklearn import cross_validation
from sklearn.feature_extraction.text import CountVectorizer
y = df['Gender']
X_train, X_test, y_train, y_test = cross_validation.train_test_split(
                                         df['Joined'], y, 
                                         test_size=0.20,random_state=53)

count_vectorizer = CountVectorizer(stop_words='english')
count_train = count_vectorizer.fit_transform(X_train.values)
count_test = count_vectorizer.transform(X_test.values)

from sklearn.naive_bayes import MultinomialNB
from sklearn import metrics
nb_classifier = MultinomialNB()
nb_classifier.fit(count_train, y_train)
pred = nb_classifier.predict(count_test)
metrics.accuracy_score(y_test, pred)

现在我有了一个新的测试语料库，该语料库具有 150 个文本文件（ 150行），并且我必须根据以前的模型来预测这些文件的性别！

我制作了一个名为newdf的新数据框，并将测试语料库文件预处理为名为new_test的一列，该列具有 150行。

现在如何在此nb_classifier列上使用以前的new_test模型？

Answer 1

假设您已经像处理new_test一样对count_test进行了预处理，则只需调用nb_classifier.predict或predict_proba并传入new_test数组即可。

我更喜欢predict_proba，因为它返回每个类别的概率，而不是单个预测。

每条评论更新

您似乎遇到了尺寸问题。训练MultinomialNB分类器时，它只能处理以与训练时相同的维度传递的数据。例如：

您使用CountVectorizer创建了具有n个样本和m个特征的训练数据。传递到分类器中的任何数据都必须符合m个功能，否则分类器将无法理解如何处理该差异。

因此，至关重要的是，在使用CountVectorizer进行预处理时，还必须使用适合的实例来转换要预测的任何数据。

在代码中：

df = pd.DataFrame({
    'joined': [
        'a sentence', 'This is some great food',
        'the quick red fox jumped over the lazy brown dog'],
    'label': ['M', 'F', 'M']})
df2 = pd.DataFrame({
    'new_text': [
        'a differenct sentence',
        'something entirely different that hasnt been seen before',
        'fox and dog'],
    'label': ['M', 'M', 'F']})

count_vectorizer = CountVectorizer(stop_words='english')
count_train = count_vectorizer.fit_transform(df.joined.values)

nb_classifier = MultinomialNB()
nb_classifier.fit(count_train, df.label)
metrics.accuracy_score(y_test, pred)

new_test = count_vectorizer.transform(df2.new_text.values)
nb_classifier.predict_proba(new_test)
array([[0.27272727, 0.72727273],
       [0.33333333, 0.66666667],
       [0.2195122 , 0.7804878 ]])

如何使用我以前的ML模型来预测新的测试语料库？

1 个答案:

每条评论更新