Question

from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer

from sklearn.ensemble import RandomForestClassifier
pipeline = Pipeline([
('features', FeatureUnion([
    ('Comments',Pipeline([
        ('selector',ItemSelector(column = "Comments")),
        ('tfidf',TfidfVectorizer(use_idf=False,ngram_range=(1,2),max_df=0.95, min_df=0,sublinear_tf=True)),
    ])),
    ('Vendor', Pipeline([
        ('selector',ItemSelector(column = "Vendor Name")),
        ('tfidf',TfidfVectorizer(use_idf=False)),

    ]))
])),
('clf',RandomForestClassifier(n_estimators =200, max_features='log2',criterion = 'entropy',random_state = 45))
 #('clf',LogisticRegression())
 ])


X_train, X_test, y_train, y_test = train_test_split(X,
                                df['code Description'],
                                test_size = 0.3, 
                                train_size = 0.7,
                                random_state = 100)
model = pipeline.fit(X_train, y_train)
s = pipeline.score(X_test,y_test)
pred = model.predict(X_test)
predicted =model.predict_proba(X_test)

对于某些分类，我的predict与预测分数匹配。但在某些情况下，

proba_predict = [0.3,0.18,0.155]

但不是将其分类为A类，而是将其归类为B类。

预测班级：B

实际班级：A

右侧列是我的标签，左侧列是我的输入文本数据：

Answer 1

我认为您说明了以下情况：对于测试向量X_test，您可以从predict_proba()方法找到预测概率分布y = [p1，p2，p3]，其中p1＆gt; p2和p1＆gt; ; p3但predict()方法不为此向量输出0级。

如果您检查了sklearn predict的{{1}}函数的source code，您会看到在那里调用RandomForest的RandomForestClassifier方法：

predict_proba()

根据这些概率，proba = self.predict_proba(X)用于输出类。

因此，预测步骤使用argmax方法进行输出。对我来说，似乎不可能出现任何问题。

我认为你在日常工作中混淆了一些类名并在那里感到困惑。但根据您提供的信息，无法提供更详细的答案。

Predict_proba（）的随机森林分类器结果与predict（）不匹配？

1 个答案: