比较python中射频模型的准确性

时间:2017-10-09 12:41:14

标签: python pandas random-forest

我想计算其准确性(在测试数据集上)。 该模型具有以下预测值:

[0 1 0 1 1 1 1 0 1 0 1 0 1 1 0 0 0 1 0 1 0 1 0 0 0 1 1 0 0 0 0 0 0 0 1 1 0
 1 1 1 0 0 0 0 0 1 0 0 0 1 0 0 1 0 0 0 0]

如何将其与实际值(在本例中为B或M)进行比较,以获得其对测试数据的准确性。这也应该是其他数据集值的通用值。 这是我用于RandomForest模型的代码:

import pandas as pd
import numpy as np
# Load scikit's random forest classifier library
from sklearn.ensemble import RandomForestClassifier

from sklearn.model_selection import train_test_split
file_path = 'https://archive.ics.uci.edu/ml/machine-learning-databases/breast-cancer-wisconsin/wdbc.data'
dataset2 = pd.read_csv(file_path, header=None, sep=',')

train, test = train_test_split(dataset2, test_size=0.1)
y = pd.factorize(train[1])[0]
clf = RandomForestClassifier(n_jobs=2, random_state=0)
features = train.columns[2:]
clf.fit(train[features], y)

RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features='auto', max_leaf_nodes=None,
            min_impurity_split=1e-07, min_samples_leaf=1,
            min_samples_split=2, min_weight_fraction_leaf=0.0,
            n_estimators=10, n_jobs=2, oob_score=False, random_state=0,
            verbose=0, warm_start=False)
# Apply the Classifier we trained to the test data 
clf.predict(test[features])

1 个答案:

答案 0 :(得分:0)

您可以使用sklearn' preprocessing.LabelEncoder()对B和M进行编码,如下所示,并使用inverse_transform()返回。此外,可以使用ConfusionMatrix() pandas_ml个软件包和sklearn的accuracy_score()进行准确性评估。

import pandas as pd
import numpy as np
# Load scikit's random forest classifier library
from sklearn.ensemble import RandomForestClassifier

from sklearn.model_selection import train_test_split
file_path = 'https://archive.ics.uci.edu/ml/machine-learning-databases/breast-cancer-wisconsin/wdbc.data'
dataset2 = pd.read_csv(file_path, header=None, sep=',')

from sklearn import preprocessing
le = preprocessing.LabelEncoder()

# Encode B, M to 0, 1
y = le.fit_transform(dataset2[1])
dataset2[1] = y

train, test = train_test_split(dataset2, test_size=0.1)
y = train[1]
y_test = test[1]
clf = RandomForestClassifier(n_jobs=2, random_state=0)
features = train.columns[2:]
clf.fit(train[features], y)

RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features='auto', max_leaf_nodes=None,
            min_impurity_split=1e-07, min_samples_leaf=1,
            min_samples_split=2, min_weight_fraction_leaf=0.0,
            n_estimators=10, n_jobs=2, oob_score=False, random_state=0,
            verbose=0, warm_start=False)
# Apply the Classifier we trained to the test data
y_pred = clf.predict(test[features])

# Decode from 0, 1 to B, M
y_test_label = le.inverse_transform(y_test)
y_pred_label = le.inverse_transform(y_pred)

from pandas_ml import ConfusionMatrix
confusion_matrix = ConfusionMatrix(y_test_label, y_pred_label)
print("Confusion matrix:\n%s" % confusion_matrix)
# Confusion matrix:
# Predicted   B   M  __all__
# Actual                    
# B          35   1       36
# M           4  17       21
# __all__    39  18       57

from sklearn.metrics import accuracy_score
accuracy_score(y_test_label, y_pred_label)
# Out[14]: 0.035087719298245612

请注意,pip可以轻松安装pandas_ml,如下所示。

pip install pandas_ml