运行ML时异常高的精度

时间:2018-11-15 23:43:11

标签: python pandas scikit-learn

您好我正在处理一个困难的数据集,因为输入和输出之间的相关性较低,但结果却非常好(测试集的准确度为99.9%)。我确定我做错了,只是不知道什么。

label是'unsafe'列,它是0或1(本来是0或100,但我限制了最大值-结果没有区别。我从随机森林开始,然后跑了k个最近的邻居,得到了几乎相同的准确率99.9%。df的屏幕截图为:

enter image description here enter image description here

0比1多得多(在80,000个训练集中,只有169个1,最后还有一个1,但这就是原始文件的导入方式)

import os
import glob

import numpy as np
import pandas as pd
import sklearn as sklearn
from sklearn.tree import DecisionTreeClassifier
import matplotlib.pyplot as plt
import seaborn as sns

df = pd.read_pickle('/Users/shellyganga/Downloads/ola.pickle')

maxVal = 1
df.unsafe = df['unsafe'].where(df['unsafe'] <= maxVal, maxVal)

print(df.head)

df.drop(df.columns[0], axis=1, inplace=True)
df.drop(df.columns[-2], axis=1, inplace=True)

#setting features and labels
labels = np.array(df['unsafe'])
features= df.drop('unsafe', axis = 1)

# Saving feature names for later use
feature_list = list(features.columns)

# Convert to numpy array
features = np.array(features)

from sklearn.model_selection import train_test_split

# 30% examples in test data
train, test, train_labels, test_labels = train_test_split(features, labels,
                                                          stratify = labels,
                                                          test_size = 0.3,
                                                          random_state = 0)

from sklearn.neighbors import KNeighborsClassifier

knn = KNeighborsClassifier()
knn.fit(train, train_labels)

print(np.mean(train_labels))
print(train_labels.shape)

print('accuracy on train: {:.5f}'.format(knn.score(train, train_labels)))
print('accuracy on test: {:.5f}'.format(knn.score(test, test_labels)))

输出:

0.0023654350798950337
(81169,)
accuracy on train: 0.99763
accuracy on test: 0.99761

1 个答案:

答案 0 :(得分:2)

0实例比1实例多的事实是类不平衡的一个示例。 Here is a really cool stats.stackexchange question主题。

基本上,如果您的80000个标签中只有169个为1,其余的为0,则您的模型可以天真地预测每个实例的标签0,并且仍然具有训练集准确性( =错误分类实例的百分比)为99.78875%。

我建议尝试F1分数,这是精度的谐波均值,AKA阳性预测值= TP /(TP + FP),而回想起,AKA灵敏度= TP /(TP + FN):https://scikit-learn.org/stable/modules/generated/sklearn.metrics.f1_score.html#sklearn.metrics.f1_score

from sklearn.metrics import f1_score
print('F1 score on train: {:.5f}'.format(f1_score(train, train_labels)))
print('F1 score on test:  {:.5f}'.format(f1_score(test, test_labels)))