您好我正在处理一个困难的数据集,因为输入和输出之间的相关性较低,但结果却非常好(测试集的准确度为99.9%)。我确定我做错了,只是不知道什么。
label是'unsafe'列,它是0或1(本来是0或100,但我限制了最大值-结果没有区别。我从随机森林开始,然后跑了k个最近的邻居,得到了几乎相同的准确率99.9%。df的屏幕截图为:
0比1多得多(在80,000个训练集中,只有169个1,最后还有一个1,但这就是原始文件的导入方式)
import os
import glob
import numpy as np
import pandas as pd
import sklearn as sklearn
from sklearn.tree import DecisionTreeClassifier
import matplotlib.pyplot as plt
import seaborn as sns
df = pd.read_pickle('/Users/shellyganga/Downloads/ola.pickle')
maxVal = 1
df.unsafe = df['unsafe'].where(df['unsafe'] <= maxVal, maxVal)
print(df.head)
df.drop(df.columns[0], axis=1, inplace=True)
df.drop(df.columns[-2], axis=1, inplace=True)
#setting features and labels
labels = np.array(df['unsafe'])
features= df.drop('unsafe', axis = 1)
# Saving feature names for later use
feature_list = list(features.columns)
# Convert to numpy array
features = np.array(features)
from sklearn.model_selection import train_test_split
# 30% examples in test data
train, test, train_labels, test_labels = train_test_split(features, labels,
stratify = labels,
test_size = 0.3,
random_state = 0)
from sklearn.neighbors import KNeighborsClassifier
knn = KNeighborsClassifier()
knn.fit(train, train_labels)
print(np.mean(train_labels))
print(train_labels.shape)
print('accuracy on train: {:.5f}'.format(knn.score(train, train_labels)))
print('accuracy on test: {:.5f}'.format(knn.score(test, test_labels)))
输出:
0.0023654350798950337
(81169,)
accuracy on train: 0.99763
accuracy on test: 0.99761
答案 0 :(得分:2)
0实例比1实例多的事实是类不平衡的一个示例。 Here is a really cool stats.stackexchange question主题。
基本上,如果您的80000个标签中只有169个为1,其余的为0,则您的模型可以天真地预测每个实例的标签0,并且仍然具有训练集准确性( =错误分类实例的百分比)为99.78875%。
我建议尝试F1分数,这是精度的谐波均值,AKA阳性预测值= TP /(TP + FP),而回想起,AKA灵敏度= TP /(TP + FN):https://scikit-learn.org/stable/modules/generated/sklearn.metrics.f1_score.html#sklearn.metrics.f1_score
from sklearn.metrics import f1_score
print('F1 score on train: {:.5f}'.format(f1_score(train, train_labels)))
print('F1 score on test: {:.5f}'.format(f1_score(test, test_labels)))