Question

嗨，我正在使用 Kaggle (https://www.kaggle.com/mlg-ulb/creditcardfraud) 的信用卡欺诈数据集对随机森林分类器进行练习。

首先，我创建了包含 20 棵树的模型并将其拟合到完整数据集（31 个特征）中，该模型能够获得大约 99.95% 的分数。随后，我检查了特征重要性，似乎特征 12、14 和 17 被测量为最重要的特征。

from sklearn.ensemble import RandomForestClassifier

x_train, x_test, y_train, y_test = train_test_split(df, y, test_size=0.25)
model = RandomForestClassifier(n_estimators=20, verbose=2)
model.fit(x_train, y_train)
model.score(x_test, y_test)
================================================================
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    0.0s remaining:    0.0s
[Parallel(n_jobs=1)]: Done  20 out of  20 | elapsed:    0.2s finished
0.999592708069998
================================================================
importance = model.feature_importances_
# summarize feature importance
for i,v in enumerate(importance):
    print('Feature: %0d, Score: %.5f' % (i,v))
================================================================
Feature: 0, Score: 0.01206
Feature: 1, Score: 0.01748
Feature: 2, Score: 0.01454
Feature: 3, Score: 0.02275
Feature: 4, Score: 0.02935
Feature: 5, Score: 0.01019
Feature: 6, Score: 0.01946
Feature: 7, Score: 0.03668
Feature: 8, Score: 0.01059
Feature: 9, Score: 0.03032
Feature: 10, Score: 0.07120
Feature: 11, Score: 0.09098
Feature: 12, Score: 0.10580
Feature: 13, Score: 0.01147
Feature: 14, Score: 0.12103
Feature: 15, Score: 0.01332
Feature: 16, Score: 0.04959
Feature: 17, Score: 0.14742
Feature: 18, Score: 0.04764
Feature: 19, Score: 0.01404
Feature: 20, Score: 0.01091
Feature: 21, Score: 0.01968
Feature: 22, Score: 0.01265
Feature: 23, Score: 0.01125
Feature: 24, Score: 0.00876
Feature: 25, Score: 0.00678
Feature: 26, Score: 0.02034
Feature: 27, Score: 0.00833
Feature: 28, Score: 0.01272
Feature: 29, Score: 0.01267

我想看看重要的特征会如何影响模型分数，因此我去掉了一堆特征来看看会发生什么。然而我意识到，即使去掉了除特征 0 之外的所有特征（只有 0.01 的重要性），模型得分仍然很高（94%）

tiny_x_train = x_train.copy()
tiny_x_test = x_test.copy()
tiny_x_train.drop(df.columns.difference(['V1']), 1, inplace=True) #Feature 0 is V1
tiny_x_test.drop(df.columns.difference(['V1']), 1, inplace=True)
model.fit(tiny_x_train, y_train)
model.score(tiny_x_test, tiny_y_test)
================================================================
0.9473684210526315

我猜测分数保持高位的部分原因是数据非常不准确（欺诈发生在 <1% 的数据中）。我的假设是正确的还是我在这里遗漏了什么？

Answer 1

你说得对。

由于数据集，您获得了如此好的准确性。当然，如果您检查混淆矩阵的其他分数（精度、召回率等），您会发现您的模型并不像看起来那么好。

作为最后的评论，假设我创建了一个总是返回：“No Fraud”的模型。这个愚蠢的模型的准确度是多少？大约 99%。因此，我建议检查混淆矩阵指标以更好地了解您的模型的行为。

特征重要性和模型得分

1 个答案: