class_weight hyperparameter in Random Forest change the amounts of samples in confusion matrix

时间:2017-11-02 15:41:21

标签: machine-learning scikit-learn random-forest confusion-matrix

I'm currently working on a Random Forest Classification model which contains 24,000 samples where 20,000 of them belong to class 0 and 4,000 of them belong to class 1. I made a train_test_split where test_set is 0.2 of the whole dataset (around 4,800 samples in test_set). Since I'm dealing with imbalanced data, I looked at the hyperparameter class_weight which is aimed to solve this issue.

The problem I'm facing the moment I'm setting class_weight='balanced' and look at the confusion_matrix of the training set I'm getting something like that:

array([[13209, 747], [ 2776, 2468]])

As you can see, the lower array corresponds to False Negative = 2776 followed by True Positive = 2468, while the upper array corresponds to True Negative = 13209 followed by False Positive = 747. The problem is that the amount of samples belong to class 1 according to the confusion_matrix is 2,776 (False Negative) + 2,468 (True Positive) which sums up to 5,244 samples belong to class 1. This doesn't make any sense since the whole dataset contains only 4,000 samples which belongs to class 1 where only 3,200 of them are in the train_set. It looks like the confusion_matrix return a Transposed version of the matrix, because the actual amount of samples belong to class 1 in the training_set should sum up to 3,200 samples in train_set and 800 in test_set. In general, the right numbers should be 747 + 2468 which sums up to 3,215 which is the right amount of samples belong to class 1. Can someone explain me what happens the moment I'm using class_weight? Is it true that the confusion_matrix returns a transposed version of the matrix? Am I looking at it the wrong way? I have tried looking for an answer and visited several questions which are somehow similar, but none of them really covered this issue.

Those are some of the sources I looked at:

scikit-learn: Random forest class_weight and sample_weight parameters

How to tune parameters in Random Forest, using Scikit Learn?

https://datascience.stackexchange.com/questions/11564/how-does-class-weights-work-in-randomforestclassifier

https://stats.stackexchange.com/questions/244630/difference-between-sample-weight-and-class-weight-randomforest-classifier

using sample_weight and class_weight in imbalanced dataset with RandomForest Classifier

Any help would be appreciated, thanks.

1 个答案:

答案 0 :(得分:1)

docs

再现玩具示例  
from sklearn.metrics import confusion_matrix

y_true = [0, 1, 0, 1]
y_pred = [1, 1, 1, 0]

tn, fp, fn, tp = confusion_matrix(y_true, y_pred).ravel()
(tn, fp, fn, tp)
# (0, 2, 1, 1)

因此,您提供的混淆矩阵的读数似乎是正确的。

  

confusion_matrix是否返回转换后的版本   矩阵?

如上例所示,没有。但是,一个非常简单(并且看起来无辜)的错误可能是你已经交换了y_truey_pred参数的 order ,这很重要;结果确实是一个转置矩阵:

# correct order of arguments:
confusion_matrix(y_true, y_pred)
# array([[0, 2],
#        [1, 1]])

# inverted (wrong) order of the arguments:
confusion_matrix(y_pred, y_true)
# array([[0, 1],
#        [2, 1]])

这是不可能的,因为这是你提供的信息的原因,这是一个很好的提醒,为什么你应该总是提供你的实际代码,而不是你想什么的口头描述你的代码正在......