I'm currently working on a Random Forest Classification model which contains 24,000 samples where 20,000 of them belong to class 0
and 4,000 of them belong to class 1
. I made a train_test_split
where test_set is 0.2
of the whole dataset (around 4,800 samples in test_set
). Since I'm dealing with imbalanced data, I looked at the hyperparameter class_weight
which is aimed to solve this issue.
The problem I'm facing the moment I'm setting class_weight='balanced'
and look at the confusion_matrix
of the training set I'm getting something like that:
array([[13209, 747],
[ 2776, 2468]])
As you can see, the lower array corresponds to False Negative = 2776
followed by True Positive = 2468
, while the upper array corresponds to True Negative = 13209
followed by False Positive = 747
. The problem is that the amount of samples belong to class 1
according to the confusion_matrix
is 2,776 (False Negative) + 2,468 (True Positive)
which sums up to 5,244 samples
belong to class 1
. This doesn't make any sense since the whole dataset contains only 4,000 samples which belongs to class 1
where only 3,200 of them are in the train_set
. It looks like the confusion_matrix
return a Transposed
version of the matrix, because the actual amount of samples belong to class 1
in the training_set
should sum up to 3,200 samples in train_set
and 800 in test_set
. In general, the right numbers should be 747 + 2468 which sums up to 3,215 which is the right amount of samples belong to class 1
.
Can someone explain me what happens the moment I'm using class_weight
? Is it true that the confusion_matrix
returns a transposed
version of the matrix? Am I looking at it the wrong way?
I have tried looking for an answer and visited several questions which are somehow similar, but none of them really covered this issue.
Those are some of the sources I looked at:
scikit-learn: Random forest class_weight and sample_weight parameters
How to tune parameters in Random Forest, using Scikit Learn?
using sample_weight and class_weight in imbalanced dataset with RandomForest Classifier
Any help would be appreciated, thanks.
答案 0 :(得分:1)
从docs:
再现玩具示例from sklearn.metrics import confusion_matrix
y_true = [0, 1, 0, 1]
y_pred = [1, 1, 1, 0]
tn, fp, fn, tp = confusion_matrix(y_true, y_pred).ravel()
(tn, fp, fn, tp)
# (0, 2, 1, 1)
因此,您提供的混淆矩阵的读数似乎是正确的。
confusion_matrix是否返回转换后的版本 矩阵?
如上例所示,没有。但是,一个非常简单(并且看起来无辜)的错误可能是你已经交换了y_true
和y_pred
参数的 order ,这很重要;结果确实是一个转置矩阵:
# correct order of arguments:
confusion_matrix(y_true, y_pred)
# array([[0, 2],
# [1, 1]])
# inverted (wrong) order of the arguments:
confusion_matrix(y_pred, y_true)
# array([[0, 1],
# [2, 1]])
这是不可能的,因为这是你提供的信息的原因,这是一个很好的提醒,为什么你应该总是提供你的实际代码,而不是你想什么的口头描述你的代码正在......