Question

因此，我正在查看数据以尝试预测谁将打开一条消息。二进制输出严重不平衡，实际点击率不到1％。

首先，分割后，我尝试使用数据集的不同子集。最初，我在分割之前应用了剪切，但我意识到数据尚未随机化，因此可能会有偏差。但是拆分后应该完全随机化

cut = 10000

X_train, X_test, Y_train, Y_test = train_test_split(X_df, y, test_size = 0.2, random_state = 0)

X_train, X_test, Y_train, Y_test = X_train[:int(cut*0.8)], X_test[:int(cut*0.2)], Y_train[:int(cut*0.8)], Y_test[:int(cut*0.2)]

此后，我运行一个随机森林模型

rf = RandomForestClassifier(n_estimators = 100,
                           n_jobs = -1,
                           oob_score = True,
                           bootstrap = True,
                           random_state = 42,
                           class_weight="balanced")
rf.fit(X_train, Y_train)

对于10K，100K，1M和完整的数据集（〜158万），我会重复此操作。奇怪的是我的表现越来越差。数据的一个问题是，预测整个数据集的0 /无点击将使您获得更高的准确性。当检查预测混淆矩阵时，我注意到随着样本数量的增加，该模型试图将越来越大的数据份额归类为“点击”，因此模型更具攻击性，因此准确性下降。我该如何处理？我尝试了同一个问题的逻辑回归。我希望实现的目标是找到某些具有较高点击率的数据子集，以便可以定位它们（例如，通过定位20％的受众群体或类似受众群体来获得80％的点击）我使用的数据离获得的数据越远。我该怎么办？而且我不明白为什么射频模型不总是一直预测全0（因为它将提供很高的准确性），以及为什么随着样本量的增加而改变。 Esp从100K采样到1M采样时，模型从预测5％的点击变为27％，并且准确性下降。再次，准确性显然不是一个很好的衡量指标，但是我将如何使用完整的数据集来寻找顶部转换子集，而模型不会仅仅预测越来越多的点击，直到最终预测每个人并且具有与样本平均值相同的准确性？

#10 K samples
print('{:.0f} samples \nR^2 Training Score: {:.2f}  \nR^2 Validation Score: {:.2f}'.format(
    X_train.shape[0]/0.8, rf.score(X_train, Y_train), rf.score(X_test, Y_test)))
confusion_matrix(Y_test, rf.predict(X_test))


out:
10000 samples 
R^2 Training Score: 0.97  
R^2 Validation Score: 0.97
array([[1947,   52],
       [   1,    0]], dtype=int64)

#100K samples
print('{:.0f} samples \nR^2 Training Score: {:.2f}  \nR^2 Validation Score: {:.2f}'.format(
    X_train.shape[0]/0.8, rf.score(X_train, Y_train), rf.score(X_test, Y_test)))
confusion_matrix(Y_test, rf.predict(X_test))

out:
100000 samples 
R^2 Training Score: 0.95  
R^2 Validation Score: 0.95
array([[19065,   904],
       [    8,    23]], dtype=int64)

#1M samples
print('{:.0f} samples \nR^2 Training Score: {:.2f}  \nR^2 Validation Score: {:.2f}'.format(
    X_train.shape[0]/0.8, rf.score(X_train, Y_train), rf.score(X_test, Y_test)))
confusion_matrix(Y_test, rf.predict(X_test))

out:
1000000 samples 
R^2 Training Score: 0.73  
R^2 Validation Score: 0.73
array([[145618,  53718],
       [   126,    538]], dtype=int64)


#Full dataset
print('{:.0f} samples \nR^2 Training Score: {:.2f}  \nR^2 Validation Score: {:.2f}'.format(
    X_train.shape[0]/0.8, rf.score(X_train, Y_train), rf.score(X_test, Y_test)))
confusion_matrix(Y_test, rf.predict(X_test))

out:
1585125 samples 
R^2 Training Score: 0.66  
R^2 Validation Score: 0.66
array([[206654, 108214],
       [   985,   1173]], dtype=int64)

我使用的数据越多，随机森林模型就越糟糕（不平衡数据集）

0 个答案: