Question

我正在使用XGBClassifier来预测用户是否会点击广告。
我正在寻找建议，以增加对少数民族班的回忆。

关于我的数据：

1. Total rows: 1,266,267
2. Total clicks: 1960 rows (0.15%) => imbalanced dataset
3. Features used:
    - Num of views 
    - Device used 
    - Time (categorized into 6 buckets)
    - Ad category
    - Site id (there are 338 unique site id)
    - User features (there are 583 unique features)(Note: features available for 60% of the data)

一键热后，总列/特征为943。
最终数据为稀疏矩阵格式。

模型结果：

Model                   | AUC    | Logloss | Recall* | Precision*
------------------------|--------|---------|---------|-----------
Using all 943 features  | 0.7359 | 0.05392 | 0.47    | 0.85
----------------------------------------------------------------
Clustered user features | 0.7548 | 0.05470 | 0.51    | 0.80
into groups             |
Final model features    |
num=361                 |
----------------------------------------------------------------
*recall and precision refers to the minority class (click=1). 
**recall, precision for majority class (click=0) is 1.

为了增加不平衡数据集中的召回率，我尝试过：

欠采样（最高召回率0.92，但精度为0.03）
SMOTE（回想率最高，为0.77，但精确度为0.05）
不同的算法（最好是XGBoost）
超参数调整（召回率提高了0.01）

问题：

1. Is my model too complex that it can't generalise well?
2. I have compared my AUC with results from other research papers. 
   Research AUC ranges from 0.7 to 0.82. 
   But, none of them showed the recall/confusion matrix.
   To anyone that has done CTR prediction before, can I know your recall/confusion matrix?
3. Is there other ways that can help increase recall for imbalanced dataset?

Answer 1

此外，建议您尝试其他操作，例如：如果数据不平衡，则使用SMOTE（50％/ 50％）平衡，如果您有很多分类变量，则尝试其他类型的编码...等等。

增加XGBClassifier中少数群体的召回率

1 个答案: