我正在尝试在类不平衡数据集上构建模型(二进制 - 1' s:25%和0' s 75%)。尝试使用分类算法和集合技术。我对以下两个概念感到有点困惑,因为我更感兴趣的是预测更多1个。
1. Should i give preference to Sensitivity or Positive Predicted Value.
Some ensemble techniques give maximum 45% of sensitivity and low Positive Predicted Value.
And some give 62% of Positive Predicted Value and low Sensitivity.
2. My dataset has around 450K observations and 250 features.
After power test i took 10K observations by Simple random sampling. While selecting
variable importance using ensemble technique's the features
are different compared to the features when i tried with 150K observations.
Now with my intuition and domain knowledge i felt features that came up as important in
150K observation sample are more relevant. what is the best practice?
3. Last, can i use the variable importance generated by RF in other ensemple
techniques to predict the accuracy?
你可以帮我解决一下,因为有点困惑吗
答案 0 :(得分:1)
敏感度和积极预测值之间的偏好取决于您的分析的最终目标。这两个值之间的差异在这里得到了很好的解释:https://onlinecourses.science.psu.edu/stat507/node/71/ 总而言之,这两个衡量两个不同观点的结果。灵敏度为您提供测试在您拥有它的人中找到“条件”的概率。积极预测值着眼于正在测试的人中“病情”的普遍程度。
准确度取决于您的分类结果:它被定义为(真阳性+真阴性)/(总),而不是由RF产生的变量重要性。
此外,可以补偿数据集中的不平衡,请参阅https://stats.stackexchange.com/questions/264798/random-forest-unbalanced-dataset-for-training-test