决策树和随机森林使用拆分逻辑的方式给我的印象是,对于这些模型,标签编码将不是问题,因为无论如何我们都要拆分列。例如:如果我们使用标签编码将性别分别设置为“男性”,“女性”和“其他”,则其变为0,1,2,这被解释为0 <1 <2。但是由于我们要拆分列,所以我认为这没关系,因为我们要拆分“ male”还是“ 0”是同一回事。但是,当我在数据集上尝试使用标签和一种热编码时,一种热编码可以提供更好的准确性和准确性。 您能分享您的想法吗?
The ACCURACY SCORE of various models on train and test are:
The accuracy score of simple decision tree on label encoded data : TRAIN: 86.46% TEST: 79.42%
The accuracy score of tuned decision tree on label encoded data : TRAIN: 81.74% TEST: 81.33%
The accuracy score of random forest ensembler on label encoded data: TRAIN: 82.26% TEST: 81.63%
The accuracy score of simple decision tree on one hot encoded data : TRAIN: 86.46% TEST: 79.74%
The accuracy score of tuned decision tree on one hot encoded data : TRAIN: 82.04% TEST: 81.46%
The accuracy score of random forest ensembler on one hot encoded data:TRAIN: 82.41% TEST: 81.66%
he PRECISION SCORE of various models on train and test are:
The precision score of simple decision tree on label encoded data : TRAIN: 78.26% TEST: 57.92%
The precision score of tuned decision tree on label encoded data : hTRAIN: 66.54% TEST: 64.6%
The precision score of random forest ensembler on label encoded data: TRAIN: 70.1% TEST: 67.44%
The precision score of simple decision tree on one hot encoded data : TRAIN: 78.26% TEST: 58.84%
The precision score of tuned decision tree on one hot encoded data : TRAIN: 68.06% TEST: 65.81%
The precision score of random forest ensembler on one hot encoded data: TRAIN: 70.34% TEST: 67.32%
答案 0 :(得分:1)
您可以将其视为正则化效果:您的模型更简单,因此更具通用性。这样您可以获得更好的性能。
以性别特征为例:带有标签编码的[男性,女性,其他]变为[0,1,2]。
现在,假设其他功能有一个特定配置,该功能仅适用于雌性:树需要两个分支来选择雌性,一个选择的性别大于零,另一个选择的性别小于2。
相反,使用一键编码,您只需要一个分支即可进行选择,例如sex_female大于零。