Question

决策树和随机森林使用拆分逻辑的方式给我的印象是，对于这些模型，标签编码将不是问题，因为无论如何我们都要拆分列。例如：如果我们使用标签编码将性别分别设置为“男性”，“女性”和“其他”，则其变为0,1,2，这被解释为0 <1 <2。但是由于我们要拆分列，所以我认为这没关系，因为我们要拆分“ male”还是“ 0”是同一回事。但是，当我在数据集上尝试使用标签和一种热编码时，一种热编码可以提供更好的准确性和准确性。您能分享您的想法吗？

The ACCURACY SCORE of various models on train and test are:

The accuracy score of simple decision tree on label encoded data :    TRAIN: 86.46%     TEST: 79.42%
The accuracy score of tuned decision tree on label encoded data :     TRAIN: 81.74%     TEST: 81.33%
The accuracy score of random forest ensembler on label encoded data:  TRAIN: 82.26%     TEST: 81.63%
The accuracy score of simple decision tree on one hot encoded data :  TRAIN: 86.46%     TEST: 79.74%
The accuracy score of tuned decision tree on one hot encoded data :   TRAIN: 82.04%     TEST: 81.46%
The accuracy score of random forest ensembler on one hot encoded data:TRAIN: 82.41%     TEST: 81.66%

he PRECISION SCORE of various models on train and test are:

The precision score of simple decision tree on label encoded data :             TRAIN: 78.26%   TEST: 57.92%
The precision score of tuned decision tree on label encoded data :              hTRAIN: 66.54%  TEST: 64.6%
The precision score of random forest ensembler on label encoded data:           TRAIN: 70.1%    TEST: 67.44%
The precision score of simple decision tree on one hot encoded data :           TRAIN: 78.26%   TEST: 58.84%
The precision score of tuned decision tree on one hot encoded data :            TRAIN: 68.06%   TEST: 65.81%
The precision score of random forest ensembler on one hot encoded data:         TRAIN: 70.34%   TEST: 67.32%

Answer 1

您可以将其视为正则化效果：您的模型更简单，因此更具通用性。这样您可以获得更好的性能。

以性别特征为例：带有标签编码的[男性，女性，其他]变为[0，1，2]。

现在，假设其他功能有一个特定配置，该功能仅适用于雌性：树需要两个分支来选择雌性，一个选择的性别大于零，另一个选择的性别小于2。

相反，使用一键编码，您只需要一个分支即可进行选择，例如sex_female大于零。

编码分类列-标签编码与决策树的一种热编码

1 个答案: