python - 标签编码n维分类值

我碰到了这篇文章Label encoding across multiple columns in scikit-learn，其中一条评论https://stackoverflow.com/a/30267328/10058906解释了给定列的每个值如何从0到（n-1）的范围编码，其中n是长度的长度。列。这引发了一个问题：我何时对red: 2，orange: 1和green: 0进行编码，是否暗示绿色比红色更接近橙色而不是红色，因为0接近于1而不是2？现实中哪个不正确？我之前曾想过，也许因为green出现的次数最多，所以它得到的值是0。但是，即使fruit，apple gets value 0列中的orange occurs the maximum number of times也不适用。

我想总结一下标签编码器和一种热门编码：

的确，Label Encoder只是简单地为单元格值提供了整数表示。这意味着对于上述数据集，如果我们对分类值进行标签编码-imply that green is closer to orange than red since 0 is closer to 1 than 2-这是错误的。

另一方面，“热编码”为每个分类值创建一个单独的列，并给出0或1的值，分别表示该功能的不存在。同样，pd.get_dummies(dataframe)的内置函数产生相同的输出。

因此，如果给定的数据集包含本质上为序数的分类值，则明智的做法是使用Label Encoding;但是如果给定的数据是名义数据，则应继续使用One Hot Encoding。

https://discuss.analyticsvidhya.com/t/dummy-variables-is-necessary-to-standardize-them/66867/2

标签编码n维分类值

1 个答案: