Question

sklearn.LabelEncoder的docs以

开头

此转换器应用于编码目标值（即y）而不是输入X。

这是为什么？

我仅发布了此建议的一个示例，尽管实际工作量似乎很多，但实际上却被忽略了。 https://www.kaggle.com/matleonard/feature-generation包含

#(ks is the input data)

# Label encoding
cat_features = ['category', 'currency', 'country']
encoder = LabelEncoder()
encoded = ks[cat_features].apply(encoder.fit_transform)

Answer 1

更改输出值y没什么大不了的，因为它仅基于该值重新学习（如果它是基于错误的回归）。

如果改变输入值“ X”的权重会导致无法进行正确的预测。

如果没有太多选项（例如2类别，2种货币，2个城市编码为int-s），您可以在X上进行操作，不会对游戏造成太大的改变。

Answer 2

也许是因为：

它自然不能同时处理多个列。
它不支持订购。 IE。如果您的类别是有序的，例如：

糟糕、差、一般、好、优秀

LabelEncoder 会给它们一个任意的顺序（可能是因为它们在数据中遇到的），这对你的分类器没有帮助。

在这种情况下，您可以使用 OrdinalEncoder 或手动替换。

1. OrdinalEncoder:

<块引用>

将分类特征编码为整数数组。

df = pd.DataFrame(data=[['Bad', 200], ['Awful', 100], ['Good', 350], ['Average', 300], ['Excellent', 1000]], columns=['Quality', 'Label'])
enc = OrdinalEncoder(categories=[['Awful', 'Bad', 'Average', 'Good', 'Excellent']])  # Use the 'categories' parameter to specify the desired order. Otherwise the ordered is inferred from the data.
enc.fit_transform(df[['Quality']])  # Can either fit on 1 feature, or multiple features at once.

输出：

array([[1.],
       [0.],
       [3.],
       [2.],
       [4.]])

注意输出中的逻辑顺序。

2. Manual replacement:

scale_mapper = {'Awful': 0, 'Bad': 1, 'Average': 2, 'Good': 3, 'Excellent': 4}
df['Quality'].replace(scale_mapper)

输出：

0    1
1    0
2    3
3    2
4    4
Name: Quality, dtype: int64

Answer 3

我认为他们警告不要将其用于X（输入数据），因为：

在大多数情况下，分类输入数据最好编码为一种热编码，而不是整数，因为大多数情况下您具有不可排序的类别。
第二，另一个技术问题是LabelEncoder未被编程为处理表（对于X列，必须按列/按功能编码）。 LabelEncoder假定数据只是一个平面列表。那将是问题。

from sklearn.preprocessing import LabelEncoder

enc = LabelEncoder()

categories = [x for x in 'abcdabaccba']
categories
## ['a', 'b', 'c', 'd', 'a', 'b', 'a', 'c', 'c', 'b', 'a']

categories_numerical = enc.fit_transform(categories)

categories_numerical
# array([0, 1, 2, 3, 0, 1, 0, 2, 2, 1, 0])

# so it makes out of categories numbers
# and can transform back

enc.inverse_transform(categories_numerical)
# array(['a', 'b', 'c', 'd', 'a', 'b', 'a', 'c', 'c', 'b', 'a'], dtype='<U1')

为什么不应该使用sklearn LabelEncoder编码输入数据？

3 个答案:

1. OrdinalEncoder:

2. Manual replacement: