Question

我希望确定sklearn LabelEncoder的标签（即0、1、2、3，...）以适合分类变量可能值的特定顺序（例如['b'，'a'，' c'，'d']）。我猜LabelEncoder选择按字典顺序排列标签，如本例所示：

from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
le.fit(['b', 'a', 'c', 'd' ])
le.classes_
array(['a', 'b', 'c', 'd'], dtype='<U1')
le.transform(['a', 'b'])
array([0, 1])

在.fit方法中首次遇到编码器时，如何强制编码器坚持数据顺序（即将'b'编码为0，'a'编码为1，'c'编码为2和' d'至3）？

Answer 1

请注意，现在http://contrib.scikit-learn.org/categorical-encoding/ordinal.html可能有更好的方法。特别是，请参见mapping参数：

用于编码的类到标签的映射，是可选的。的 dict包含键“ col”和“ mapping”。 “ col”的值应为功能名称。 “映射”的值应为从“原始标签”到“已编码标签”。映射示例：[{'col'：'col1'， ‘映射’：{无：0，‘a’：1，‘b’：2}}]

Answer 2

您无法在原始版本中做到这一点。

LabelEncoder.fit()使用numpy.unique，它将始终按排序方式返回数据，例如given in source：

def fit(...):
    y = column_or_1d(y, warn=True)
    self.classes_ = np.unique(y)
    return self

因此，如果要执行此操作，则需要覆盖fit()函数。像这样：

import pandas as pd
from sklearn.preprocessing import LabelEncoder
from sklearn.utils import column_or_1d

class MyLabelEncoder(LabelEncoder):

    def fit(self, y):
        y = column_or_1d(y, warn=True)
        self.classes_ = pd.Series(y).unique()
        return self

然后您可以执行以下操作：

le = MyLabelEncoder()
le.fit(['b', 'a', 'c', 'd' ])
le.classes_
#Output:  array(['b', 'a', 'c', 'd'], dtype=object)

在这里，我正在使用pandas.Series.unique()来获取唯一的类。如果您由于任何原因不能使用熊猫，请参考此问题，该问题使用numpy进行：

numpy unique without sort

Answer 3

Vivek Kumar解决方案为我工作，但必须采用这种方式

class LabelEncoder(LabelEncoder):

def fit(self, y):
    y = column_or_1d(y, warn=True)
    self.classes_ = pd.Series(y).unique().sort()
    return self

Answer 4

注意::这不是一种标准方法，而是一种骇人听闻的方法我使用了“ classes_”属性来自定义我的映射

from sklearn import preprocessing
le_temp = preprocessing.LabelEncoder()
le_temp = le_temp.fit(df_1['Temp'])
print(df_1['Temp'])
le_temp.classes_ = np.array(['Cool', 'Mild','Hot'])
print("New classes sequence::",le_temp.classes_)
df_1['Temp'] = le_temp.transform(df_1['Temp'])
print(df_1['Temp'])

我的输出看起来像

1      Hot
2      Hot
3      Hot
4     Mild
5     Cool
6     Cool

Name: Temp, dtype: object
New classes sequence:: ['Cool' 'Mild' 'Hot']

1     2
2     2
3     2
4     1
5     0
6     0

Name: Temp, dtype: int32

Python sklearn-确定LabelEncoder的编码顺序

4 个答案: