Question

我有一个包含数值，分类和NaN值的数据框。

    customer_class  B   C
0   OM1            1    2.0
1   NaN        6    1.0
2   OM1            9    NaN
....

我需要一个LabelEncoder，将丢失的值保留为“ NaN”，以便以后使用Imputer。

所以我想使用此代码来通过保持NaN值来编码我的数据帧。

这是代码：

   class LabelEncoderByCol(BaseEstimator, TransformerMixin):
    def __init__(self,col):
        #List of column names in the DataFrame that should be encoded
        self.col = col
        #Dictionary storing a LabelEncoder for each column
        self.le_dic = {}
        for el in self.col:
            self.le_dic[el] = LabelEncoder()

    def fit(self,x,y=None):
        #Fill missing values with the string 'NaN'
        x[self.col] = x[self.col].fillna('NaN')
        for el in self.col:
            #Only use the values that are not 'NaN' to fit the Encoder
            a = x[el][x[el]!='NaN']
            self.le_dic[el].fit(a)
        return self

    def transform(self,x,y=None):
        #Fill missing values with the string 'NaN'
        x[self.col] = x[self.col].fillna('NaN')
        for el in self.col:
            #Only use the values that are not 'NaN' to fit the Encoder
            a = x[el][x[el]!='NaN']
            #Store an ndarray of the current column
            b = x[el].get_values()
            #Replace the elements in the ndarray that are not 'NaN'
            #using the transformer
            b[b!='NaN'] = self.le_dic[el].transform(a)
            #Overwrite the column in the DataFrame
            x[el]=b
        #return the transformed D


col = data1['customer_class']
LabelEncoderByCol(col)
LabelEncoderByCol.fit(x=col,y=None)

但是我得到了这个错误：第846章，你是我的老公 -> 847提高ValueError（'％s未包含在索引中'％str（key [mask]）） 848 self._set_values（indexer，value） 849

ValueError：索引中不包含['OM1''OM1''OM1'...'其他''EU''EUB']

有任何想法请解决此错误吗？

谢谢

Answer 1

当我尝试复制时，有两件事突然发生了：

您的代码似乎期望将数据帧传递给您的类。但是在您的示例中，您通过了一系列操作。我通过将系列包装为数据框并将其传递给您的类：col = pd.DataFrame(data1['customer_class'])。
在您的类的__init__方法中，您似乎打算遍历列名列表，但实际上是逐序列遍历所有列。我将相应的行更改为self.col = col.columns.values。

下面，我粘贴了对您的类的__init__和fit方法的修改（对transform方法的唯一修改是让它返回修改后的数据帧）：

from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.preprocessing import LabelEncoder

data1 = pd.DataFrame({'customer_class': ['OM1', np.nan, 'OM1'],
                      'B': [1,6,9],
                      'C': [2.0, 1.0, np.nan]})

class LabelEncoderByCol(BaseEstimator, TransformerMixin):
    def __init__(self,col):
        #List of column names in the DataFrame that should be encoded
        self.col = col.columns.values
        #Dictionary storing a LabelEncoder for each column
        self.le_dic = {}
        for el in self.col:
            self.le_dic[el] = LabelEncoder()

    def fit(self,x,y=None):
        #Fill missing values with the string 'NaN'
        x = x.fillna('NaN')
        for el in self.col:
            #Only use the values that are not 'NaN' to fit the Encoder
            a = x[el][x[el]!='NaN']
            self.le_dic[el].fit(a)
        return self

    def transform(self,x,y=None):
        #Fill missing values with the string 'NaN'
        x[self.col] = x[self.col].fillna('NaN')
        for el in self.col:
            #Only use the values that are not 'NaN' to fit the Encoder
            a = x[el][x[el]!='NaN']
            #Store an ndarray of the current column
            b = x[el].get_values()
            #Replace the elements in the ndarray that are not 'NaN'
            #using the transformer
            b[b!='NaN'] = self.le_dic[el].transform(a)
            #Overwrite the column in the DataFrame
            x[el]=b
        return x

我能够运行以下几行（与您最初的实现相比也有所修改），没有错误：

col = pd.DataFrame(data1['customer_class'])
lenc = LabelEncoderByCol(col)
lenc.fit(x=col,y=None)

然后我可以从您的示例访问customer_class列的类：

lenc.fit(x=col,y=None).le_dic['customer_class'].classes_

哪个输出：

array(['OM1'], dtype=object)

最后，我可以使用您的类的transform方法来转换列：

lenc.transform(x=col,y=None)

输出以下内容：

    customer_class
0   0
1   NaN
2   0

标签编码器编码数据帧而不编码NaN缺失值

1 个答案: