标签编码器编码数据帧而不编码NaN缺失值

时间:2019-01-30 18:19:37

标签: python pandas class dataframe

我有一个包含数值,分类和NaN值的数据框。

    customer_class  B   C
0   OM1            1    2.0
1   NaN        6    1.0
2   OM1            9    NaN
....

我需要一个LabelEncoder,将丢失的值保留为“ NaN”,以便以后使用Imputer。

所以我想使用此代码来通过保持NaN值来编码我的数据帧。

这是代码:

   class LabelEncoderByCol(BaseEstimator, TransformerMixin):
    def __init__(self,col):
        #List of column names in the DataFrame that should be encoded
        self.col = col
        #Dictionary storing a LabelEncoder for each column
        self.le_dic = {}
        for el in self.col:
            self.le_dic[el] = LabelEncoder()

    def fit(self,x,y=None):
        #Fill missing values with the string 'NaN'
        x[self.col] = x[self.col].fillna('NaN')
        for el in self.col:
            #Only use the values that are not 'NaN' to fit the Encoder
            a = x[el][x[el]!='NaN']
            self.le_dic[el].fit(a)
        return self

    def transform(self,x,y=None):
        #Fill missing values with the string 'NaN'
        x[self.col] = x[self.col].fillna('NaN')
        for el in self.col:
            #Only use the values that are not 'NaN' to fit the Encoder
            a = x[el][x[el]!='NaN']
            #Store an ndarray of the current column
            b = x[el].get_values()
            #Replace the elements in the ndarray that are not 'NaN'
            #using the transformer
            b[b!='NaN'] = self.le_dic[el].transform(a)
            #Overwrite the column in the DataFrame
            x[el]=b
        #return the transformed D


col = data1['customer_class']
LabelEncoderByCol(col)
LabelEncoderByCol.fit(x=col,y=None)

但是我得到了这个错误:     第846章,你是我的老公 -> 847提高ValueError('%s未包含在索引中'%str(key [mask]))     848 self._set_values(indexer,value)     849

ValueError:索引中不包含['OM1''OM1''OM1'...'其他''EU''EUB']

有任何想法请解决此错误吗?

谢谢

1 个答案:

答案 0 :(得分:0)

当我尝试复制时,有两件事突然发生了:

  1. 您的代码似乎期望将数据帧传递给您的类。但是在您的示例中,您通过了一系列操作。我通过将系列包装为数据框并将其传递给您的类:col = pd.DataFrame(data1['customer_class'])

  2. 在您的类的__init__方法中,您似乎打算遍历列名列表,但实际上是逐序列遍历所有列。我将相应的行更改为self.col = col.columns.values

下面,我粘贴了对您的类的__init__fit方法的修改(对transform方法的唯一修改是让它返回修改后的数据帧):

from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.preprocessing import LabelEncoder

data1 = pd.DataFrame({'customer_class': ['OM1', np.nan, 'OM1'],
                      'B': [1,6,9],
                      'C': [2.0, 1.0, np.nan]})

class LabelEncoderByCol(BaseEstimator, TransformerMixin):
    def __init__(self,col):
        #List of column names in the DataFrame that should be encoded
        self.col = col.columns.values
        #Dictionary storing a LabelEncoder for each column
        self.le_dic = {}
        for el in self.col:
            self.le_dic[el] = LabelEncoder()

    def fit(self,x,y=None):
        #Fill missing values with the string 'NaN'
        x = x.fillna('NaN')
        for el in self.col:
            #Only use the values that are not 'NaN' to fit the Encoder
            a = x[el][x[el]!='NaN']
            self.le_dic[el].fit(a)
        return self

    def transform(self,x,y=None):
        #Fill missing values with the string 'NaN'
        x[self.col] = x[self.col].fillna('NaN')
        for el in self.col:
            #Only use the values that are not 'NaN' to fit the Encoder
            a = x[el][x[el]!='NaN']
            #Store an ndarray of the current column
            b = x[el].get_values()
            #Replace the elements in the ndarray that are not 'NaN'
            #using the transformer
            b[b!='NaN'] = self.le_dic[el].transform(a)
            #Overwrite the column in the DataFrame
            x[el]=b
        return x

我能够运行以下几行(与您最初的实现相比也有所修改),没有错误:

col = pd.DataFrame(data1['customer_class'])
lenc = LabelEncoderByCol(col)
lenc.fit(x=col,y=None)

然后我可以从您的示例访问customer_class列的类:

lenc.fit(x=col,y=None).le_dic['customer_class'].classes_

哪个输出:

array(['OM1'], dtype=object)

最后,我可以使用您的类的transform方法来转换列:

lenc.transform(x=col,y=None)

输出以下内容:

    customer_class
0   0
1   NaN
2   0