我有一个包含数值,分类和NaN值的数据框。
customer_class B C
0 OM1 1 2.0
1 NaN 6 1.0
2 OM1 9 NaN
....
我需要一个LabelEncoder,将丢失的值保留为“ NaN”,以便以后使用Imputer。
所以我想使用此代码来通过保持NaN值来编码我的数据帧。
这是代码:
class LabelEncoderByCol(BaseEstimator, TransformerMixin):
def __init__(self,col):
#List of column names in the DataFrame that should be encoded
self.col = col
#Dictionary storing a LabelEncoder for each column
self.le_dic = {}
for el in self.col:
self.le_dic[el] = LabelEncoder()
def fit(self,x,y=None):
#Fill missing values with the string 'NaN'
x[self.col] = x[self.col].fillna('NaN')
for el in self.col:
#Only use the values that are not 'NaN' to fit the Encoder
a = x[el][x[el]!='NaN']
self.le_dic[el].fit(a)
return self
def transform(self,x,y=None):
#Fill missing values with the string 'NaN'
x[self.col] = x[self.col].fillna('NaN')
for el in self.col:
#Only use the values that are not 'NaN' to fit the Encoder
a = x[el][x[el]!='NaN']
#Store an ndarray of the current column
b = x[el].get_values()
#Replace the elements in the ndarray that are not 'NaN'
#using the transformer
b[b!='NaN'] = self.le_dic[el].transform(a)
#Overwrite the column in the DataFrame
x[el]=b
#return the transformed D
col = data1['customer_class']
LabelEncoderByCol(col)
LabelEncoderByCol.fit(x=col,y=None)
但是我得到了这个错误: 第846章,你是我的老公 -> 847提高ValueError('%s未包含在索引中'%str(key [mask])) 848 self._set_values(indexer,value) 849
ValueError:索引中不包含['OM1''OM1''OM1'...'其他''EU''EUB']
有任何想法请解决此错误吗?
谢谢
答案 0 :(得分:0)
当我尝试复制时,有两件事突然发生了:
您的代码似乎期望将数据帧传递给您的类。但是在您的示例中,您通过了一系列操作。我通过将系列包装为数据框并将其传递给您的类:col = pd.DataFrame(data1['customer_class'])
。
在您的类的__init__
方法中,您似乎打算遍历列名列表,但实际上是逐序列遍历所有列。我将相应的行更改为self.col = col.columns.values
。
下面,我粘贴了对您的类的__init__
和fit
方法的修改(对transform
方法的唯一修改是让它返回修改后的数据帧):
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.preprocessing import LabelEncoder
data1 = pd.DataFrame({'customer_class': ['OM1', np.nan, 'OM1'],
'B': [1,6,9],
'C': [2.0, 1.0, np.nan]})
class LabelEncoderByCol(BaseEstimator, TransformerMixin):
def __init__(self,col):
#List of column names in the DataFrame that should be encoded
self.col = col.columns.values
#Dictionary storing a LabelEncoder for each column
self.le_dic = {}
for el in self.col:
self.le_dic[el] = LabelEncoder()
def fit(self,x,y=None):
#Fill missing values with the string 'NaN'
x = x.fillna('NaN')
for el in self.col:
#Only use the values that are not 'NaN' to fit the Encoder
a = x[el][x[el]!='NaN']
self.le_dic[el].fit(a)
return self
def transform(self,x,y=None):
#Fill missing values with the string 'NaN'
x[self.col] = x[self.col].fillna('NaN')
for el in self.col:
#Only use the values that are not 'NaN' to fit the Encoder
a = x[el][x[el]!='NaN']
#Store an ndarray of the current column
b = x[el].get_values()
#Replace the elements in the ndarray that are not 'NaN'
#using the transformer
b[b!='NaN'] = self.le_dic[el].transform(a)
#Overwrite the column in the DataFrame
x[el]=b
return x
我能够运行以下几行(与您最初的实现相比也有所修改),没有错误:
col = pd.DataFrame(data1['customer_class'])
lenc = LabelEncoderByCol(col)
lenc.fit(x=col,y=None)
然后我可以从您的示例访问customer_class
列的类:
lenc.fit(x=col,y=None).le_dic['customer_class'].classes_
哪个输出:
array(['OM1'], dtype=object)
最后,我可以使用您的类的transform
方法来转换列:
lenc.transform(x=col,y=None)
输出以下内容:
customer_class
0 0
1 NaN
2 0