我有一个带有浮点数,字符串和可以解释为日期的字符串的DataFrame。
Label encoding across multiple columns in scikit-learn
from sklearn.base import BaseEstimator, TransformerMixin
class DataFrameSelector(BaseException, TransformerMixin):
def __init__(self, attribute_names):
self.attribute_names = attribute_names
def fit(self, X, y=None):
return self
def transform(self, X):
return X[self.attribute_names].values
class MultiColumnLabelEncoder:
def __init__(self,columns = None):
self.columns = columns # array of column names to encode
def fit(self,X,y=None):
return self # not relevant here
def transform(self,X):
'''
Transforms columns of X specified in self.columns using
LabelEncoder(). If no columns specified, transforms all
columns in X.
'''
output = X.copy()
if self.columns is not None:
for col in self.columns:
output[col] = LabelEncoder().fit_transform(output[col])
else:
for colname,col in output.iteritems():
output[colname] = LabelEncoder().fit_transform(col)
return output
def fit_transform(self,X,y=None):
return self.fit(X,y).transform(X)
num_attributes = ["a", "b", "c"]
num_attributes = list(df_num_median)
str_attributes = list(df_str_only)
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
num_pipeline = Pipeline([
('selector', DataFrameSelector(num_attributes)), # transforming the Pandas DataFrame into a NumPy array
('imputer', Imputer(strategy="median")), # replacing missing values with the median
('std_scalar', StandardScaler()), # scaling the features using standardization (subtract mean value, divide by variance)
])
from sklearn.preprocessing import LabelEncoder
str_pipeline = Pipeline([
('selector', DataFrameSelector(str_attributes)), # transforming the Pandas DataFrame into a NumPy array
('encoding', MultiColumnLabelEncoder(str_attributes))
])
from sklearn.pipeline import FeatureUnion
full_pipeline = FeatureUnion(transformer_list=[
("num_pipeline", num_pipeline),
#("str_pipeline", str_pipeline) # replaced by line below
("str_pipeline", MultiColumnLabelEncoder(str_attributes))
])
df_prepared = full_pipeline.fit_transform(df_combined)
管道的num_pipeline部分工作正常。在str_pipeline部分,我得到了错误
IndexError:仅整数,切片(
:
),省略号(...
), numpy.newaxis(None
)和整数或布尔数组是有效索引
如果我在str_pipeline中注释掉了MultiColumnLabelEncoder,就不会发生这种情况。我还创建了一些代码,以在没有管道的情况下将MultiColumnLabelEncoder应用于数据集,并且效果很好。有任何想法吗?作为额外的步骤,我将不得不为字符串和日期字符串创建两个单独的管道。
编辑:添加了DataFrameSelector类
答案 0 :(得分:1)
问题不在MultiColumnLabelEncoder
中,而在管道上方的DataFrameSelector
中。
您正在这样做:
str_pipeline = Pipeline([
('selector', DataFrameSelector(str_attributes)), # transforming the Pandas DataFrame into a NumPy array
('encoding', MultiColumnLabelEncoder(str_attributes))
])
DataFrameSelector
返回数据帧的.values
属性,该属性是一个numpy数组。所以很显然,当您在MultiColumnLabelEncoder
中进行此操作时:
...
...
if self.columns is not None:
for col in self.columns:
output[col] = LabelEncoder().fit_transform(output[col])
output[col]
引发错误。由于output
是X
的副本,它是一个numpy数组(因为它已被DataFrameSelector
转换为numpy数组),并且没有有关列名的信息。
由于您已经将'str_attributes'
传递给MultiColumnLabelEncoder
,因此不需要在管道中使用DataFrameSelector。只需这样做:
full_pipeline = FeatureUnion(transformer_list=[
("num_pipeline", num_pipeline),
("str_pipeline", MultiColumnLabelEncoder(str_attributes))
])
我删除了str_pipeline,因为它现在只有一个转换器(删除了DataFrameSelector之后)。