我在sklearn
中编写了一组自定义转换,以便清理管道中的数据。每个自定义转换都将两个Pandas DataFrame作为fit
和transform
的参数,transform
也会返回两个DataFrame(请参阅下面的示例)。当管道中只有一个Transformer时,这可以正常工作:DataFrames in和DataFrames out。
然而,当两个Rransformers组合在一个Pipeline中时,如下所示:
pipeline = Pipeline ([
('remove_missing_columns', RemoveAllMissing (['mailing_address_str_number'])),
('remove_rows_based_on_target', RemoveMissingRowsBasedOnTarget ()),
])
X, y = pipeline.fit_transform (X, y)
==>TypeError: tuple indices must be integers or slices, not Series
类RemoveMissingRowsBasedOnTarget
神秘地接收元组作为输入。当我像这样切换变形金刚的位置时
pipeline = Pipeline ([
('remove_rows_based_on_target', RemoveMissingRowsBasedOnTarget ()),
('remove_missing_columns', RemoveAllMissing (['mailing_address_str_number'])),
])
==> AttributeError: 'tuple' object has no attribute 'apply'
错误发生在班级RemoveAllMissing
中。在这两种情况下,错误消息都以==>表示。在发生错误的行之上。我想我已经完成了一些关于究竟究竟会发生什么的内容,但我找不到关于这个主题的任何内容。有人能告诉我我做错了什么吗?您可以在下面找到问题的代码。
import numpy as np
import pandas as pd
import random
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.pipeline import Pipeline
def create_data (rows, cols, frac_nan, random_state=42):
random.seed (random_state)
X = pd.DataFrame (np.zeros ((rows, cols)),
columns=['col' + str(i) for i in range (cols)],
index=None)
# Create dataframe of (rows * cols) with random floating points
y = pd.DataFrame (np.zeros ((rows,)))
for row in range(rows):
for col in range(cols):
X.iloc [row,col] = random.random()
X.iloc [row,1] = np.nan # column 1 exists colely of NaN's
y.iloc [row] = random.randint (0, 1)
# Assign NaN's to a fraction of X
n = int(frac_nan * rows * cols)
for i in range (n):
row = random.randint (0, rows-1)
col = random.randint (0, cols-1)
X.iloc [row, col] = np.nan
# Same applies to y
n = int(frac_nan * rows)
for i in range (n):
row = random.randint (0, rows-1)
y.iloc [row,] = np.nan
return X, y
class RemoveAllMissing (BaseEstimator, TransformerMixin):
# remove columns containg NaN only
def __init__ (self, requested_cols=[]):
self.all_missing_data = requested_cols
def fit (self, X, y=None):
# find empty columns == columns with all missing data
missing_cols = X.apply (lambda x: x.count (), axis=0)
for idx in missing_cols.index:
if missing_cols [idx] == 0:
self.all_missing_data.append (idx)
return self
def transform (self, X, y=None):
print (">RemoveAllMissing - X shape: " + str (X.shape), " y shape: " + str (y.shape), 'type (X):', type(X))
for all_missing_predictor in self.all_missing_data:
del X [all_missing_predictor]
print ("<RemoveAllMissing - X shape: " + str (X.shape), " y shape: " + str (y.shape), 'type (X):', type(X))
return X, y
def fit_transform (self, X, y=None):
return self.fit (X, y).transform (X, y)
class RemoveMissingRowsBasedOnTarget (BaseEstimator, TransformerMixin):
# remove each row where target contains one or more NaN's
def __init__ (self):
self.missing_rows = []
def fit (self, X, y = None):
# remove all rows where the target value is missing data
print (type (X))
if y is None:
print ('RemoveMissingRowsBasedOnTarget: target (y) cannot be None')
return self
self.missing_rows = np.array (y.notnull ()) # false = missing data
return self
def transform (self, X, y=None):
print (">RemoveMissingRowsBasedOnTarget - X shape: " + str (X.shape), " y shape: " + str (y.shape), 'type (X):', type(X))
if y is None:
print ('RemoveMissingRowsBasedOnTarget: target (y) cannot be None')
return X, y
X = X [self.missing_rows].reset_index ()
del X ['index']
y = y [self.missing_rows].reset_index ()
del y ['index']
print ("<RemoveMissingRowsBasedOnTarget - X shape: " + str (X.shape), " y shape: " + str (y.shape), 'type (X):', type(X))
return X, y
def fit_transform (self, X, y=None):
return self.fit (X, y).transform (X, y)
pipeline = Pipeline ([
('RemoveAllMissing', RemoveAllMissing ()),
('RemoveMissingRowsBasedOnTarget', RemoveMissingRowsBasedOnTarget ()),
])
X, y = create_data (25, 10, 0.1)
print ("X shape: " + str (X.shape), " y shape: " + str (y.shape), 'type (X):', type(X))
X, y = pipeline.fit_transform (X, y)
#X, y = RemoveAllMissing ().fit_transform (X, y)
#X, y = RemoveMissingRowsBasedOnTarget ().fit_transform (X, y)
编辑正如@Vivek所要求的那样,我已经用代码替换原始代码,其中问题被隔离并且独立运行。由于元组作为参数而不是DataFrame传输,因此代码将在某处崩溃。管道更改数据类型,我在文档中找不到。当一个人注释掉管道的调用并在变压器的单独调用之前移除注释时,每个工作正常,如下所示:
#X, y = pipeline.fit_transform (X, y)
X, y = RemoveAllMissing ().fit_transform (X, y)
X, y = RemoveMissingRowsBasedOnTarget ().fit_transform (X, y)
答案 0 :(得分:2)
好的,现在我得到了错误,这似乎是你的类返回X,y而管道可以接受y的输入(并沿着它的内部变换器传递),它假设y在整个过程中是不变的从来没有被任何transform()方法返回。在您的代码中不是这种情况。如果你可以把那个部分分开,那就可以了。
请参阅this line in the source code of pipeline:
$1, North Street, Chennai @ 11
您返回两个值(X,y),但它只包含在一个变量 if hasattr(transformer, 'fit_transform'):
res = transformer.fit_transform(X, y, **fit_params)
else:
res = transformer.fit(X, y, **fit_params).transform(X)
中,因此它变成了一个元组。然后在你的下一个变压器中失败。
您可以通过将元组解压缩为X来处理此类数据,如下所示:
res
确保对管道中的所有后续变换器执行此操作。但我建议你分开X和y处理。此外,我发现在管道中转换目标变量class RemoveMissingRowsBasedOnTarget (BaseEstimator, TransformerMixin):
...
...
def fit (self, X, y = None):
# remove all rows where the target value is missing data
print (type (X))
if isinstance(X, tuple):
y=X[1]
X=X[0]
...
...
return self
def transform (self, X, y=None):
if isinstance(X, tuple):
y=X[1]
X=X[0]
...
...
return X, y
def fit_transform(self, X, y=None):
self.fit(X, y).transform(X, y)
存在一些相关问题,您可以查看: