Question

我在sklearn中编写了一组自定义转换，以便清理管道中的数据。每个自定义转换都将两个Pandas DataFrame作为fit和transform的参数，transform也会返回两个DataFrame（请参阅下面的示例）。当管道中只有一个Transformer时，这可以正常工作：DataFrames in和DataFrames out。

然而，当两个Rransformers组合在一个Pipeline中时，如下所示：

pipeline = Pipeline ([
        ('remove_missing_columns', RemoveAllMissing (['mailing_address_str_number'])),
        ('remove_rows_based_on_target', RemoveMissingRowsBasedOnTarget ()),
        ])

X, y = pipeline.fit_transform (X, y)

==>TypeError: tuple indices must be integers or slices, not Series

类RemoveMissingRowsBasedOnTarget神秘地接收元组作为输入。当我像这样切换变形金刚的位置时

pipeline = Pipeline ([
        ('remove_rows_based_on_target', RemoveMissingRowsBasedOnTarget ()),
        ('remove_missing_columns', RemoveAllMissing (['mailing_address_str_number'])),
        ])

==> AttributeError: 'tuple' object has no attribute 'apply'

错误发生在班级RemoveAllMissing中。在这两种情况下，错误消息都以==＆gt;表示。在发生错误的行之上。我想我已经完成了一些关于究竟究竟会发生什么的内容，但我找不到关于这个主题的任何内容。有人能告诉我我做错了什么吗？您可以在下面找到问题的代码。

import numpy as np
import pandas as pd
import random
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.pipeline import Pipeline

def create_data (rows, cols, frac_nan, random_state=42):
    random.seed (random_state)
    X = pd.DataFrame (np.zeros ((rows, cols)), 
                      columns=['col' + str(i) for i in range (cols)], 
                      index=None)
    # Create dataframe of (rows * cols) with random floating points
    y = pd.DataFrame (np.zeros ((rows,)))
    for row in range(rows):
        for col in range(cols):
            X.iloc [row,col] = random.random()
        X.iloc [row,1] = np.nan # column 1 exists colely of NaN's
        y.iloc [row] = random.randint (0, 1)
    # Assign NaN's to a fraction of X
    n = int(frac_nan * rows * cols)
    for i in range (n):
        row = random.randint (0, rows-1)
        col = random.randint (0, cols-1)
        X.iloc [row, col] = np.nan
    # Same applies to y
    n = int(frac_nan * rows)
    for i in range (n):
        row = random.randint (0, rows-1)
        y.iloc [row,] = np.nan

    return X, y    

class RemoveAllMissing (BaseEstimator, TransformerMixin):
    # remove columns containg NaN only
    def __init__ (self, requested_cols=[]):
        self.all_missing_data = requested_cols

    def fit (self, X, y=None):
        # find empty columns == columns with all missing data
        missing_cols = X.apply (lambda x: x.count (), axis=0)
        for idx in missing_cols.index:
            if missing_cols [idx] == 0:
                self.all_missing_data.append (idx)

        return self

    def transform (self, X, y=None):
        print (">RemoveAllMissing - X shape: " + str (X.shape), " y shape: " + str (y.shape), 'type (X):', type(X))
        for all_missing_predictor in self.all_missing_data:
            del X [all_missing_predictor]

        print ("<RemoveAllMissing - X shape: " + str (X.shape), " y shape: " + str (y.shape), 'type (X):', type(X))
        return X, y

    def fit_transform (self, X, y=None):
        return self.fit (X, y).transform (X, y)

class RemoveMissingRowsBasedOnTarget (BaseEstimator, TransformerMixin):
    # remove each row where target contains one or more NaN's
    def __init__ (self):
        self.missing_rows = []

    def fit (self, X, y = None):
        # remove all rows where the target value is missing data
        print (type (X))
        if y is None:
            print ('RemoveMissingRowsBasedOnTarget: target (y) cannot be None')
            return self

        self.missing_rows = np.array (y.notnull ()) #  false = missing data

        return self

    def transform (self, X, y=None):
        print (">RemoveMissingRowsBasedOnTarget - X shape: " + str (X.shape), " y shape: " + str (y.shape), 'type (X):', type(X))
        if y is None:
            print ('RemoveMissingRowsBasedOnTarget: target (y) cannot be None')
            return X, y

        X = X [self.missing_rows].reset_index ()
        del X ['index']
        y = y [self.missing_rows].reset_index ()
        del y ['index']  

        print ("<RemoveMissingRowsBasedOnTarget - X shape: " + str (X.shape), " y shape: " + str (y.shape), 'type (X):', type(X))
        return X, y

    def fit_transform (self, X, y=None):
        return self.fit (X, y).transform (X, y)

pipeline = Pipeline ([
        ('RemoveAllMissing', RemoveAllMissing ()),
        ('RemoveMissingRowsBasedOnTarget', RemoveMissingRowsBasedOnTarget ()),
        ])

X, y = create_data (25, 10, 0.1)
print ("X shape: " + str (X.shape), " y shape: " + str (y.shape), 'type (X):', type(X))
X, y = pipeline.fit_transform (X, y) 
#X, y = RemoveAllMissing ().fit_transform (X, y)
#X, y = RemoveMissingRowsBasedOnTarget ().fit_transform (X, y)

编辑正如@Vivek所要求的那样，我已经用代码替换原始代码，其中问题被隔离并且独立运行。由于元组作为参数而不是DataFrame传输，因此代码将在某处崩溃。管道更改数据类型，我在文档中找不到。当一个人注释掉管道的调用并在变压器的单独调用之前移除注释时，每个工作正常，如下所示：

#X, y = pipeline.fit_transform (X, y) 
X, y = RemoveAllMissing ().fit_transform (X, y)
X, y = RemoveMissingRowsBasedOnTarget ().fit_transform (X, y)

Answer 1

好的，现在我得到了错误，这似乎是你的类返回X，y而管道可以接受y的输入（并沿着它的内部变换器传递），它假设y在整个过程中是不变的从来没有被任何transform（）方法返回。在您的代码中不是这种情况。如果你可以把那个部分分开，那就可以了。

请参阅this line in the source code of pipeline：

$1, North Street, Chennai @ 11

您返回两个值（X，y），但它只包含在一个变量if hasattr(transformer, 'fit_transform'): res = transformer.fit_transform(X, y, **fit_params) else: res = transformer.fit(X, y, **fit_params).transform(X)中，因此它变成了一个元组。然后在你的下一个变压器中失败。

您可以通过将元组解压缩为X来处理此类数据，如下所示：

res

确保对管道中的所有后续变换器执行此操作。但我建议你分开X和y处理。此外，我发现在管道中转换目标变量class RemoveMissingRowsBasedOnTarget (BaseEstimator, TransformerMixin): ... ... def fit (self, X, y = None): # remove all rows where the target value is missing data print (type (X)) if isinstance(X, tuple): y=X[1] X=X[0] ... ... return self def transform (self, X, y=None): if isinstance(X, tuple): y=X[1] X=X[0] ... ... return X, y def fit_transform(self, X, y=None): self.fit(X, y).transform(X, y)存在一些相关问题，您可以查看：

关于sklearn Transformation运行的数据是什么？

1 个答案: