关于sklearn Transformation运行的数据是什么?

时间:2017-10-04 19:38:44

标签: python pandas scikit-learn

我在sklearn中编写了一组自定义转换,以便清理管道中的数据。每个自定义转换都将两个Pandas DataFrame作为fittransform的参数,transform也会返回两个DataFrame(请参阅下面的示例)。当管道中只有一个Transformer时,这可以正常工作:DataFrames in和DataFrames out。

然而,当两个Rransformers组合在一个Pipeline中时,如下所示:

pipeline = Pipeline ([
        ('remove_missing_columns', RemoveAllMissing (['mailing_address_str_number'])),
        ('remove_rows_based_on_target', RemoveMissingRowsBasedOnTarget ()),
        ])

X, y = pipeline.fit_transform (X, y)

==>TypeError: tuple indices must be integers or slices, not Series

RemoveMissingRowsBasedOnTarget神秘地接收元组作为输入。当我像这样切换变形金刚的位置时

pipeline = Pipeline ([
        ('remove_rows_based_on_target', RemoveMissingRowsBasedOnTarget ()),
        ('remove_missing_columns', RemoveAllMissing (['mailing_address_str_number'])),
        ])

==> AttributeError: 'tuple' object has no attribute 'apply'

错误发生在班级RemoveAllMissing中。在这两种情况下,错误消息都以==>表示。在发生错误的行之上。我想我已经完成了一些关于究竟究竟会发生什么的内容,但我找不到关于这个主题的任何内容。有人能告诉我我做错了什么吗?您可以在下面找到问题的代码。

import numpy as np
import pandas as pd
import random
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.pipeline import Pipeline

def create_data (rows, cols, frac_nan, random_state=42):
    random.seed (random_state)
    X = pd.DataFrame (np.zeros ((rows, cols)), 
                      columns=['col' + str(i) for i in range (cols)], 
                      index=None)
    # Create dataframe of (rows * cols) with random floating points
    y = pd.DataFrame (np.zeros ((rows,)))
    for row in range(rows):
        for col in range(cols):
            X.iloc [row,col] = random.random()
        X.iloc [row,1] = np.nan # column 1 exists colely of NaN's
        y.iloc [row] = random.randint (0, 1)
    # Assign NaN's to a fraction of X
    n = int(frac_nan * rows * cols)
    for i in range (n):
        row = random.randint (0, rows-1)
        col = random.randint (0, cols-1)
        X.iloc [row, col] = np.nan
    # Same applies to y
    n = int(frac_nan * rows)
    for i in range (n):
        row = random.randint (0, rows-1)
        y.iloc [row,] = np.nan

    return X, y    

class RemoveAllMissing (BaseEstimator, TransformerMixin):
    # remove columns containg NaN only
    def __init__ (self, requested_cols=[]):
        self.all_missing_data = requested_cols

    def fit (self, X, y=None):
        # find empty columns == columns with all missing data
        missing_cols = X.apply (lambda x: x.count (), axis=0)
        for idx in missing_cols.index:
            if missing_cols [idx] == 0:
                self.all_missing_data.append (idx)

        return self

    def transform (self, X, y=None):
        print (">RemoveAllMissing - X shape: " + str (X.shape), " y shape: " + str (y.shape), 'type (X):', type(X))
        for all_missing_predictor in self.all_missing_data:
            del X [all_missing_predictor]

        print ("<RemoveAllMissing - X shape: " + str (X.shape), " y shape: " + str (y.shape), 'type (X):', type(X))
        return X, y

    def fit_transform (self, X, y=None):
        return self.fit (X, y).transform (X, y)

class RemoveMissingRowsBasedOnTarget (BaseEstimator, TransformerMixin):
    # remove each row where target contains one or more NaN's
    def __init__ (self):
        self.missing_rows = []

    def fit (self, X, y = None):
        # remove all rows where the target value is missing data
        print (type (X))
        if y is None:
            print ('RemoveMissingRowsBasedOnTarget: target (y) cannot be None')
            return self

        self.missing_rows = np.array (y.notnull ()) #  false = missing data

        return self

    def transform (self, X, y=None):
        print (">RemoveMissingRowsBasedOnTarget - X shape: " + str (X.shape), " y shape: " + str (y.shape), 'type (X):', type(X))
        if y is None:
            print ('RemoveMissingRowsBasedOnTarget: target (y) cannot be None')
            return X, y

        X = X [self.missing_rows].reset_index ()
        del X ['index']
        y = y [self.missing_rows].reset_index ()
        del y ['index']  

        print ("<RemoveMissingRowsBasedOnTarget - X shape: " + str (X.shape), " y shape: " + str (y.shape), 'type (X):', type(X))
        return X, y

    def fit_transform (self, X, y=None):
        return self.fit (X, y).transform (X, y)

pipeline = Pipeline ([
        ('RemoveAllMissing', RemoveAllMissing ()),
        ('RemoveMissingRowsBasedOnTarget', RemoveMissingRowsBasedOnTarget ()),
        ])

X, y = create_data (25, 10, 0.1)
print ("X shape: " + str (X.shape), " y shape: " + str (y.shape), 'type (X):', type(X))
X, y = pipeline.fit_transform (X, y) 
#X, y = RemoveAllMissing ().fit_transform (X, y)
#X, y = RemoveMissingRowsBasedOnTarget ().fit_transform (X, y)

编辑正如@Vivek所要求的那样,我已经用代码替换原始代码,其中问题被隔离并且独立运行。由于元组作为参数而不是DataFrame传输,因此代码将在某处崩溃。管道更改数据类型,我在文档中找不到。当一个人注释掉管道的调用并在变压器的单独调用之前移除注释时,每个工作正常,如下所示:

#X, y = pipeline.fit_transform (X, y) 
X, y = RemoveAllMissing ().fit_transform (X, y)
X, y = RemoveMissingRowsBasedOnTarget ().fit_transform (X, y)

1 个答案:

答案 0 :(得分:2)

好的,现在我得到了错误,这似乎是你的类返回X,y而管道可以接受y的输入(并沿着它的内部变换器传递),它假设y在整个过程中是不变的从来没有被任何transform()方法返回。在您的代码中不是这种情况。如果你可以把那个部分分开,那就可以了。

请参阅this line in the source code of pipeline

$1, North Street, Chennai @ 11

您返回两个值(X,y),但它只包含在一个变量 if hasattr(transformer, 'fit_transform'): res = transformer.fit_transform(X, y, **fit_params) else: res = transformer.fit(X, y, **fit_params).transform(X) 中,因此它变成了一个元组。然后在你的下一个变压器中失败。

您可以通过将元组解压缩为X来处理此类数据,如下所示:

res

确保对管道中的所有后续变换器执行此操作。但我建议你分开X和y处理。此外,我发现在管道中转换目标变量class RemoveMissingRowsBasedOnTarget (BaseEstimator, TransformerMixin): ... ... def fit (self, X, y = None): # remove all rows where the target value is missing data print (type (X)) if isinstance(X, tuple): y=X[1] X=X[0] ... ... return self def transform (self, X, y=None): if isinstance(X, tuple): y=X[1] X=X[0] ... ... return X, y def fit_transform(self, X, y=None): self.fit(X, y).transform(X, y) 存在一些相关问题,您可以查看: