Sklearn的SimpleImputer无法在管道中运行?

时间:2018-08-08 08:20:54

标签: scikit-learn pipeline sklearn-pandas

我有一个熊猫数据框,在特定列中有一些NaN值:

1291   NaN
1841   NaN
2049   NaN
Name: some column, dtype: float64

为了解决这个问题,我制作了以下管道:

from sklearn.preprocessing import StandardScaler
from sklearn.impute import SimpleImputer
from sklearn.pipeline import Pipeline

scaler = StandardScaler(with_mean = True)
imputer = SimpleImputer(strategy = 'median')
logistic = LogisticRegression()

pipe = Pipeline([('imputer', imputer),
                 ('scaler', scaler), 
                 ('logistic', logistic)])

现在,当我将此管道传递到RandomizedSearchCV时,出现以下错误:

ValueError: Input contains NaN, infinity or a value too large for dtype('float64').

实际上比这要长得多-如果需要,我可以在编辑中发布整个错误。无论如何,我很确定此列是唯一包含NaN的列。此外,如果我从SimpleImputer切换到管道中的Imputer(现已弃用)RandomizedSearchCV,则该管道在我的SimpleImputer中可以正常工作。我检查了文档,但似乎Imputer的行为(几乎)与Imputer完全相同。行为上有什么区别?

>如何在不使用不推荐使用的class user{ /** * @return array */ public function showwinners(){ $query = "SELECT points, memberid, uname FROM user"; $all_answers = array(); if( $query_run = mysql_query( $query ) ){ if( mysql_num_rows( $query_run ) == NULL ){ return 0; } while( $query_row = mysql_fetch_assoc( $query_run ) ){ $one = $query_row['points']; $two = $query_row['memberid']; $three = $query_row['uname']; if( $one >=1 ){ $first = $one; $second = $three; array_push( $all_answers, ['name'=>$second,'points'=>$first] ); } } } return $all_answers; } } 的情况下,在管道中使用imputer?

2 个答案:

答案 0 :(得分:0)

我遇到了同样的问题,但这已经解决了:

imputer = SimpleImputer(strategy = 'median', fill_value = 0)

答案 1 :(得分:0)

make_pipeline中的SimpleImputer

preprocess_pipeline = make_pipeline(   
    FeatureUnion(transformer_list=[
        ('Handle numeric columns', make_pipeline(
            ColumnSelector(columns=['Amount']),
            SimpleImputer(strategy='constant', fill_value=0),
            StandardScaler()
        )),
        ('Handle categorical data', make_pipeline(
            ColumnSelector(columns=['Type', 'Name', 'Changes']),
            SimpleImputer(strategy='constant', missing_values=' ', fill_value='missing_value'),
            OneHotEncoder(sparse=False)
        ))
    ])
)

管道中的SimpleImputer

('features', FeatureUnion ([
     ('Cat Columns', Pipeline([
          ('Category Extractor', TypeSelector(np.number)),
                 ('Impute Zero', SimpleImputer(strategy="constant", fill_value=0))
                                    ])),
('Numerics', Pipeline([
      ('Numeric Extractor', TypeSelector("category")),
          ('Impute Missing', SimpleImputer(strategy="constant", fill_value='missing'))
          ]))        
     ]))