在scikit-learn中计算分类缺失值

时间:2014-08-11 09:26:42

标签: python pandas scikit-learn imputation

我的pandas数据包含一些文本类型的列。这些文本列中包含一些NaN值。我试图做的是用sklearn.preprocessing.Imputer来代替那些NaN(以最常见的值取代NaN)。问题在于实施。 假设有一个包含30列的Pandas数据帧df,其中10列具有分类性质。 一旦我跑:

from sklearn.preprocessing import Imputer
imp = Imputer(missing_values='NaN', strategy='most_frequent', axis=0)
imp.fit(df) 

Python生成error: 'could not convert string to float: 'run1'',其中' run1'是来自第一列的具有分类数据的普通(非缺失)值。

非常欢迎任何帮助

11 个答案:

答案 0 :(得分:78)

要使用数字列的平均值和非数字列的最常用值,您可以执行以下操作。您可以进一步区分整数和浮点数。我想对整数列使用中位数可能是有意义的。

import pandas as pd
import numpy as np

from sklearn.base import TransformerMixin

class DataFrameImputer(TransformerMixin):

    def __init__(self):
        """Impute missing values.

        Columns of dtype object are imputed with the most frequent value 
        in column.

        Columns of other types are imputed with mean of column.

        """
    def fit(self, X, y=None):

        self.fill = pd.Series([X[c].value_counts().index[0]
            if X[c].dtype == np.dtype('O') else X[c].mean() for c in X],
            index=X.columns)

        return self

    def transform(self, X, y=None):
        return X.fillna(self.fill)

data = [
    ['a', 1, 2],
    ['b', 1, 1],
    ['b', 2, 2],
    [np.nan, np.nan, np.nan]
]

X = pd.DataFrame(data)
xt = DataFrameImputer().fit_transform(X)

print('before...')
print(X)
print('after...')
print(xt)

打印,

before...
     0   1   2
0    a   1   2
1    b   1   1
2    b   2   2
3  NaN NaN NaN
after...
   0         1         2
0  a  1.000000  2.000000
1  b  1.000000  1.000000
2  b  2.000000  2.000000
3  b  1.333333  1.666667

答案 1 :(得分:4)

您可以将sklearn_pandas.CategoricalImputer用于分类列。详细说明:

首先,(来自Scikit-Learn和TensorFlow的动手机器学习一书)您可以使用数字和字符串/分类功能的子管道,其中每个子管道的第一个变换器是一个选择器,其中列出了列名称(以及full_pipeline.fit_transform()采用pandas DataFrame):

class DataFrameSelector(BaseEstimator, TransformerMixin):
    def __init__(self, attribute_names):
        self.attribute_names = attribute_names
    def fit(self, X, y=None):
        return self
    def transform(self, X):
        return X[self.attribute_names].values

然后,您可以将这些子管道与sklearn.pipeline.FeatureUnion组合在一起,例如:

full_pipeline = FeatureUnion(transformer_list=[
    ("num_pipeline", num_pipeline),
    ("cat_pipeline", cat_pipeline)
])

现在,在num_pipeline中,您只需使用sklearn.preprocessing.Imputer(),但在cat_pipline中,您可以使用CategoricalImputer()包中的sklearn_pandas

注意: sklearn-pandas包可以与pip install sklearn-pandas一起安装,但导入为import sklearn_pandas

答案 2 :(得分:2)

复制和修改sveitser的答案,我为pandas.Series对象制作了一个imputer

import numpy
import pandas 

from sklearn.base import TransformerMixin

class SeriesImputer(TransformerMixin):

    def __init__(self):
        """Impute missing values.

        If the Series is of dtype Object, then impute with the most frequent object.
        If the Series is not of dtype Object, then impute with the mean.  

        """
    def fit(self, X, y=None):
        if   X.dtype == numpy.dtype('O'): self.fill = X.value_counts().index[0]
        else                            : self.fill = X.mean()
        return self

    def transform(self, X, y=None):
       return X.fillna(self.fill)

要使用它,你会这样做:

# Make a series
s1 = pandas.Series(['k', 'i', 't', 't', 'e', numpy.NaN])


a  = SeriesImputer()   # Initialize the imputer
a.fit(s1)              # Fit the imputer
s2 = a.transform(s1)   # Get a new series

答案 3 :(得分:2)

受到这里的答案的启发以及对所有用例的goto Imputer的缺乏,我最终写了这篇文章。它支持四种估算策略mean, mode, median, fill适用于pd.DataFramePd.Series

meanmedian仅适用于数字数据,modefill适用于数字和分类数据。

class CustomImputer(BaseEstimator, TransformerMixin):
    def __init__(self, strategy='mean',filler='NA'):
       self.strategy = strategy
       self.fill = filler

    def fit(self, X, y=None):
       if self.strategy in ['mean','median']:
           if not all(X.dtypes == np.number):
               raise ValueError('dtypes mismatch np.number dtype is \
                                 required for '+ self.strategy)
       if self.strategy == 'mean':
           self.fill = X.mean()
       elif self.strategy == 'median':
           self.fill = X.median()
       elif self.strategy == 'mode':
           self.fill = X.mode().iloc[0]
       elif self.strategy == 'fill':
           if type(self.fill) is list and type(X) is pd.DataFrame:
               self.fill = dict([(cname, v) for cname,v in zip(X.columns, self.fill)])
       return self

   def transform(self, X, y=None):
       return X.fillna(self.fill)

用法

>> df   
    MasVnrArea  FireplaceQu
Id  
1   196.0   NaN
974 196.0   NaN
21  380.0   Gd
5   350.0   TA
651 NaN     Gd


>> CustomImputer(strategy='mode').fit_transform(df)
MasVnrArea  FireplaceQu
Id      
1   196.0   Gd
974 196.0   Gd
21  380.0   Gd
5   350.0   TA
651 196.0   Gd

>> CustomImputer(strategy='fill', filler=[0, 'NA']).fit_transform(df)
MasVnrArea  FireplaceQu
Id      
1   196.0   NA
974 196.0   NA
21  380.0   Gd
5   350.0   TA
651 0.0     Gd 

答案 4 :(得分:2)

  • strategy ='most_frequent'只能用于定量功能,而不能用于定性功能。此定制冲床可用于定性和定量。同样,使用scikit Learn imputer,我们可以将其用于整个数据帧(如果所有特征都是定量的),或者可以将“ for loop”用于具有相似特征/列类型的列表(请参见以下示例)。但是自定义imputer可以与任何组合一起使用。

        from sklearn.preprocessing import Imputer
        impute = Imputer(strategy='mean')
        for cols in ['quantitative_column', 'quant']:  # here both are quantitative features.
              xx[cols] = impute.fit_transform(xx[[cols]])
    
  • 自定义入侵者:

       from sklearn.preprocessing import Imputer
       from sklearn.base import TransformerMixin
    
       class CustomImputer(TransformerMixin):
             def __init__(self, cols=None, strategy='mean'):
                   self.cols = cols
                   self.strategy = strategy
    
             def transform(self, df):
                   X = df.copy()
                   impute = Imputer(strategy=self.strategy)
                   if self.cols == None:
                          self.cols = list(X.columns)
                   for col in self.cols:
                          if X[col].dtype == np.dtype('O') : 
                                 X[col].fillna(X[col].value_counts().index[0], inplace=True)
                          else : X[col] = impute.fit_transform(X[[col]])
    
                   return X
    
             def fit(self, *_):
                   return self
    
  • 数据框:

          X = pd.DataFrame({'city':['tokyo', np.NaN, 'london', 'seattle', 'san 
                                     francisco', 'tokyo'], 
              'boolean':['yes', 'no', np.NaN, 'no', 'no', 'yes'], 
              'ordinal_column':['somewhat like', 'like', 'somewhat like', 'like', 
                                'somewhat like', 'dislike'], 
              'quantitative_column':[1, 11, -.5, 10, np.NaN, 20]})
    
    
                city              boolean   ordinal_column  quantitative_column
            0   tokyo             yes       somewhat like   1.0
            1   NaN               no        like            11.0
            2   london            NaN       somewhat like   -0.5
            3   seattle           no        like            10.0
            4   san francisco     no        somewhat like   NaN
            5   tokyo             yes       dislike         20.0
    
  • 1)可以与功能相似的列表一起使用。

     cci = CustomImputer(cols=['city', 'boolean']) # here default strategy = mean
     cci.fit_transform(X)
    
  • 可以与策略=中位数一起使用

     sd = CustomImputer(['quantitative_column'], strategy = 'median')
     sd.fit_transform(X)
    
  • 3)可以用于整个数据框,它将使用默认均值(或者我们也可以使用中位数更改它。对于定性特征,它使用strategy ='most_frequent'以及定量均值/中位数。 >

     call = CustomImputer()
     call.fit_transform(X)   
    

答案 5 :(得分:1)

此代码填写了最常见类别的系列:

import pandas as pd
import numpy as np

# create fake data 
m = pd.Series(list('abca'))
m.iloc[1] = np.nan #artificially introduce nan

print('m = ')
print(m)

#make dummy variables, count and sort descending:
most_common = pd.get_dummies(m).sum().sort_values(ascending=False).index[0] 

def replace_most_common(x):
    if pd.isnull(x):
        return most_common
    else:
        return x

new_m = m.map(replace_most_common) #apply function to original data

print('new_m = ')
print(new_m)

输出:

m =
0      a
1    NaN
2      c
3      a
dtype: object

new_m =
0    a
1    a
2    c
3    a
dtype: object

答案 6 :(得分:1)

有一个软件包sklearn-pandas,该软件包可以为归类变量进行插补 https://github.com/scikit-learn-contrib/sklearn-pandas#categoricalimputer

>>> from sklearn_pandas import CategoricalImputer
>>> data = np.array(['a', 'b', 'b', np.nan], dtype=object)
>>> imputer = CategoricalImputer()
>>> imputer.fit_transform(data)
array(['a', 'b', 'b', 'b'], dtype=object)

答案 7 :(得分:0)

类似。修改String的{​​{1}}:

Imputer

其中strategy='most_frequent'找到每列最常用的值,然后class GeneralImputer(Imputer): def __init__(self, **kwargs): Imputer.__init__(self, **kwargs) def fit(self, X, y=None): if self.strategy == 'most_frequent': self.fills = pd.DataFrame(X).mode(axis=0).squeeze() self.statistics_ = self.fills.values return self else: return Imputer.fit(self, X, y=y) def transform(self, X): if hasattr(self, 'fills'): return pd.DataFrame(X).fillna(self.fills).values.astype(str) else: return Imputer.transform(self, X) 用这些值填充缺失值。 pandas.DataFrame.mode()仍然以相同的方式处理其他pandas.DataFrame.fillna()值。

答案 8 :(得分:0)

您可以尝试以下操作:

"%__APPDIR__%timeout.exe"

答案 9 :(得分:0)

sklearn.impute.SimpleImputer而不是Imputer可以轻松解决此问题,该问题可以处理分类变量。

根据Sklearn文档: 如果为“ most_frequent”,则使用每一列中的最频繁值替换“ missing”。可以与字符串或数字数据一起使用。

https://scikit-learn.org/stable/modules/generated/sklearn.impute.SimpleImputer.html

impute_size=SimpleImputer(strategy="most_frequent") 
data['Outlet_Size']=impute_size.transform(data[['Outlet_Size']])

答案 10 :(得分:0)

Missforest 可用于对分类变量中的缺失值以及其他分类特征进行插补。它以类似于 IterativeImputer 的迭代方式工作,以随机森林为基础模型。

以下是标记编码特征和目标变量的代码,拟合模型以估算 nan 值,并将特征编码回来

import sklearn.neighbors._base
import sys
sys.modules['sklearn.neighbors.base'] = sklearn.neighbors._base
from missingpy import MissForest

def label_encoding(df, columns):
    """
    Label encodes the set of the features to be used for imputation
    Args:
        df: data frame (processed data)
        columns: list (features to be encoded)
    Returns: dictionary
    """
    encoders = dict()
    for col_name in columns:
        series = df[col_name]
        label_encoder = LabelEncoder()
        df[col_name] = pd.Series(
            label_encoder.fit_transform(series[series.notnull()]),
            index=series[series.notnull()].index
        )
        encoders[col_name] = label_encoder
    return encoders

# adding to be imputed global category along with features
features = ['feature_1', 'feature_2', 'target_variable']
# label encoding features
encoders = label_encoding(data, features)
# categorical imputation using random forest 
# parameters can be tuned accordingly
imp_cat = MissForest(n_estimators=50, max_depth=80)
data[features] = imp_cat.fit_transform(data[features], cat_vars=[0, 1, 2])
# decoding features
for variable in features:
    data[variable] = encoders[variable].inverse_transform(data[variable].astype(int))