熊猫simpleimputer保留数据类型

时间:2018-11-11 23:28:48

标签: pandas

我在下面的代码中遇到一个简单的错误。

我的目标是使用simpleimputer一次插入不同数据类型的缺失值。

当我尝试这样做时,fit_transform似乎无法正常工作。 当不使用dtype参数时,代码可以正常工作,但是结果数据框将丢失其数据类型信息。当我在参数中包含dtype列表时,我看到以下错误。您只需复制并粘贴到本地即可模拟错误。

import pandas as pd
import numpy as np
from sklearn.impute import SimpleImputer

import sklearn
print(sklearn.__version__)

0.21.dev0

data = [['Alex','NJ',21,5.10],['Mary','NY',20,np.nan],['Sam',np.nan,np.nan,6.3]]
df = pd.DataFrame(data,columns=['Name','State','Age','Height'])

df.dtypes
Name       object
State      object
Age       float64
Height    float64
dtype: object                 

imp = SimpleImputer(strategy="most_frequent")

#df = pd.DataFrame(imp.fit_transform(df),columns=df.columns)   <<<<----- This works just fine
#df
#Name   State   Age Height
#0  Alex    NJ  21  5.1
#1  Mary    NY  20  5.1
#2  Sam NJ  20  6.3
#df.dtypes
#Name      object
#State     object
#Age       object
#Height    object
#dtype: object

以下语句失败-出现以下错误(我正在尝试在插补过程中保留dtypes)

df = pd.DataFrame(imp.fit_transform(df),columns=df.columns,dtype=df.dtypes)
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-23-e9780979921f> in <module>()
      7 
      8 imp = SimpleImputer(strategy="most_frequent")
----> 9 df = pd.DataFrame(imp.fit_transform(df),columns=df.columns,dtype=df.dtypes)

~\AppData\Local\Continuum\anaconda3\lib\site-packages\pandas\core\frame.py in __init__(self, data, index, columns, dtype, copy)
    337             data = {}
    338         if dtype is not None:
--> 339             dtype = self._validate_dtype(dtype)
    340 
    341         if isinstance(data, DataFrame):

~\AppData\Local\Continuum\anaconda3\lib\site-packages\pandas\core\generic.py in _validate_dtype(self, dtype)
    166 
    167         if dtype is not None:
--> 168             dtype = pandas_dtype(dtype)
    169 
    170             # a compound dtype

~\AppData\Local\Continuum\anaconda3\lib\site-packages\pandas\core\dtypes\common.py in pandas_dtype(dtype)
   2020     # which we safeguard against by catching them earlier and returning
   2021     # np.dtype(valid_dtype) before this condition is evaluated.
-> 2022     if dtype in [object, np.object_, 'object', 'O']:
   2023         return npdtype
   2024     elif npdtype.kind == 'O':

~\AppData\Local\Continuum\anaconda3\lib\site-packages\pandas\core\generic.py in __nonzero__(self)
   1574         raise ValueError("The truth value of a {0} is ambiguous. "
   1575                          "Use a.empty, a.bool(), a.item(), a.any() or a.all()."
-> 1576                          .format(self.__class__.__name__))
   1577 
   1578     __bool__ = __nonzero__

ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().

1 个答案:

答案 0 :(得分:1)

如果要保留dtype,建议使用pandas查找模式,然后调用fillna

df = df.fillna(df.agg(lambda x: pd.Series.mode(x)[0], axis=0))
print(df)
   Name State   Age  Height
0  Alex    NJ  21.0     5.1
1  Mary    NY  20.0     5.1
2   Sam    NJ  20.0     6.3

print(df.dtypes)
Name       object
State      object
Age       float64
Height    float64
dtype: object

或者,使用astype并通过字典:

df = pd.DataFrame(
     imp.fit_transform(df), columns=df.columns
).astype(df.dtypes.to_dict())

print(df)
   Name State   Age  Height
0  Alex    NJ  21.0     5.1
1  Mary    NY  20.0     5.1
2   Sam    NJ  20.0     6.3

print(df.dtypes)
Name       object
State      object
Age       float64
Height    float64
dtype: object

需要显式astype调用,因为根据文档,只能将单个dtype传递给pd.DataFrame构造函数。

?pd.DataFrame
...
dtype : dtype, default None
 |      Data type to force. Only a single dtype is allowed.