我在下面的代码中遇到一个简单的错误。
我的目标是使用simpleimputer一次插入不同数据类型的缺失值。
当我尝试这样做时,fit_transform似乎无法正常工作。 当不使用dtype参数时,代码可以正常工作,但是结果数据框将丢失其数据类型信息。当我在参数中包含dtype列表时,我看到以下错误。您只需复制并粘贴到本地即可模拟错误。
import pandas as pd
import numpy as np
from sklearn.impute import SimpleImputer
import sklearn
print(sklearn.__version__)
0.21.dev0
data = [['Alex','NJ',21,5.10],['Mary','NY',20,np.nan],['Sam',np.nan,np.nan,6.3]]
df = pd.DataFrame(data,columns=['Name','State','Age','Height'])
df.dtypes
Name object
State object
Age float64
Height float64
dtype: object
imp = SimpleImputer(strategy="most_frequent")
#df = pd.DataFrame(imp.fit_transform(df),columns=df.columns) <<<<----- This works just fine
#df
#Name State Age Height
#0 Alex NJ 21 5.1
#1 Mary NY 20 5.1
#2 Sam NJ 20 6.3
#df.dtypes
#Name object
#State object
#Age object
#Height object
#dtype: object
以下语句失败-出现以下错误(我正在尝试在插补过程中保留dtypes)
df = pd.DataFrame(imp.fit_transform(df),columns=df.columns,dtype=df.dtypes)
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-23-e9780979921f> in <module>()
7
8 imp = SimpleImputer(strategy="most_frequent")
----> 9 df = pd.DataFrame(imp.fit_transform(df),columns=df.columns,dtype=df.dtypes)
~\AppData\Local\Continuum\anaconda3\lib\site-packages\pandas\core\frame.py in __init__(self, data, index, columns, dtype, copy)
337 data = {}
338 if dtype is not None:
--> 339 dtype = self._validate_dtype(dtype)
340
341 if isinstance(data, DataFrame):
~\AppData\Local\Continuum\anaconda3\lib\site-packages\pandas\core\generic.py in _validate_dtype(self, dtype)
166
167 if dtype is not None:
--> 168 dtype = pandas_dtype(dtype)
169
170 # a compound dtype
~\AppData\Local\Continuum\anaconda3\lib\site-packages\pandas\core\dtypes\common.py in pandas_dtype(dtype)
2020 # which we safeguard against by catching them earlier and returning
2021 # np.dtype(valid_dtype) before this condition is evaluated.
-> 2022 if dtype in [object, np.object_, 'object', 'O']:
2023 return npdtype
2024 elif npdtype.kind == 'O':
~\AppData\Local\Continuum\anaconda3\lib\site-packages\pandas\core\generic.py in __nonzero__(self)
1574 raise ValueError("The truth value of a {0} is ambiguous. "
1575 "Use a.empty, a.bool(), a.item(), a.any() or a.all()."
-> 1576 .format(self.__class__.__name__))
1577
1578 __bool__ = __nonzero__
ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().
答案 0 :(得分:1)
如果要保留dtype,建议使用pandas查找模式,然后调用fillna
:
df = df.fillna(df.agg(lambda x: pd.Series.mode(x)[0], axis=0))
print(df)
Name State Age Height
0 Alex NJ 21.0 5.1
1 Mary NY 20.0 5.1
2 Sam NJ 20.0 6.3
print(df.dtypes)
Name object
State object
Age float64
Height float64
dtype: object
或者,使用astype
并通过字典:
df = pd.DataFrame(
imp.fit_transform(df), columns=df.columns
).astype(df.dtypes.to_dict())
print(df)
Name State Age Height
0 Alex NJ 21.0 5.1
1 Mary NY 20.0 5.1
2 Sam NJ 20.0 6.3
print(df.dtypes)
Name object
State object
Age float64
Height float64
dtype: object
需要显式astype
调用,因为根据文档,只能将单个dtype
传递给pd.DataFrame
构造函数。
?pd.DataFrame ... dtype : dtype, default None | Data type to force. Only a single dtype is allowed.