Question

我是python和pandas的新手。我试图预处理一个包含数字和分类特征的大数据帧，在某些列中有NaN值。首先，我尝试获取特征矩阵，然后使用Imputer为Nan值设置平均值或中值。

这是数据框

    MSSubClass MSZoning  LotFrontage  LotArea Street LotShape LandContour  \
0             60       RL         65.0     8450   Pave      Reg         Lvl   
1             20       RL         80.0     9600   Pave      Reg         Lvl   
2             60       RL         68.0    11250   Pave      IR1         Lvl   
3             70       RL         60.0     9550   Pave      IR1         Lvl   
4             60       RL         84.0    14260   Pave      IR1         Lvl   
5             50       RL         85.0    14115   Pave      IR1         Lvl   
6             20       RL         75.0    10084   Pave      Reg         Lvl   
7             60       RL          NaN    10382   Pave      IR1         Lvl   
8             50       RM         51.0     6120   Pave      Reg         Lvl   
9            190       RL         50.0     7420   Pave      Reg         Lvl   
10            20       RL         70.0    11200   Pave      Reg         Lvl   
11            60       RL         85.0    11924   Pave      IR1         Lvl

代码：只是为了改变LotFrontage中的Nan值（索引号= 2）来表示列的值

imputer = Imputer(missing_values='Nan',strategy="mean",axis=0)
features = reduced_data.iloc[:,:-1].values
imputer.fit(features[:,2])

当我运行此命令时，会出现错误：

TypeError: ufunc 'isnan' not supported for the input types, and the inputs could not be safely coerced to any supported types according to the casting rule ''safe''

首先：我的方法是否正确？第二：如何处理错误？

谢谢

Answer 1

我想由于字符串'Nan'，你的LotFrontage列数据存储为对象数据类型。找出使用它。它最有可能给对象/字符串。

print(reduced_data.LotFrontage.values.dtype)

Imputer仅适用于Floats。

第一种方法：

你可以在下面做： 1）将列类型转换为Float 2）找出LotFrontage列的平均值 3）使用pandas dataframe function fillna填充Dataframe中的NAN。

reduced_data.LotFrontage = pd.to_numeric(reduced_data.LotFrontage, errors='coerce')
m = reduced_data.LotFrontage.mean(skipna=True)
reduced_data.fillna(m)

以上代码将在存在NAN的任何地方填充数据框。

第二种方法：

reduced_data.LotFrontage = pd.to_numeric(reduced_data.LotFrontage, errors='coerce')
imputer = Imputer()
features = reduced_data.iloc[:,:-1].values
imputer.fit(features[:,2])

Answer 2

请注意您使用 Nan 的Nan和 Nan 之间的区别

  imputer = Imputer(missing_values='NaN',strategy="mean",axis=0)

替换＆＃39; Nan＆＃39;与＆＃39; NaN＆＃39;并且你不会得到这个错误

Answer 3

在missing_value参数中，使用'NaN'代替'Nan'：imputer=Imputer(missing_values='NaN' ,strategy='mean',axis=0)

Answer 4

尝试一下，这是工作代码的示例

from sklearn.preprocessing import Imputer
imputer = Imputer(missing_values = np.nan, strategy = 'mean', axis =0)
imputer = imputer.fit(X[:,1:3])
X[:,1:3] = imputer.transform(X[:,1:3])

Answer 5

这应该有效

f_ = function(data, cond) {
    data %>%
        filter(b == !!cond)

 }

f_(d, cond = 2)
# A tibble: 1 x 2
#   cond     b
#  <dbl> <dbl>
#1     2     2

TypeError：ufunc＆＃39; isnan＆＃39;输入类型不支持，将Imputer用于NaN值

5 个答案: