Question

我正在进行数据预处理，并希望删除缺失值超过10％的要素/列。

我编写了以下代码：

df_missing=df.isna()
result=df_missing.sum()/len(df)
result

Default           0.010066
Income            0.142857
Age               0.109090
Name              0.047000
Gender            0.000000
Type of job       0.200000
Amt of credit     0.850090
Years employed    0.009003
dtype: float64

我希望df仅在没有缺失值高于10％的地方才有列。

预期输出：

df

Default   Name   Gender   Years employed

（缺失值大于10％的列将被删除。）

我尝试过

result.iloc[:,0] 
IndexingError: Too many indexers

请帮助

Answer 1

由于总和除以mean，因此您可以df_missing.sum()/len(df)使用df_missing.mean()：

result = df.isna().mean()

然后用DataFrame.loc和:用掩码对所有行和列进行过滤：

df = df.loc[:,result > .1]

Answer 2

它应该是 df = df.loc[:,result < .1]，因为用户只想保留缺失行数少于 10% 的列

删除缺失值超过阈值熊猫的列

2 个答案: