假设我有一个df
,其中某列的缺失值是50%。
我该如何删除相对于该列缺少值的10%的行?
基本上如何将列缺失值的百分比从50%降低到40%?
输入(缺少50%的值(6/12)):
0
0 1.0
1 1.0
2 NaN
3 NaN
4 NaN
5 1.0
6 NaN
7 1.0
8 NaN
9 1.0
10 NaN
11 1.0
输出(缺少40%的值(4/10)): 我们删除了ID为8和10的最后2个NaN行
0
0 1.0
1 1.0
2 NaN
3 NaN
4 NaN
5 1.0
6 NaN
7 1.0
9 1.0
11 1.0
答案 0 :(得分:0)
尝试一下:
# find NaN entries in your df
nanEntries = df[pd.isnull(df)].index.tolist()
# choose 10% randomly
dropIndices = np.random.choice(nanEntries, size = int(df.shape[0]*0.1))
# drop them
df.drop(dropIndices)
答案 1 :(得分:0)
要获取列中具有nan值的索引的数组,请使用:
nan_indices = df.index[df['your_column'].isna()]
要下降例如前20%,请使用:
df.drop(nan_indices[:int(len(nan_indices) * 0.2)]) #to create a new DataFrame, if you want to modify the original one, put inplace=True