在数据帧中随机清空值的最有效方法

时间:2016-10-24 18:11:57

标签: python pandas numpy

考虑df

df = pd.DataFrame(np.ones((10, 10)) * 2,
                  list('abcdefghij'), list('ABCDEFGHIJ'))
df

enter image description here

如何随机取消这些值的约20%?

enter image description here

2 个答案:

答案 0 :(得分:9)

您可以使用numpy.random.choice生成mask

import numpy as np

mask = np.random.choice([True, False], size=df.shape, p=[.2,.8])

df.mask(mask)

在一行中:

df.mask(np.random.choice([True, False], size=df.shape, p=[.2,.8]))

使用timeit在~770μs进行速度测试:

>>> python -m timeit -n 10000 
        -s "import pandas as pd;import numpy as np;df=pd.DataFrame(np.ones((10,10))*2)"
        "df.mask(np.random.choice([True,False], size=df.shape, p=[.2,.8]))"
10000 loops, best of 3: 770 usec per loop

答案 1 :(得分:4)

您可以将stackunstacksample一起使用,其中您要采样的分数是您在结果帧中所需的非空值的分数(即一个减去空值的分数。)

df = df.stack().sample(frac=0.8).unstack()

结果输出:

     A    B    C    D    E    F    G    H    I    J
a  2.0  2.0  2.0  2.0  2.0  NaN  NaN  2.0  2.0  2.0
b  2.0  NaN  2.0  2.0  2.0  2.0  2.0  2.0  2.0  2.0
c  2.0  NaN  NaN  2.0  2.0  2.0  NaN  2.0  2.0  2.0
d  2.0  2.0  2.0  2.0  2.0  NaN  2.0  2.0  2.0  2.0
e  2.0  2.0  2.0  2.0  2.0  NaN  2.0  NaN  2.0  NaN
f  2.0  2.0  NaN  NaN  2.0  NaN  2.0  2.0  2.0  2.0
g  2.0  2.0  NaN  2.0  NaN  2.0  2.0  2.0  2.0  2.0
h  2.0  2.0  2.0  2.0  2.0  2.0  2.0  NaN  NaN  2.0
i  NaN  2.0  2.0  2.0  2.0  2.0  NaN  2.0  2.0  2.0
j  2.0  2.0  2.0  2.0  NaN  2.0  2.0  2.0  2.0  2.0

编辑:

根据您的样本量的大小,上述方法可能会导致您丢失行/列,如果它们全部变为NaN。如果不需要,解决方法是在最后添加reindex

df = df.stack().sample(frac=0.8).unstack().reindex(index=df.index, columns=df.columns)