Question

我有一个包含20列和10.000行的数据集。我的计划是将数据集中的某些数据替换为NaN。我的任务是观察缺失值对数据集的影响。

我的计划是使用Zipf分布生成每一列的缺失百分比，然后根据这些百分比将一些值替换为NaN。

例如，这里是我的代码：

import matplotlib.pyplot as plt
from scipy import special

import numpy as np
a = 1.01 # parameter
s = np.random.zipf(a, 200000)
count, bins, ignored = plt.hist(s[s<20], 20, density=True)

plt.show()

条形图如下所示：

是否可以获取条形图的百分比，所以我可以根据条形图的百分比替换每列中的某些值？例如，第一列丢失80％，第二列丢失40％，第三列丢失25％，等等。

Answer 1

您需要了解Zipf分布的定义。维基百科上对此进行了很好的解释。这是Wiki中的图片。

有两个重要的参数，分别是ffill和#dummy df df = pd.DataFrame({'temp': [36, 39, 24, 34 ,56, 42, 40, 38, 36, 37, 32, 36, 23]}) df['status'] = pd.Series('bad', index=df.index).where(df.temp.lt(35)|df.temp.gt(45))\ .ffill(limit=2).fillna('good') print (df) temp status 0 36 good 1 39 good 2 24 bad 3 34 bad 4 56 bad 5 42 bad #here it is 42 but the previous row is bad so still bad 6 40 bad #here it is 40 but the second previous row is bad so still bad 7 38 good #here it is good then 8 36 good 9 37 good 10 32 bad 11 36 bad 12 23 bad。参数a> 1对您的行有影响（在N上图中），a是行大小。

当您基于zipf分布生成数字时，您必须考虑频率。在您的代码中，您使用了a = s，这意味着您的柱高已被规范化，而无需使用此参数，您将看到准确的计数数字。

结果：

density=True

以确切的数字作图：

column_rank = list(range(1,21))
a = 2.
N = 200000
s = np.random.zipf(a, N)
for i in column_rank: 
    print(i, ((len(s[s==i]))/N)*100)

结果：

https://console.cloud.google.com/appengine/

从条形图Zipf分布中获取条形的百分比

1 个答案: