Question

我想知道是否有一种pythonic方法通过从唯一值的分布中随机选择来填充分类数据的空值。基本上按比例/随机填充分类空值，基于列中值的现有分布......

- 下面是我已经做过的一个例子

- 我使用数字作为类别来节省时间，我不确定如何随机输入字母

import numpy as np
import pandas as pd
np.random.seed([1])
df = pd.DataFrame(np.random.normal(10, 2, 20).round().astype(object))
df.rename(columns = {0 :  'category'}, inplace = True)
df.loc[::5] = np.nan
print df

   category
0       NaN
1        12
2         4
3         9
4        12
5       NaN
6        10
7        12
8        13
9         9
10      NaN
11        9
12       10
13       11
14        9
15      NaN
16       10
17        4
18        9
19        9

这就是我目前输入值

的方式

df.category.value_counts()

9     6
12    3
10    3
4     2
13    1
11    1

df.category.value_counts()/16

9     0.3750
12    0.1875
10    0.1875
4     0.1250
13    0.0625
11    0.0625

# to fill categorical info based on percentage
category_fill = np.random.choice((9, 12, 10, 4, 13, 11), size = 4, p = (.375, .1875, .1875, .1250, .0625, .0625))
df.loc[df.category.isnull(), "category"] = category_fill

最终输出有效，只需要一段时间来编写

df.category.value_counts()

9     9
12    4
10    3
4     2
13    1
11    1

有更快的方法来实现这个目的吗？或者是一个能够达到这个目的的功能吗？

感谢您的帮助！

Answer 1

您可以使用stats.rv_discrete：

from scipy import stats

counts = df.category.value_counts()
dist = stats.rv_discrete(values=(counts.index, counts/counts.sum()))
fill_values = dist.rvs(size=df.shape[0] - df.category.count())
df.loc[df.category.isnull(), "category"] = fill_values

编辑：对于一般数据（不限于整数），您可以这样做：

dist = stats.rv_discrete(values=(np.arange(counts.shape[0]), 
                                 counts/counts.sum()))
fill_idxs = dist.rvs(size=df.shape[0] - df.category.count())
df.loc[df.category.isnull(), "category"] = counts.iloc[fill_idxs].index.values

为分类数据填充多个空值

1 个答案: