我创建了一个数据框'Pclass'
class deck weight
0 3 C 0.367568
1 3 B 0.259459
2 3 D 0.156757
3 3 E 0.140541
4 3 A 0.070270
5 3 T 0.005405
我的初始数据框'df'看起来像
class deck
0 3 NaN
1 1 C
2 3 NaN
3 1 C
4 3 NaN
5 3 NaN
6 1 E
7 3 NaN
8 3 NaN
9 2 NaN
10 3 G
11 1 C
我想通过从中选择一个样本来填充df中的空牌组值 根据权重在Pclass中给出的套牌。
我只设法对采样程序进行编码。
np.random.choice(a=Pclass.deck,p=Pclass.weight)
我无法通过查找属于类3的空行并为每个选择一个随机牌组值来实现一个填充空值的方法(总是不是相同的值),所以不 fillna('只有一个')。
注意:我有另一个与此类似的问题,但更广泛的是groupby对象以及最大化效率但我没有得到任何回复。任何帮助将不胜感激!
编辑:向数据框Pclass
添加行1 F 0.470588
1 E 0.294118
1 D 0.235294
2 F 0.461538
2 G 0.307692
2 E 0.230769
答案 0 :(得分:1)
这会从deck
数据框的Pclass
列生成随机选择,并将这些选择分配到df
列中的deck
数据框(生成所需的数字)。如果要跨类变量的不同值执行此操作,可以将这些命令放入列表推导中。我建议避免使用class
作为变量名,因为它用于在Python中定义新的classes。
import numpy as np
import pandas as pd
# Generate data and normalised weights
normweights = np.random.rand(6)
normweights /= normweights.sum()
Pclass = pd.DataFrame({
"cla": [3, 3, 3, 3, 3, 3],
"deck": ["C", "B", "D", "E", "A", "T"],
"weight": normweights
})
df = pd.DataFrame({
"cla": [3, 1, 3, 1, 3, 3, 1, 3, 3, 2, 3, 1],
"deck": [np.nan, "C", np.nan, "C",
np.nan, np.nan, "E", np.nan,
np.nan, np.nan, "G", "C"]
})
# Find missing locations
missing_locs = np.where(df.deck.isnull() & (df.cla == 3))[0]
# Generate new values
new_vals = np.random.choice(a = Pclass.deck.values,
p = Pclass.weight.values, size = len(missing_locs))
# Assign the new values to the dataframe
df.set_value(missing_locs, 'deck', new_vals)
如果你想在类变量的所有级别上运行它,你需要确保选择Pclass
中的数据子集(只是感兴趣的类)。可以使用列表推导来查找每个级别的“类”的丢失数据,如下所示(我已经更新了下面的模拟数据)......
# Find missing locations
missing_locs = [np.where(df.deck.isnull() & (df.cla == i))[0] for i in [1,2,3]]
但是,我认为如果它在循环中,代码会更容易阅读:
# Generate data and normalised weights
normweights3 = np.random.rand(6)
normweights3 /= normweights3.sum()
normweights2 = np.random.rand(3)
normweights2 /= normweights2.sum()
Pclass = pd.DataFrame({
"cla": [3, 3, 3, 3, 3, 3, 2, 2, 2],
"deck": ["C", "B", "D", "E", "A", "T", "X", "Y", "Z"],
"weight": np.concatenate((normweights3, normweights2))
})
df = pd.DataFrame({
"cla": [3, 1, 3, 1, 3, 3, 1, 3, 3, 2, 3, 1],
"deck": [np.nan, "C", np.nan, "C",
np.nan, np.nan, "E", np.nan,
np.nan, np.nan, "G", "C"]
})
class_levels = [1, 2, 3]
for i in class_levels:
missing_locs = np.where(df.deck.isnull() & (df.cla == i))[0]
if len(missing_locs) > 0:
subset = Pclass[Pclass.cla == i]
# Generate new values
new_vals = np.random.choice(a = subset.deck.values,
p = subset.weight.values, size = len(missing_locs))
# Assign the new values to the dataframe
df.set_value(missing_locs, 'deck', new_vals)