通过对pandas进行采样,使用不同的元素填充数据帧中的空值

时间:2017-02-23 00:27:37

标签: python pandas

我创建了一个数据框'Pclass'

    class   deck    weight
0   3       C         0.367568
1   3       B         0.259459
2   3       D         0.156757
3   3       E         0.140541
4   3       A         0.070270
5   3       T         0.005405

我的初始数据框'df'看起来像

  class deck
0   3   NaN
1   1   C
2   3   NaN
3   1   C
4   3   NaN
5   3   NaN
6   1   E
7   3   NaN
8   3   NaN
9   2   NaN
10  3   G
11  1   C

我想通过从中选择一个样本来填充df中的空牌组值 根据权重在Pclass中给出的套牌。

我只设法对采样程序进行编码。

np.random.choice(a=Pclass.deck,p=Pclass.weight)

我无法通过查找属于类3的空行并为每个选择一个随机牌组值来实现一个填充空值的方法(总是不是相同的值),所以 fillna('只有一个')。

注意:我有另一个与此类似的问题,但更广泛的是groupby对象以及最大化效率但我没有得到任何回复。任何帮助将不胜感激!

编辑:向数据框Pclass

添加行
1       F             0.470588
1       E             0.294118
1       D             0.235294
2       F             0.461538
2       G             0.307692
2       E             0.230769  

1 个答案:

答案 0 :(得分:1)

这会从deck数据框的Pclass列生成随机选择,并将这些选择分配到df列中的deck数据框(生成所需的数字)。如果要跨类变量的不同值执行此操作,可以将这些命令放入列表推导中。我建议避免使用class作为变量名,因为它用于在Python中定义新的classes

import numpy as np
import pandas as pd

# Generate data and normalised weights
normweights = np.random.rand(6)
normweights /= normweights.sum()

Pclass = pd.DataFrame({
    "cla": [3, 3, 3, 3, 3, 3],
    "deck": ["C", "B", "D", "E", "A", "T"],
    "weight": normweights
    })

df = pd.DataFrame({
    "cla": [3, 1, 3, 1, 3, 3, 1, 3, 3, 2, 3, 1],
    "deck": [np.nan, "C", np.nan, "C", 
            np.nan, np.nan, "E", np.nan, 
            np.nan, np.nan, "G", "C"]
    })

# Find missing locations
missing_locs = np.where(df.deck.isnull() & (df.cla == 3))[0]

# Generate new values
new_vals =  np.random.choice(a = Pclass.deck.values, 
        p = Pclass.weight.values, size = len(missing_locs))

# Assign the new values to the dataframe
df.set_value(missing_locs, 'deck', new_vals)

运行多个级别的分类变量

如果你想在类变量的所有级别上运行它,你需要确保选择Pclass中的数据子集(只是感兴趣的类)。可以使用列表推导来查找每个级别的“类”的丢失数据,如下所示(我已经更新了下面的模拟数据)......

# Find missing locations
missing_locs = [np.where(df.deck.isnull() & (df.cla == i))[0] for i in [1,2,3]]

但是,我认为如果它在循环中,代码会更容易阅读:

# Generate data and normalised weights
normweights3 = np.random.rand(6)
normweights3 /= normweights3.sum()

normweights2 = np.random.rand(3)
normweights2 /= normweights2.sum()

Pclass = pd.DataFrame({
    "cla": [3, 3, 3, 3, 3, 3, 2, 2, 2],
    "deck": ["C", "B", "D", "E", "A", "T", "X", "Y", "Z"],
    "weight": np.concatenate((normweights3, normweights2))
    })

df = pd.DataFrame({
    "cla": [3, 1, 3, 1, 3, 3, 1, 3, 3, 2, 3, 1],
    "deck": [np.nan, "C", np.nan, "C", 
            np.nan, np.nan, "E", np.nan, 
            np.nan, np.nan, "G", "C"]
    })

    class_levels = [1, 2, 3]
    for i in class_levels:

        missing_locs = np.where(df.deck.isnull() & (df.cla == i))[0]

        if len(missing_locs) > 0:
            subset = Pclass[Pclass.cla == i]

            # Generate new values
            new_vals = np.random.choice(a = subset.deck.values, 
                p = subset.weight.values, size = len(missing_locs))

            # Assign the new values to the dataframe
            df.set_value(missing_locs, 'deck', new_vals)