如何根据pandas中的列填充缺失值?

时间:2017-03-24 02:59:24

标签: python pandas numpy

我在pandas中有这个数据框:

df = pandas.DataFrame({
        "n": ["a", "b", "c", "a", "b", "x"],
        "t": [0, 0, 0, 1, 1, 1],
        "v": [10,20,30,40,50,60]
    })

如何填充缺失值,以使列t的每个值在列n中具有相同的条目?每个t值都应包含a, b, c, x的条目,如果缺少则会记录为NaN

   n  t   v
   a  0  10
   b  0  20
   c  0  30
   x  NaN NaN
   a  1  40
   b  1  50
   c  NaN NaN
   x  1  60

4 个答案:

答案 0 :(得分:2)

计划

  • 获取列'n'的唯一值。我们将reindex用于
  • 我们将f应用于每个列't'组中的我们的群组idx重新编制索引将确保我们获得针对每个群组的idx所有元素独特的't'
  • 我们设置了索引,以便我们可以reindex一点
idx = df.n.unique()
f = lambda x: x.reindex(idx)
df.set_index('n').groupby('t', group_keys=False).apply(f).reset_index()

   n    t     v
0  a  0.0  10.0
1  b  0.0  20.0
2  c  0.0  30.0
3  x  NaN   NaN
4  a  1.0  40.0
5  b  1.0  50.0
6  c  NaN   NaN
7  x  1.0  60.0

答案 1 :(得分:1)

如果df之前NaN之前没有MultiIndex,则可以使用NaN然后reindext中设置v按列cols = ["n", "t"] df1 = df.set_index(cols) mux = pd.MultiIndex.from_product(df1.index.levels, names=cols) df1 = df1.reindex(mux).sort_index(level=[1,0]).reset_index() df1['t'] = df1['t'].mask(df1['v'].isnull()) print (df1) n t v 0 a 0.0 10.0 1 b 0.0 20.0 2 c 0.0 30.0 3 x NaN NaN 4 a 1.0 40.0 5 b 1.0 50.0 6 c NaN NaN 7 x 1.0 60.0

cols = ["n", "t"]
df1 = df.set_index(cols)['v'].unstack().stack(dropna=False)
df1 = df1.sort_index(level=[1,0]).reset_index(name='v')
df1['t'] = df1['t'].mask(df1['v'].isnull())
print (df1)
    n    t     v
0  a  0.0  10.0
1  b  0.0  20.0
2  c  0.0  30.0
3  x  NaN   NaN
4  a  1.0  40.0
5  b  1.0  50.0
6  c  NaN   NaN
7  x  1.0  60.0

添加NaN的另一种解决方案是unstackstack方法:

NaN

但是,如果某些groupby值需要loc n unique个值为df = pd.DataFrame({"n": ["a", "b", "c", "a", "b", "x"], "t": [0, 0, 0, 1, 1, 1], "v": [10,20,30,40,50,np.nan]}) print (df) n t v 0 a 0 10.0 1 b 0 20.0 2 c 0 30.0 3 a 1 40.0 4 b 1 50.0 5 x 1 NaN df1 = df.set_index('n') .groupby('t', group_keys=False) .apply(lambda x: x.loc[df.n.unique()]) .reset_index() print (df1) n t v 0 a 0.0 10.0 1 b 0.0 20.0 2 c 0.0 30.0 3 x NaN NaN 4 a 1.0 40.0 5 b 1.0 50.0 6 c NaN NaN 7 x 1.0 NaN 列:

df1 = df.groupby('t', group_keys=False)
        .apply(lambda x: x.set_index('n').loc[df.n.unique()])
        .reset_index()
print (df1)
   n    t     v
0  a  0.0  10.0
1  b  0.0  20.0
2  c  0.0  30.0
3  x  NaN   NaN
4  a  1.0  40.0
5  b  1.0  50.0
6  c  NaN   NaN
7  x  1.0   NaN
**\bin\$(BuildConfiguration)\*.dll

答案 2 :(得分:1)

据我所知,您希望"n"中的每个值均匀分布在按"t"分组的子组中。我也希望那些"n"不能在这些子组中重复。

考虑到这些假设是正确的,pd.pivot_table似乎是这个用例的一个很好的选择。此处,"n"下的值将构成列名称,"t"将成为分组索引,DF的内容将由"v"下的值填充。稍后堆叠DF,同时保留NaN个条目,并使用"t"访问者填充.loc中的相应单元格。

df1 = pd.pivot_table(df, "v", "t", "n", "first").stack(dropna=False).reset_index(name="v")
df1.loc[df1['v'].isnull(), "t"] = np.nan

enter image description here

答案 3 :(得分:0)

好像你错了。通常情况下会自动读入NaN或指定它们。如果您的np.nan位于顶部,则可以import numpy as np手动输入NaN&#39}。或者,熊猫在内部存储numpy,你可以通过pandas.np.nan

获得一个Nan