Pandas DataFrame插入/填充之前日期的遗漏行

时间:2016-10-17 19:13:59

标签: python pandas dataframe

我有一个由DataFrame组成的date,其他列和一个数值,其中“其他列”中的某些值组合可能会丢失,我想从之前的{{ 1}} S上。

实施例。说date如下所示。您可以在DataFrame上看到,我们在2016-01-01列上有(LN, A)(LN, B)(NY, A)(NY, B)的数据。

        date  location  band  value
0 2016-01-01        LN     A   10.0
1 2016-01-01        LN     B    5.0
2 2016-01-01        NY     A    9.0
3 2016-01-01        NY     B    6.0
4 2016-01-02        LN     A   11.0
5 2016-01-02        NY     B    7.0
6 2016-01-03        NY     A   10.0

然后您注意到(location, band),我们只有2016-01-02(LN, A),但(NY, B)(LN, B)丢失了。同样,在(NY, A)上,只有2016-01-03可用;缺少所有其他三种组合。

我想要做的是从其前任填充每个日期的缺失组合。比如对(NY, A)说,我想再添加两行,从2016-01-02“{翻转”2016-01-01(LN, B, 5.0)为列(NY, A, 9.0)(location, band, value)也是如此。为了使整个事情如下:

        date  location  band  value
 0 2016-01-01        LN     A   10.0
 1 2016-01-01        LN     B    5.0
 2 2016-01-01        NY     A    9.0
 3 2016-01-01        NY     B    6.0
 4 2016-01-02        LN     A   11.0
 5 2016-01-02        NY     B    7.0
 6 2016-01-03        NY     A   10.0
 7 2016-01-02        LN     B    5.0
 8 2016-01-02        NY     A    9.0
 9 2016-01-03        LN     A   11.0
10 2016-01-03        LN     B    5.0
11 2016-01-03        NY     B    7.0

注意,行7-11分别从行1,2,4,7和5填充。订单并不重要,因为如果我需要的所有数据都存在,我总是可以排序。

有人帮忙吗?非常感谢!

2 个答案:

答案 0 :(得分:1)

您可以使用unstack / stack方法获取所有缺失值,然后进行前向填充:

# Use unstack/stack to add missing locations.
df = df.set_index(['date', 'location', 'band']) \
       .unstack(level=['location', 'band']) \
       .stack(level=['location', 'band'], dropna=False)

# Forward fill NaN values within ['location', 'band'] groups.
df = df.groupby(level=['location', 'band']).ffill().reset_index()

或者您可以直接构建包含所有组合的MultiIndex

# Build the full MultiIndex, set the partial MultiIndex, and reindex.
levels = ['date', 'location', 'band']
full_idx = pd.MultiIndex.from_product([df[col].unique() for col in levels], names=levels)
df = df.set_index(levels).reindex(full_idx)

# Forward fill NaN values within ['location', 'band'] groups.
df = df.groupby(level=['location', 'band']).ffill().reset_index()

任一方法的结果输出:

         date location band  value
0  2016-01-01       LN    A   10.0
1  2016-01-01       LN    B    5.0
2  2016-01-01       NY    A    9.0
3  2016-01-01       NY    B    6.0
4  2016-01-02       LN    A   11.0
5  2016-01-02       LN    B    5.0
6  2016-01-02       NY    A    9.0
7  2016-01-02       NY    B    7.0
8  2016-01-03       LN    A   11.0
9  2016-01-03       LN    B    5.0
10 2016-01-03       NY    A   10.0
11 2016-01-03       NY    B    7.0

答案 1 :(得分:0)

我的解决方案,总结使用产品操作来获取多索引中的所有组合,然后是一些堆叠和ffill()。

array_merge_recursive

产生:

df =pd.DataFrame({'date': {0: '2016-01-01', 1: '2016-01-01', 2: '2016-01-01', 3: '2016-01-01', 4: '2016-01-02', 5: '2016-01-02', 6: '2016-01-03'}, 'band': {0: 'A', 1: 'B', 2: 'A', 3: 'B', 4: 'A', 5: 'B', 6: 'A'}, 'location': {0: 'LN', 1: 'LN', 2: 'NY', 3: 'NY', 4: 'LN', 5: 'NY', 6: 'NY'}, 'value': {0: 10, 1: 5, 2: 9, 3: 6, 4: 11, 5: 7, 6: 10}})
unique_dates = df['date'].unique()
df.set_index(['date','location','band'],inplace=True)
idx = pd.MultiIndex.from_product([unique_dates,['LN','NY'],['A','B']])
df  = df.reindex(idx)
df = df.unstack(level=[2,1])

最后:

             value                      
                 A      B       A      B
                LN     LN      NY     NY
2016-01-01 10.0000 5.0000  9.0000 6.0000
2016-01-02 11.0000    nan     nan 7.0000
2016-01-03     nan    nan 10.0000    nan