缺少行的熊猫合并重采样结果

时间:2018-07-09 09:08:57

标签: python pandas date dataframe

注意:昨天我问了一个较差的版本,我很快将其删除。但是@FlorianGD留下了评论,给了我所需的答案-因此无论如何我都将其与他建议的解决方案加在一起。

准备开始数据框

我有一些约会:

date_dict = {0: '1/31/2010',
 1: '12/15/2009',
 2: '3/19/2010',
 3: '10/25/2009',
 4: '1/17/2009',
 5: '9/4/2009',
 6: '2/21/2010',
 7: '8/30/2009',
 8: '1/31/2010',
 9: '11/30/2008',
 10: '2/8/2009',
 11: '4/9/2010',
 12: '9/13/2009',
 13: '10/19/2009',
 14: '1/24/2010',
 15: '3/8/2009',
 16: '11/30/2008',
 17: '7/30/2009',
 18: '12/12/2009',
 19: '3/8/2009',
 20: '6/18/2010',
 21: '11/30/2008',
 22: '12/30/2009',
 23: '10/28/2009',
 24: '1/28/2010'}

转换为数据框和datetime格式:

import pandas as pd
from datetime import datetime
df = pd.DataFrame(list(date_dict.items()), columns=['Ind', 'Game_date'])
df['Date'] = df['Game_date'].apply(lambda x: datetime.strptime(x.strip(), "%m/%d/%Y"))
df.sort_values(by='Date', inplace=True)
df.reset_index(drop=True, inplace=True)
del df['Ind'], df['Game_date']
df['Count'] = 1

df
         Date
0  2008-11-30
1  2008-11-30
2  2008-11-30
3  2009-01-17
4  2009-02-08
5  2009-03-08
6  2009-03-08
7  2009-07-30
8  2009-08-30
9  2009-09-04
10 2009-09-13
11 2009-10-19
12 2009-10-25
13 2009-10-28
14 2009-12-12
15 2009-12-15
16 2009-12-30
17 2010-01-24
18 2010-01-28
19 2010-01-31
20 2010-01-31
21 2010-02-21
22 2010-03-19
23 2010-04-09
24 2010-06-18

现在我要做的是对该数据框重新采样,以将行分成几周,然后将信息返回到原始数据框。

2使用resample()进行每周分组并返回计数

我每个星期二都进行一次重采样:

c_index = df.set_index('Date', drop=True).resample('1W-TUE').sum()['Count'].reset_index()
c_index.dropna(subset=['Count'], axis=0, inplace=True)
c_index = c_index.reset_index(drop=True)
c_index['Index_Col'] = c_index.index + 1

c_index
         Date  Count  Index_Col
0  2008-12-02    3.0          1
1  2009-01-20    1.0          2
2  2009-02-10    1.0          3
3  2009-03-10    2.0          4
4  2009-08-04    1.0          5
5  2009-09-01    1.0          6
6  2009-09-08    1.0          7
7  2009-09-15    1.0          8
8  2009-10-20    1.0          9
9  2009-10-27    1.0         10
10 2009-11-03    1.0         11
11 2009-12-15    2.0         12
12 2010-01-05    1.0         13
13 2010-01-26    1.0         14
14 2010-02-02    3.0         15
15 2010-02-23    1.0         16
16 2010-03-23    1.0         17
17 2010-04-13    1.0         18
18 2010-06-22    1.0         19

这显示了df中每周c_index中的行数,因此,在一周2008-12-02中,本周有3行。

将广播信息恢复为原始df

现在,我想将这些列合并回原来的df基本上将分组数据广播到各个行上。

这应该给出:

    Date        Count_Raw       Count_Total     Index_Col
0   2008-11-30          1           3           1
1   2008-11-30          1           3           1
2   2008-11-30          1           3           1
3   2009-01-17          1           1           2
4   2009-02-08          1           1           3
5   2009-03-08          1           2           4
6   2009-03-08          1           2           4
7   2009-07-30          1           1           5
8   2009-08-30          1           1           6
9   2009-09-04          1           1           7
10  2009-09-13          1           1           8
11  2009-10-19          1           1           9
12  2009-10-25          1           1           10
13  2009-10-28          1           1           11
14  2009-12-12          1           2           12
15  2009-12-15          1           2           12
16  2009-12-30          1           1           13
17  2010-01-24          1           1           14
18  2010-01-28          1           3           15
19  2010-01-31          1           3           15
20  2010-01-31          1           3           15
21  2010-02-21          1           1           16
22  2010-03-19          1           1           17
23  2010-04-09          1           1           18
24  2010-06-18          1           1           19

因此Count_Total代表该组中的总数,而Index_Col跟踪这些组的顺序。

例如,在这种情况下,2010-02-02的组信息已分配给2010-01-282010-01-312010-01-31

为此,我尝试了以下方法:

尝试失败

df.merge(c_index, on='Date', how='left', suffixes=('_Raw', '_Total'))
         Date  Count_Raw  Count_Total  Index_Col
0  2008-11-30          1          NaN        NaN
1  2008-11-30          1          NaN        NaN
2  2008-11-30          1          NaN        NaN
3  2009-01-17          1          NaN        NaN
4  2009-02-08          1          NaN        NaN
5  2009-03-08          1          NaN        NaN
6  2009-03-08          1          NaN        NaN
7  2009-07-30          1          NaN        NaN
8  2009-08-30          1          NaN        NaN
9  2009-09-04          1          NaN        NaN
10 2009-09-13          1          NaN        NaN
11 2009-10-19          1          NaN        NaN
12 2009-10-25          1          NaN        NaN
13 2009-10-28          1          NaN        NaN
14 2009-12-12          1          NaN        NaN
15 2009-12-15          1          2.0       12.0
16 2009-12-30          1          NaN        NaN
17 2010-01-24          1          NaN        NaN
18 2010-01-28          1          NaN        NaN
19 2010-01-31          1          NaN        NaN
20 2010-01-31          1          NaN        NaN
21 2010-02-21          1          NaN        NaN
22 2010-03-19          1          NaN        NaN
23 2010-04-09          1          NaN        NaN
24 2010-06-18          1          NaN        NaN

失败原因:仅当c_index中的日期也出现在df 中时,这两个数据帧才会合并。在此示例中,唯一添加信息的星期是2009-12-15,因为这是两个数据帧中唯一的共同日期。

我如何更好地合并才能得到我想要的东西?

1 个答案:

答案 0 :(得分:0)

如@FlorianGD所示,可以使用带有direction='forward'参数的pandas.merge_asof来实现:

pd.merge_asof(left=df, right=c_index, on='Date', suffixes=('_Raw', '_Total'), direction='forward')

         Date  Count_Raw  Count_Total  Index_Col
0  2008-11-30          1          3.0          1
1  2008-11-30          1          3.0          1
2  2008-11-30          1          3.0          1
3  2009-01-17          1          1.0          2
4  2009-02-08          1          1.0          3
5  2009-03-08          1          2.0          4
6  2009-03-08          1          2.0          4
7  2009-07-30          1          1.0          5
8  2009-08-30          1          1.0          6
9  2009-09-04          1          1.0          7
10 2009-09-13          1          1.0          8
11 2009-10-19          1          1.0          9
12 2009-10-25          1          1.0         10
13 2009-10-28          1          1.0         11
14 2009-12-12          1          2.0         12
15 2009-12-15          1          2.0         12
16 2009-12-30          1          1.0         13
17 2010-01-24          1          1.0         14
18 2010-01-28          1          3.0         15
19 2010-01-31          1          3.0         15
20 2010-01-31          1          3.0         15
21 2010-02-21          1          1.0         16
22 2010-03-19          1          1.0         17
23 2010-04-09          1          1.0         18
24 2010-06-18          1          1.0         19