注意:昨天我问了一个较差的版本,我很快将其删除。但是@FlorianGD留下了评论,给了我所需的答案-因此无论如何我都将其与他建议的解决方案加在一起。
我有一些约会:
date_dict = {0: '1/31/2010',
1: '12/15/2009',
2: '3/19/2010',
3: '10/25/2009',
4: '1/17/2009',
5: '9/4/2009',
6: '2/21/2010',
7: '8/30/2009',
8: '1/31/2010',
9: '11/30/2008',
10: '2/8/2009',
11: '4/9/2010',
12: '9/13/2009',
13: '10/19/2009',
14: '1/24/2010',
15: '3/8/2009',
16: '11/30/2008',
17: '7/30/2009',
18: '12/12/2009',
19: '3/8/2009',
20: '6/18/2010',
21: '11/30/2008',
22: '12/30/2009',
23: '10/28/2009',
24: '1/28/2010'}
转换为数据框和datetime
格式:
import pandas as pd
from datetime import datetime
df = pd.DataFrame(list(date_dict.items()), columns=['Ind', 'Game_date'])
df['Date'] = df['Game_date'].apply(lambda x: datetime.strptime(x.strip(), "%m/%d/%Y"))
df.sort_values(by='Date', inplace=True)
df.reset_index(drop=True, inplace=True)
del df['Ind'], df['Game_date']
df['Count'] = 1
df
Date
0 2008-11-30
1 2008-11-30
2 2008-11-30
3 2009-01-17
4 2009-02-08
5 2009-03-08
6 2009-03-08
7 2009-07-30
8 2009-08-30
9 2009-09-04
10 2009-09-13
11 2009-10-19
12 2009-10-25
13 2009-10-28
14 2009-12-12
15 2009-12-15
16 2009-12-30
17 2010-01-24
18 2010-01-28
19 2010-01-31
20 2010-01-31
21 2010-02-21
22 2010-03-19
23 2010-04-09
24 2010-06-18
现在我要做的是对该数据框重新采样,以将行分成几周,然后将信息返回到原始数据框。
resample()
进行每周分组并返回计数我每个星期二都进行一次重采样:
c_index = df.set_index('Date', drop=True).resample('1W-TUE').sum()['Count'].reset_index()
c_index.dropna(subset=['Count'], axis=0, inplace=True)
c_index = c_index.reset_index(drop=True)
c_index['Index_Col'] = c_index.index + 1
c_index
Date Count Index_Col
0 2008-12-02 3.0 1
1 2009-01-20 1.0 2
2 2009-02-10 1.0 3
3 2009-03-10 2.0 4
4 2009-08-04 1.0 5
5 2009-09-01 1.0 6
6 2009-09-08 1.0 7
7 2009-09-15 1.0 8
8 2009-10-20 1.0 9
9 2009-10-27 1.0 10
10 2009-11-03 1.0 11
11 2009-12-15 2.0 12
12 2010-01-05 1.0 13
13 2010-01-26 1.0 14
14 2010-02-02 3.0 15
15 2010-02-23 1.0 16
16 2010-03-23 1.0 17
17 2010-04-13 1.0 18
18 2010-06-22 1.0 19
这显示了df
中每周c_index
中的行数,因此,在一周2008-12-02
中,本周有3行。
df
现在,我想将这些列合并回原来的df
,基本上将分组数据广播到各个行上。
这应该给出:
Date Count_Raw Count_Total Index_Col
0 2008-11-30 1 3 1
1 2008-11-30 1 3 1
2 2008-11-30 1 3 1
3 2009-01-17 1 1 2
4 2009-02-08 1 1 3
5 2009-03-08 1 2 4
6 2009-03-08 1 2 4
7 2009-07-30 1 1 5
8 2009-08-30 1 1 6
9 2009-09-04 1 1 7
10 2009-09-13 1 1 8
11 2009-10-19 1 1 9
12 2009-10-25 1 1 10
13 2009-10-28 1 1 11
14 2009-12-12 1 2 12
15 2009-12-15 1 2 12
16 2009-12-30 1 1 13
17 2010-01-24 1 1 14
18 2010-01-28 1 3 15
19 2010-01-31 1 3 15
20 2010-01-31 1 3 15
21 2010-02-21 1 1 16
22 2010-03-19 1 1 17
23 2010-04-09 1 1 18
24 2010-06-18 1 1 19
因此Count_Total
代表该组中的总数,而Index_Col
跟踪这些组的顺序。
例如,在这种情况下,2010-02-02
的组信息已分配给2010-01-28
,2010-01-31
和2010-01-31
。
为此,我尝试了以下方法:
尝试失败
df.merge(c_index, on='Date', how='left', suffixes=('_Raw', '_Total'))
Date Count_Raw Count_Total Index_Col
0 2008-11-30 1 NaN NaN
1 2008-11-30 1 NaN NaN
2 2008-11-30 1 NaN NaN
3 2009-01-17 1 NaN NaN
4 2009-02-08 1 NaN NaN
5 2009-03-08 1 NaN NaN
6 2009-03-08 1 NaN NaN
7 2009-07-30 1 NaN NaN
8 2009-08-30 1 NaN NaN
9 2009-09-04 1 NaN NaN
10 2009-09-13 1 NaN NaN
11 2009-10-19 1 NaN NaN
12 2009-10-25 1 NaN NaN
13 2009-10-28 1 NaN NaN
14 2009-12-12 1 NaN NaN
15 2009-12-15 1 2.0 12.0
16 2009-12-30 1 NaN NaN
17 2010-01-24 1 NaN NaN
18 2010-01-28 1 NaN NaN
19 2010-01-31 1 NaN NaN
20 2010-01-31 1 NaN NaN
21 2010-02-21 1 NaN NaN
22 2010-03-19 1 NaN NaN
23 2010-04-09 1 NaN NaN
24 2010-06-18 1 NaN NaN
失败原因:仅当c_index
中的日期也出现在df
中时,这两个数据帧才会合并。在此示例中,唯一添加信息的星期是2009-12-15,因为这是两个数据帧中唯一的共同日期。
我如何更好地合并才能得到我想要的东西?
答案 0 :(得分:0)
如@FlorianGD所示,可以使用带有direction='forward'
参数的pandas.merge_asof来实现:
pd.merge_asof(left=df, right=c_index, on='Date', suffixes=('_Raw', '_Total'), direction='forward')
Date Count_Raw Count_Total Index_Col
0 2008-11-30 1 3.0 1
1 2008-11-30 1 3.0 1
2 2008-11-30 1 3.0 1
3 2009-01-17 1 1.0 2
4 2009-02-08 1 1.0 3
5 2009-03-08 1 2.0 4
6 2009-03-08 1 2.0 4
7 2009-07-30 1 1.0 5
8 2009-08-30 1 1.0 6
9 2009-09-04 1 1.0 7
10 2009-09-13 1 1.0 8
11 2009-10-19 1 1.0 9
12 2009-10-25 1 1.0 10
13 2009-10-28 1 1.0 11
14 2009-12-12 1 2.0 12
15 2009-12-15 1 2.0 12
16 2009-12-30 1 1.0 13
17 2010-01-24 1 1.0 14
18 2010-01-28 1 3.0 15
19 2010-01-31 1 3.0 15
20 2010-01-31 1 3.0 15
21 2010-02-21 1 1.0 16
22 2010-03-19 1 1.0 17
23 2010-04-09 1 1.0 18
24 2010-06-18 1 1.0 19