How to group pandas DataFrame by varying dates?

时间:2016-10-19 13:38:33

标签: python pandas numpy

I am trying to roll up daily data into fiscal quarter data. For example, I have a table with fiscal quarter end dates:

Company Period Quarter_End
M       2016Q1 05/02/2015
M       2016Q2 08/01/2015
M       2016Q3 10/31/2015
M       2016Q4 01/30/2016
WFM     2015Q2 04/12/2015
WFM     2015Q3 07/05/2015 
WFM     2015Q4 09/27/2015
WFM     2016Q1 01/17/2016

and a table of daily data:

Company Date       Price
M       06/20/2015 1.05
M       06/22/2015 4.05
M       07/10/2015 3.45
M       07/29/2015 1.86
M       08/24/2015 1.58
M       09/02/2015 8.64
M       09/22/2015 2.56
M       10/20/2015 5.42
M       11/02/2015 1.58
M       11/24/2015 4.58
M       12/03/2015 6.48
M       12/05/2015 4.56
M       01/03/2016 7.14
M       01/30/2016 6.34
WFM     06/20/2015 1.05
WFM     06/22/2015 4.05
WFM     07/10/2015 3.45
WFM     07/29/2015 1.86
WFM     08/24/2015 1.58
WFM     09/02/2015 8.64
WFM     09/22/2015 2.56
WFM     10/20/2015 5.42
WFM     11/02/2015 1.58
WFM     11/24/2015 4.58
WFM     12/03/2015 6.48
WFM     12/05/2015 4.56
WFM     01/03/2016 7.14
WFM     01/17/2016 6.34

And I would like to create the table below.

Company Period  Quarter_end Sum(Price)
M       2016Q2  8/1/2015    10.41
M       2016Q3  10/31/2015  18.2
M       2016Q4  1/30/2016   30.68
WFM     2015Q3  7/5/2015    5.1
WFM     2015Q4  9/27/2015   18.09
WFM     2016Q1  1/17/2016   36.1

However, I don't know how to group by varying dates without looping through each record. Any help is greatly appreciated.

Thanks!

2 个答案:

答案 0 :(得分:6)

我认为您可以使用merge_ordered

#first convert columns to datetime
df1.Quarter_End = pd.to_datetime(df1.Quarter_End)
df2.Date = pd.to_datetime(df2.Date)


df = pd.merge_ordered(df1, 
                      df2, 
                      left_on=['Company','Quarter_End'], 
                      right_on=['Company','Date'], 
                      how='outer')
print (df)
   Company  Period Quarter_End       Date  Price
0        M  2016Q1  2015-05-02        NaT    NaN
1        M     NaN         NaT 2015-06-20   1.05
2        M     NaN         NaT 2015-06-22   4.05
3        M     NaN         NaT 2015-07-10   3.45
4        M     NaN         NaT 2015-07-29   1.86
5        M  2016Q2  2015-08-01        NaT    NaN
6        M     NaN         NaT 2015-08-24   1.58
7        M     NaN         NaT 2015-09-02   8.64
8        M     NaN         NaT 2015-09-22   2.56
9        M     NaN         NaT 2015-10-20   5.42
10       M  2016Q3  2015-10-31        NaT    NaN
11       M     NaN         NaT 2015-11-02   1.58
12       M     NaN         NaT 2015-11-24   4.58
13       M     NaN         NaT 2015-12-03   6.48
14       M     NaN         NaT 2015-12-05   4.56
15       M     NaN         NaT 2016-01-03   7.14
16       M  2016Q4  2016-01-30 2016-01-30   6.34
17     WFM  2015Q2  2015-04-12        NaT    NaN
18     WFM     NaN         NaT 2015-06-20   1.05
19     WFM     NaN         NaT 2015-06-22   4.05
20     WFM  2015Q3  2015-07-05        NaT    NaN
21     WFM     NaN         NaT 2015-07-10   3.45
22     WFM     NaN         NaT 2015-07-29   1.86
23     WFM     NaN         NaT 2015-08-24   1.58
24     WFM     NaN         NaT 2015-09-02   8.64
25     WFM     NaN         NaT 2015-09-22   2.56
26     WFM  2015Q4  2015-09-27        NaT    NaN
27     WFM     NaN         NaT 2015-10-20   5.42
28     WFM     NaN         NaT 2015-11-02   1.58
29     WFM     NaN         NaT 2015-11-24   4.58
30     WFM     NaN         NaT 2015-12-03   6.48
31     WFM     NaN         NaT 2015-12-05   4.56
32     WFM     NaN         NaT 2016-01-03   7.14
33     WFM  2016Q1  2016-01-17 2016-01-17   6.34

然后在bfill列中填写NaN,在sum填充Period并汇总Series.dropna。如果需要删除所有NaN值,请添加reset_index和最后LUA_PATH

Quarter_End

答案 1 :(得分:4)

  • set_index
  • pd.concat以对齐索引
  • groupbyagg
prd_df = period_df.set_index(['Company', 'Quarter_End'])

prc_df = price_df.set_index(['Company', 'Date'], drop=False)

df = pd.concat([prd_df, prc_df], axis=1)

df.groupby([df.index.get_level_values(0), df.Period.bfill()])  \
  .agg(dict(Date='last', Price='sum')).dropna()

enter image description here