Python Pandas Dataframe - 基于条件的Groupby和Average

时间:2015-10-19 16:44:19

标签: python pandas group-by dataframe mean

我有一个如下所示的数据框:

id  start       end         diff mindiff
1   2015-01-02  2015-07-01  180 57
2   2015-02-03  2015-05-12  98  56
3   2015-01-15  2015-01-20  5   5
4   2015-02-04  2015-04-15  70  55
5   2015-03-15  2015-05-01  47  46
6   2015-02-22  2015-03-01  7   7
7   2015-03-21  2015-04-12  22  22
8   2015-04-11  2015-06-15  65  50
9   2015-04-11  2015-05-01  20  20
10  2015-03-30  2015-04-01  2   2
11  2015-04-28  2015-06-15  48  33
12  2015-05-01  2015-06-01  31  31
13  2015-05-10  2015-06-09  30  30
14  2015-05-19  2015-07-01  43  42
15  2015-06-01  2015-06-06  5   5
16  2015-06-02  2015-06-29  27  27
17  2015-04-29  2015-05-21  22  22
18  2015-05-25  2015-07-01  37  36
19  2015-06-04  2015-06-26  22  22
20  2015-06-21  2015-07-01  10  10
21  2015-05-30  2015-06-06  7   7
22  2015-06-30  2015-07-01  1   1

字段是id,start(日期),end(日期),diff(开始和结束之间的天数),mindiff(min(差异和最后一天x起点后的几个月)。

在这种情况下,

是1(所以比“开始日期”晚一个月)

我想要完成的是找到Mindiff的平均值(平均值),按“结束”的年/月分组,但只对每组记录具有“开始”年/月x的记录(上面定义的几个月,直到groupedby个月。根据上述数据集的示例,id 1将仅以年/月2015/1和2015/1 + x(2015/2)进行平均。

这是一个标记每条记录的表格,以及我想要平均的月份:

    Months                      
id  1   2   3   4   5   6   7
1   1   1                   
2       1   1               
3   1                       
4       1   1               
5           1   1           
6       1   1               
7           1   1           
8               1   1       
9               1   1       
10          1   1           
11              1   1       
12                  1   1   
13                  1   1   
14                  1   1   
15                      1   
16                      1   
17              1   1       
18                  1   1   
19                      1   
20                      1   1
21                  1   1   
22                      1   1

以下是我想要的精神和AVG /月:

    Months                      
id  1   2   3   4   5   6   7
1   57  57                  
2       56  56              
3   5                       
4       55  55              
5           46  46          
6       7   7               
7           22  22          
8               50  50      
9               20  20      
10          2   2           
11              33  33      
12                  31  31  
13                  30  30  
14                  42  42  
15                      5   
16                      27  
17              22  22      
18                  36  36  
19                      22  
20                      10  10
21                  7   7   
22                      1   1
AVG 31  43.8    31.3    27.9    30.1    21.1    5.5

最后,这是我正在寻找的数据帧:

Month   Avg Diff Trailing x months
2015-01 31
2015-02 43.75
2015-03 31.33333333
2015-05 27.85714286
2015-05 30.11111111
2015-06 21.1
2015-07 5.5

我知道这可以用循环,但我的直觉说GROUPBY更pythonic,可能更有效。但是,我如何才能获得“开始”月份的特定滚动思维值,以便在“年末/月”的组合中进行平均。谢谢您的帮助。

1 个答案:

答案 0 :(得分:2)

首先,我创建了不同年份的测试数据,并将最后一行的开始设置为12月。然后,我将startend列转换为句点 - periodSperiodE列。

我按列groupby使用函数month,并从列Avg计算均值:

g = df1.groupby('months')['Avg'].mean().reset_index()
import pandas as pd
import numpy as np
import io

temp=u"""id;start;end
1;2014-01-02;2014-07-01
2;2014-02-03;2014-05-12
3;2014-01-15;2014-01-20
4;2014-02-04;2014-04-15
5;2014-03-15;2014-05-01
6;2014-02-22;2014-03-01
7;2015-03-21;2015-04-12
8;2015-04-11;2015-06-15
9;2015-04-11;2015-05-01
10;2015-03-30;2015-04-01
11;2015-04-28;2015-06-15
12;2015-05-01;2015-06-01
13;2015-05-10;2015-06-09
14;2016-05-19;2016-07-01
15;2016-06-01;2016-06-06
16;2016-06-02;2016-06-29
17;2016-04-29;2016-05-21
18;2016-05-25;2016-07-01
19;2017-06-04;2017-06-26
20;2017-06-21;2017-07-01
21;2017-05-30;2017-06-06
22;2017-12-30;2018-02-01"""

df = pd.read_csv(io.StringIO(temp), sep=";", index_col=[0])
print df
def last_day_of_next_month(any_day):
    next_month = any_day.replace(day=28) + pd.Timedelta(days=36)  # this will never fail
    return next_month - pd.Timedelta(days=next_month.day)

df['mindiff'] = (pd.to_datetime(df['start']).apply(last_day_of_next_month) - pd.to_datetime(df['start'])).astype('timedelta64[D]')
df['diff'] = (pd.to_datetime(df['end']) - pd.to_datetime(df['start'])).astype('timedelta64[D]')
df['mindiff'] = df[['mindiff', 'diff']].apply(lambda x: min(x), axis=1)
#print df

#set day of start and end to periodindex
df['periodS'] =  pd.to_datetime(df['start']).dt.to_period('M')
df['periodE'] =  pd.to_datetime(df['end']).dt.to_period('M')

#if period end is higher as period start, add one month else NaN
df['period'] = np.where(df['periodE'] > df['periodS'],df['periodS'] + 1, np.nan)
#print df
#df from subset
df1 = df[['mindiff', 'periodS', 'period']]
#pivot data (from rows to columns)
df1 = df1.set_index('mindiff').stack().reset_index()
#rename columns names
df1.columns = ['Avg', 'tmp', 'months']
#groupby by column month and count mean from column Avg
g = df1.groupby('months')['Avg'].mean().reset_index()
print g
#     months        Avg
#0   2014-01  31.000000
#1   2014-02  43.750000
#2   2014-03  41.000000
#3   2014-04  46.000000
#4   2015-03  12.000000
#5   2015-04  25.400000
#6   2015-05  32.800000
#7   2015-06  30.500000
#8   2016-04  22.000000
#9   2016-05  33.333333
#10  2016-06  27.500000
#11  2017-05   7.000000
#12  2017-06  13.000000
#13  2017-07  10.000000
#14  2017-12  32.000000
#15  2018-01  32.000000