我有一个如下所示的数据框:
id start end diff mindiff
1 2015-01-02 2015-07-01 180 57
2 2015-02-03 2015-05-12 98 56
3 2015-01-15 2015-01-20 5 5
4 2015-02-04 2015-04-15 70 55
5 2015-03-15 2015-05-01 47 46
6 2015-02-22 2015-03-01 7 7
7 2015-03-21 2015-04-12 22 22
8 2015-04-11 2015-06-15 65 50
9 2015-04-11 2015-05-01 20 20
10 2015-03-30 2015-04-01 2 2
11 2015-04-28 2015-06-15 48 33
12 2015-05-01 2015-06-01 31 31
13 2015-05-10 2015-06-09 30 30
14 2015-05-19 2015-07-01 43 42
15 2015-06-01 2015-06-06 5 5
16 2015-06-02 2015-06-29 27 27
17 2015-04-29 2015-05-21 22 22
18 2015-05-25 2015-07-01 37 36
19 2015-06-04 2015-06-26 22 22
20 2015-06-21 2015-07-01 10 10
21 2015-05-30 2015-06-06 7 7
22 2015-06-30 2015-07-01 1 1
字段是id,start(日期),end(日期),diff(开始和结束之间的天数),mindiff(min(差异和最后一天x起点后的几个月)。
在这种情况下,是1(所以比“开始日期”晚一个月)
我想要完成的是找到Mindiff的平均值(平均值),按“结束”的年/月分组,但只对每组记录具有“开始”年/月x的记录(上面定义的几个月,直到groupedby
个月。根据上述数据集的示例,id 1将仅以年/月2015/1和2015/1 + x(2015/2)进行平均。
这是一个标记每条记录的表格,以及我想要平均的月份:
Months
id 1 2 3 4 5 6 7
1 1 1
2 1 1
3 1
4 1 1
5 1 1
6 1 1
7 1 1
8 1 1
9 1 1
10 1 1
11 1 1
12 1 1
13 1 1
14 1 1
15 1
16 1
17 1 1
18 1 1
19 1
20 1 1
21 1 1
22 1 1
以下是我想要的精神和AVG /月:
Months
id 1 2 3 4 5 6 7
1 57 57
2 56 56
3 5
4 55 55
5 46 46
6 7 7
7 22 22
8 50 50
9 20 20
10 2 2
11 33 33
12 31 31
13 30 30
14 42 42
15 5
16 27
17 22 22
18 36 36
19 22
20 10 10
21 7 7
22 1 1
AVG 31 43.8 31.3 27.9 30.1 21.1 5.5
最后,这是我正在寻找的数据帧:
Month Avg Diff Trailing x months
2015-01 31
2015-02 43.75
2015-03 31.33333333
2015-05 27.85714286
2015-05 30.11111111
2015-06 21.1
2015-07 5.5
我知道这可以用循环,但我的直觉说GROUPBY更pythonic,可能更有效。但是,我如何才能获得“开始”月份的特定滚动思维值,以便在“年末/月”的组合中进行平均。谢谢您的帮助。
答案 0 :(得分:2)
首先,我创建了不同年份的测试数据,并将最后一行的开始设置为12月。然后,我将start
和end
列转换为句点 - periodS
和periodE
列。
我按列groupby
使用函数month
,并从列Avg
计算均值:
g = df1.groupby('months')['Avg'].mean().reset_index()
import pandas as pd
import numpy as np
import io
temp=u"""id;start;end
1;2014-01-02;2014-07-01
2;2014-02-03;2014-05-12
3;2014-01-15;2014-01-20
4;2014-02-04;2014-04-15
5;2014-03-15;2014-05-01
6;2014-02-22;2014-03-01
7;2015-03-21;2015-04-12
8;2015-04-11;2015-06-15
9;2015-04-11;2015-05-01
10;2015-03-30;2015-04-01
11;2015-04-28;2015-06-15
12;2015-05-01;2015-06-01
13;2015-05-10;2015-06-09
14;2016-05-19;2016-07-01
15;2016-06-01;2016-06-06
16;2016-06-02;2016-06-29
17;2016-04-29;2016-05-21
18;2016-05-25;2016-07-01
19;2017-06-04;2017-06-26
20;2017-06-21;2017-07-01
21;2017-05-30;2017-06-06
22;2017-12-30;2018-02-01"""
df = pd.read_csv(io.StringIO(temp), sep=";", index_col=[0])
print df
def last_day_of_next_month(any_day):
next_month = any_day.replace(day=28) + pd.Timedelta(days=36) # this will never fail
return next_month - pd.Timedelta(days=next_month.day)
df['mindiff'] = (pd.to_datetime(df['start']).apply(last_day_of_next_month) - pd.to_datetime(df['start'])).astype('timedelta64[D]')
df['diff'] = (pd.to_datetime(df['end']) - pd.to_datetime(df['start'])).astype('timedelta64[D]')
df['mindiff'] = df[['mindiff', 'diff']].apply(lambda x: min(x), axis=1)
#print df
#set day of start and end to periodindex
df['periodS'] = pd.to_datetime(df['start']).dt.to_period('M')
df['periodE'] = pd.to_datetime(df['end']).dt.to_period('M')
#if period end is higher as period start, add one month else NaN
df['period'] = np.where(df['periodE'] > df['periodS'],df['periodS'] + 1, np.nan)
#print df
#df from subset
df1 = df[['mindiff', 'periodS', 'period']]
#pivot data (from rows to columns)
df1 = df1.set_index('mindiff').stack().reset_index()
#rename columns names
df1.columns = ['Avg', 'tmp', 'months']
#groupby by column month and count mean from column Avg
g = df1.groupby('months')['Avg'].mean().reset_index()
print g
# months Avg
#0 2014-01 31.000000
#1 2014-02 43.750000
#2 2014-03 41.000000
#3 2014-04 46.000000
#4 2015-03 12.000000
#5 2015-04 25.400000
#6 2015-05 32.800000
#7 2015-06 30.500000
#8 2016-04 22.000000
#9 2016-05 33.333333
#10 2016-06 27.500000
#11 2017-05 7.000000
#12 2017-06 13.000000
#13 2017-07 10.000000
#14 2017-12 32.000000
#15 2018-01 32.000000