根据以下输入,我们的目标是按平均值为每个日期使用平均值和总和函数对值进行分组。 按小时对其进行分组的解决方案是here,但它不考虑新的一天。
Date Time F1 F2 F3
21-01-16 8:11 5 2 4
21-01-16 9:25 9 8 2
21-01-16 9:39 7 3 2
21-01-16 9:53 6 5 1
21-01-16 10:07 4 6 7
21-01-16 10:21 7 3 1
21-01-16 10:35 5 6 7
21-01-16 11:49 1 2 1
21-01-16 12:03 3 3 1
22-01-16 9:45 6 5 1
22-01-16 9:20 4 6 7
22-01-16 12:10 7 3 1
预期产出:
Date,Time,SUM F1,SUM F2,SUM F3,AVG F1,AVG F2,AVG F3
21-01-16,8:00,5,2,4,5,2,4
21-01-16,9:00,22,16,5,7.3,5.3,1.6
21-01-16,10:00,16,15,15,5.3,5,5
21-01-16,11:00,1,2,1,1,2,1
21-01-16,12:00,3,3,1,3,3,1
22-01-16,9:00,10,11,8,5,5.5,4
22-01-16,12:00,7,3,1,7,3,1
答案 0 :(得分:4)
您可以在阅读csv文件时解析日期:
from __future__ import print_function # make it work with Python 2 and 3
df = pd.read_csv('f123_dates.csv', index_col=0, parse_dates=[0, 1],
delim_whitespace=True)
print(df.groupby([df.index, df.Time.dt.hour]).agg(['mean','sum']))
输出:
F1 F2 F3
mean sum mean sum mean sum
Date Time
2016-01-21 8 5.000000 5 2.000000 2 4.000000 4
9 7.333333 22 5.333333 16 1.666667 5
10 5.333333 16 5.000000 15 5.000000 15
11 1.000000 1 2.000000 2 1.000000 1
12 3.000000 3 3.000000 3 1.000000 1
2016-01-22 9 5.000000 10 5.500000 11 4.000000 8
12 7.000000 7 3.000000 3 1.000000 1
一直到csv:
from __future__ import print_function
df = pd.read_csv('f123_dates.csv', index_col=0, parse_dates=[0, 1],
delim_whitespace=True)
df2 = df.groupby([df.index, df.Time.dt.hour]).agg(['mean','sum'])
df3 = df2.reset_index()
df3.columns = [' '.join(col).strip() for col in df3.columns.values]
print(df3.to_csv(columns=df3.columns, index=False))
输出:
Date,Time,F1 mean,F1 sum,F2 mean,F2 sum,F3 mean,F3 sum
2016-01-21,8,5.0,5,2.0,2,4.0,4
2016-01-21,9,7.333333333333333,22,5.333333333333333,16,1.6666666666666667,5
2016-01-21,10,5.333333333333333,16,5.0,15,5.0,15
2016-01-21,11,1.0,1,2.0,2,1.0,1
2016-01-21,12,3.0,3,3.0,3,1.0,1
2016-01-22,9,5.0,10,5.5,11,4.0,8
2016-01-22,12,7.0,7,3.0,3,1.0,1
答案 1 :(得分:3)
您可以使用to_datetime
将time
转换为datetime
,然后使用groupby
转换agg
:
print df
Date Time F1 F2 F3
0 2016-01-21 8:11 5 2 4
1 2016-01-21 9:25 9 8 2
2 2016-01-21 9:39 7 3 2
3 2016-01-21 9:53 6 5 1
4 2016-01-21 10:07 4 6 7
5 2016-01-21 10:21 7 3 1
6 2016-01-21 10:35 5 6 7
7 2016-01-21 11:49 1 2 1
8 2016-01-21 12:03 3 3 1
9 2016-01-22 9:45 6 5 1
10 2016-01-22 9:20 4 6 7
11 2016-01-22 12:10 7 3 1
df['Time'] = pd.to_datetime(df['Time'], format="%H:%M")
print df
Date Time F1 F2 F3
0 2016-01-21 1900-01-01 08:11:00 5 2 4
1 2016-01-21 1900-01-01 09:25:00 9 8 2
2 2016-01-21 1900-01-01 09:39:00 7 3 2
3 2016-01-21 1900-01-01 09:53:00 6 5 1
4 2016-01-21 1900-01-01 10:07:00 4 6 7
5 2016-01-21 1900-01-01 10:21:00 7 3 1
6 2016-01-21 1900-01-01 10:35:00 5 6 7
7 2016-01-21 1900-01-01 11:49:00 1 2 1
8 2016-01-21 1900-01-01 12:03:00 3 3 1
9 2016-01-22 1900-01-01 09:45:00 6 5 1
10 2016-01-22 1900-01-01 09:20:00 4 6 7
11 2016-01-22 1900-01-01 12:10:00 7 3 1
df = df.groupby([df['Date'], df['Time'].dt.hour]).agg(['mean','sum']).reset_index()
print df
Date Time F1 F2 F3
mean sum mean sum mean sum
0 2016-01-21 8 5.000000 5 2.000000 2 4.000000 4
1 2016-01-21 9 7.333333 22 5.333333 16 1.666667 5
2 2016-01-21 10 5.333333 16 5.000000 15 5.000000 15
3 2016-01-21 11 1.000000 1 2.000000 2 1.000000 1
4 2016-01-21 12 3.000000 3 3.000000 3 1.000000 1
5 2016-01-22 9 5.000000 10 5.500000 11 4.000000 8
6 2016-01-22 12 7.000000 7 3.000000 3 1.000000 1
然后您可以按列表理解设置列名称:
levels = df.columns.levels
labels = df.columns.labels
df.columns = [ x + " " + y for x, y in zip(levels[0][labels[0]],df.columns.droplevel(0))]
print df
Date Time F1 mean F1 sum F2 mean F2 sum F3 mean F3 sum
0 2016-01-21 8 5.000000 5 2.000000 2 4.000000 4
1 2016-01-21 9 7.333333 22 5.333333 16 1.666667 5
2 2016-01-21 10 5.333333 16 5.000000 15 5.000000 15
3 2016-01-21 11 1.000000 1 2.000000 2 1.000000 1
4 2016-01-21 12 3.000000 3 3.000000 3 1.000000 1
5 2016-01-22 9 5.000000 10 5.500000 11 4.000000 8
6 2016-01-22 12 7.000000 7 3.000000 3 1.000000 1