如何在pandas中对两个字段进行分组?

时间:2016-01-22 10:17:14

标签: python pandas aggregate

根据以下输入,我们的目标是按平均值为每个日期使用平均值和总和函数对值进行分组。 按小时对其进行分组的解决方案是here,但它不考虑新的一天。

Date        Time    F1  F2  F3
21-01-16    8:11    5   2   4
21-01-16    9:25    9   8   2
21-01-16    9:39    7   3   2
21-01-16    9:53    6   5   1
21-01-16    10:07   4   6   7
21-01-16    10:21   7   3   1
21-01-16    10:35   5   6   7
21-01-16    11:49   1   2   1
21-01-16    12:03   3   3   1
22-01-16    9:45    6   5   1
22-01-16    9:20    4   6   7
22-01-16    12:10   7   3   1

预期产出:

Date,Time,SUM F1,SUM F2,SUM F3,AVG F1,AVG F2,AVG F3
21-01-16,8:00,5,2,4,5,2,4
21-01-16,9:00,22,16,5,7.3,5.3,1.6
21-01-16,10:00,16,15,15,5.3,5,5
21-01-16,11:00,1,2,1,1,2,1
21-01-16,12:00,3,3,1,3,3,1
22-01-16,9:00,10,11,8,5,5.5,4
22-01-16,12:00,7,3,1,7,3,1

2 个答案:

答案 0 :(得分:4)

您可以在阅读csv文件时解析日期:

from __future__ import print_function # make it work with Python 2 and 3

df = pd.read_csv('f123_dates.csv', index_col=0, parse_dates=[0, 1],
                 delim_whitespace=True)
print(df.groupby([df.index, df.Time.dt.hour]).agg(['mean','sum']))

输出:

                       F1            F2            F3    
                     mean sum      mean sum      mean sum
Date       Time                                          
2016-01-21 8     5.000000   5  2.000000   2  4.000000   4
           9     7.333333  22  5.333333  16  1.666667   5
           10    5.333333  16  5.000000  15  5.000000  15
           11    1.000000   1  2.000000   2  1.000000   1
           12    3.000000   3  3.000000   3  1.000000   1
2016-01-22 9     5.000000  10  5.500000  11  4.000000   8
           12    7.000000   7  3.000000   3  1.000000   1

一直到csv:

from __future__ import print_function

df = pd.read_csv('f123_dates.csv', index_col=0, parse_dates=[0, 1],
                 delim_whitespace=True)
df2 = df.groupby([df.index, df.Time.dt.hour]).agg(['mean','sum'])
df3 = df2.reset_index()
df3.columns = [' '.join(col).strip() for col in df3.columns.values]
print(df3.to_csv(columns=df3.columns, index=False))

输出:

Date,Time,F1 mean,F1 sum,F2 mean,F2 sum,F3 mean,F3 sum
2016-01-21,8,5.0,5,2.0,2,4.0,4
2016-01-21,9,7.333333333333333,22,5.333333333333333,16,1.6666666666666667,5
2016-01-21,10,5.333333333333333,16,5.0,15,5.0,15
2016-01-21,11,1.0,1,2.0,2,1.0,1
2016-01-21,12,3.0,3,3.0,3,1.0,1
2016-01-22,9,5.0,10,5.5,11,4.0,8
2016-01-22,12,7.0,7,3.0,3,1.0,1 

答案 1 :(得分:3)

您可以使用to_datetimetime转换为datetime,然后使用groupby转换agg

print df    
         Date   Time  F1  F2  F3
0  2016-01-21   8:11   5   2   4
1  2016-01-21   9:25   9   8   2
2  2016-01-21   9:39   7   3   2
3  2016-01-21   9:53   6   5   1
4  2016-01-21  10:07   4   6   7
5  2016-01-21  10:21   7   3   1
6  2016-01-21  10:35   5   6   7
7  2016-01-21  11:49   1   2   1
8  2016-01-21  12:03   3   3   1
9  2016-01-22   9:45   6   5   1
10 2016-01-22   9:20   4   6   7
11 2016-01-22  12:10   7   3   1

df['Time'] = pd.to_datetime(df['Time'], format="%H:%M")
print df
         Date                Time  F1  F2  F3
0  2016-01-21 1900-01-01 08:11:00   5   2   4
1  2016-01-21 1900-01-01 09:25:00   9   8   2
2  2016-01-21 1900-01-01 09:39:00   7   3   2
3  2016-01-21 1900-01-01 09:53:00   6   5   1
4  2016-01-21 1900-01-01 10:07:00   4   6   7
5  2016-01-21 1900-01-01 10:21:00   7   3   1
6  2016-01-21 1900-01-01 10:35:00   5   6   7
7  2016-01-21 1900-01-01 11:49:00   1   2   1
8  2016-01-21 1900-01-01 12:03:00   3   3   1
9  2016-01-22 1900-01-01 09:45:00   6   5   1
10 2016-01-22 1900-01-01 09:20:00   4   6   7
11 2016-01-22 1900-01-01 12:10:00   7   3   1
df = df.groupby([df['Date'], df['Time'].dt.hour]).agg(['mean','sum']).reset_index()
print df
        Date Time        F1            F2            F3    
                       mean sum      mean sum      mean sum
0 2016-01-21    8  5.000000   5  2.000000   2  4.000000   4
1 2016-01-21    9  7.333333  22  5.333333  16  1.666667   5
2 2016-01-21   10  5.333333  16  5.000000  15  5.000000  15
3 2016-01-21   11  1.000000   1  2.000000   2  1.000000   1
4 2016-01-21   12  3.000000   3  3.000000   3  1.000000   1
5 2016-01-22    9  5.000000  10  5.500000  11  4.000000   8
6 2016-01-22   12  7.000000   7  3.000000   3  1.000000   1

然后您可以按列表理解设置列名称:

levels = df.columns.levels
labels = df.columns.labels
df.columns = [ x + " " + y for x, y in  zip(levels[0][labels[0]],df.columns.droplevel(0))]
print df

        Date   Time   F1 mean  F1 sum   F2 mean  F2 sum   F3 mean  F3 sum
0 2016-01-21      8  5.000000       5  2.000000       2  4.000000       4
1 2016-01-21      9  7.333333      22  5.333333      16  1.666667       5
2 2016-01-21     10  5.333333      16  5.000000      15  5.000000      15
3 2016-01-21     11  1.000000       1  2.000000       2  1.000000       1
4 2016-01-21     12  3.000000       3  3.000000       3  1.000000       1
5 2016-01-22      9  5.000000      10  5.500000      11  4.000000       8
6 2016-01-22     12  7.000000       7  3.000000       3  1.000000       1