返回基于小时的出现次数-熊猫

时间:2019-03-27 03:35:44

标签: python pandas datetime group-by

我试图返回按小时分组的最大值。我尝试使用以下方法实现此目的,但是有多个相同的小时(组)。我希望仅返回每小时的最大值。

d = ({
    'Time' : ['0/1/1900 8:00:00','0/1/1900 9:59:00','0/1/1900 10:00:00','0/1/1900 12:29:00','0/1/1900 12:30:00','0/1/1900 13:00:00','0/1/1900 13:02:00','0/1/1900 13:15:00','0/1/1900 13:20:00','0/1/1900 18:10:00','0/1/1900 18:15:00','0/1/1900 18:20:00','0/1/1900 18:25:00','0/1/1900 18:45:00','0/1/1900 18:50:00','0/1/1900 19:05:00','0/1/1900 19:07:00','0/1/1900 21:57:00','0/1/1900 22:00:00','0/1/1900 22:30:00','0/1/1900 22:35:00','1/1/1900 3:00:00','1/1/1900 3:05:00','1/1/1900 3:20:00','1/1/1900 3:25:00'],                 
    'People' : [1,1,2,2,3,3,2,2,3,3,4,4,3,3,2,2,3,3,4,4,3,3,2,2,1],                      
     })

df = pd.DataFrame(data = d)

df['Time'] = ['/'.join([str(int(x.split('/')[0])+1)] + x.split('/')[1:]) for x in df['Time']]
df['Time'] = pd.to_datetime(df['Time'], format='%d/%m/%Y %H:%M:%S') 

df = df.groupby([pd.Grouper(key='Time',freq='H'),df.People]).size().reset_index(name='count')

print(df)

                  Time  People  count
0  1900-01-01 08:00:00       1      1
1  1900-01-01 09:00:00       1      1
2  1900-01-01 10:00:00       2      1
3  1900-01-01 12:00:00       2      1
4  1900-01-01 12:00:00       3      1
5  1900-01-01 13:00:00       2      2
6  1900-01-01 13:00:00       3      2
7  1900-01-01 18:00:00       2      1
8  1900-01-01 18:00:00       3      3
9  1900-01-01 18:00:00       4      2
10 1900-01-01 19:00:00       2      1
11 1900-01-01 19:00:00       3      1
12 1900-01-01 21:00:00       3      1
13 1900-01-01 22:00:00       3      1
14 1900-01-01 22:00:00       4      2
15 1900-01-02 03:00:00       1      1
16 1900-01-02 03:00:00       2      2
17 1900-01-02 03:00:00       3      1

预期输出:

              Time  People  count
0  1900-01-01 08:00:00       1      1
1  1900-01-01 09:00:00       1      1
2  1900-01-01 10:00:00       2      2
3  1900-01-01 12:00:00       2      3
4  1900-01-01 13:00:00       2      3
5  1900-01-01 18:00:00       2      4
6  1900-01-01 19:00:00       2      3
7  1900-01-01 21:00:00       3      3
8  1900-01-01 22:00:00       3      4
9  1900-01-02 03:00:00       1      3

2 个答案:

答案 0 :(得分:1)

使用pandas.DataFrame.groupby。给定df

                   Time  People
0   1900-01-01 08:00:00       1
1   1900-01-01 09:00:00       1
2   1900-01-01 10:00:00       2
3   1900-01-01 12:00:00       2
4   1900-01-01 12:00:00       3
5   1900-01-01 13:00:00       2
6   1900-01-01 13:00:00       3
7   1900-01-01 18:00:00       2
8   1900-01-01 18:00:00       3
9   1900-01-01 18:00:00       4
10  1900-01-01 19:00:00       2
11  1900-01-01 19:00:00       3
12  1900-01-01 21:00:00       3
13  1900-01-01 22:00:00       3
14  1900-01-01 22:00:00       4
15  1900-01-02 03:00:00       1
16  1900-01-02 03:00:00       2
17  1900-01-02 03:00:00       3

df.groupby('Time')['People'].max()返回:

Time
1900-01-01 08:00:00    1
1900-01-01 09:00:00    1
1900-01-01 10:00:00    2
1900-01-01 12:00:00    3
1900-01-01 13:00:00    3
1900-01-01 18:00:00    4
1900-01-01 19:00:00    3
1900-01-01 21:00:00    3
1900-01-01 22:00:00    4
1900-01-02 03:00:00    3

答案 1 :(得分:1)

要对项目本身进行更多控制,您可以迭代df的单个键并获取其他列的max()值,然后进行修改  根据需要,然后重新创建df。这应该起作用:

import pandas as pd

d = ({
    'Time' : ['0/1/1900 8:00:00','0/1/1900 9:59:00','0/1/1900 10:00:00','0/1/1900 12:29:00','0/1/1900 12:30:00','0/1/1900 13:00:00','0/1/1900 13:02:00','0/1/1900 13:15:00','0/1/1900 13:20:00','0/1/1900 18:10:00','0/1/1900 18:15:00','0/1/1900 18:20:00','0/1/1900 18:25:00','0/1/1900 18:45:00','0/1/1900 18:50:00','0/1/1900 19:05:00','0/1/1900 19:07:00','0/1/1900 21:57:00','0/1/1900 22:00:00','0/1/1900 22:30:00','0/1/1900 22:35:00','1/1/1900 3:00:00','1/1/1900 3:05:00','1/1/1900 3:20:00','1/1/1900 3:25:00'],
    'People' : [1,1,2,2,3,3,2,2,3,3,4,4,3,3,2,2,3,3,4,4,3,3,2,2,1],
     })

df = pd.DataFrame(data = d)

df['Time'] = ['/'.join([str(int(x.split('/')[0])+1)] + x.split('/')[1:]) for x in df['Time']]
df['Time'] = pd.to_datetime(df['Time'], format='%d/%m/%Y %H:%M:%S')


df = df.groupby([pd.Grouper(key='Time',freq='H'),df.People]).size().reset_index(name='count')

single_times = set(df['Time'])
p, c = [ [] for i in range(2) ]
for v in single_times :
    c.append(max(df.loc[df['Time'] == v]['count']))
    p.append(max(df.loc[df['Time'] == v]['People']))

###make something with c/p

dfdata = {
    'Time' : list(single_times),
    'People' : p,
    'Count' : c
}
df2 = pd.DataFrame(data = dfdata)

print(df2)

可能会有更快的方法。