如何在不计算Python中的重复值的同时进行分组和求和

时间:2019-04-16 09:47:26

标签: python pandas

我想将时间格式从12:45更改为datetime格式,同时保持该格式不变,并计算活动的时差(activity_duration的结果)。第二,我想对按activity_station分组的activity_duration求和

我将时间更改为日期时间格式,但是我得到了随机的年,月,日等信息。我知道如何分组而不是在应用分组依据时如何消除重复项。

df = pd.DataFrame({ 
    'Shift_id' :[ 123,123,123,123,123,123,123,123,123,123,123,123,123,123,123,
                345,345,345,345,345,345,345,345,345,345,345,345,345,345,345,345],
    'activity_id' : [1,2,3,4,5,1,2,3,4,5,1,2,3,4,5,
                    6,7,8,9,6,7,8,9,6,7,8,9,6,7,8,9],
    'activity_begin_time' : ['09:00','09:05','12:00','12:30','17:25','09:00','09:05','12:00','12:30','17:25','09:00','09:05','12:00','12:30','17:25',
                            '09:00','09:05','12:00','12:30','09:00','09:05','12:00','12:30','09:00','09:05','12:00','12:30','09:00','09:05','12:00','12:30'],
    'activity_end_time' : ['09:05','12:00','12:30', '17:25','17:30','09:05','12:00','12:30', '17:25','17:30','09:05','12:00','12:30', '17:25','17:30',
                          '09:05','12:00','12:30', '17:25','09:05','12:00','12:30', '17:25','09:05','12:00','12:30', '17:25','09:05','12:00','12:30', '17:25'],
    'activity_station' : ['None', 'Za','None','Ba','None','None', 'Za','None','Ba','None','None', 'Za','None','Ba','None',
                         'None','Za','Ba','Ra','None','Za','Ba','Ra','None','Za','Ba','Ra','None','Za','Ba','Ra']
})


df['activity_begin_time'] = pd.to_datetime(df['activity_begin_time'])
df['activity_end_time'] = pd.to_datetime(df['activity_end_time'])
df['activity_duration'] = df['activity_end_time'] - df['activity_begin_time']
df['activity_duration'] = df['activity_duration']/np.timedelta64(1,'h')

我想对activity_station分组的acitivity_duration求和,同时消除重复的值

1 个答案:

答案 0 :(得分:2)

这是我的解决方法:

df = pd.DataFrame({ 
    'Shift_id' :[ 123,123,123,123,123,123,123,123,123,123,123,123,123,123,123,
                345,345,345,345,345,345,345,345,345,345,345,345,345,345,345,345],
    'activity_id' : [1,2,3,4,5,1,2,3,4,5,1,2,3,4,5,
                    6,7,8,9,6,7,8,9,6,7,8,9,6,7,8,9],
    'activity_begin_time' : ['09:00','09:05','12:00','12:30','17:25','09:00','09:05','12:00','12:30','17:25','09:00','09:05','12:00','12:30','17:25',
                            '09:00','09:05','12:00','12:30','09:00','09:05','12:00','12:30','09:00','09:05','12:00','12:30','09:00','09:05','12:00','12:30'],
    'activity_end_time' : ['09:05','12:00','12:30', '17:25','17:30','09:05','12:00','12:30', '17:25','17:30','09:05','12:00','12:30', '17:25','17:30',
                          '09:05','12:00','12:30', '17:25','09:05','12:00','12:30', '17:25','09:05','12:00','12:30', '17:25','09:05','12:00','12:30', '17:25'],
    'activity_station' : ['None', 'Za','None','Ba','None','None', 'Za','None','Ba','None','None', 'Za','None','Ba','None',
                         'None','Za','Ba','Ra','None','Za','Ba','Ra','None','Za','Ba','Ra','None','Za','Ba','Ra']
})

丢弃重复的内容:

df = df.drop_duplicates()

使用pandas.to_timedelta

df['activity_begin_time'] = pd.to_timedelta(df['activity_begin_time']+':00')
df['activity_end_time'] = pd.to_timedelta(df['activity_end_time']+':00')
df['activity_duration'] = df['activity_end_time'] - df['activity_begin_time']

然后您可以对每列使用特定的聚合,并使用groupby

df.groupby('activity_station').agg({'activity_duration': np.sum})

哪个会产生:

                   activity_duration
activity_station    
Ba                 05:25:00
None               00:45:00
Ra                 04:55:00
Za                 05:50:00