大熊猫按组汇总多个列

时间:2020-02-04 16:32:01

标签: python-3.x aggregate pandas-groupby

给出一个数据框...

import pandas as pd
import numpy as np

# create some data
data = [['A1','234','1','12/17/2013','5/1/2014','1'],
        ['A1','234','2','2/13/2014','2/13/2014','1'],
        ['A2','532','1','8/4/2015','8/4/2015','2'],
        ['A3','123','1','4/6/2017','8/28/2017','3'],
        ['A4','754','1','4/11/2019','4/11/2019','4'],
        ['A5','754','1','9/20/2019','9/20/2019','5'],
        ['A5','754','2','9/20/2019','9/25/2019','5'],
        ['A5','754','3','9/24/2019','9/24/2019','5'],
        ['A5','754','4','9/25/2019','9/25/2019','5'],
        ['A5','754','5','9/25/2019','9/26/2019','5'],
        ['A5','754','6','9/26/2019','9/26/2019','5'],
        ['A5','754','7','9/27/2019','9/29/2019','5'],
        ['A5','754','8','9/29/2019','10/2/2019','5']]

# create dataframe
df = pd.DataFrame(data,columns=['MemberID','OrgID','RowID','StartDate','StopDate','Group'])

# format as datetime
df["StartDate"] = pd.to_datetime(df["StartDate"],errors ="coerce")
df["StopDate"] = pd.to_datetime(df["StopDate"],errors ="coerce")

通过对Group上的行进行分组并使用agg保留最早的StartDate和最新的StopDate来返回新的数据帧,这应该产生...

MemberID    OrgID   RowID   StartDate   StopDate    Group
A1           234    1       12/17/2013  5/1/2014     1
A2           532    1       8/4/2015    8/4/2015     2
A3           123    1       4/6/2017    8/28/2017    3
A4           754    1       4/11/2019   4/11/2019    4
A5           754    1       9/20/2019   10/2/2019    5

经过多次尝试,我得到的最接近的是...

# groupby
gb = df.groupby(['Group'], as_index=False,group_keys=False)

# aggregate by min and max date
result_df = gb.agg({'StartDate': np.min, 
                    'StopDate': np.max})

但是,以上内容删除了所有其他列

  Group  StartDate     StopDate
     1   2013-12-17   2014-05-01
     2   2015-08-04   2015-08-04
     3   2017-04-06   2017-08-28
     4   2019-04-11   2019-04-11
     5   2019-09-20   2019-10-02

我知道我可以从原始数据框中删除日期并合并到Group

# copy old df and remove date columns
old_df = df.copy()
del old_df['StartDate']
del old_df['StopDate']

# remove duplicates
old_df.drop_duplicates(subset = ['Group'], keep = 'first', inplace = True)

# merge with agg result
final_df = pd.merge(result_df, old_df, on = "Group", how = "outer")

但这显然很冗长,无法很好地扩展。我将在具有成千上万行的数据帧上执行此操作。

0 个答案:

没有答案