pandas为另一列的每个不同值选择具有某些列的最大值的行

时间:2018-06-14 19:59:57

标签: python pandas

我在这样的pandas中有一个数据框:

    id  some_type   some_date   some_data
0   1   A           19/12/1995  X
1   2   A           10/04/1997  Y
2   2   B           05/03/2013  Z
3   2   B           09/05/2017  W
4   2   B           09/05/2017  R
5   3   A           01/07/1998  M
6   3   B           09/08/2009  N

我需要id的每个值,最大值为some_type和some_date的行而不删除some_data的任何值。

换句话说,我需要的是以下内容:

    id  some_type   some_date   some_data
0   1   A           19/12/1995  X
3   2   B           09/05/2017  W
4   2   B           09/05/2017  R
6   3   B           09/08/2009  N

2 个答案:

答案 0 :(得分:2)

您可以使用sort_valuesgroupbyapply来保留最后一个值为some_type和some_date的行:

df_output = (df.sort_values(by=['some_type','some_date']).groupby('id')
                .apply(lambda df_g: df_g[(df_g['some_type'] == df_g['some_type'].iloc[-1]) & 
                                          (df_g['some_date'] == df_g['some_date'].iloc[-1])])
                  .reset_index(0,drop=True))

,输出为:

   id some_type  some_date some_data
0   1         A 1995-12-19         X
3   2         B 2017-09-05         W
4   2         B 2017-09-05         R
6   3         B 2009-09-08         N

编辑:如果您不关心索引,也可以使用merge

#first get the last one once sorting
df_last = df.sort_values(['some_type','some_date']).groupby('id')['some_type','some_date'].last()
# now merge with inner to keep the one you want
df_output  = df.merge(df_last ,how='inner')

除了索引

,你将获得相同的结果

答案 1 :(得分:2)

使用max()df['some_date'] = pd.to_datetime(df['some_date']) m = df.groupby('id')['some_type','some_date'].transform(lambda x: x == x.max()).all(1) df = df[m] 创建一个掩码并应用。但首先转换为datetime:

import pandas as pd

text = '''\
id  some_type   some_date   some_data
1   A           19/12/1995  X
2   A           10/04/1997  Y
2   B           05/03/2013  Z
2   B           09/05/2017  W
2   B           09/05/2017  R
3   A           01/07/1998  M
3   B           09/08/2009  N'''

fileobj = pd.compat.StringIO(text)
df = pd.read_csv(fileobj, sep='\s+')

df['some_date'] = pd.to_datetime(df['some_date'])

m = df.groupby('id')['some_type','some_date'].transform(lambda x: x == x.max()).all(1)

df = df[m]

print(df)

完整示例:

   id some_type  some_date some_data
0   1         A 1995-12-19         X
3   2         B 2017-09-05         W
4   2         B 2017-09-05         R
6   3         B 2009-09-08         N

返回:

Public Sub test()
    Dim a As Object, b As Object, i As Long
    Set a = CreateObject("System.Collections.Queue")
    a.Enqueue "D"
    a.Enqueue "E"

   Set b = CreateObject("System.Collections.ArrayList")
    With b
        .Add "A"
        .Add "B"
        .Add "C"
        .InsertRange 1, a
    End With

    For i = 0 To b.count - 1
        MsgBox b(i)
    Next i
End Sub