pandas:根据标准对一行进行分组和选择

时间:2016-03-15 18:59:57

标签: python pandas

我正在寻找一个像这样输入的pandas函数序列:

 id   label    rank
aab   quz         2
aaa   foo         1
aac   bar         4
aad   foo         4
aac   foo         2
aac   baz         3
aab   baz         3
aaa   bar         5

id分组并在每个组中选择最低rank的记录。输出如下所示:

输出:

id   label    rank
aaa  foo         1
aab  qaz         3
aac  foo         2
aad  foo         4

假设输入数据无序。

2 个答案:

答案 0 :(得分:2)

我认为您可以groupbyid,应用函数idxmin来查找列rank值最小的行的索引。然后使用loc选择这些行:

print df.groupby('id')['rank'].idxmin()
id
aaa    1
aab    0
aac    4
aad    3
Name: rank, dtype: int64

print df.loc[df.groupby('id')['rank'].idxmin(),:]
    id label  rank
1  aaa   foo     1
0  aab   quz     2
4  aac   foo     2
3  aad   foo     4

或者:

print df.loc[df.groupby('id')['rank'].idxmin()]
    id label  rank
1  aaa   foo     1
0  aab   quz     2
4  aac   foo     2
3  aad   foo     4

<强>时序

len(df) = 8

In [153]: %timeit df.sort_values('rank').groupby('id').first().reset_index()
The slowest run took 4.30 times longer than the fastest. This could mean that an intermediate result is being cached 
100 loops, best of 3: 2.26 ms per loop

In [154]: %timeit df.loc[df.groupby('id')['rank'].idxmin(),:]
1000 loops, best of 3: 1.67 ms per loop

In [155]: %timeit df.loc[df.groupby('id')['rank'].idxmin()]
1000 loops, best of 3: 1.52 ms per loop

len(df) = 8k

In [157]: %timeit df.sort_values('rank').groupby('id').first().reset_index()
100 loops, best of 3: 3.55 ms per loop

In [158]: %timeit df.loc[df.groupby('id')['rank'].idxmin(),:]
100 loops, best of 3: 2.24 ms per loop

In [159]: %timeit df.loc[df.groupby('id')['rank'].idxmin()]
The slowest run took 4.35 times longer than the fastest. This could mean that an intermediate result is being cached 
100 loops, best of 3: 2.12 ms per loop

答案 1 :(得分:1)

最简单的可能是按等级排序,按ID分组并选择每组的第一个元素;

> df.sort('rank').groupby('id').first().reset_index()

#     id label  rank
# 0  aaa   foo     1
# 1  aab   quz     2
# 2  aac   foo     2
# 3  aad   foo     4