我正在寻找一个像这样输入的pandas函数序列:
id label rank
aab quz 2
aaa foo 1
aac bar 4
aad foo 4
aac foo 2
aac baz 3
aab baz 3
aaa bar 5
按id
分组并在每个组中选择最低rank
的记录。输出如下所示:
输出:
id label rank
aaa foo 1
aab qaz 3
aac foo 2
aad foo 4
假设输入数据无序。
答案 0 :(得分:2)
我认为您可以groupby
列id
,应用函数idxmin
来查找列rank
值最小的行的索引。然后使用loc
选择这些行:
print df.groupby('id')['rank'].idxmin()
id
aaa 1
aab 0
aac 4
aad 3
Name: rank, dtype: int64
print df.loc[df.groupby('id')['rank'].idxmin(),:]
id label rank
1 aaa foo 1
0 aab quz 2
4 aac foo 2
3 aad foo 4
或者:
print df.loc[df.groupby('id')['rank'].idxmin()]
id label rank
1 aaa foo 1
0 aab quz 2
4 aac foo 2
3 aad foo 4
<强>时序强>:
len(df) = 8
:
In [153]: %timeit df.sort_values('rank').groupby('id').first().reset_index()
The slowest run took 4.30 times longer than the fastest. This could mean that an intermediate result is being cached
100 loops, best of 3: 2.26 ms per loop
In [154]: %timeit df.loc[df.groupby('id')['rank'].idxmin(),:]
1000 loops, best of 3: 1.67 ms per loop
In [155]: %timeit df.loc[df.groupby('id')['rank'].idxmin()]
1000 loops, best of 3: 1.52 ms per loop
len(df) = 8k
:
In [157]: %timeit df.sort_values('rank').groupby('id').first().reset_index()
100 loops, best of 3: 3.55 ms per loop
In [158]: %timeit df.loc[df.groupby('id')['rank'].idxmin(),:]
100 loops, best of 3: 2.24 ms per loop
In [159]: %timeit df.loc[df.groupby('id')['rank'].idxmin()]
The slowest run took 4.35 times longer than the fastest. This could mean that an intermediate result is being cached
100 loops, best of 3: 2.12 ms per loop
答案 1 :(得分:1)
最简单的可能是按等级排序,按ID分组并选择每组的第一个元素;
> df.sort('rank').groupby('id').first().reset_index()
# id label rank
# 0 aaa foo 1
# 1 aab quz 2
# 2 aac foo 2
# 3 aad foo 4