根据B中的值,每个A的前两个C值是多少?
df = pd.DataFrame({
'A': ["first","second","second","first",
"second","first","third","fourth",
"fifth","second","fifth","first",
"first","second","third","fourth","fifth"],
'B': [1,1,1,2,2,3,3,3,3,4,4,5,6,6,6,7,7],
'C': ["a", "b", "c", "d",
"e", "f", "g", "h",
"i", "j", "k", "l",
"m", "n", "o", "p", "q"]})
我正在尝试
x = df.groupby(['A'])['B'].nlargest(2)
A
fifth 16 7
10 4
first 12 6
11 5
fourth 15 7
7 3
second 13 6
9 4
third 14 6
6 3
但这会丢弃C列,这就是我需要的实际值。
我想在结果中使用C,而不是原始df的行索引。我必须加入吗?我甚至只拿一份C单......
对于每个A,我需要对前2个C值(基于B)采取行动。
答案 0 :(得分:5)
IIUC:
In [42]: df.groupby(['A'])['B','C'].apply(lambda x: x.nlargest(2, columns=['B'])
Out[42]:
B C
A
fifth 16 7 q
10 4 k
first 12 6 m
11 5 l
fourth 15 7 p
7 3 h
second 13 6 n
9 4 j
third 14 6 o
6 3 g
答案 1 :(得分:0)
我刚遇到相同的问题,并使用@MaxU解决方案(也投票赞成)。但是,由于apply
实际上会创建许多新的子数据帧并再次合并,因此速度很慢。这是将sort_values
与tail
结合使用的另一种方法:
df.sort_values(["A", "B"]).groupby("A").tail(2)
A B C
10 fifth 4 k
16 fifth 7 q
11 first 5 l
12 first 6 m
7 fourth 3 h
15 fourth 7 p
9 second 4 j
13 second 6 n
6 third 3 g
14 third 6 o
此解决方案产生相同的结果,但行顺序不同,但是我认为这在您的示例中无所谓。但是,如果很重要,您可以添加一些额外的调用以获得准确的结果:
df.sort_values(["A", "B"], ascending=[True, False]).groupby("A").head(2).set_index("A")
B C
A
fifth 7 q
fifth 4 k
first 6 m
first 5 l
fourth 7 p
fourth 3 h
second 6 n
second 4 j
third 6 o
third 3 g
以下是基准:
%%timeit
df.sort_values(["A", "B"]).groupby("A").tail(2)
1.9 ms ± 35 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%%timeit
df.sort_values(["A", "B"], ascending=[True, False]).groupby("A").head(2).set_index("A")
2.4 ms ± 62.6 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%%timeit
df.groupby(['A'])['B','C'].apply(lambda x: x.nlargest(2, columns=['B']))
10.1 ms ± 213 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
使用sort_values
的解决方案大约快5倍。我希望实际(较大)的数据集会有所增加。