Pandas数据帧:按A分组,B取nlargest,输出C.

时间:2017-05-03 14:22:00

标签: python pandas dataframe

根据B中的值,每个A的前两个C值是多少?

    df = pd.DataFrame({
            'A': ["first","second","second","first",
                        "second","first","third","fourth",
                        "fifth","second","fifth","first",
                        "first","second","third","fourth","fifth"],
            'B': [1,1,1,2,2,3,3,3,3,4,4,5,6,6,6,7,7],
            'C': ["a", "b", "c", "d",
                     "e", "f", "g", "h",
                     "i", "j", "k", "l",
                     "m", "n", "o", "p", "q"]})

我正在尝试

    x = df.groupby(['A'])['B'].nlargest(2)

    A
    fifth   16    7
            10    4
    first   12    6
            11    5
    fourth  15    7
            7     3
    second  13    6
            9     4
    third   14    6
            6     3

但这会丢弃C列,这就是我需要的实际值。

我想在结果中使用C,而不是原始df的行索引。我必须加入吗?我甚至只拿一份C单......

对于每个A,我需要对前2个C值(基于B)采取行动。

2 个答案:

答案 0 :(得分:5)

IIUC:

In [42]: df.groupby(['A'])['B','C'].apply(lambda x: x.nlargest(2, columns=['B'])
Out[42]:
           B  C
A
fifth  16  7  q
       10  4  k
first  12  6  m
       11  5  l
fourth 15  7  p
       7   3  h
second 13  6  n
       9   4  j
third  14  6  o
       6   3  g

答案 1 :(得分:0)

我刚遇到相同的问题,并使用@MaxU解决方案(也投票赞成)。但是,由于apply实际上会创建许多新的子数据帧并再次合并,因此速度很慢。这是将sort_valuestail结合使用的另一种方法:

df.sort_values(["A", "B"]).groupby("A").tail(2)

    A       B   C
10  fifth   4   k
16  fifth   7   q
11  first   5   l
12  first   6   m
7   fourth  3   h
15  fourth  7   p
9   second  4   j
13  second  6   n
6   third   3   g
14  third   6   o

此解决方案产生相同的结果,但行顺序不同,但是我认为这在您的示例中无所谓。但是,如果很重要,您可以添加一些额外的调用以获得准确的结果:

df.sort_values(["A", "B"], ascending=[True, False]).groupby("A").head(2).set_index("A")

        B   C
A       
fifth   7   q
fifth   4   k
first   6   m
first   5   l
fourth  7   p
fourth  3   h
second  6   n
second  4   j
third   6   o
third   3   g

以下是基准:

%%timeit
df.sort_values(["A", "B"]).groupby("A").tail(2)
1.9 ms ± 35 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

%%timeit
df.sort_values(["A", "B"], ascending=[True, False]).groupby("A").head(2).set_index("A")
2.4 ms ± 62.6 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

%%timeit
df.groupby(['A'])['B','C'].apply(lambda x: x.nlargest(2, columns=['B']))
10.1 ms ± 213 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

使用sort_values的解决方案大约快5倍。我希望实际(较大)的数据集会有所增加。