熊猫:如何按类别分组(和求和)并保留子类别的信息

时间:2018-06-29 08:21:25

标签: python pandas

这是Pandas: How to subset (and sum) top N observations within subcategories?的后续问题,它演示了如何在此数据框中找到每年前3个月的总和:

示例数据框

    year      month   passengers
0    1949    January         112
1    1949   February         118
2    1949      March         132
3    1949      April         129
4    1949        May         121
5    1949       June         135
.
.
.
137  1960       June         535
138  1960       July         622
139  1960     August         606
140  1960  September         508
141  1960    October         461
142  1960   November         390
143  1960   December         432

这样您最终会得到:

    year  passengers
0   1949         432
1   1950         498
2   1951         582
3   1952         690
4   1953         779
5   1954         859
6   1955        1026
7   1956        1192
8   1957        1354
9   1958        1431
10  1959        1579
11  1960         176

数字432 for 1949148+148+136 for the months July, August and September.的总和 我的问题是这样:

是否可以执行相同的计算,并同时将相应的子类别作为列表保留在其自己的列中?

所需的输出

(我只检查了1949年的实际总和。由1950年组成):

        year  passengers  months
    0   1949         432  July, August, September 
    1   1950         498  August, September, December
    2   1951         582  .
    3   1952         690  .
    4   1953         779  .
    5   1954         859  .
    6   1955        1026  .
    7   1956        1192  .
    8   1957        1354  .
    9   1958        1431  .
    10  1959        1579  .
    11  1960         176  .

可复制的代码和数据:

import pandas as pd
import seaborn as sns
df = sns.load_dataset('flights')
print(df.head())

df2 = df.groupby('year')['passengers'].apply(lambda x: x.nlargest(3).sum()).reset_index()
print(df2.head())

df:

   year     month  passengers
0  1949   January         112
1  1949  February         118
2  1949     March         132
3  1949     April         129
4  1949       May         121

df2:

   year  passengers
0  1949         432
1  1950         498
2  1951         582
3  1952         690
4  1953         779

谢谢您的任何建议!

4 个答案:

答案 0 :(得分:3)

将自定义函数与GroupBy.apply一起使用,想法是首先按sort_values进行排序,然后调用head获取每组的最高值:

def f(x):
    x = x.head(3)
    names = ['passengers','months']
    return pd.Series([x['passengers'].sum(), ', '.join(x['month'])], index=names)

df2 = df.sort_values('passengers', ascending=False).groupby('year').apply(f).reset_index()
print(df2.head())
   year  passengers                   months
0  1949         432  July, August, September
1  1950         498  July, August, September
2  1951         582  July, August, September
3  1952         690       August, July, June
4  1953         779       August, July, June

答案 1 :(得分:1)

您可以

In [69]: df.groupby('year').apply(lambda x: 
           x.nlargest(3, 'passengers').agg(
              {'passengers': 'sum', 'month': lambda x: ', '.join(x.values)}
             )).reset_index()
Out[69]:
    year  passengers                    month
0   1949         432  July, August, September
1   1950         498  July, August, September
2   1951         582  July, August, September
3   1952         690       August, July, June
4   1953         779       August, July, June
5   1954         859       July, August, June
6   1955        1026       July, August, June
7   1956        1192       July, August, June
8   1957        1354       August, July, June
9   1958        1431       August, July, June
10  1959        1579       August, July, June
11  1960        1763       July, August, June

答案 2 :(得分:1)

这是使用nlargest的一种解决方案。

def largest(x, k):
    vals = x.nlargest(n=k, columns=['passengers'])
    return [vals['passengers'].sum(), vals['month'].tolist()]

g = df.groupby('year').apply(largest, k=3).reset_index()
joiner = pd.DataFrame(g[0].values.tolist(), columns=['passengers', 'months'])

res = g.drop(0, axis=1).join(joiner)

print(res)

   year  passengers               months
0  1949         382  [March, April, May]

我特意将months保留为列表,如果需要,您可以将其转换为逗号分隔的字符串。

答案 3 :(得分:1)

或者-分组,然后使用pd.DataFrame.nlargest而不是自定义函数/ lambda进行应用,然后在索引上重新组合并应用合适的agg,例如:

new_df = (
    df.groupby('year').apply(pd.DataFrame.nlargest, 3, 'passengers')
    .groupby(level=0).agg({'passengers': 'sum', 'month': ', '.join})
    # optionally reset index
    # .reset_index()
)

那将给你:

      passengers                    month
year                                     
1949         432  July, August, September
1950         498  July, August, September
1951         582  July, August, September
1952         690       August, July, June
1953         779       August, July, June
1954         859       July, August, June
...

似乎year作为索引在结果帧中有意义,但如果没有,则应用.reset_index()