这是Pandas: How to subset (and sum) top N observations within subcategories?的后续问题,它演示了如何在此数据框中找到每年前3个月的总和:
示例数据框
year month passengers
0 1949 January 112
1 1949 February 118
2 1949 March 132
3 1949 April 129
4 1949 May 121
5 1949 June 135
.
.
.
137 1960 June 535
138 1960 July 622
139 1960 August 606
140 1960 September 508
141 1960 October 461
142 1960 November 390
143 1960 December 432
这样您最终会得到:
year passengers
0 1949 432
1 1950 498
2 1951 582
3 1952 690
4 1953 779
5 1954 859
6 1955 1026
7 1956 1192
8 1957 1354
9 1958 1431
10 1959 1579
11 1960 176
数字432 for 1949
是148+148+136 for the months July, August and September.
的总和
我的问题是这样:
是否可以执行相同的计算,并同时将相应的子类别作为列表保留在其自己的列中?
所需的输出
(我只检查了1949年的实际总和。由1950年组成):
year passengers months
0 1949 432 July, August, September
1 1950 498 August, September, December
2 1951 582 .
3 1952 690 .
4 1953 779 .
5 1954 859 .
6 1955 1026 .
7 1956 1192 .
8 1957 1354 .
9 1958 1431 .
10 1959 1579 .
11 1960 176 .
可复制的代码和数据:
import pandas as pd
import seaborn as sns
df = sns.load_dataset('flights')
print(df.head())
df2 = df.groupby('year')['passengers'].apply(lambda x: x.nlargest(3).sum()).reset_index()
print(df2.head())
df:
year month passengers
0 1949 January 112
1 1949 February 118
2 1949 March 132
3 1949 April 129
4 1949 May 121
df2:
year passengers
0 1949 432
1 1950 498
2 1951 582
3 1952 690
4 1953 779
谢谢您的任何建议!
答案 0 :(得分:3)
将自定义函数与GroupBy.apply
一起使用,想法是首先按sort_values
进行排序,然后调用head
获取每组的最高值:
def f(x):
x = x.head(3)
names = ['passengers','months']
return pd.Series([x['passengers'].sum(), ', '.join(x['month'])], index=names)
df2 = df.sort_values('passengers', ascending=False).groupby('year').apply(f).reset_index()
print(df2.head())
year passengers months
0 1949 432 July, August, September
1 1950 498 July, August, September
2 1951 582 July, August, September
3 1952 690 August, July, June
4 1953 779 August, July, June
答案 1 :(得分:1)
您可以
In [69]: df.groupby('year').apply(lambda x:
x.nlargest(3, 'passengers').agg(
{'passengers': 'sum', 'month': lambda x: ', '.join(x.values)}
)).reset_index()
Out[69]:
year passengers month
0 1949 432 July, August, September
1 1950 498 July, August, September
2 1951 582 July, August, September
3 1952 690 August, July, June
4 1953 779 August, July, June
5 1954 859 July, August, June
6 1955 1026 July, August, June
7 1956 1192 July, August, June
8 1957 1354 August, July, June
9 1958 1431 August, July, June
10 1959 1579 August, July, June
11 1960 1763 July, August, June
答案 2 :(得分:1)
这是使用nlargest
的一种解决方案。
def largest(x, k):
vals = x.nlargest(n=k, columns=['passengers'])
return [vals['passengers'].sum(), vals['month'].tolist()]
g = df.groupby('year').apply(largest, k=3).reset_index()
joiner = pd.DataFrame(g[0].values.tolist(), columns=['passengers', 'months'])
res = g.drop(0, axis=1).join(joiner)
print(res)
year passengers months
0 1949 382 [March, April, May]
我特意将months
保留为列表,如果需要,您可以将其转换为逗号分隔的字符串。
答案 3 :(得分:1)
或者-分组,然后使用pd.DataFrame.nlargest
而不是自定义函数/ lambda进行应用,然后在索引上重新组合并应用合适的agg
,例如:
new_df = (
df.groupby('year').apply(pd.DataFrame.nlargest, 3, 'passengers')
.groupby(level=0).agg({'passengers': 'sum', 'month': ', '.join})
# optionally reset index
# .reset_index()
)
那将给你:
passengers month
year
1949 432 July, August, September
1950 498 July, August, September
1951 582 July, August, September
1952 690 August, July, June
1953 779 August, July, June
1954 859 July, August, June
...
似乎year
作为索引在结果帧中有意义,但如果没有,则应用.reset_index()
。