我正在尝试从数据框创建累积表,我得到总和,然后我得到以下代码的cumsum。
g=df.groupby(['Category'])['VALUE'].agg(['sum']).sort_values(by="sum", ascending=False)
index = g.index.values
cum_sum = g['sum'].cumsum()
data = g['sum']
g['cum_sum'] = g['sum'].cumsum()
cum_max = g['cum_sum'].max()
但我想创建这样的新数据框:
|Category|Sum|CumSum|Percentage|CumPercentage|
A 5 5 1 1
B 4 9 2 3
我也试图通过聚合获得cumsum,但没有取得成功:
g=df.groupby(['Category'])['VALUE'].agg(['sum'i 'cumsum']).sort_values(by="sum", ascending=False)
以上代码对我不起作用。
Category VALUE ST_CITY ST_STATE ST_POSTAL ST_COUNTRY
Category1 72 CHARLESTON SC 29403 USA
Category1 36 BROOKLYN NY 11235 USA
Category2 9 FAIRFIELD CT 6824 USA
Category2 9 FAIRFIELD CT 6824 USA
Category3 12 CHARLESTON SC 29412 USA
Category1 30 WALLINGFORD PA 19086 USA
Category2 9 GRAND RAPIDS MI 49506 USA
Category3 12 ALAMEDA CA 94502 USA
Category2 9 DANVILLE CA 94526 USA
Category2 9 LONGWOOD FL 32779 USA
Category3 12 NEW YORK NY 10022 USA
Category1 36 ALAMOGORDO NM 88310 USA
Category3 12 BRECKSVILLE OH 44141 USA
Category3 12 CHICAGO IL 60657 USA
Category2 9 BIRMINGHAM AL 35242 USA
Category2 9 BIRMINGHAM AL 35242 USA
Category2 9 GREENWOOD VILLAGE CO 80121 USA
Category1 36 GREENWOOD VILLAGE CO 80121 USA
Category1 36 CHICAGO IL 60615 USA
Category2 9 MANASSAS VA 20112 USA
Category2 9 OTIS ORCHARDS WA 99027 USA
Category2 9 ARNOLD MD 21012 USA
Category2 9 MOUNTAIN VIEW CA 94043 USA
Category1 36 DEL MAR CA 92014 USA
Category1 36 NEW YORK NY 10023 USA
Category1 36 NEW YORK NY 10128 USA
Category3 12 OJAI CA 93023 USA
Category2 9 BROOKLYN NY 11201 USA
Category2 9 ST. CHARLES MO 63301 USA
答案 0 :(得分:1)
似乎你需要:
df = df.groupby('Category')['VALUE'].sum().reset_index(name='Sum')
df['cum_sum'] = df['Sum'].cumsum()
df['Percentage'] = df['Sum'] / df['Sum'].sum() * 100
print (df)
Category Sum cum_sum Percentage
0 Category1 354 354 64.130435
1 Category2 126 480 22.826087
2 Category3 72 552 13.043478
但不可能由sum
和cumsum
聚合在一起:
print (df.groupby('Category')['VALUE'].agg(['sum','cumsum']))
sum cumsum
0 NaN 72.0
1 NaN 108.0
2 NaN 9.0
3 NaN 18.0
4 NaN 12.0
5 NaN 138.0
6 NaN 27.0
7 NaN 24.0
8 NaN 36.0
9 NaN 45.0
10 NaN 36.0
11 NaN 174.0
12 NaN 48.0
13 NaN 60.0
14 NaN 54.0
15 NaN 63.0
16 NaN 72.0
17 NaN 210.0
18 NaN 246.0
19 NaN 81.0
20 NaN 90.0
21 NaN 99.0
22 NaN 108.0
23 NaN 282.0
24 NaN 318.0
25 NaN 354.0
26 NaN 72.0
27 NaN 117.0
28 NaN 126.0
Category1 354.0 NaN
Category2 126.0 NaN
Category3 72.0 NaN
但是如果真的需要它,那么使用transform
将aggreagte
值返回到原始df
中的新列:
df['cumsum'] = df.groupby('Category')['VALUE'].cumsum()
df['sum'] = df.groupby('Category')['VALUE'].transform('sum')
print (df[['cumsum','sum']])
cumsum sum
0 72 354
1 108 354
2 9 126
3 18 126
4 12 72
5 138 354
6 27 126
7 24 72
8 36 126
9 45 126
10 36 72
11 174 354
12 48 72
13 60 72
14 54 126
15 63 126
16 72 126
17 210 354
18 246 354
19 81 126
20 90 126
21 99 126
22 108 126
23 282 354
24 318 354
25 354 354
26 72 72
27 117 126
28 126 126