来自Pandas Dataframe的Sum,Cumsum,Percetage,Cum Percentage

时间:2017-11-23 11:42:03

标签: python pandas

我正在尝试从数据框创建累积表,我得到总和,然后我得到以下代码的cumsum。

g=df.groupby(['Category'])['VALUE'].agg(['sum']).sort_values(by="sum", ascending=False)
index = g.index.values
cum_sum = g['sum'].cumsum()
data = g['sum']
g['cum_sum'] = g['sum'].cumsum()
cum_max = g['cum_sum'].max()

但我想创建这样的新数据框:

|Category|Sum|CumSum|Percentage|CumPercentage|
    A      5    5       1            1
    B      4    9       2            3

我也试图通过聚合获得cumsum,但没有取得成功:

g=df.groupby(['Category'])['VALUE'].agg(['sum'i 'cumsum']).sort_values(by="sum", ascending=False)

以上代码对我不起作用。

Category    VALUE   ST_CITY ST_STATE    ST_POSTAL   ST_COUNTRY
Category1   72  CHARLESTON  SC  29403   USA
Category1   36  BROOKLYN    NY  11235   USA
Category2   9   FAIRFIELD   CT  6824    USA
Category2   9   FAIRFIELD   CT  6824    USA
Category3   12  CHARLESTON  SC  29412   USA
Category1   30  WALLINGFORD PA  19086   USA
Category2   9   GRAND RAPIDS    MI  49506   USA
Category3   12  ALAMEDA CA  94502   USA
Category2   9   DANVILLE    CA  94526   USA
Category2   9   LONGWOOD    FL  32779   USA
Category3   12  NEW YORK    NY  10022   USA
Category1   36  ALAMOGORDO  NM  88310   USA
Category3   12  BRECKSVILLE OH  44141   USA
Category3   12  CHICAGO IL  60657   USA
Category2   9   BIRMINGHAM  AL  35242   USA
Category2   9   BIRMINGHAM  AL  35242   USA
Category2   9   GREENWOOD VILLAGE   CO  80121   USA
Category1   36  GREENWOOD VILLAGE   CO  80121   USA
Category1   36  CHICAGO IL  60615   USA
Category2   9   MANASSAS    VA  20112   USA
Category2   9   OTIS ORCHARDS   WA  99027   USA
Category2   9   ARNOLD  MD  21012   USA
Category2   9   MOUNTAIN VIEW   CA  94043   USA
Category1   36  DEL MAR CA  92014   USA
Category1   36  NEW YORK    NY  10023   USA
Category1   36  NEW YORK    NY  10128   USA
Category3   12  OJAI    CA  93023   USA
Category2   9   BROOKLYN    NY  11201   USA
Category2   9   ST. CHARLES MO  63301   USA

1 个答案:

答案 0 :(得分:1)

似乎你需要:

df = df.groupby('Category')['VALUE'].sum().reset_index(name='Sum')
df['cum_sum'] = df['Sum'].cumsum()
df['Percentage'] = df['Sum'] / df['Sum'].sum() * 100
print (df)
    Category  Sum  cum_sum  Percentage
0  Category1  354      354   64.130435
1  Category2  126      480   22.826087
2  Category3   72      552   13.043478

但不可能由sumcumsum聚合在一起:

print (df.groupby('Category')['VALUE'].agg(['sum','cumsum']))

             sum  cumsum
0            NaN    72.0
1            NaN   108.0
2            NaN     9.0
3            NaN    18.0
4            NaN    12.0
5            NaN   138.0
6            NaN    27.0
7            NaN    24.0
8            NaN    36.0
9            NaN    45.0
10           NaN    36.0
11           NaN   174.0
12           NaN    48.0
13           NaN    60.0
14           NaN    54.0
15           NaN    63.0
16           NaN    72.0
17           NaN   210.0
18           NaN   246.0
19           NaN    81.0
20           NaN    90.0
21           NaN    99.0
22           NaN   108.0
23           NaN   282.0
24           NaN   318.0
25           NaN   354.0
26           NaN    72.0
27           NaN   117.0
28           NaN   126.0
Category1  354.0     NaN
Category2  126.0     NaN
Category3   72.0     NaN

但是如果真的需要它,那么使用transformaggreagte值返回到原始df中的新列:

df['cumsum'] = df.groupby('Category')['VALUE'].cumsum()
df['sum'] = df.groupby('Category')['VALUE'].transform('sum')
print (df[['cumsum','sum']])
    cumsum  sum
0       72  354
1      108  354
2        9  126
3       18  126
4       12   72
5      138  354
6       27  126
7       24   72
8       36  126
9       45  126
10      36   72
11     174  354
12      48   72
13      60   72
14      54  126
15      63  126
16      72  126
17     210  354
18     246  354
19      81  126
20      90  126
21      99  126
22     108  126
23     282  354
24     318  354
25     354  354
26      72   72
27     117  126
28     126  126