python每组和每组内的累计运行百分比按降序排列

时间:2018-08-10 04:17:10

标签: python python-3.x pandas

df = pd.DataFrame({'state': ['CA', 'WA', 'CO', 'AZ'] * 3,'id': 
       [1,2,3,4,5,6]*2 ,'sales': [np.random.randint(100000, 999999) for _ in 
        range(12)]})

这是df的输出:

    id    sales    state
0    1    847754    CA
1    2    362532    WA
2    3    615849    CO
3    4    376480    AZ
4    5    381286    CA
5    6    411001    WA
6    1    946795    CO
7    2    857435    AZ
8    3    928087    CA
9    4    675593    WA
10   5    371339    CO
11   6    440285    AZ

我无法按降序计算每个组的累积百分比。我想要这样的输出:

    id    sales   state  cumsum      run_pct
0    2    857435    AZ    857435     0.5121460996296738
1    6    440285    AZ    1297720    0.7751284195436626
2    4    376480    AZ    1674200    1.0
3    3    928087    CA    928087     0.43024216932985404
4    1    847754    CA    1775841    0.8232436013271356
5    5    381286    CA    2157127    1.0
6    1    946795    CO    946795     0.48955704367618535
7    3    615849    CO    1562644    0.807992624547372
8    5    371339    CO    1933983    1.0
9    4    675593    WA    675593     0.46620721731581655
10   6    411001    WA    1086594    0.7498271371847582
11   2    362532    WA    1449126    1.0

1 个答案:

答案 0 :(得分:1)

一个可能的解决方案是首先对数据进行排序,计算总和,然后最后计算百分比。 按升序排列和销售额递减排序:

df = df.sort_values(['state', 'sales'], ascending=[True, False])

计算总和:

df['cumsum'] = df.groupby('state')['sales'].cumsum()

和百分比:

df['run_pct'] = df.groupby('state')['sales'].apply(lambda x: (x/x.sum()).cumsum())

这将给出:

    id  sales   state   cumsum  run_pct
0   4   846079  AZ  846079  0.608566
1   2   312708  AZ  1158787 0.833491
2   6   231495  AZ  1390282 1.000000
3   3   790291  CA  790291  0.506795
4   1   554631  CA  1344922 0.862467
5   5   214467  CA  1559389 1.000000
6   1   983878  CO  983878  0.388139
7   5   779497  CO  1763375 0.695650
8   3   771486  CO  2534861 1.000000
9   6   794407  WA  794407  0.420899
10  2   587843  WA  1382250 0.732355
11  4   505155  WA  1887405 1.000000