修改

Question

我有一个未订购的数据框：

df
     A   B  Moves
0   E1  E2     10
1   E1  E3     20
2   E1  E4     15
3   E2  E1      9
4   E2  E3      8
5   E2  E4      7
6   E3  E1     30
7   E3  E2     32
8   E3  E4     40
9   E4  E1      5
10  E4  E2     20
11  E4  E3      3

我希望返回行B，直到他们的累积总和加起来为Moves B中A的每个分组的总A B Moves E1 E3 20 E1 E4 15 E2 E1 9 E2 E3 8 E3 E4 40 E3 E2 32 E4 E2 20的最小百分比取得最高的第一名。

达到％阈值后，我停止行（累计总和）。程序必须是＆＃34;贪心＆＃34;因为如果一行超过所需的％，则包含该行。

如果总数的最小百分比是50％，那么我想先返回：

所需输出

df.groupby(...).apply(list)

然后，我想要使用this question

中的

A     Most_Moved
E1      [E3, E4] 
E2      [E1, E3]
E3      [E4, E2]
E4          [E2]

为每个分组提取行名称

cumsum

我尝试过的事情：

我可以在this问题和this问题中使用df.groupby(by=['A','B']).sum().groupby(level=[0]).cumsum()[::-1] Moves A B E4 E3 28 E2 25 E1 5 E3 E4 102 E2 62 E1 30 E2 E4 24 E3 17 E1 9 E1 E4 45 E3 30 E2 10返回订购的Total_Moves：

df.groupby(by="A").sum()

    Moves
A        
E1     45
E2     24
E3    102
E4     28

另外我可以返回每组的总动作（总和）：

df.groupby(by=["A"])["Moves"].apply(lambda x: 100 * x / float(x.sum()))

0     22.222222
1     44.444444
2     33.333333
3     37.500000
4     33.333333
5     29.166667
6     29.411765
7     31.372549
8     39.215686
9     17.857143
10    71.428571
11    10.714286

从this问题和this问题我可以将每一行作为该类别总和的百分比返回：

df.groupby(by=["A", "B"])["Moves"].agg({"Total_Moves":sum}).sort_values("Total_Moves", ascending=False).apply(lambda x: 100 * x / float(x.sum()))

       Total_Moves
A  B              
E3 E4    20.100503
   E2    16.080402
   E1    15.075377
E1 E3    10.050251
E4 E2    10.050251
E1 E4     7.537688
   E2     5.025126
E2 E1     4.522613
   E3     4.020101
   E4     3.517588
E4 E1     2.512563
   E3     1.507538

什么不起作用

但是，如果我将它们组合在一起，它会评估整个行的百分比：

.Value

这将评估整个数据框中的百分比，而不是单个组中的百分比。

我无法弄清楚如何将它拼凑在一起以获得我的输出。

任何帮助表示感谢。

Answer 1

您可以将groupby.apply与自定义功能

一起使用

def select(group, pct=50):
    # print(group)
    moves = group['Moves'].sort_values(ascending=False)
    cumsum = moves.cumsum() / moves.sum()
    # print(cumsum)
    # `cumsum` is the cumulative contribution of the sorted moves
    idx = len(cumsum[cumsum < pct/100]) + 1
    # print(idx)
    # `idx` is the first index of the move which has a cumulative sum of `pct` or higher
    idx = moves.index[:idx]  
    # print(idx)
    # here, `idx` is the Index of all the moves in with a cumulative contribution of `pct` or higher
    # print(group.loc[idx])
    return group.loc[idx].set_index(['B'], drop=True)['Moves']
    # return a Series of Moves with column `B` as index of the items which have index `idx`

df.groupby('A').apply(select)

修改

我在代码中添加了一些注释。为了更清楚它的作用，我还添加了（评论）中间变量的打印语句。如果您取消注释，请不要惊讶第一组打印twice

熊猫集团 - 如何使行数达到累计和的百分比？

1 个答案:

修改