Question

我想在看起来像这样的DataFrame的每组上分配一些“单位”：

       limit  allocation  spaceLeft
Group                              
A        5.0         0.0        5.0
A        3.0         0.0        3.0
A        7.0         0.0        7.0
B        1.0         0.0        1.0
B        2.0         0.0        2.0
B        4.0         0.0        4.0
B        6.0         0.0        6.0

...可以通过以下方式创建：

df = pd.DataFrame(data=[('A', 5.0, 0.0),
                        ('A', 3.0, 0.0),
                        ('A', 7.0, 0.0),
                        ('B', 1.0, 0.0),
                        ('B', 2.0, 0.0),
                        ('B', 4.0, 0.0),
                        ('B', 6.0, 0.0)],
                  columns=('Group', 'limit', 'allocation')).set_index('Group')
df['spaceLeft'] = df['limit'] - df['allocation']

约束是，每个组的行内的单位分配必须尽可能统一，但不能超过每行的limit。因此，例如，如果我们有10个单位，那么对组A的最终正确分配将是：

       limit  allocation  spaceLeft
Group                              
A        5.0         3.5        1.5
A        3.0         3.0        0.0
A        7.0         3.5        3.5

我写了一个递归函数来做到这一点：

unitsToAllocate = 10.0
def f(g):
    allocated = g['allocation'].sum()
    unitsLeft = unitsToAllocate - allocated
    if unitsLeft > 0:
        g['spaceLeft'] = g['limit'] - g['allocation']
        # "Quantum" is the space left in the smallest bin with space remaining
        quantum = g[g['spaceLeft'] > 0]['spaceLeft'].min()
        # Distribute only as much as will fill next bin to its limit
        alloc = min(unitsLeft / g[g['spaceLeft'] > 0]['spaceLeft'].count(), quantum)
        g.loc[g['spaceLeft'] > 0, 'allocation'] = g[g['spaceLeft'] > 0]['allocation'] + alloc
        f(g)
    else:
        return g

如果我手动地，在像f这样的单个组上迭代地运行内部group = df.groupby('Group').get_group('A')逻辑，那么它将起作用。（即，它为上面显示的A产生了正确的结果。）

但是，如果我按照f的设计调用df.groupby('Group').apply(f)，则会失败并显示：

ValueError: cannot reindex from a duplicate axis。

怎么了？

还有一种更令人讨厌的方式来实现该算法吗？

Answer 1

递归逻辑中的愚蠢错误：f(g)的两个分支必须返回一个组。

以下代码有效：

def f(g):
    allocated = g['allocation'].sum()
    unitsLeft = unitsToAllocate - allocated
    if unitsLeft > 0:
        g['spaceLeft'] = g['limit'] - g['allocation']
        quantum = g[g['spaceLeft'] > 0]['spaceLeft'].min()
        alloc = min(unitsLeft / g[g['spaceLeft'] > 0]['spaceLeft'].count(), quantum)
        g.loc[g['spaceLeft'] > 0, 'allocation'] = g[g['spaceLeft'] > 0]['allocation'] + alloc
        return f(g)  # <-- FIXED THIS LINE
    else:
        return g

递归应用于DataFrame组，导致重新索引错误

1 个答案: