递归应用于DataFrame组,导致重新索引错误

时间:2019-06-04 21:41:20

标签: python pandas dataframe recursion pandas-groupby

我想在看起来像这样的DataFrame的每组上分配一些“单位”:

       limit  allocation  spaceLeft
Group                              
A        5.0         0.0        5.0
A        3.0         0.0        3.0
A        7.0         0.0        7.0
B        1.0         0.0        1.0
B        2.0         0.0        2.0
B        4.0         0.0        4.0
B        6.0         0.0        6.0

...可以通过以下方式创建:

df = pd.DataFrame(data=[('A', 5.0, 0.0),
                        ('A', 3.0, 0.0),
                        ('A', 7.0, 0.0),
                        ('B', 1.0, 0.0),
                        ('B', 2.0, 0.0),
                        ('B', 4.0, 0.0),
                        ('B', 6.0, 0.0)],
                  columns=('Group', 'limit', 'allocation')).set_index('Group')
df['spaceLeft'] = df['limit'] - df['allocation']

约束是,每个组的行内的单位分配必须尽可能统一,但不能超过每行的limit。因此,例如,如果我们有10个单位,那么对组A的最终正确分配将是:

       limit  allocation  spaceLeft
Group                              
A        5.0         3.5        1.5
A        3.0         3.0        0.0
A        7.0         3.5        3.5

我写了一个递归函数来做到这一点:

unitsToAllocate = 10.0
def f(g):
    allocated = g['allocation'].sum()
    unitsLeft = unitsToAllocate - allocated
    if unitsLeft > 0:
        g['spaceLeft'] = g['limit'] - g['allocation']
        # "Quantum" is the space left in the smallest bin with space remaining
        quantum = g[g['spaceLeft'] > 0]['spaceLeft'].min()
        # Distribute only as much as will fill next bin to its limit
        alloc = min(unitsLeft / g[g['spaceLeft'] > 0]['spaceLeft'].count(), quantum)
        g.loc[g['spaceLeft'] > 0, 'allocation'] = g[g['spaceLeft'] > 0]['allocation'] + alloc
        f(g)
    else:
        return g

如果我手动地,在像f这样的单个组上迭代地运行内部group = df.groupby('Group').get_group('A')逻辑,那么它将起作用。 (即,它为上面显示的A产生了正确的结果。)

但是,如果我按照f的设计调用df.groupby('Group').apply(f),则会失败并显示:

  

ValueError: cannot reindex from a duplicate axis

怎么了?

还有一种更令人讨厌的方式来实现该算法吗?

1 个答案:

答案 0 :(得分:0)

递归逻辑中的愚蠢错误:f(g)的两个分支必须返回一个组。

以下代码有效:

def f(g):
    allocated = g['allocation'].sum()
    unitsLeft = unitsToAllocate - allocated
    if unitsLeft > 0:
        g['spaceLeft'] = g['limit'] - g['allocation']
        quantum = g[g['spaceLeft'] > 0]['spaceLeft'].min()
        alloc = min(unitsLeft / g[g['spaceLeft'] > 0]['spaceLeft'].count(), quantum)
        g.loc[g['spaceLeft'] > 0, 'allocation'] = g[g['spaceLeft'] > 0]['allocation'] + alloc
        return f(g)  # <-- FIXED THIS LINE
    else:
        return g