我想在看起来像这样的DataFrame的每组上分配一些“单位”:
limit allocation spaceLeft
Group
A 5.0 0.0 5.0
A 3.0 0.0 3.0
A 7.0 0.0 7.0
B 1.0 0.0 1.0
B 2.0 0.0 2.0
B 4.0 0.0 4.0
B 6.0 0.0 6.0
...可以通过以下方式创建:
df = pd.DataFrame(data=[('A', 5.0, 0.0),
('A', 3.0, 0.0),
('A', 7.0, 0.0),
('B', 1.0, 0.0),
('B', 2.0, 0.0),
('B', 4.0, 0.0),
('B', 6.0, 0.0)],
columns=('Group', 'limit', 'allocation')).set_index('Group')
df['spaceLeft'] = df['limit'] - df['allocation']
约束是,每个组的行内的单位分配必须尽可能统一,但不能超过每行的limit
。因此,例如,如果我们有10个单位,那么对组A
的最终正确分配将是:
limit allocation spaceLeft
Group
A 5.0 3.5 1.5
A 3.0 3.0 0.0
A 7.0 3.5 3.5
我写了一个递归函数来做到这一点:
unitsToAllocate = 10.0
def f(g):
allocated = g['allocation'].sum()
unitsLeft = unitsToAllocate - allocated
if unitsLeft > 0:
g['spaceLeft'] = g['limit'] - g['allocation']
# "Quantum" is the space left in the smallest bin with space remaining
quantum = g[g['spaceLeft'] > 0]['spaceLeft'].min()
# Distribute only as much as will fill next bin to its limit
alloc = min(unitsLeft / g[g['spaceLeft'] > 0]['spaceLeft'].count(), quantum)
g.loc[g['spaceLeft'] > 0, 'allocation'] = g[g['spaceLeft'] > 0]['allocation'] + alloc
f(g)
else:
return g
如果我手动地,在像f
这样的单个组上迭代地运行内部group = df.groupby('Group').get_group('A')
逻辑,那么它将起作用。 (即,它为上面显示的A
产生了正确的结果。)
但是,如果我按照f
的设计调用df.groupby('Group').apply(f)
,则会失败并显示:
ValueError: cannot reindex from a duplicate axis
。
怎么了?
还有一种更令人讨厌的方式来实现该算法吗?
答案 0 :(得分:0)
递归逻辑中的愚蠢错误:f(g)
的两个分支必须返回一个组。
以下代码有效:
def f(g):
allocated = g['allocation'].sum()
unitsLeft = unitsToAllocate - allocated
if unitsLeft > 0:
g['spaceLeft'] = g['limit'] - g['allocation']
quantum = g[g['spaceLeft'] > 0]['spaceLeft'].min()
alloc = min(unitsLeft / g[g['spaceLeft'] > 0]['spaceLeft'].count(), quantum)
g.loc[g['spaceLeft'] > 0, 'allocation'] = g[g['spaceLeft'] > 0]['allocation'] + alloc
return f(g) # <-- FIXED THIS LINE
else:
return g