我们说我有一个带有一定数量列和行的pandas DataFrame。我想要做的是找到5行的组合,在给定某个阈值的情况下,在特定列中产生最高分。下面是一个小玩具示例,以更好地说明它:
下面是我的代码的一个简化示例,我想知道这是否是"蛮力"方法是解决这个问题的明智方法。有没有机会更有效率地做到这一点?使用其他Python库,或者有更快的运行技巧(我考虑过Cython,但我认为itertools已经在C中实现,因此不会有多大好处?)。另外,我不知道如何在这里使用多处理,因为itertools是一个生成器。我欢迎任何讨论和想法!
谢谢!
编辑:对不起,我忘了提到第二个限制。例如,行的组合必须符合某些类别标准。例如,。所以,总结一下这个问题:我希望找到 k 行的组合来优化得分 s ,因为 k 行属于某些类别,并且不会超出约束列中的某个分数阈值。
from itertools import combinations
from itertools import product
# based on the suggested answer:
# sort by best score per constraint ratio:
r = df['constraint_column']/df['constraint']
r.sort(ascending=False, inplace=True)
df = df.ix[r.index]
df_a = df[df['col1'] == some_criterion] # rows from category a
df_b = df[df['col2'] == some_criterion] # rows from category b
df_c = df[df['col3'] == some_criterion] # rows from category c
score = 0.0
for i in product(
combinations(df_a.index, r=1),
combinations(df_b.index, r=2),
combinations(df_c.index, r=2)):
indexes = set(chain.from_iterable(i))
df_cur = df.ix[indexes]
if df_cur['constraint_column'].values.sum() > some_threshold:
continue
new_score = df_cur['score_column'].values.sum()
if new_score > score:
score = new_score
# based on the suggested answer:
# break here, since it can't get any better if the threshold is exactly
# matched since we sorted by the best score/constraint ratio previously.
if df_cur['constraint_column'].values.sum() == some_threshold:
break
答案 0 :(得分:1)
我认为你可以通过根据“每个约束得分”指标采取最佳措施来解决这个问题:
constraint = 6 #whatever value you want here
df['s_per_c'] = df.score / df.constraint
df.sort('s_per_c', inplace=True, ascending=False)
total = 0
for i, r in df.iterrows():
if r.constraint > constraint:
continue
constraint -= r.constraint
total += r.score
if constraint == 0:
break
我的逻辑是,每次我得分,我都想确保我能负担得起(“约束”)并且我得到了最好的回报(“s_per_c”)