如何在pandas DataFrame中有效地对行组合进行采样

时间:2015-01-08 18:35:30

标签: python pandas itertools

我们说我有一个带有一定数量列和行的pandas DataFrame。我想要做的是找到5行的组合,在给定某个阈值的情况下,在特定列中产生最高分。下面是一个小玩具示例,以更好地说明它:

enter image description here

下面是我的代码的一个简化示例,我想知道这是否是"蛮力"方法是解决这个问题的明智方法。有没有机会更有效率地做到这一点?使用其他Python库,或者有更快的运行技巧(我考虑过Cython,但我认为itertools已经在C中实现,因此不会有多大好处?)。另外,我不知道如何在这里使用多处理,因为itertools是一个生成器。我欢迎任何讨论和想法!

谢谢!

编辑:对不起,我忘了提到第二个限制。例如,行的组合必须符合某些类别标准。例如,。

  • 1x category a
  • 2x catergoy b
  • 2x catergoy c

所以,总结一下这个问题:我希望找到 k 行的组合来优化得分 s ,因为 k 行属于某些类别,并且不会超出约束列中的某个分数阈值。

from itertools import combinations
from itertools import product

# based on the suggested answer:
# sort by best score per constraint ratio:
r = df['constraint_column']/df['constraint']
r.sort(ascending=False, inplace=True)
df = df.ix[r.index]


df_a = df[df['col1'] == some_criterion] # rows from category a
df_b = df[df['col2'] == some_criterion] # rows from category b
df_c = df[df['col3'] == some_criterion] # rows from category c

score = 0.0

for i in product(
            combinations(df_a.index, r=1), 
            combinations(df_b.index, r=2), 
            combinations(df_c.index, r=2)):

    indexes = set(chain.from_iterable(i))

    df_cur = df.ix[indexes]

    if df_cur['constraint_column'].values.sum() > some_threshold:
        continue


    new_score = df_cur['score_column'].values.sum()
    if new_score > score:
        score = new_score


    # based on the suggested answer:
    # break here, since it can't get any better if the threshold is exactly
    # matched since we sorted by the best score/constraint ratio previously.

    if df_cur['constraint_column'].values.sum() == some_threshold:
        break 

1 个答案:

答案 0 :(得分:1)

我认为你可以通过根据“每个约束得分”指标采取最佳措施来解决这个问题:

constraint = 6 #whatever value you want here
df['s_per_c'] = df.score / df.constraint
df.sort('s_per_c', inplace=True, ascending=False)

total = 0
for i, r in df.iterrows():
    if r.constraint > constraint:
        continue
    constraint -= r.constraint
    total += r.score
    if constraint == 0:
        break

我的逻辑是,每次我得分,我都想确保我能负担得起(“约束”)并且我得到了最好的回报(“s_per_c”)