方法

Question

我有一个可能很长的命名元组列表（目前它可以达到10.000行，但将来它可能会更多）。

我需要将每个namedtuple的几个元素与列表中的所有其他namedtuples进行比较。我正在寻找一种高效且通用的方式来实现这一目标。

为简单起见，我将对蛋糕进行类比，这样可以更容易理解问题。

有一个namedtuples列表，其中每个namedtuple都是一个蛋糕：

Cake = namedtuple('Cake', 
                       ['cake_id',
                        'ingredient1', 'ingredient2', 'ingredient3',
                        'baking_time', 'cake_price']
                 )

cake_price和baking_time都很重要。如果蛋糕具有相同的成分，我想从列表中删除那些不相关的蛋糕。因此任何相同或更昂贵并且烘焙时间相同或更长的蛋糕（具有相同成分）都无关紧要（下面有一个详细的例子）。

最好的方法是什么？

方法

到目前为止，我所做的是按cake_price和baking_time对named_tuples列表进行排序：

sorted_cakes = sorted(list_of_cakes, key=lambda c: (c.cake_price, c.baking_time))

然后创建一个新列表，我添加所有蛋糕，只要之前添加的蛋糕没有相同的成分，更便宜，更快烘烤。

list_of_good_cakes = []
    for cake in sorted_cakes:
        if interesting_cake(cake, list_of_good_cakes):
            list_of_good_cakes.append(cake)

def interesting_cake(current_cake, list_of_good_cakes):
    is_interesting = True
    if list_of_good_cakes: #first cake to be directly appended
        for included_cake in list_of_good_cakes:
            if (current_cake.ingredient1 == included_cake.ingredient1 and
                current_cake.ingredient2 == included_cake.ingredient2 and
                current_cake.ingredient3 == included_cake.ingredient3 and
                current_cake.baking_time >= included_cake.baking_time):

                if current_cake.cake_price >= included_cake.cake_price:
                    is_interesting = False

    return is_interesting

（我知道有一个嵌套循环远非最佳，但我想不出有任何其他方法可以做到这一点......）

实施例

具有

list_of_cakes = [cake_1, cake_2, cake_3, cake_4, cake_5]

，其中

cake_1 = Cake('cake_id'=1,
              'ingredient1'='dark chocolate', 
              'ingredient2'='cookies', 
              'ingredient3'='strawberries',
              'baking_time'=60, 'cake_price'=20)

cake_2 = Cake('cake_id'=2,
              'ingredient1'='dark chocolate', 
              'ingredient2'='cookies', 
              'ingredient3'='strawberries',
              'baking_time'=80, 'cake_price'=20)

cake_3 = Cake('cake_id'=3,
              'ingredient1'='white chocolate', 
              'ingredient2'='bananas', 
              'ingredient3'='strawberries',
              'baking_time'=150, 'cake_price'=100)

cake_4 = Cake('cake_id'=4,
              'ingredient1'='dark chocolate', 
              'ingredient2'='cookies', 
              'ingredient3'='strawberries',
              'baking_time'=40, 'cake_price'=30)

cake_5 = Cake('cake_id'=5,
              'ingredient1'='dark chocolate', 
              'ingredient2'='cookies', 
              'ingredient3'='strawberries',
              'baking_time'=10, 'cake_price'=80)

预期结果将是：

list_of_relevant_cakes = [cake_1, cake_3, cake_4, cake_5]

cake_1是最便宜的（并且是同等价格中最快的） - ＆gt; IN
cake_2与cake1的价格相同，需要更长时间才能烘焙 - ＆gt; OUT
cake_3是另一种蛋糕 - ＆gt; IN
cake_4比cake_1贵，但烘烤速度更快 - ＆gt; IN
cake_5比cake_1和cake_4贵，但烘烤速度更快 - ＆gt; IN

Answer 1

您的进场的运行时间将与

大致成比例

len(list_of_cakes) * len(list_of_relevant_cakes)

......如果您有很多蛋糕并且其中很多是相关的，那么这可能会变得非常大。

我们可以通过利用每个具有相同成分的蛋糕簇可能更小的事实来改进这一点。首先，我们需要一个功能来检查新蛋糕与现有的，已经优化的具有相同成分的集群：

from copy import copy

def update_cluster(cakes, new):
    for c in copy(cakes):
        if c.baking_time <= new.baking_time and c.cake_price <= new.cake_price:
            break
        elif c.baking_time >= new.baking_time and c.cake_price >= new.cake_price:
            cakes.discard(c)
    else:
        cakes.add(new)

这样做是在new的副本中检查每个蛋糕c的{{1}}蛋糕，然后：

如果烘焙时间和价格都大于或等于现有蛋糕，请立即退出（您可以cakes而不是return，但我更愿意明确控制流动）。
如果烘焙时间和价格都小于或等于现有蛋糕，请从群集中删除现有蛋糕
如果它超过了所有现有的蛋糕（因此到达break语句的for子句），请将其添加到群集中。

一旦我们有了，我们可以用它来过滤蛋糕：

else

这是在行动：

def select_from(cakes):
    clusters = {}
    for cake in cakes:
        key = cake.ingredient1, cake.ingredient2, cake.ingredient3
        if key in clusters:
            update_cluster(clusters[key], cake)
        else:
            clusters[key] = {cake}
    return [c for v in clusters.values() for c in v]

此解决方案的运行时间大致与

>>> select_from(list_of_cakes)
[Cake(cake_id=1, ingredient1='dark chocolate', ingredient2='cookies', ingredient3='strawberries', baking_time=60, cake_price=20),
 Cake(cake_id=4, ingredient1='dark chocolate', ingredient2='cookies', ingredient3='strawberries', baking_time=40, cake_price=30),
 Cake(cake_id=5, ingredient1='dark chocolate', ingredient2='cookies', ingredient3='strawberries', baking_time=10, cake_price=80),
 Cake(cake_id=3, ingredient1='white chocolate', ingredient2='bananas', ingredient3='strawberries', baking_time=150, cake_price=100)]

我做了一些随机蛋糕的测试，每个蛋糕使用五种不同成分和随机价格和烘焙时间的选择，

此方法始终产生与您相同的结果（尽管未排除）
运行速度相当快 - 我的机器上运行时间为0.2秒，随机蛋糕数量为100,000秒。而

Answer 2

未经测试的代码，但应该有助于指出更好的方法：

equivalence_fields = operator.attrgetter('ingredient1', 'ingredient2', 'ingrediant3')
relevant_fields = operator.attrgetter('baking_time', 'cake_price')

def irrelevent(cake1, cake2):
    """cake1 is irrelevant if it is both
       more expensive and takes longer to bake.
    """
    return cake1.cake_price > cake2.cake_price and cake1.baking_time > cake2.bake_time

# Group equivalent cakes together
equivalent_cakes = collections.defaultdict(list)
for cake in cakes:
    feature = equivalence_fields(cake)
    equivalent_cakes[feature].append(cake)

# Weed-out irrelevant cakes within an equivalence class
for feature, group equivalent_cakes.items():
    best = min(group, key=relevant_fields)
    group[:] = [cake for cake in group if not irrelevant(cake, best)]

比较命名元组

方法

实施例

2 个答案: