联合集更快或检查整个列表是否重复?

时间:2011-01-12 21:22:58

标签: python set

对于措辞不佳的标题感到抱歉,但我之前提出了一个问题,即从两个列表中获取一个唯一的项目列表。人们告诉我列出清单 - >集合然后结合。

所以现在我想知道它是否更快:

  1. 在列表中添加一个项目时,请扫描整个列表以查找重复项。
  2. 将一个项目设为一组,然后设置联合。
  3. 我应该在后见之明读起来......

    在Python中,顺便说一句 - 抱歉没有澄清。

4 个答案:

答案 0 :(得分:20)

正如您所看到的那样,将一个列表扩展到另一个列,然后通过设置删除重复项 是最快的方法(至少在python;))

>>> def foo():
...     """
...     extending one list by another end then remove duplicates by making set
...     """
...     l1 = range(200)
...     l2 = range(150, 250)
...     l1.extend(l2)
...     set(l1)
... 
>>> def bar():
...     """
...     checking if element is on one list end adding it only if not
...     """
...     l1 = range(200)
...     l2 = range(150, 250)
...     for elem in l2:
...             if elem not in l1:
...                     l1.append(elem)
... 
>>> def baz():
...     """
...     making sets from both lists and then union from them
...     """
...     l1 = range(200)
...     l2 = range(150, 250)
...     set(l1) | set(l2)
... 
>>> from timeit import Timer
>>> Timer(foo).timeit(10000)
0.265153169631958
>>> Timer(bar).timeit(10000)
7.921358108520508
>>> Timer(baz).timeit(10000)
0.3845551013946533
>>> 

答案 1 :(得分:9)

我非常喜欢virhilo所做的方法,但这是他正在测试的一组非常具体的数据。在所有这些中,不仅要测试功能,还要测试它们将如何进行。我整理了一个更详尽的测试集。它通过一个比较列表运行你指定的每个函数(只有一个小装饰器),并计算出每个函数需要多长时间,因此它会慢多少。结果是,如果不了解更多有关数据的大小,重叠和类型的信息,您应该做的功能并不总是很清楚。

这是我的测试程序,下面是输出。

from timeit import Timer
from copy import copy
import random
import sys

funcs = []

class timeMe(object):
    def __init__(self, f):
        funcs.append(f)
        self.f = f

    def __call__(self, *args, **kwargs):
        return self.f(*args, **kwargs)

@timeMe
def extend_list_then_set(input1, input2):
    """
    extending one list by another end then remove duplicates by making set
    """
    l1 = copy(input1)
    l2 = copy(input2)
    l1.extend(l2)
    set(l1)

@timeMe
def per_element_append_to_list(input1, input2):
    """
    checking if element is on one list end adding it only if not
    """
    l1 = copy(input1)
    l2 = copy(input2)
    for elem in l2:
            if elem not in l1:
                    l1.append(elem)

@timeMe
def union_sets(input1, input2):
    """
    making sets from both lists and then union from them
    """
    l1 = copy(input1)
    l2 = copy(input2)
    set(l1) | set(l2)

@timeMe
def set_from_one_add_from_two(input1, input2):
    """
    make set from list 1, then add elements for set 2
    """
    l1 = copy(input1)
    l2 = copy(input2)
    l1 = set(l1)
    for element in l2:
        l1.add(element)

@timeMe
def set_from_one_union_two(input1, input2):
    """
    make set from list 1, then union list 2
    """
    l1 = copy(input1)
    l2 = copy(input2)
    x = set(l1).union(l2)

@timeMe
def chain_then_set(input1, input2):
    """
    chain l1 & l2, then make a set out of that
    """
    l1 = copy(input1)
    l2 = copy(input2)
    set(itertools.chain(l1, l2))

def run_results(l1, l2, times):
    for f in funcs:
        t = Timer('%s(l1, l2)' % f.__name__,
            'from __main__ import %s; l1 = %s; l2 = %s' % (f.__name__, l1, l2))
        yield (f.__name__, t.timeit(times))    

test_datasets = [
    ('original, small, some overlap', range(200), range(150, 250), 10000),
    ('no overlap: l1 = [1], l2 = [2..100]', [1], range(2, 100), 10000),
    ('lots of overlap: l1 = [1], l2 = [1]*100', [1], [1]*100, 10000),
    ('50 random ints below 2000 in each', [random.randint(0, 2000) for x in range(50)], [random.randint(0, 2000) for x in range(50)], 10000),
    ('50 elements in each, no overlap', range(50), range(51, 100), 10000),
    ('50 elements in each, total overlap', range(50), range(50), 10000),
    ('500 random ints below 500 in each', [random.randint(0, 500) for x in range(500)], [random.randint(0, 500) for x in range(500)], 1000),
    ('500 random ints below 2000 in each', [random.randint(0, 2000) for x in range(500)], [random.randint(0, 2000) for x in range(500)], 1000),
    ('500 random ints below 200000 in each', [random.randint(0, 200000) for x in range(500)], [random.randint(0, 200000) for x in range(500)], 1000),
    ('500 elements in each, no overlap', range(500), range(501, 1000), 10000),
    ('500 elements in each, total overlap', range(500), range(500), 10000),
    ('10000 random ints below 200000 in each', [random.randint(0, 200000) for x in range(10000)], [random.randint(0, 200000) for x in range(10000)], 50),
    ('10000 elements in each, no overlap', range(10000), range(10001, 20000), 10),
    ('10000 elements in each, total overlap', range(10000), range(10000), 10),
    ('original lists 100 times', range(200)*100, range(150, 250)*100, 10),
]

fullresults = []
for description, l1, l2, times in test_datasets:
    print "Now running %s times: %s" % (times, description)
    results = list(run_results(l1, l2, times))
    speedresults = [x for x in sorted(results, key=lambda x: x[1])]
    for name, speed in results:
        finish = speedresults.index((name, speed)) + 1
        timesslower = speed / speedresults[0][1]
        fullresults.append((description, name, speed, finish, timesslower))
        print '\t', finish, ('%.2fx' % timesslower).ljust(10), name.ljust(40), speed

print
import csv
out = csv.writer(sys.stdout)
out.writerow(('Test', 'Function', 'Speed', 'Place', 'timesslower'))
out.writerows(fullresults)

结果

我的观点是鼓励您使用您的数据进行测试,因此我不想强调具体细节。但是......第一种扩展方法是最快的平均方法,但set_from_one_union_two(x = set(l1).union(l2))赢了几次。如果您自己运行脚本,可以获得更多详细信息。

我报告的数字是此功能比该测试中的最大功能慢的次数。如果它是最快的,它将是1.

                                            Functions                                                                                                                           
Tests                                       extend_list_then_set     per_element_append_to_list    set_from_one_add_from_two  set_from_one_union_two     union_sets      chain_then_set
original, small, some overlap               1                          25.04                        1.53                        1.18                       1.39           1.08
no overlap: l1 = [1], l2 = [2..100]         1.08                       13.31                        2.10                        1                          1.27           1.07
lots of overlap: l1 = [1], l2 = [1]*100     1.10                        1.30                        2.43                        1                          1.25           1.05
50 random ints below 2000 in each           1                           7.76                        1.35                        1.20                       1.31           1   
50 elements in each, no overlap             1                           9.00                        1.48                        1.13                       1.18           1.10
50 elements in each, total overlap          1.08                        4.07                        1.64                        1.04                       1.41           1   
500 random ints below 500 in each           1.16                       68.24                        1.75                        1                          1.28           1.03
500 random ints below 2000 in each          1                         102.42                        1.64                        1.43                       1.81           1.20
500 random ints below 200000 in each        1.14                      118.96                        1.99                        1.52                       1.98           1   
500 elements in each, no overlap            1.01                      145.84                        1.86                        1.25                       1.53           1   
500 elements in each, total overlap         1                          53.10                        1.95                        1.16                       1.57           1.05          
10000 random ints below 200000 in each      1                        2588.99                        1.73                        1.35                       1.88           1.12
10000 elements in each, no overlap          1                        3164.01                        1.91                        1.26                       1.65           1.02
10000 elements in each, total overlap       1                        1068.67                        1.89                        1.26                       1.70           1.05
original lists 100 times                    1.11                     2068.06                        2.03                        1                          1.04           1.17

                                 Average    1.04                      629.25                       1.82                         1.19                       1.48           1.06
                      Standard Deviation    0.05                     1040.76                       0.26                         0.15                       0.26           0.05
                                     Max    1.16                     3164.01                       2.43                         1.52                       1.98           1.20

答案 2 :(得分:1)

你能做的最快的事情是从列表中构建两个集合并将它们结合起来。从list和set union设置构造都在运行时实现,在非常优化的C中,​​所以它非常快。

在代码中,如果列表为l1l2,则可以执行

unique_elems = set(l1) | set(l2)

编辑:正如@kriss所说,用l1延长l2的速度更快。但是,此代码不会更改l1,并且如果l1l2是通用迭代,也会有效。

答案 3 :(得分:1)

全部取决于您的输入和想要的输出。

如果您在开头有一个列表li并希望最终获得修改后的列表,则更快的方法是if not elt in li: li.append(elt)问题是将初始列表转换为set,然后转换回列表哪个太慢了。

但是如果你可以随时使用集合s(你不关心列表的顺序,接收它的方法只需要一些迭代),只需要s.add(elt)即可更快。

如果在开始时你必须列出并希望最后列出一个列表,即使从列表设置到列表的最终转换,使用集合管理项目的单一性也会更快,但是您可以轻松查看提供的示例通过@virhilo的答案,而不是使用extend连接两个列表,然后将结果转换为set比将两个列表转换为集合并执行联合更快。

我不确切知道你的程序有什么限制,但是如果unicity和看起来一样重要,并且如果不需要保持插入顺序,那么建议你始终使用set,永远不要改变他们到列表。由于 Duck Typing ,大多数算法无论如何都适用于它们,因为它们都是不同类型的迭代。