对于措辞不佳的标题感到抱歉,但我之前提出了一个问题,即从两个列表中获取一个唯一的项目列表。人们告诉我列出清单 - >集合然后结合。
所以现在我想知道它是否更快:
我应该在后见之明读起来......
在Python中,顺便说一句 - 抱歉没有澄清。
答案 0 :(得分:20)
正如您所看到的那样,将一个列表扩展到另一个列,然后通过设置删除重复项 是最快的方法(至少在python;))
>>> def foo():
... """
... extending one list by another end then remove duplicates by making set
... """
... l1 = range(200)
... l2 = range(150, 250)
... l1.extend(l2)
... set(l1)
...
>>> def bar():
... """
... checking if element is on one list end adding it only if not
... """
... l1 = range(200)
... l2 = range(150, 250)
... for elem in l2:
... if elem not in l1:
... l1.append(elem)
...
>>> def baz():
... """
... making sets from both lists and then union from them
... """
... l1 = range(200)
... l2 = range(150, 250)
... set(l1) | set(l2)
...
>>> from timeit import Timer
>>> Timer(foo).timeit(10000)
0.265153169631958
>>> Timer(bar).timeit(10000)
7.921358108520508
>>> Timer(baz).timeit(10000)
0.3845551013946533
>>>
答案 1 :(得分:9)
我非常喜欢virhilo所做的方法,但这是他正在测试的一组非常具体的数据。在所有这些中,不仅要测试功能,还要测试它们将如何进行。我整理了一个更详尽的测试集。它通过一个比较列表运行你指定的每个函数(只有一个小装饰器),并计算出每个函数需要多长时间,因此它会慢多少。结果是,如果不了解更多有关数据的大小,重叠和类型的信息,您应该做的功能并不总是很清楚。
这是我的测试程序,下面是输出。
from timeit import Timer
from copy import copy
import random
import sys
funcs = []
class timeMe(object):
def __init__(self, f):
funcs.append(f)
self.f = f
def __call__(self, *args, **kwargs):
return self.f(*args, **kwargs)
@timeMe
def extend_list_then_set(input1, input2):
"""
extending one list by another end then remove duplicates by making set
"""
l1 = copy(input1)
l2 = copy(input2)
l1.extend(l2)
set(l1)
@timeMe
def per_element_append_to_list(input1, input2):
"""
checking if element is on one list end adding it only if not
"""
l1 = copy(input1)
l2 = copy(input2)
for elem in l2:
if elem not in l1:
l1.append(elem)
@timeMe
def union_sets(input1, input2):
"""
making sets from both lists and then union from them
"""
l1 = copy(input1)
l2 = copy(input2)
set(l1) | set(l2)
@timeMe
def set_from_one_add_from_two(input1, input2):
"""
make set from list 1, then add elements for set 2
"""
l1 = copy(input1)
l2 = copy(input2)
l1 = set(l1)
for element in l2:
l1.add(element)
@timeMe
def set_from_one_union_two(input1, input2):
"""
make set from list 1, then union list 2
"""
l1 = copy(input1)
l2 = copy(input2)
x = set(l1).union(l2)
@timeMe
def chain_then_set(input1, input2):
"""
chain l1 & l2, then make a set out of that
"""
l1 = copy(input1)
l2 = copy(input2)
set(itertools.chain(l1, l2))
def run_results(l1, l2, times):
for f in funcs:
t = Timer('%s(l1, l2)' % f.__name__,
'from __main__ import %s; l1 = %s; l2 = %s' % (f.__name__, l1, l2))
yield (f.__name__, t.timeit(times))
test_datasets = [
('original, small, some overlap', range(200), range(150, 250), 10000),
('no overlap: l1 = [1], l2 = [2..100]', [1], range(2, 100), 10000),
('lots of overlap: l1 = [1], l2 = [1]*100', [1], [1]*100, 10000),
('50 random ints below 2000 in each', [random.randint(0, 2000) for x in range(50)], [random.randint(0, 2000) for x in range(50)], 10000),
('50 elements in each, no overlap', range(50), range(51, 100), 10000),
('50 elements in each, total overlap', range(50), range(50), 10000),
('500 random ints below 500 in each', [random.randint(0, 500) for x in range(500)], [random.randint(0, 500) for x in range(500)], 1000),
('500 random ints below 2000 in each', [random.randint(0, 2000) for x in range(500)], [random.randint(0, 2000) for x in range(500)], 1000),
('500 random ints below 200000 in each', [random.randint(0, 200000) for x in range(500)], [random.randint(0, 200000) for x in range(500)], 1000),
('500 elements in each, no overlap', range(500), range(501, 1000), 10000),
('500 elements in each, total overlap', range(500), range(500), 10000),
('10000 random ints below 200000 in each', [random.randint(0, 200000) for x in range(10000)], [random.randint(0, 200000) for x in range(10000)], 50),
('10000 elements in each, no overlap', range(10000), range(10001, 20000), 10),
('10000 elements in each, total overlap', range(10000), range(10000), 10),
('original lists 100 times', range(200)*100, range(150, 250)*100, 10),
]
fullresults = []
for description, l1, l2, times in test_datasets:
print "Now running %s times: %s" % (times, description)
results = list(run_results(l1, l2, times))
speedresults = [x for x in sorted(results, key=lambda x: x[1])]
for name, speed in results:
finish = speedresults.index((name, speed)) + 1
timesslower = speed / speedresults[0][1]
fullresults.append((description, name, speed, finish, timesslower))
print '\t', finish, ('%.2fx' % timesslower).ljust(10), name.ljust(40), speed
print
import csv
out = csv.writer(sys.stdout)
out.writerow(('Test', 'Function', 'Speed', 'Place', 'timesslower'))
out.writerows(fullresults)
我的观点是鼓励您使用您的数据进行测试,因此我不想强调具体细节。但是......第一种扩展方法是最快的平均方法,但set_from_one_union_two(x = set(l1).union(l2)
)赢了几次。如果您自己运行脚本,可以获得更多详细信息。
我报告的数字是此功能比该测试中的最大功能慢的次数。如果它是最快的,它将是1.
Functions
Tests extend_list_then_set per_element_append_to_list set_from_one_add_from_two set_from_one_union_two union_sets chain_then_set
original, small, some overlap 1 25.04 1.53 1.18 1.39 1.08
no overlap: l1 = [1], l2 = [2..100] 1.08 13.31 2.10 1 1.27 1.07
lots of overlap: l1 = [1], l2 = [1]*100 1.10 1.30 2.43 1 1.25 1.05
50 random ints below 2000 in each 1 7.76 1.35 1.20 1.31 1
50 elements in each, no overlap 1 9.00 1.48 1.13 1.18 1.10
50 elements in each, total overlap 1.08 4.07 1.64 1.04 1.41 1
500 random ints below 500 in each 1.16 68.24 1.75 1 1.28 1.03
500 random ints below 2000 in each 1 102.42 1.64 1.43 1.81 1.20
500 random ints below 200000 in each 1.14 118.96 1.99 1.52 1.98 1
500 elements in each, no overlap 1.01 145.84 1.86 1.25 1.53 1
500 elements in each, total overlap 1 53.10 1.95 1.16 1.57 1.05
10000 random ints below 200000 in each 1 2588.99 1.73 1.35 1.88 1.12
10000 elements in each, no overlap 1 3164.01 1.91 1.26 1.65 1.02
10000 elements in each, total overlap 1 1068.67 1.89 1.26 1.70 1.05
original lists 100 times 1.11 2068.06 2.03 1 1.04 1.17
Average 1.04 629.25 1.82 1.19 1.48 1.06
Standard Deviation 0.05 1040.76 0.26 0.15 0.26 0.05
Max 1.16 3164.01 2.43 1.52 1.98 1.20
答案 2 :(得分:1)
你能做的最快的事情是从列表中构建两个集合并将它们结合起来。从list和set union设置构造都在运行时实现,在非常优化的C中,所以它非常快。
在代码中,如果列表为l1
和l2
,则可以执行
unique_elems = set(l1) | set(l2)
编辑:正如@kriss所说,用l1
延长l2
的速度更快。但是,此代码不会更改l1
,并且如果l1
和l2
是通用迭代,也会有效。
答案 3 :(得分:1)
全部取决于您的输入和想要的输出。
如果您在开头有一个列表li
并希望最终获得修改后的列表,则更快的方法是if not elt in li: li.append(elt)
问题是将初始列表转换为set,然后转换回列表哪个太慢了。
但是如果你可以随时使用集合s
(你不关心列表的顺序,接收它的方法只需要一些迭代),只需要s.add(elt)
即可更快。
如果在开始时你必须列出并希望最后列出一个列表,即使从列表设置到列表的最终转换,使用集合管理项目的单一性也会更快,但是您可以轻松查看提供的示例通过@virhilo的答案,而不是使用extend连接两个列表,然后将结果转换为set比将两个列表转换为集合并执行联合更快。
我不确切知道你的程序有什么限制,但是如果unicity和看起来一样重要,并且如果不需要保持插入顺序,那么建议你始终使用set,永远不要改变他们到列表。由于 Duck Typing ,大多数算法无论如何都适用于它们,因为它们都是不同类型的迭代。