假设两个列表a,b,c,d为int。
l1 = [[a,b],[c,d],...[x,y]] # len(l1) = 1 million
l2 = [[a,b],[c,d],...[x,y]) # len(l2) = 10k
要
l3 = [[c,d],...[x,y]] # items [c,d] in l1 but not l2
用法:l1是我的测试成对结果,l2是误报。我想删除误报,所以我的确是积极的。
我的尝试1:双循环,慢〜30分钟 - 是
我的尝试3:设置不起作用
我的尝试2:双循环,但内循环:当匹配一个项目时,将它从两个l2中删除,所以下一次迭代,将在l2中搜索较少的一个项目,仍然慢28min-ish
欢迎任何更好的表现。
答案 0 :(得分:3)
您可以简单地将内部列表转换为元组。 Python列表不可清除,因此您无法将它们放入集合中。我想这就是你的第三次尝试没有成功的原因。
如果您的内部元素是可清除的(例如int
,float
,string
)并且您的内部列表转换为元组,则可以将外部列表转换为集合并计算{{ 1}}。
set1 - set2
你丢失了订单,但操作非常快:import random
l1 = [[random.randrange(10), random.randrange(10)] for _ in range(100)]
l2 = [[random.randrange(10), random.randrange(10)] for _ in range(100)]
# set(l1)
## TypeError: unhashable type: 'list'
set1 = {tuple(l) for l in l1}
set2 = {tuple(l) for l in l2}
print(l1)
# [[0, 9], [9, 7], [8, 7], [6, 7], [1, 4], [2, 7], [8, 9], [8, 7], [5, 0], [3, 8], [8, 1], [6, 3], [5, 2], [0, 5], [2, 0], [2, 4], [7, 8], [2, 3], [4, 6], [4, 4], [3, 1], [7, 5], [2, 6], [8, 5], [6, 0], [0, 0], [4, 8], [5, 2], [1, 8], [6, 8], [9, 7], [0, 8], [5, 5], [4, 6], [0, 7], [0, 8], [7, 8], [5, 3], [2, 4], [1, 0], [8, 8], [6, 5], [8, 9], [7, 0], [8, 0], [1, 1], [1, 3], [2, 6], [3, 8], [7, 2], [6, 8], [3, 9], [1, 9], [9, 8], [3, 8], [1, 2], [1, 1], [2, 5], [7, 8], [3, 9], [0, 6], [9, 4], [4, 6], [9, 6], [8, 9], [7, 2], [4, 6], [9, 0], [0, 7], [0, 1], [5, 6], [5, 1], [1, 5], [9, 1], [8, 9], [4, 5], [4, 0], [4, 2], [1, 7], [9, 7], [4, 7], [1, 6], [9, 2], [7, 0], [9, 8], [3, 7], [9, 9], [9, 9], [0, 7], [3, 0], [0, 4], [4, 7], [9, 9], [0, 4], [9, 1], [2, 9], [7, 7], [5, 6], [6, 4], [7, 4]]
print(l2)
# [[0, 4], [2, 0], [1, 2], [9, 0], [8, 0], [2, 0], [5, 6], [6, 2], [2, 5], [0, 1], [9, 7], [8, 1], [3, 5], [3, 5], [3, 1], [0, 4], [4, 1], [1, 1], [3, 3], [0, 8], [3, 3], [5, 8], [1, 3], [0, 9], [6, 6], [4, 4], [6, 9], [0, 4], [5, 5], [0, 8], [4, 5], [4, 1], [0, 8], [2, 2], [2, 9], [1, 1], [7, 2], [8, 3], [6, 3], [1, 0], [6, 0], [4, 8], [1, 4], [8, 2], [9, 7], [5, 9], [6, 3], [7, 2], [9, 7], [8, 3], [8, 6], [3, 6], [7, 8], [9, 4], [1, 2], [6, 1], [1, 7], [5, 0], [8, 6], [7, 5], [0, 0], [6, 9], [1, 3], [0, 0], [8, 9], [6, 2], [4, 6], [0, 9], [2, 8], [7, 1], [3, 1], [0, 9], [1, 5], [7, 8], [3, 6], [8, 6], [1, 2], [0, 6], [5, 2], [9, 3], [0, 6], [3, 2], [8, 6], [3, 1], [8, 6], [9, 6], [6, 2], [8, 4], [7, 3], [7, 9], [4, 9], [1, 3], [2, 2], [9, 2], [8, 4], [6, 8], [7, 6], [8, 9], [5, 2], [6, 4]]
print(set1 - set2)
# {(4, 7), (9, 1), (3, 0), (9, 8), (7, 7), (0, 7), (1, 6), (3, 7), (5, 1), (8, 5), (4, 0), (6, 7), (2, 6), (3, 9), (0, 5), (2, 3), (8, 7), (1, 9), (4, 2), (6, 5), (5, 3), (2, 7), (7, 0), (9, 9), (3, 8), (7, 4), (1, 8), (8, 8), (2, 4)}
的减法需要几秒钟。
如果您想保留订单并len(l1) = 1 million and len(l2) = 10k
重复,可以迭代l1
并检查内部元组是否不在l1
中:
set2
它应该比双循环快得多。
答案 1 :(得分:0)
由于双循环将是O(X*Y)
,其中X和Y是列表的长度,更简单的解决方案是对两个列表O(X*log(X)) + O(Y*log(Y))
进行排序,然后像这样迭代两次:
i = 0
j = 0
while True:
x_el = X[i]
y_el = Y[j]
# do something with it and increase i and/or j accordingly
if i >= len(X) or j >= len(Y):
break