用于两个大型列表相交的高效算法

时间:2017-11-08 07:51:51

标签: python algorithm

假设两个列表a,b,c,d为int。

l1 = [[a,b],[c,d],...[x,y]] # len(l1) = 1 million
l2 = [[a,b],[c,d],...[x,y]) # len(l2) = 10k

l3 = [[c,d],...[x,y]] # items [c,d] in l1 but not l2

用法:l1是我的测试成对结果,l2是误报。我想删除误报,所以我的确是积极的。

我的尝试1:双循环,慢〜30分钟 - 是

我的尝试3:设置不起作用

我的尝试2:双循环,但内循环:当匹配一个项目时,将它从两个l2中删除,所以下一次迭代,将在l2中搜索较少的一个项目,仍然慢28min-ish

欢迎任何更好的表现。

2 个答案:

答案 0 :(得分:3)

您可以简单地将内部列表转换为元组。 Python列表不可清除,因此您无法将它们放入集合中。我想这就是你的第三次尝试没有成功的原因。

设置减法

如果您的内部元素是可清除的(例如intfloatstring)并且您的内部列表转换为元组,则可以将外部列表转换为集合并计算{{ 1}}。

set1 - set2

你丢失了订单,但操作非常快:import random l1 = [[random.randrange(10), random.randrange(10)] for _ in range(100)] l2 = [[random.randrange(10), random.randrange(10)] for _ in range(100)] # set(l1) ## TypeError: unhashable type: 'list' set1 = {tuple(l) for l in l1} set2 = {tuple(l) for l in l2} print(l1) # [[0, 9], [9, 7], [8, 7], [6, 7], [1, 4], [2, 7], [8, 9], [8, 7], [5, 0], [3, 8], [8, 1], [6, 3], [5, 2], [0, 5], [2, 0], [2, 4], [7, 8], [2, 3], [4, 6], [4, 4], [3, 1], [7, 5], [2, 6], [8, 5], [6, 0], [0, 0], [4, 8], [5, 2], [1, 8], [6, 8], [9, 7], [0, 8], [5, 5], [4, 6], [0, 7], [0, 8], [7, 8], [5, 3], [2, 4], [1, 0], [8, 8], [6, 5], [8, 9], [7, 0], [8, 0], [1, 1], [1, 3], [2, 6], [3, 8], [7, 2], [6, 8], [3, 9], [1, 9], [9, 8], [3, 8], [1, 2], [1, 1], [2, 5], [7, 8], [3, 9], [0, 6], [9, 4], [4, 6], [9, 6], [8, 9], [7, 2], [4, 6], [9, 0], [0, 7], [0, 1], [5, 6], [5, 1], [1, 5], [9, 1], [8, 9], [4, 5], [4, 0], [4, 2], [1, 7], [9, 7], [4, 7], [1, 6], [9, 2], [7, 0], [9, 8], [3, 7], [9, 9], [9, 9], [0, 7], [3, 0], [0, 4], [4, 7], [9, 9], [0, 4], [9, 1], [2, 9], [7, 7], [5, 6], [6, 4], [7, 4]] print(l2) # [[0, 4], [2, 0], [1, 2], [9, 0], [8, 0], [2, 0], [5, 6], [6, 2], [2, 5], [0, 1], [9, 7], [8, 1], [3, 5], [3, 5], [3, 1], [0, 4], [4, 1], [1, 1], [3, 3], [0, 8], [3, 3], [5, 8], [1, 3], [0, 9], [6, 6], [4, 4], [6, 9], [0, 4], [5, 5], [0, 8], [4, 5], [4, 1], [0, 8], [2, 2], [2, 9], [1, 1], [7, 2], [8, 3], [6, 3], [1, 0], [6, 0], [4, 8], [1, 4], [8, 2], [9, 7], [5, 9], [6, 3], [7, 2], [9, 7], [8, 3], [8, 6], [3, 6], [7, 8], [9, 4], [1, 2], [6, 1], [1, 7], [5, 0], [8, 6], [7, 5], [0, 0], [6, 9], [1, 3], [0, 0], [8, 9], [6, 2], [4, 6], [0, 9], [2, 8], [7, 1], [3, 1], [0, 9], [1, 5], [7, 8], [3, 6], [8, 6], [1, 2], [0, 6], [5, 2], [9, 3], [0, 6], [3, 2], [8, 6], [3, 1], [8, 6], [9, 6], [6, 2], [8, 4], [7, 3], [7, 9], [4, 9], [1, 3], [2, 2], [9, 2], [8, 4], [6, 8], [7, 6], [8, 9], [5, 2], [6, 4]] print(set1 - set2) # {(4, 7), (9, 1), (3, 0), (9, 8), (7, 7), (0, 7), (1, 6), (3, 7), (5, 1), (8, 5), (4, 0), (6, 7), (2, 6), (3, 9), (0, 5), (2, 3), (8, 7), (1, 9), (4, 2), (6, 5), (5, 3), (2, 7), (7, 0), (9, 9), (3, 8), (7, 4), (1, 8), (8, 8), (2, 4)} 的减法需要几秒钟。

使用集合成员资格测试列表理解

如果您想保留订单并len(l1) = 1 million and len(l2) = 10k重复,可以迭代l1并检查内部元组是否不在l1中:

set2

它应该比双循环快得多。

答案 1 :(得分:0)

由于双循环将是O(X*Y),其中X和Y是列表的长度,更简单的解决方案是对两个列表O(X*log(X)) + O(Y*log(Y))进行排序,然后像这样迭代两次:

i = 0
j = 0
while True:
  x_el = X[i]
  y_el = Y[j]
  # do something with it and increase i and/or j accordingly
  if i >= len(X) or j >= len(Y):
    break