在Python 3.3上运行
我正在尝试创建一个有效的算法来拉取两个列表之间的所有类似元素。 问题是双重的。首先,我似乎无法在线找到任何算法。其次, 应该是一种更有效的方式。
通过'相似元素',我指的是两个值相等的元素(无论是string
,int
还是其他元素。
目前,我正在使用greedy approach:
largeList
和smallList
已排序,我们可以保存最后访问过的索引,largeIndex
)继续。目前,运行时间似乎是O(nlog(n))
的平均值。这可以通过运行此代码块之后列出的测试用例来看出。
现在,我的代码看起来像这样:
def compare(small,large,largeStart,largeEnd):
for i in range(largeStart, largeEnd):
if small==large[i]:
return [1,i]
if small<large[i]:
if i!=0:
return [0,i-1]
else:
return [0, i]
return [0,largeStart]
def determineLongerList(aList, bList):
if len(aList)>len(bList):
return (aList, bList)
elif len(aList)<len(bList):
return (bList, aList)
else:
return (aList, bList)
def compareElementsInLists(aList, bList):
import time
startTime = time.time()
holder = determineLongerList(aList, bList)
sameItems = []
iterations = 0
##########################################
smallList = sorted(holder[1])
smallLength = len(smallList)
smallIndex = 0
largeList = sorted(holder[0])
largeLength = len(largeList)
largeIndex = 0
while (smallIndex<smallLength):
boolean = compare(smallList[smallIndex],largeList,largeIndex,largeLength)
if boolean[0]==1:
#`compare` returns 1 as True
sameItems.append(smallList[smallIndex])
oldIndex = largeIndex
largeIndex = boolean[1]
else:
#else no match and possible new index
oldIndex = largeIndex
largeIndex = boolean[1]
smallIndex+=1
iterations =largeIndex-oldIndex+iterations+1
print('RAN {it} OUT OF {mathz} POSSIBLE'.format(it=iterations, mathz=smallLength*largeLength))
print('RATIO:\t\t'+str(iterations/(smallLength*largeLength))+'\n')
return sameItems
,以下是一些测试用例:
def testLargest():
import time
from random import randint
print('\n\n******************************************\n')
start_time = time.time()
lis = []
for i in range(0,1000000):
ran = randint(0,1000000)
lis.append(ran)
lis2 = []
for i in range(0,1000000):
ran = randint(0,1000000)
lis2.append(ran)
timeTaken = time.time()-start_time
print('CREATING LISTS TOOK:\t\t'+str(timeTaken))
print('\n******************************************')
start_time = time.time()
c = compareElementsInLists(lis, lis2)
timeTaken = time.time()-start_time
print('COMPARING LISTS TOOK:\t\t'+str(timeTaken))
print('NUMBER OF SAME ITEMS:\t\t'+str(len(c)))
print('\n******************************************')
#testLargest()
'''
One rendition of testLargest:
******************************************
CREATING LISTS TOOK: 21.009342908859253
******************************************
RAN 999998 OUT OF 1000000000000 POSSIBLE
RATIO: 9.99998e-07
COMPARING LISTS TOOK: 13.99990701675415
NUMBER OF SAME ITEMS: 632328
******************************************
'''
def testLarge():
import time
from random import randint
print('\n\n******************************************\n')
start_time = time.time()
lis = []
for i in range(0,1000000):
ran = randint(0,100)
lis.append(ran)
lis2 = []
for i in range(0,1000000):
ran = randint(0,100)
lis2.append(ran)
timeTaken = time.time()-start_time
print('CREATING LISTS TOOK:\t\t'+str(timeTaken))
print('\n******************************************')
start_time = time.time()
c = compareElementsInLists(lis, lis2)
timeTaken = time.time()-start_time
print('COMPARING LISTS TOOK:\t\t'+str(timeTaken))
print('NUMBER OF SAME ITEMS:\t\t'+str(len(c)))
print('\n******************************************')
testLarge()
答案 0 :(得分:1)
使用ipython magic进行计时,但它与标准的set()
交叉点相比毫不逊色。
设置:
import random
alist = [random.randint(0, 100000) for _ in range(1000)]
blist = [random.randint(0, 100000) for _ in range(1000)]
比较元素:
%%timeit -n 1000
compareElementsInLists(alist, blist)
1000 loops, best of 3: 1.9 ms per loop
Vs Set Intersection
%%timeit -n 1000
set(alist) & set(blist)
1000 loops, best of 3: 104 µs per loop
只是为了确保我们得到相同的结果:
>>> compareElementsInLists(alist, blist)
[8282, 29521, 43042, 47193, 48582, 74173, 96216, 98791]
>>> set(alist) & set(blist)
{8282, 29521, 43042, 47193, 48582, 74173, 96216, 98791}
答案 1 :(得分:1)
如果您只是搜索两个列表中的所有元素,则应使用旨在处理此类任务的数据类型。在这种情况下,set
或bag
s是合适的。这些内部由散列机制表示,这比在排序列表中搜索更有效。
(collections.Counter
代表合适的bag
。)
如果你不关心加倍的元素,那么set
就可以了。
a = set(listA)
print a.intersection(listB)
这将打印listA
和listB
中的所有元素。 (对于双倍输入元素,没有双倍输出。)
import collections
a = collections.Counter(listA)
b = collections.Counter(listB)
print a & b
这将打印两个列表中的元素数量。
我没有做任何测量,但我很确定这些解决方案比你自己的尝试更快。
要将计数器再次转换为所有代表元素的list
,您可以使用list(c.elements())
。