列表比较算法:如何做得更好?

时间:2015-04-05 23:45:49

标签: python algorithm python-3.x

在Python 3.3上运行

我正在尝试创建一个有效的算法来拉取两个列表之间的所有类似元素。 问题是双重的。首先,我似乎无法在线找到任何算法。其次, 应该是一种更有效的方式。

通过'相似元素',我指的是两个值相等的元素(无论是stringint还是其他元素。

目前,我正在使用greedy approach

  1. 对要比较的列表进行排序,
  2. 将较短列表中的每个元素与较大列表中的每个元素进行比较,
  3. 由于largeListsmallList已排序,我们可以保存最后访问过的索引,
  4. 从上一个索引(largeIndex)继续。
  5. 目前,运行时间似乎是O(nlog(n))的平均值。这可以通过运行此代码块之后列出的测试用例来看出。

    现在,我的代码看起来像这样:

      def compare(small,large,largeStart,largeEnd):
            for i in range(largeStart, largeEnd):
                  if small==large[i]:
                        return [1,i]
                  if small<large[i]:
                        if i!=0:
                              return [0,i-1]
                        else:
                              return [0, i]
            return [0,largeStart]
    
      def determineLongerList(aList, bList):
        if len(aList)>len(bList):
            return (aList, bList)
        elif len(aList)<len(bList):
            return (bList, aList)
        else:
            return (aList, bList)
    
      def compareElementsInLists(aList, bList):
            import time
            startTime   = time.time()
            holder      = determineLongerList(aList, bList)
            sameItems   = []
            iterations  = 0
            ##########################################
            smallList   = sorted(holder[1])
            smallLength = len(smallList)
            smallIndex  = 0
            largeList   = sorted(holder[0])
            largeLength = len(largeList)
            largeIndex  = 0
            while (smallIndex<smallLength):
                  boolean = compare(smallList[smallIndex],largeList,largeIndex,largeLength)
                  if boolean[0]==1:
                        #`compare` returns 1 as True
                        sameItems.append(smallList[smallIndex])
                        oldIndex    = largeIndex
                        largeIndex  = boolean[1]
                  else:
                        #else no match and possible new index
                        oldIndex    = largeIndex
                        largeIndex  = boolean[1]
                  smallIndex+=1
                  iterations =largeIndex-oldIndex+iterations+1
            print('RAN {it} OUT OF {mathz} POSSIBLE'.format(it=iterations, mathz=smallLength*largeLength))
      print('RATIO:\t\t'+str(iterations/(smallLength*largeLength))+'\n')
      return sameItems
    

    ,以下是一些测试用例:

      def testLargest():
            import time
            from random import randint
            print('\n\n******************************************\n')
            start_time  = time.time()
            lis   = []
            for i in range(0,1000000):
                  ran   = randint(0,1000000)
                  lis.append(ran)
            lis2  = []
            for i in range(0,1000000):
                  ran   = randint(0,1000000)
                  lis2.append(ran)
            timeTaken   = time.time()-start_time     
            print('CREATING LISTS TOOK:\t\t'+str(timeTaken))
            print('\n******************************************')
            start_time  = time.time()
            c           = compareElementsInLists(lis, lis2)
            timeTaken   = time.time()-start_time     
            print('COMPARING LISTS TOOK:\t\t'+str(timeTaken))
            print('NUMBER OF SAME ITEMS:\t\t'+str(len(c)))
            print('\n******************************************')
    
      #testLargest()
    
      '''
      One rendition of testLargest:
            ******************************************
    
            CREATING LISTS TOOK:        21.009342908859253
    
            ******************************************
            RAN 999998 OUT OF 1000000000000 POSSIBLE
            RATIO:      9.99998e-07
    
            COMPARING LISTS TOOK:       13.99990701675415
            NUMBER OF SAME ITEMS:       632328
    
            ******************************************
      '''
    
      def testLarge():
            import time
            from random import randint
            print('\n\n******************************************\n')
            start_time  = time.time()
            lis   = []
            for i in range(0,1000000):
                  ran   = randint(0,100)
                  lis.append(ran)
            lis2  = []
            for i in range(0,1000000):
                  ran   = randint(0,100)
                  lis2.append(ran)
            timeTaken   = time.time()-start_time     
            print('CREATING LISTS TOOK:\t\t'+str(timeTaken))
            print('\n******************************************')
            start_time  = time.time()
            c           = compareElementsInLists(lis, lis2)
            timeTaken   = time.time()-start_time     
            print('COMPARING LISTS TOOK:\t\t'+str(timeTaken))
            print('NUMBER OF SAME ITEMS:\t\t'+str(len(c)))
            print('\n******************************************')
    
      testLarge()
    

2 个答案:

答案 0 :(得分:1)

使用ipython magic进行计时,但它与标准的set()交叉点相比毫不逊色。
设置:

import random
alist = [random.randint(0, 100000) for _ in range(1000)]
blist = [random.randint(0, 100000) for _ in range(1000)]

比较元素:

%%timeit -n 1000
compareElementsInLists(alist, blist)
1000 loops, best of 3: 1.9 ms per loop

Vs Set Intersection

%%timeit -n 1000
set(alist) & set(blist)
1000 loops, best of 3: 104 µs per loop

只是为了确保我们得到相同的结果:

>>> compareElementsInLists(alist, blist)
[8282, 29521, 43042, 47193, 48582, 74173, 96216, 98791]
>>> set(alist) & set(blist)
{8282, 29521, 43042, 47193, 48582, 74173, 96216, 98791}

答案 1 :(得分:1)

如果您只是搜索两个列表中的所有元素,则应使用旨在处理此类任务的数据类型。在这种情况下,setbag s是合适的。这些内部由散列机制表示,这比在排序列表中搜索更有效。

collections.Counter代表合适的bag。)

如果你不关心加倍的元素,那么set就可以了。

a = set(listA)
print a.intersection(listB)

这将打印listAlistB中的所有元素。 (对于双倍输入元素,没有双倍输出。)

import collections

a = collections.Counter(listA)
b = collections.Counter(listB)

print a & b

这将打印两个列表中的元素数量。

我没有做任何测量,但我很确定这些解决方案比你自己的尝试更快。

要将计数器再次转换为所有代表元素的list,您可以使用list(c.elements())