不了解不同算法之间计算复杂性的差异

时间:2017-01-26 18:24:25

标签: python algorithm performance time-complexity

我目前正在leetcode.com上做一些编码问题,并且由于输出相同结果的两个算法之间的计算时间差异而感到难过。

问题: 给定一个字符串数组字,找到length(word [i])* length(word [j])的最大值,其中两个字不共享公共字母。您可以假设每个单词仅包含小写字母。如果不存在这两个单词,则返回0。

至少需要比较字符串对的每个组合才能计算出答案。我意识到按位更快​​/更有效率,但我选择忽略它是为了这个问题。

您可以在本文的底部找到我正在使用的测试用例单词列表。

算法1:

curr_max=0
for i in range(0,len(words)):
    for j in range(0,len(words)):
        if i<j:
            curr_word=set(words[i])
            other_word=words[j]
            for char in curr_word:
                if char in other_word:
                    break
            else:
                curr_max=max(curr_max,len(words[i])*len(other_word))
print(curr_max)

这个算法本质上是一个嵌套的for循环,比较每个可能的单词组合,如果不是“if i”,它将包括双重计数和自我计数

算法2:

curr_max = 0
while words:
    curr_word = set(words[0])
    curr_len = len(words[0])
    words = words[1:]
    for word in words:
        for char in curr_word:
            if char in word:
                break
        else:
            curr_max = max(curr_max, curr_len*len(word))
print(curr_max)

该算法基本上将列表中的第一个单词与所有其他单词进行比较,然后将其从列表中删除并重复该过程。它大约需要0.02秒,大约快4倍。我很难理解为什么差异如此之大。提前感谢您的帮助!

我用来比较时间的输入是:

words=["eedfdddadbcc","acdadecebafaebaec","dfde","ececbefe","bbafebbeccbddddbd","eafeffddbbbf","cbd","abadeaddfbcfbadb","ffdcacaebbeaa","fbadcfeede","bdcefdbfec","bbadfccdfebefd","dbefdfabededb","cbbccdecfbbe","abaeeecdbecebafedbfee","fbefbdfc","fffafb","bfadbdefbfedbddff","cbccbfdadbfe","bacaafecbbfaae","fcbffdbefcfbccd","fefaadfaaafdfbdaff","ecabaff","ccbdefcdcfac","bbbfafbffcefbc","edecefa","bcdfbcebabae","aaefecfcbbccfaeeaf","beaabaaeaebef","adfac","acedfdabccebc","efbfbef","bfccadbcbcfcdabfa","ffcaddbcf","dfae","ccadeeebeaabddebcadec","babaa","ebbdbaabddfdddad","fafaddbaebdaa","eeeeddadedfe","effbca","abcddfa","cbadcfeffeaaeecbbfe","ceaabcfaaaefeeadf","acecadddde","ece","dc","bfafdefbbdafacdcfb","fbdcad","dbaaffcdbcbdea","baaee","bebed","beaedceceaa","eacbcfdbcefbaddffcac","acddaedacfeedffad","efebff","efcbf","cdfffffaacfacafb","adacaceea","fceeffededbcfbfaaf","eafaeffcbfde","debadcddbdbabefdbe","ef","eeeeabfbaabddaecb","eeadcdcdaacaabe","ebcbffdefafdcbcebec","eb","adedefbaabfcefbea","ddceabfddaaefcea","ddffb","fdadfac","de","cbcdcbddcdabeb","ccffeeafbbbf","ccba","dab","bbdbeefdbef","cbec","ffcbefdbfdacdbdbf","adfad","ceacdcbfbdbbaebbd","ecfeaefff","ddbbdaefddeebd","eeeee","abdadc","eafecdbdef","aeedaeeaebaaeecdd","dddeebcbdea","bcaadedacb","ebdeadddcafa","ecbdbcbfccbdffaef","fddcffbfffa","accbdcfcedeabeab","cfbbefbddcaeecfbfacc","efffdaacbafeecdad","aaadfa","efeccbabdefaf","defebaddaafdcd","ecebcaacdaccaddcfeee","fdbbfecaffeafaa","bafccdbea","caa","deedefdeccead","bbfbfeaeddacfacea","daaefbbcbcdbfbfdda","aceed","cfeadadadbcff","eaefcdca","cefebdbaafeabdbdeaafd","abec","aeececad","cfeabcbaeaebdbcaada","aac","ebabffeb","fa","cf","dcebedefc","dbaedceecf","ffebaedafccceb","faefbeacaddefbe","eeadbfabfbbbfaeffaeea","affdecaca","ccfdcbdefcdfaddbbeaed","bc","feafaaabaceade","bebdfbbad","eeacaefaddacac","fff","aeddcd","ccffbabbdfc","ecddbcdeecdfbbb","debdbcdcafdcd","cfaebaeddbbd","efdada","becdccaeffeedcadbdd","feedbacc","cbbeebcdad","bfdcbdfdbcceadded","cfdbfdddafadadddcba","bcedeaeeaac","ffdcfccffdfffaebf","afffceaaadbbedfdd","faaeebdfbfefddebed","eedafbddeeaaadcdeaccc","eeceadafa","ebcfaccabea","eebdbbedcaedcbdcfaba","ecfcadaebacbdfdccebe","cbbabdadaee","cfeea","dec","cfedcbaabbaef","aacdabcbf","dfdbacadbebeedcd","bccccfdcdfe","cfcacdbcdccddcadce","dafafeccfaccaadeabbf","eaffaaffefccde","bbfbddccfda","fbdbbbbfbe","eafbcafbdbead","edbcdcefdc","fe","aafdcabce","ddafedceddcdcbfbcafe","dabcafbcfafeeadbbbef","beeaacd","cadeabebdbcfbbdfe","ecfefbfbbfa","fedacafcc","bcdcefecbcebaeeccdbd","fefde","cafba","bdabeaabbdbbbccecebda","dfeaadbeaaeefdfbed","dbaecde","cfbdfffbdeeeeb","fc","decadcacfaabca","cebbdff","badabbddcfed","fcce","fedfadefcf","acfccfbfcda","debfc","bebafeaeffe","ceaefbbcefacbbacb","cebbaeb","cadedfdafecdfb","bfefdfbaceddfcbade","cefeefaeddafbbdcade","faceadcefbffadb","cfbacafae","dfbfadfdccedbcbeaae","dbbccdddaf","ebbcbcebdddcedcfdcfaa","ccedffbcdbaedfaeb","ccfeaceaaaaeee","faade","afaaacaecbffdbadcbcd","cebfbbefbbdabbbffea","cdaadba","bbefdcacaaadbbbdedec","adabfbebdb","fcfefadcbadaacbdcfdbb","adddadebfc","fb","ecfebaacbdabece","dabacfdecfe","eeeecc","eabbe","fcdffababd","aafdbbcfdecbccca","efebaaadfecccecaa","cffefdbf","bcbdd","eaaccdcfdbbbcf"]

更新OP以帮助人们协助我解决此问题! 我现在用这个生成单词:

words = [''.join([choice('abcdefghijklmnopqrstuvwxyz') for _ in range(randrange(2, 22))]) for _ in range(250)]

我正在执行第一个算法:

t1=time.time()
curr_max=0
for i in range(0,len(words)):
    for j in range(0,i):
        curr_word=set(words[i])
        other_word=words[j]
        for char in curr_word:
            if char in other_word:
                break
        else:
            curr_max=max(curr_max,len(words[i])*len(other_word))
print(curr_max)
t0=time.time()
print(t0-t1)

我看到的结果是在0.1秒范围内。

我正在使用的第二个算法:

t1=time.time()
curr_max = 0
while words:
    curr_word = set(words[0])
    curr_len = len(words[0])
    words = words[1:]
    for word in words:
        for char in curr_word:
            if char in word:
                break
        else:
            curr_max = max(curr_max, curr_len*len(word))
print(curr_max)
t0=time.time()
print(t0-t1)

我看到的结果是0.04-0.05秒。任何人都可以复制这个吗?

1 个答案:

答案 0 :(得分:5)

两种算法看起来都像是相同的工作量。两者都重新创建itertools.combinations() function。 但是,第一种方法更频繁地重新创建集合,并执行额外的n ** 2 i < j次测试(对于n个单词)!

你正在创建超过2种组合的len(n),所以n! /(n - 2)! (见Wikipedia),这比n ** 2要小得多:

>>> import math
>>> n = 250
>>> math.factorial(n) / (math.factorial(2) * math.factorial(n - 2))
31125.0
>>> n ** 2
62500
>>> n ** 2 - (math.factorial(n) / (math.factorial(2) * math.factorial(n - 2)))
31375.0

因此,对于您的特定情况,算法#1执行的循环次数是算法#2的两倍多。随着n的增加,产品除以组合的数量接近2,所以它总是至少做两倍的工作。

接下来,您只在算法#2中创建一组words[0],但为算法#1的每个内循环执行

# algorithm #1
for ...  # current word loop
    for ...   # other word loop
        set(words[i])

# algorithm #2
while    # current word loop
    set(words[0])
    for ...   # other word loop

导致它变慢的那些差异;创建(N超过2个)集合而不仅仅是N集可能会使你在这里的大部分性能成本。

为了进行适当的比较,你应该使用多次重复测试的timeit module,确保使用最准确的时钟来测量花费的时间并禁用Python垃圾收集器(所以它赢了#39;干扰)。我已经包含了一个随机单词列表,对于破坏性算法(你的#2),我每次都必须克隆列表,为此我通过减去相同数量的裸列表副本的时间进行补偿。

我跑的脚本:

from random import choice, randrange
from timeit import timeit

def naive_loop(words):
    curr_max=0
    for i in range(0,len(words)):
        for j in range(0,len(words)):
            if i<j:
                curr_word=set(words[i])
                other_word=words[j]
                for char in curr_word:
                    if char in other_word:
                        break
                else:
                    curr_max=max(curr_max,len(words[i])*len(other_word))
    return curr_max

def destructive_loop(words):
    curr_max = 0
    while words:
        curr_word = set(words[0])
        curr_len = len(words[0])
        words = words[1:]
        for word in words:
            for char in curr_word:
                if char in word:
                    break
            else:
                curr_max = max(curr_max, curr_len*len(word))
    return curr_max

def reduced_set_calls_loop(words):
    curr_max=0
    for i in range(0,len(words)):
        curr_word=set(words[i])
        for j in range(0,len(words)):
            if i<j:
                other_word=words[j]
                for char in curr_word:
                    if char in other_word:
                        break
                else:
                    curr_max=max(curr_max,len(words[i])*len(other_word))
    return curr_max


words = [''.join([choice('abcdef') for _ in range(randrange(2, 22))]) for _ in range(250)]
number = 100
print('Naive:', timeit('naive_loop(words)', 'from __main__ import naive_loop, words', number=number))
print('Destructive:',  # don't include time to copy a list
    timeit('destructive_loop(words[:])', 'from __main__ import destructive_loop, words', number=number) -
    timeit('words[:]', 'from __main__ import naive_loop, words', number=number))
print('Reduced set calls:', timeit('reduced_set_calls_loop(words)', 'from __main__ import reduced_set_calls_loop, words', number=number))

和结果:

Naive: 1.8516130640055053
Destructive: 0.3646556100138696
Reduced set calls: 0.5927464940032223

因此,移动set()中的reduced_set_calls_loop()电话会大大改善第一个版本。通过将if i < j替换为for j in range(i):循环来减少循环次数可进一步缩小差距:

>>> def reduced_iteration_loop(words):
...     curr_max=0
...     for i in range(0,len(words)):
...         curr_word=set(words[i])
...         for j in range(i):
...             other_word=words[j]
...             for char in curr_word:
...                 if char in other_word:
...                     break
...             else:
...                 curr_max=max(curr_max,len(words[i])*len(other_word))
...     return curr_max
...
>>> print('Reduced iteration:', timeit('reduced_iteration_loop(words)', 'from __main__ import reduced_iteration_loop, words', number=number))
Reduced iteration: 0.44450017900089733

令我惊讶的是,你的破坏性循环比使用itertools.combinations()要快:

>>> from itertools import combinations
>>> def destructive_loop_empty(words):
...     while words:
...         curr_word, words = words[0], words[1:]
...         for word in words:
...             pass
...
>>> def empty_combinations(words):
...     for a, b in combinations(words, 2):
...         pass
...
>>> timeit('destructive_loop_empty(words[:])', 'from __main__ import destructive_loop_empty, words', number=1000)
0.324253979997593
>>> timeit('empty_combinations(words[:])', 'from __main__ import empty_combinations, words', number=1000)
0.5626872480061138

我们可以使用set disjunctions使您的算法#2更快,而不是单独测试每个字符。因为我们会反复测试单词,所以在字典中预先创建集合是有意义的,充当我们可以在测试时绘制的缓存。

最后,我们可以通过在字典中存储长度来制作非破坏性版本,并且只是循环遍历值(我们会破坏字典):

def nondestructive_loop(words):
    curr_max = 0
    words = {w: (set(w), len(w)) for w in words}
    while words:
        curr_word, curr_word_length = words.popitem()[1]
        for other, other_length in words.values():
            if curr_word.isdisjoint(other):
                curr_max = max(curr_max, curr_word_length * other_length)
    return curr_max

这是我能够做到最快的事情:

>>> print('Nondestructive:', timeit('nondestructive_loop(words)', 'from __main__ import nondestructive_loop, words', number=number))
Nondestructive: 0.2944725830020616

削减另外20%。

因此,总而言之,直接在列表上进行迭代比从range()生成索引,然后索引到列表中更快。差异足够大,值得你破坏列表(或字典)!

这也是itertools.combinations()慢的原因; 使用索引,因为它必须支持大于2的组合(这意味着你不能从输入序列中删除)。