有效地检查字符串上的小字符串的排列

时间:2016-11-13 00:45:30

标签: python performance

与字典相比,我试图查看使用加扰字符串可以创建的单词。我已经获得了长字符串案例的一些帮助,但我认为我的短字符串案例就是将我的程序拖到20秒范围内。我正在测试1000个争夺和一个大约170,000个单词"的字典。

对于短的乱码字,我认为创建字符串的每个排列并将其与字典条目进行比较会更有效,如下所示:

from itertools import permutations

wordStore = {
    7:[],
    8:['acowbtec', 'acowbtce', 'acowbetc', 'aocwbtec', 'acwobetc', 'acotbecw', 'caowbtec', 'caowbtce', 'caowbetc',
       'zsdfvsvv', 'sdffbrfv', 'sdjfjsjf', 'sjnshsnj', 'adhnsrhn', 'sdfbhxdf', 'zsdfgzdf', 'cnzsdfgf', 'sdbdzvff',
       'dbgtbzdf', 'zsvrvrdz', 'zdrvrvrn', 'nhcncnby', 'mmmnyndd', 'zswewedf', 'zeswffee', 'sefdedee', 'sefeefee',
       'iuygfjhg', 'uytmjnbb', 'uythbgvf', 'ytrgfdcv', 'ytregfcv', 'ytrevcxd', 'ytrevcxs', 'ytrewgfd', 'trewgfds',
       'uytrgfdd', 'uytrenhg', 'ytrebgfd', 'jhgfdbvc', 'mnbvyhtr', 'ytrehbgv', 'uytrwwsz', 'mnbtrexx', 'uytrebgv',
       'fgfgfvdw', 'werfdcse', 'mnbvcdes', 'kjhgfnbv', 'sdfhgfdw', 'yujhredq', 'wsxrtyhn', 'jfrvsdxw', 'jmrtgedw',
       'ujrtgedw', 'ujtgedws', 'yhvedsgy', 'yhygdfex', 'kjjkjuhy', 'rffdddwe', 'esrdtfgd', 'uytrewww', 'vfcdtred',
       'kjhgfnbv', 'uytrbvcd', 'jhgfhgfd', 'adfgdfgg', 'mnbvtred', 'jhgfrewb', 'hgfdtred', 'dsfgdfgg', 'dfgdgggg']
}

scrambles = set([''.join(p) for p in permutations('acowbtec',8)])
for x in scrambles.intersection(wordStore[8]):
    print('Found ', x)

我创建了一个小的简单集来对这里进行测试。

正如您所看到的,它相当直接,但速度太慢。这是我的大数据集测试中的相关cProfile部分。

ncalls   tottime  percall  cumtime  percall  filename:lineno(function)
1        9.324    9.324    29.804   29.804   wordplayer.py:2(<module>)
990      9.053    0.009    16.147   0.016    wordplayer.py:28(<listcomp>)
990      2.205    0.002    2.205    0.002    {method 'intersection' of 'set' objects}
39916800 7.093    0.000    7.093    0.000    {method 'join' of 'str' objects}

我不完全了解cProfile结果。看起来在每次通话的基础上它们都不会太慢,但总的来说它们需要花费太多时间。关于如何加快速度的任何想法?

更新

在Dan的帮助下,我大大加快了我的计划。但我有这种初始化似乎并不合适。它应该如何完成?

with open(file1) as f:
for line in f:
    line = line.rstrip()
    try:
        wordStore[len(line)].setdefault(''.join(sorted(line)), []).append(line)
    except:
        wordStore[len(line)] = {}
        wordStore[len(line)].setdefault(''.join(sorted(line)), []).append(line)

1 个答案:

答案 0 :(得分:1)

不是生成排列,而是在使用排序顺序对字符串进行标准化后搜索字符串。从线性搜索开始,然后使用哈希索引:

>>> eight = ['acowbtec', 'acowbtce', 'acowbetc', 'aocwbtec', 'acwobetc', 'acotbecw', 'caowbtec', 'caowbtce', 'caowbetc',
...        'zsdfvsvv', 'sdffbrfv', 'sdjfjsjf', 'sjnshsnj', 'adhnsrhn', 'sdfbhxdf', 'zsdfgzdf', 'cnzsdfgf', 'sdbdzvff',
...        'dbgtbzdf', 'zsvrvrdz', 'zdrvrvrn', 'nhcncnby', 'mmmnyndd', 'zswewedf', 'zeswffee', 'sefdedee', 'sefeefee',
...        'iuygfjhg', 'uytmjnbb', 'uythbgvf', 'ytrgfdcv', 'ytregfcv', 'ytrevcxd', 'ytrevcxs', 'ytrewgfd', 'trewgfds',
...        'uytrgfdd', 'uytrenhg', 'ytrebgfd', 'jhgfdbvc', 'mnbvyhtr', 'ytrehbgv', 'uytrwwsz', 'mnbtrexx', 'uytrebgv',
...        'fgfgfvdw', 'werfdcse', 'mnbvcdes', 'kjhgfnbv', 'sdfhgfdw', 'yujhredq', 'wsxrtyhn', 'jfrvsdxw', 'jmrtgedw',
...        'ujrtgedw', 'ujtgedws', 'yhvedsgy', 'yhygdfex', 'kjjkjuhy', 'rffdddwe', 'esrdtfgd', 'uytrewww', 'vfcdtred',
...        'kjhgfnbv', 'uytrbvcd', 'jhgfhgfd', 'adfgdfgg', 'mnbvtred', 'jhgfrewb', 'hgfdtred', 'dsfgdfgg', 'dfgdgggg']

>>> map(lambda s: ''.join(sorted(s)), eight)
['abcceotw', 'abcceotw', 'abcceotw', 'abcceotw', 'abcceotw', 'abcceotw', 'abcceotw', 'abcceotw', 'abcceotw', 'dfssvvvz', 'bdfffrsv', 'dffjjjss', 'hjjnnsss', 'adhhnnrs', 'bddffhsx', 'ddffgszz', 'cdffgnsz', 'bddffsvz', 'bbddfgtz', 'drrsvvzz', 'dnrrrvvz', 'bcchnnny', 'ddmmmnny', 'deefswwz', 'eeeffswz', 'ddeeeefs', 'eeeeeffs', 'fgghijuy', 'bbjmntuy', 'bfghtuvy', 'cdfgrtvy', 'cefgrtvy', 'cdertvxy', 'cerstvxy', 'defgrtwy', 'defgrstw', 'ddfgrtuy', 'eghnrtuy', 'bdefgrty', 'bcdfghjv', 'bhmnrtvy', 'beghrtvy', 'rstuwwyz', 'bemnrtxx', 'begrtuvy', 'dfffggvw', 'cdeefrsw', 'bcdemnsv', 'bfghjknv', 'ddffghsw', 'dehjqruy', 'hnrstwxy', 'dfjrsvwx', 'degjmrtw', 'degjrtuw', 'degjstuw', 'deghsvyy', 'defghxyy', 'hjjjkkuy', 'dddeffrw', 'ddefgrst', 'ertuwwwy', 'cddefrtv', 'bfghjknv', 'bcdrtuvy', 'dffgghhj', 'addffggg', 'bdemnrtv', 'befghjrw', 'ddefghrt', 'ddffgggs', 'ddfggggg']

>>> ''.join(sorted('acowbtec'))
'abcceotw'

线性搜索对于此数据集来说足够快,但可以使用字典并按字母排序的版本索引字符串。

>>> [v for v in eight if ''.join(sorted(v)) == ''.join(sorted('acowbtec'))]
['acowbtec', 'acowbtce', 'acowbetc', 'aocwbtec', 'acwobetc', 'acotbecw', 'caowbtec', 'caowbtce', 'caowbetc']

Timeit报告此线性搜索需要:

>>> timeit.timeit(setup="eight = ['acowbtec', 'acowbtce', 'acowbetc', 'aocwbtec', 'acwobetc', 'acotbecw', 'caowbtec', 'caowbtce', 'caowbetc','zsdfvsvv', 'sdffbrfv', 'sdjfjsjf', 'sjnshsnj', 'adhnsrhn', 'sdfbhxdf', 'zsdfgzdf', 'cnzsdfgf', 'sdbdzvff','dbgtbzdf', 'zsvrvrdz', 'zdrvrvrn', 'nhcncnby', 'mmmnyndd', 'zswewedf', 'zeswffee', 'sefdedee', 'sefeefee','iuygfjhg', 'uytmjnbb', 'uythbgvf', 'ytrgfdcv', 'ytregfcv', 'ytrevcxd', 'ytrevcxs', 'ytrewgfd', 'trewgfds','uytrgfdd', 'uytrenhg', 'ytrebgfd', 'jhgfdbvc', 'mnbvyhtr', 'ytrehbgv', 'uytrwwsz', 'mnbtrexx', 'uytrebgv','fgfgfvdw', 'werfdcse', 'mnbvcdes', 'kjhgfnbv', 'sdfhgfdw', 'yujhredq', 'wsxrtyhn', 'jfrvsdxw', 'jmrtgedw','ujrtgedw', 'ujtgedws', 'yhvedsgy', 'yhygdfex', 'kjjkjuhy', 'rffdddwe', 'esrdtfgd', 'uytrewww', 'vfcdtred','kjhgfnbv', 'uytrbvcd', 'jhgfhgfd', 'adfgdfgg', 'mnbvtred', 'jhgfrewb', 'hgfdtred', 'dsfgdfgg', 'dfgdgggg']",stmt="[v for v in eight if ''.join(sorted(v)) == ''.join(sorted('acowbtec'))]",number=1000)
0.22520709037780762

1000次迭代0.2秒。

创建{sorted:[unsorted]}的索引并按排序的查询字符串索引该字典可以比使用线性搜索单独执行每个查询更快地执行多个查询。

构建该索引只是:

>>> index = {}
>>> for v in eight:
...     index.setdefault(''.join(sorted(v)), []).append(v)
... 
>>> index
{'hjjnnsss': ['sjnshsnj'], 'bbddfgtz': ['dbgtbzdf'], 'ddffgggs': ['dsfgdfgg'], 'defghxyy': ['yhygdfex'], 'begrtuvy': ['uytrebgv'], 'dffjjjss': ['sdjfjsjf'], 'cefgrtvy': ['ytregfcv'], 'dddeffrw': ['rffdddwe'], 'befghjrw': ['jhgfrewb'], 'eeeeeffs': ['sefeefee'], 'ddfgrtuy': ['uytrgfdd'], 'cdfgrtvy': ['ytrgfdcv'], 'deefswwz': ['zswewedf'], 'cerstvxy': ['ytrevcxs'], 'bdemnrtv': ['mnbvtred'], 'bbjmntuy': ['uytmjnbb'], 'ddmmmnny': ['mmmnyndd'], 'ddfggggg': ['dfgdgggg'], 'bcchnnny': ['nhcncnby'], 'ddeeeefs': ['sefdedee'], 'bcdfghjv': ['jhgfdbvc'], 'dfffggvw': ['fgfgfvdw'], 'bemnrtxx': ['mnbtrexx'], 'bhmnrtvy': ['mnbvyhtr'], 'cdeefrsw': ['werfdcse'], 'dnrrrvvz': ['zdrvrvrn'], 'cdertvxy': ['ytrevcxd'], 'bdefgrty': ['ytrebgfd'], 'dffgghhj': ['jhgfhgfd'], 'ddffgszz': ['zsdfgzdf'], 'cdffgnsz': ['cnzsdfgf'], 'fgghijuy': ['iuygfjhg'], 'hjjjkkuy': ['kjjkjuhy'], 'bddffhsx': ['sdfbhxdf'], 'ddefgrst': ['esrdtfgd'], 'degjrtuw': ['ujrtgedw'], 'bcdemnsv': ['mnbvcdes'], 'bfghjknv': ['kjhgfnbv', 'kjhgfnbv'], 'defgrtwy': ['ytrewgfd'], 'rstuwwyz': ['uytrwwsz'], 'bdfffrsv': ['sdffbrfv'], 'ddefghrt': ['hgfdtred'], 'bfghtuvy': ['uythbgvf'], 'eeeffswz': ['zeswffee'], 'drrsvvzz': ['zsvrvrdz'], 'ddffghsw': ['sdfhgfdw'], 'abcceotw': ['acowbtec', 'acowbtce', 'acowbetc', 'aocwbtec', 'acwobetc', 'acotbecw', 'caowbtec', 'caowbtce', 'caowbetc'], 'dfjrsvwx': ['jfrvsdxw'], 'eghnrtuy': ['uytrenhg'], 'addffggg': ['adfgdfgg'], 'cddefrtv': ['vfcdtred'], 'bcdrtuvy': ['uytrbvcd'], 'degjmrtw': ['jmrtgedw'], 'bddffsvz': ['sdbdzvff'], 'adhhnnrs': ['adhnsrhn'], 'ertuwwwy': ['uytrewww'], 'degjstuw': ['ujtgedws'], 'dfssvvvz': ['zsdfvsvv'], 'hnrstwxy': ['wsxrtyhn'], 'beghrtvy': ['ytrehbgv'], 'deghsvyy': ['yhvedsgy'], 'defgrstw': ['trewgfds'], 'dehjqruy': ['yujhredq']}

Timeit声明这需要:

>>> timeit.timeit(setup="eight = ['acowbtec', 'acowbtce', 'acowbetc', 'aocwbtec', 'acwobetc', 'acotbecw', 'caowbtec', 'caowbtce', 'caowbetc','zsdfvsvv', 'sdffbrfv', 'sdjfjsjf', 'sjnshsnj', 'adhnsrhn', 'sdfbhxdf', 'zsdfgzdf', 'cnzsdfgf', 'sdbdzvff','dbgtbzdf', 'zsvrvrdz', 'zdrvrvrn', 'nhcncnby', 'mmmnyndd', 'zswewedf', 'zeswffee', 'sefdedee', 'sefeefee','iuygfjhg', 'uytmjnbb', 'uythbgvf', 'ytrgfdcv', 'ytregfcv', 'ytrevcxd', 'ytrevcxs', 'ytrewgfd', 'trewgfds','uytrgfdd', 'uytrenhg', 'ytrebgfd', 'jhgfdbvc', 'mnbvyhtr', 'ytrehbgv', 'uytrwwsz', 'mnbtrexx', 'uytrebgv','fgfgfvdw', 'werfdcse', 'mnbvcdes', 'kjhgfnbv', 'sdfhgfdw', 'yujhredq', 'wsxrtyhn', 'jfrvsdxw', 'jmrtgedw','ujrtgedw', 'ujtgedws', 'yhvedsgy', 'yhygdfex', 'kjjkjuhy', 'rffdddwe', 'esrdtfgd', 'uytrewww', 'vfcdtred','kjhgfnbv', 'uytrbvcd', 'jhgfhgfd', 'adfgdfgg', 'mnbvtred', 'jhgfrewb', 'hgfdtred', 'dsfgdfgg', 'dfgdgggg']",stmt="index={}\nfor v in eight:index.setdefault(''.join(sorted(v)), []).append(v)",number=1000)
0.14768695831298828

1000次迭代0.2秒。

然后查询它是:

>>> index[''.join(sorted('acowbtec'))]
['acowbtec', 'acowbtce', 'acowbetc', 'aocwbtec', 'acwobetc', 'acotbecw', 'caowbtec', 'caowbtce', 'caowbetc']

Timeit声明这需要:

>>> timeit.timeit(setup="index = {'hjjnnsss': ['sjnshsnj'], 'bbddfgtz': ['dbgtbzdf'], 'ddffgggs': ['dsfgdfgg'], 'defghxyy': ['yhygdfex'], 'begrtuvy': ['uytrebgv'], 'dffjjjss': ['sdjfjsjf'], 'cefgrtvy': ['ytregfcv'], 'dddeffrw': ['rffdddwe'], 'befghjrw': ['jhgfrewb'], 'eeeeeffs': ['sefeefee'], 'ddfgrtuy': ['uytrgfdd'], 'cdfgrtvy': ['ytrgfdcv'], 'deefswwz': ['zswewedf'], 'cerstvxy': ['ytrevcxs'], 'bdemnrtv': ['mnbvtred'], 'bbjmntuy': ['uytmjnbb'], 'ddmmmnny': ['mmmnyndd'], 'ddfggggg': ['dfgdgggg'], 'bcchnnny': ['nhcncnby'], 'ddeeeefs': ['sefdedee'], 'bcdfghjv': ['jhgfdbvc'], 'dfffggvw': ['fgfgfvdw'], 'bemnrtxx': ['mnbtrexx'], 'bhmnrtvy': ['mnbvyhtr'], 'cdeefrsw': ['werfdcse'], 'dnrrrvvz': ['zdrvrvrn'], 'cdertvxy': ['ytrevcxd'], 'bdefgrty': ['ytrebgfd'], 'dffgghhj': ['jhgfhgfd'], 'ddffgszz': ['zsdfgzdf'], 'cdffgnsz': ['cnzsdfgf'], 'fgghijuy': ['iuygfjhg'], 'hjjjkkuy': ['kjjkjuhy'], 'bddffhsx': ['sdfbhxdf'], 'ddefgrst': ['esrdtfgd'], 'degjrtuw': ['ujrtgedw'], 'bcdemnsv': ['mnbvcdes'], 'bfghjknv': ['kjhgfnbv', 'kjhgfnbv'], 'defgrtwy': ['ytrewgfd'], 'rstuwwyz': ['uytrwwsz'], 'bdfffrsv': ['sdffbrfv'], 'ddefghrt': ['hgfdtred'], 'bfghtuvy': ['uythbgvf'], 'eeeffswz': ['zeswffee'], 'drrsvvzz': ['zsvrvrdz'], 'ddffghsw': ['sdfhgfdw'], 'abcceotw': ['acowbtec', 'acowbtce', 'acowbetc', 'aocwbtec', 'acwobetc', 'acotbecw', 'caowbtec', 'caowbtce', 'caowbetc'], 'dfjrsvwx': ['jfrvsdxw'], 'eghnrtuy': ['uytrenhg'], 'addffggg': ['adfgdfgg'], 'cddefrtv': ['vfcdtred'], 'bcdrtuvy': ['uytrbvcd'], 'degjmrtw': ['jmrtgedw'], 'bddffsvz': ['sdbdzvff'], 'adhhnnrs': ['adhnsrhn'], 'ertuwwwy': ['uytrewww'], 'degjstuw': ['ujtgedws'], 'dfssvvvz': ['zsdfvsvv'], 'hnrstwxy': ['wsxrtyhn'], 'beghrtvy': ['ytrehbgv'], 'deghsvyy': ['yhvedsgy'], 'defgrstw': ['trewgfds'], 'dehjqruy': ['yujhredq']}",stmt="index[''.join(sorted('acowbtec'))]",number=1000)
0.0015790462493896484
1000次迭代

0.002秒。

这两个步骤都非常有效。

删除try - except的方式:

wordStore = {}
with open(file1) as f:
    for line in f:
        line = line.rstrip()
        try:
            wordStore[len(line)].setdefault(''.join(sorted(line)), []).append(line)
        except:
            wordStore[len(line)] = {}
            wordStore[len(line)].setdefault(''.join(sorted(line)), []).append(line)

两次使用setdefault

wordStore = {}
with open(file1) as f:
    for line in f:
        line = line.rstrip()
        wordStore.setdefault(len(line), {}).setdefault(''.join(sorted(line)), []).append(line)

另一种选择是使用defaultdict,但这需要:

from collections import defaultdict

wordStore = defaultdict(lambda: defaultdict(list))
with open(file1) as f:
    for line in f:
        line = line.rstrip()
        wordStore[len(line)][''.join(sorted(line))].append(line)

它有较短的行但defaultdict初始化对于某些人来说比使用setdefault更难以理解,而订阅隐藏了setdefault解释的魔力。如果不存在,每次访问都会创建一个条目。