python中的整数比较将所有内容减慢到爬行速度

时间:2010-02-22 09:44:20

标签: python loops integer

以下代码让我感到非常头疼:

def extract_by_letters(letters, dictionary):

    for word in dictionary:
       for letter in letters:
           if word.count(letter) != letters.count(letter):
               if word in dictionary: #I cant leave this line out
                   dictionary.remove(word)

    return dictionary

首先:'如果字典中的单词'行。为什么我不能把它留下来?我收到一个错误,说ValueError:list.remove(x):x不在列表中

但显然是这样。

第二:词典是由换行符分隔的大约50,000个单词的文件。上面的代码需要大约2分钟才能运行...... Wayyy太长了。我玩了一段代码,我发现了这一行:

if word.count(letter) != letters.count(letter):

似乎导致了我所有的问题。如果我取出那条线(这完全搞砸了输出),该函数大约需要2秒才能通过字典。

我认为这是计数功能,但事实并非如此。

如果我将if语句更改为:

print word.count(letter) 
print letters.count(letter)

该功能大约需要3秒钟才能运行。

我确信这是比较。还有其他建议吗?有更好的方法吗?

提前致谢!

5 个答案:

答案 0 :(得分:4)

您获得例外的原因是,如果字母数与多个字母匹配,您尝试多次删除该字

它太慢的原因是你在循环内部的循环中有循环。

如果你要写一两句关于函数应该做什么,那么重构它会容易得多。与此同时,这将阻止您检查一旦删除了某个单词后是否需要删除它。

def extract_by_letters(letters, dictionary):
    for word in dictionary[:]:  # bad idea to change this while you iterate over it
        for letter in letters:
            if word.count(letter) != letters.count(letter):
                dictionary.remove(word)
                break
    return dictionary

如果词典是set,你应该加快速度。如果字典是list,这应该会带来巨大的加速

答案 1 :(得分:2)

尝试构建输出而不是从中删除:

def extract_by_letters(letters, dictionary):
    d = []
    for word in dictionary:
       for letter in letters:
           if word.count(letter)>0:
               d.append(word)
               break
    return d

或者,您可以使用正则表达式:

import re
def extract_by_letters(letters, dictionary):
    regex = re.compile('['+letters+']')
    d=[]
    for word in dictionary:
       if regex.search(word):
           d.append(word)
    return d

或者,或许最简单的方法:

import re
def extract_by_letters(letters, dictionary):
    regex = re.compile('['+letters+']')
    return [word for word in dictionary if regex.search(word)]

这最后一个在我的Mac上扫描/ usr / share / dict / words没有明显的时间,这是一个234936个单词的列表。

答案 2 :(得分:2)

这是一个应该提供重大加速的功能:

def extract_by_letters(letters,dictionary):
    letdict = zip(set(letters),[letters.count(let) for let in set(letters)])
    outarr = []
    for word in dictionary:
        goodword = True
        for letter in letdict:
            if word.count(letter) != letdict[letter]:
                goodword = False
                break
        if goodword:
            outarr.append(word)
    return outarr

以下是我所做的优化:

  1. 制作字母及相应频率的字典。这样,当您只需要执行此过程一次并存储结果时,您不会反复使用letters.count。

  2. 我不是从字典中删除单词,而是将它们添加到从函数返回的数组中。如果你有一本庞大的字典,很可能只有几个单词匹配。此外,如果字典变量是一个数组(我怀疑),那么每次调用remove时,都必须首先在字典中搜索单词(从开头线性开始),然后将其删除。通过使用要删除的单词的索引弹出来删除它会更快。

  3. 当发现不匹配时,断开循环检查字母计数。这可以防止我们在得到答案时进行不必要的检查。

  4. 我不确定你是否在字母变量中重复了字母,所以我确保它可以通过使用letdict处理它。如果您之前在字母变量中重复了字母,那么您将重复检查单词中这些字母的计数。

答案 3 :(得分:1)

import pprint
from collections import defaultdict

# This is a best approximation to what Bryan is trying to do.
# However the results are meaningless because the list is being
# mutated during iteration over it. So I haven't shown the output.

def extract_by_letters_0(letters, input_list):
    dictionary = input_list.copy()
    for word in dictionary:
       for letter in letters:
           if word.count(letter) != letters.count(letter):
               if word in dictionary: #I cant leave this line out
                   dictionary.remove(word)
    return dictionary

# This avoids the mutation.
# The results are anagrams PLUS letters that don't occur
# in the query. E.g. "same" produces "samehood" but not "sameness"
# ("sameness" has 3*"s" and 2*"e" instead of 1 of each)

def extract_by_letters_1(letters, input_list):
    dictionary = set(input_list)
    ripouts = set()
    for word in dictionary:
       for letter in letters:
           if word.count(letter) != letters.count(letter):
               ripouts.add(word)
    return dictionary - ripouts

def anagram_key(strg):
    return ''.join(sorted(list(strg)))

def check_anagrams(str1, str2):
    return sorted(list(str1)) == sorted(list(str2))

# Advice: try algorithms like this out on a SMALL data set first.
# Get it working correctly. Use different test cases. Have test code
# however primitive that check your results.
# Then if it runs slowly, helpers
# don't have to guess what you are doing.

raw_text = """
twas brillig and the slithy toves
did gyre and gimble in the wabe
same mesa seam sameness samehood
"""

lexicon = sorted(set(raw_text.split()))
print "\nlexicon:", lexicon
#
# Assuming we want anagrams:
#
# Build an anagram dictionary
#
anagram_dict = defaultdict(set)
for word in lexicon:
    anagram_dict[anagram_key(word)].add(word)

print "\nanagram_dict (len == %d):" % len(anagram_dict)
pprint.pprint(anagram_dict)

# now purge trivial entries

temp = {}
for k, v in anagram_dict.iteritems():
    if len(v) != 1:
        temp[k] = v
anagram_dict = temp
print "\nanagram_dict (len == %d):" % len(anagram_dict)
pprint.pprint(anagram_dict)

# Test cases

tests = "sam same mesa sameness samehood xsame samex".split()
default_set = frozenset()
for test in tests:
    print
    results = extract_by_letters_1(test, lexicon)
    print test, [(result, check_anagrams(test, result)) for result in results]
    # In the following statement, you can use set([test]) as the default
    # if that produces a more useful or orthogonal result.
    results = anagram_dict.get(anagram_key(test), default_set)
    print test, [(result, check_anagrams(test, result)) for result in results]

输出:

lexicon: ['and', 'brillig', 'did', 'gimble', 'gyre', 'in', 'mesa', 'same', 'samehood', 'sameness', 'seam', 'slithy', 'the', 'toves', 'twas', 'wabe']

anagram_dict (len == 14):
defaultdict(<type 'set'>, {'abew': set(['wabe']), 'eht': set(['the']), 'egry': set(['gyre']), 'begilm': set(['gimble']), 'hilsty': set(['slithy']), 'aems': set(['mesa', 'seam', 'same']), 'bgiillr': set(['brillig']), 'ddi': set(['did']), 'eostv': set(['toves']), 'adehmoos': set(['samehood']), 'in': set(['in']), 'adn': set(['and']), 'aeemnsss': set(['sameness']), 'astw': set(['twas'])})

anagram_dict (len == 1):
{'aems': set(['mesa', 'same', 'seam'])}

sam [('mesa', False), ('samehood', False), ('seam', False), ('same', False)]
sam []

same [('mesa', True), ('samehood', False), ('seam', True), ('same', True)]
same [('mesa', True), ('seam', True), ('same', True)]

mesa [('mesa', True), ('samehood', False), ('seam', True), ('same', True)]
mesa [('mesa', True), ('seam', True), ('same', True)]

sameness [('sameness', True)]
sameness []

samehood [('samehood', True)]
samehood []

xsame []
xsame []

samex []
samex []

答案 4 :(得分:0)

你想找到所有'字母'的字谜?

def anagrams(letters, words):
    letters = sorted(letters)
    result = []
    for word in words:
        if sorted(word.strip()) == letters:
            result.append(word)
    return result