以下代码让我感到非常头疼:
def extract_by_letters(letters, dictionary):
for word in dictionary:
for letter in letters:
if word.count(letter) != letters.count(letter):
if word in dictionary: #I cant leave this line out
dictionary.remove(word)
return dictionary
首先:'如果字典中的单词'行。为什么我不能把它留下来?我收到一个错误,说ValueError:list.remove(x):x不在列表中
但显然是这样。
第二:词典是由换行符分隔的大约50,000个单词的文件。上面的代码需要大约2分钟才能运行...... Wayyy太长了。我玩了一段代码,我发现了这一行:
if word.count(letter) != letters.count(letter):
似乎导致了我所有的问题。如果我取出那条线(这完全搞砸了输出),该函数大约需要2秒才能通过字典。
我认为这是计数功能,但事实并非如此。
如果我将if语句更改为:
print word.count(letter)
print letters.count(letter)
该功能大约需要3秒钟才能运行。
我确信这是比较。还有其他建议吗?有更好的方法吗?
提前致谢!
答案 0 :(得分:4)
您获得例外的原因是,如果字母数与多个字母匹配,您尝试多次删除该字
它太慢的原因是你在循环内部的循环中有循环。
如果你要写一两句关于函数应该做什么,那么重构它会容易得多。与此同时,这将阻止您检查一旦删除了某个单词后是否需要删除它。
def extract_by_letters(letters, dictionary):
for word in dictionary[:]: # bad idea to change this while you iterate over it
for letter in letters:
if word.count(letter) != letters.count(letter):
dictionary.remove(word)
break
return dictionary
如果词典是set
,你应该加快速度。如果字典是list
,这应该会带来巨大的加速
答案 1 :(得分:2)
尝试构建输出而不是从中删除:
def extract_by_letters(letters, dictionary):
d = []
for word in dictionary:
for letter in letters:
if word.count(letter)>0:
d.append(word)
break
return d
或者,您可以使用正则表达式:
import re
def extract_by_letters(letters, dictionary):
regex = re.compile('['+letters+']')
d=[]
for word in dictionary:
if regex.search(word):
d.append(word)
return d
或者,或许最简单的方法:
import re
def extract_by_letters(letters, dictionary):
regex = re.compile('['+letters+']')
return [word for word in dictionary if regex.search(word)]
这最后一个在我的Mac上扫描/ usr / share / dict / words没有明显的时间,这是一个234936个单词的列表。
答案 2 :(得分:2)
这是一个应该提供重大加速的功能:
def extract_by_letters(letters,dictionary):
letdict = zip(set(letters),[letters.count(let) for let in set(letters)])
outarr = []
for word in dictionary:
goodword = True
for letter in letdict:
if word.count(letter) != letdict[letter]:
goodword = False
break
if goodword:
outarr.append(word)
return outarr
以下是我所做的优化:
制作字母及相应频率的字典。这样,当您只需要执行此过程一次并存储结果时,您不会反复使用letters.count。
我不是从字典中删除单词,而是将它们添加到从函数返回的数组中。如果你有一本庞大的字典,很可能只有几个单词匹配。此外,如果字典变量是一个数组(我怀疑),那么每次调用remove时,都必须首先在字典中搜索单词(从开头线性开始),然后将其删除。通过使用要删除的单词的索引弹出来删除它会更快。
当发现不匹配时,断开循环检查字母计数。这可以防止我们在得到答案时进行不必要的检查。
我不确定你是否在字母变量中重复了字母,所以我确保它可以通过使用letdict处理它。如果您之前在字母变量中重复了字母,那么您将重复检查单词中这些字母的计数。
答案 3 :(得分:1)
import pprint
from collections import defaultdict
# This is a best approximation to what Bryan is trying to do.
# However the results are meaningless because the list is being
# mutated during iteration over it. So I haven't shown the output.
def extract_by_letters_0(letters, input_list):
dictionary = input_list.copy()
for word in dictionary:
for letter in letters:
if word.count(letter) != letters.count(letter):
if word in dictionary: #I cant leave this line out
dictionary.remove(word)
return dictionary
# This avoids the mutation.
# The results are anagrams PLUS letters that don't occur
# in the query. E.g. "same" produces "samehood" but not "sameness"
# ("sameness" has 3*"s" and 2*"e" instead of 1 of each)
def extract_by_letters_1(letters, input_list):
dictionary = set(input_list)
ripouts = set()
for word in dictionary:
for letter in letters:
if word.count(letter) != letters.count(letter):
ripouts.add(word)
return dictionary - ripouts
def anagram_key(strg):
return ''.join(sorted(list(strg)))
def check_anagrams(str1, str2):
return sorted(list(str1)) == sorted(list(str2))
# Advice: try algorithms like this out on a SMALL data set first.
# Get it working correctly. Use different test cases. Have test code
# however primitive that check your results.
# Then if it runs slowly, helpers
# don't have to guess what you are doing.
raw_text = """
twas brillig and the slithy toves
did gyre and gimble in the wabe
same mesa seam sameness samehood
"""
lexicon = sorted(set(raw_text.split()))
print "\nlexicon:", lexicon
#
# Assuming we want anagrams:
#
# Build an anagram dictionary
#
anagram_dict = defaultdict(set)
for word in lexicon:
anagram_dict[anagram_key(word)].add(word)
print "\nanagram_dict (len == %d):" % len(anagram_dict)
pprint.pprint(anagram_dict)
# now purge trivial entries
temp = {}
for k, v in anagram_dict.iteritems():
if len(v) != 1:
temp[k] = v
anagram_dict = temp
print "\nanagram_dict (len == %d):" % len(anagram_dict)
pprint.pprint(anagram_dict)
# Test cases
tests = "sam same mesa sameness samehood xsame samex".split()
default_set = frozenset()
for test in tests:
print
results = extract_by_letters_1(test, lexicon)
print test, [(result, check_anagrams(test, result)) for result in results]
# In the following statement, you can use set([test]) as the default
# if that produces a more useful or orthogonal result.
results = anagram_dict.get(anagram_key(test), default_set)
print test, [(result, check_anagrams(test, result)) for result in results]
输出:
lexicon: ['and', 'brillig', 'did', 'gimble', 'gyre', 'in', 'mesa', 'same', 'samehood', 'sameness', 'seam', 'slithy', 'the', 'toves', 'twas', 'wabe']
anagram_dict (len == 14):
defaultdict(<type 'set'>, {'abew': set(['wabe']), 'eht': set(['the']), 'egry': set(['gyre']), 'begilm': set(['gimble']), 'hilsty': set(['slithy']), 'aems': set(['mesa', 'seam', 'same']), 'bgiillr': set(['brillig']), 'ddi': set(['did']), 'eostv': set(['toves']), 'adehmoos': set(['samehood']), 'in': set(['in']), 'adn': set(['and']), 'aeemnsss': set(['sameness']), 'astw': set(['twas'])})
anagram_dict (len == 1):
{'aems': set(['mesa', 'same', 'seam'])}
sam [('mesa', False), ('samehood', False), ('seam', False), ('same', False)]
sam []
same [('mesa', True), ('samehood', False), ('seam', True), ('same', True)]
same [('mesa', True), ('seam', True), ('same', True)]
mesa [('mesa', True), ('samehood', False), ('seam', True), ('same', True)]
mesa [('mesa', True), ('seam', True), ('same', True)]
sameness [('sameness', True)]
sameness []
samehood [('samehood', True)]
samehood []
xsame []
xsame []
samex []
samex []
答案 4 :(得分:0)
你想找到所有'字母'的字谜?
def anagrams(letters, words):
letters = sorted(letters)
result = []
for word in words:
if sorted(word.strip()) == letters:
result.append(word)
return result