Question

我有一个字符串列表，一个单词词典及其替代词：

titles = ['The cat in the hat', 'Horton hears a who', \ 
          'Green eggs and ham', 'The butter battle book', 'My book about me']
wlist = {'cat': 'word1', 'hat': 'word2', 'Horton': 'word3', \
         'eggs': 'word4', 'butter': 'word5', 'book': 'word6'}

如果在字符串中找到，我需要将字典中存在的单词替换为相应的值。
到目前为止，我有以下代码：

for i, book in enumerate(titles):
     for k,v in wlist.items():
         if k in book:
             book = book.replace(k, v)
             titles[i] = book

这给了我输出：

['The word1 in the word2',
 'word3 hears a who',
 'Green word4 and ham',
 'The word5 battle word6',
 'My word6 about me']

是否有更高效（更快）的方法来实现这一点，也许没有两个for循环？我实际拥有的清单非常大！

非常感谢！

Answer 1

以下是一些想法（以及我对它们的理由）。请根据您的数据进行衡量！

重要的想法是减少Python代码，转而使用C实现的Python函数，并将数据与标记结合起来，以便从缓存中获取帮助。

第一个想法是将所有字符串合并为一个字符串，由字符串中找不到的某种标记值分隔，进行替换，然后再将它们分开。我认为这可能会更快，因为字符串是不可变的，因此Python在替换发生时不会继续重新分配它们（尽管我确定它保留了某种缓存，你可能会重新使用它），你可能会受益于任何算法优点.replace可能会超过一个字符串，并且您只遍历一个字符串，因此您可能会获得缓存加速。当然，您需要支付组合和分离所有字符串的费用，并确保您的标记不在数据中。

第二个想法（从this article被盗）是使用正则表达式替换字符串，因此正则表达式库只需使用它使用的任何C实现的魔法遍历字符串一次。

所以，结合这两个想法：

import re

# I'm using '\n' to join the strings. Of course, if you can control your input,
# you can just load the list into this format instead of converting it
titles = '\n'.join(['The cat in the hat', 'Horton hears a who', \
          'Green eggs and ham', 'The butter battle book', 'My book about me'])

wlist = {'cat': 'word1', 'hat': 'word2', 'Horton': 'word3', \
         'eggs': 'word4', 'butter': 'word5', 'book': 'word6'}

robj = re.compile('|'.join(wlist.keys()))
result = robj.sub(lambda m: wlist[m.group(0)], titles)
result = result.split('\n') # uncombine
print(result)

再一次，这实际上是猜测。这些想法中的一个或两个或没有一个可能会有所帮助，或者我可能完全脱离了左侧领域。一旦你测试了它们，我很想看到数字！

如果在字典中有效地找到字符串中的单词

1 个答案: