Question

在字符串数组中替换复合词的最有效方法是什么。

text = ['San', 'Francisco', 'is', 'foggy', '.','Viva', 'Las', 'Vegas','.']


replacements = {'san_francisco':['San Francisco'],
                'las_vegas': ['Las Vegas'],
                }

text2= ' '.join(text)

for key, value in replacements.items():
    text2=text2.replace(value[0],key)

final=text2.split(' ')

print(final)

因此，此方法重建整个字符串，循环遍历字典并替换文本。 Sublime文本表明这需要0.2秒。有没有更有效的方法来做到这一点？

Answer 1

我还没有在更大的数据集上对此进行分析，但这可能更有效。很多＆＃34;繁重的＆＃34;你的解决方案是通过replace方法完成的，所以无论哪种方式更有效，都将在很大程度上取决于cPython replace方法的优化程度（即他们可能会使用一些聪明的技巧使其运行得非常快）。

text = ['San', 'Francisco', 'is', 'foggy', '.','Viva', 'Las', 'Vegas','.', "wild", "wild", "west"]

replacements = {
'San':  {'Francisco': 'san_francisco'},
'Las': {'Vegas': 'las_vegas'},
'wild': {'wild': {'west': 'wild_wild_west'}}
}

for i in range(0, len(text)-1):

    if text[i] is None:
        continue

    replacement_value = replacements.get(text[i])
    if replacement_value is None:.
        continue

    number_of_items_to_delete = 0
    while isinstance(replacement_value, dict):
        number_of_items_to_delete += 1
        replacement_value = replacement_value.get(text[i + number_of_items_to_delete])

    text[i] = replacement_value

    for j in range(i+1, i+1 + number_of_items_to_delete):
        text[j] = None

text = [n for n in text if n is not None]
print (text)

我们现在为查询表使用嵌套字典。注意我已经＆＃34;翻转＆＃34;查找表，以便密钥来自单词列表中的值，我们希望在表中查找替换。

算法可以描述如下：

迭代单词列表。
如果在查找表中找到给定的单词，则在查找表中查找其值。如果该值是另一个字典，请检查单词列表中的下一个单词是否在我们刚检索到的嵌套字典中。跟踪我们正在查看的列表中前进的单词数。
当在查找表中检索的项不再是字典时（当我们找到实际的替换字符串时），我们用替换字符串替换当前字。然后，无论多少单词前进，我们都希望到达查找表的末尾，我们用None替换这些索引
一旦我们进行迭代，我们就会删除＆＃34; None＆＃34;的所有实例。来自单词列表。

有效地替换标记化字符串数组中的复合词。蟒蛇

1 个答案: