Question

嗨，我有一个大的文档另存为一个句子，并在文档中列出了一些专有名称。

我想用标签[PERSON]替换列表的实例

ex: sentence = "John and Marie went to school today....."

list = ["Maria", "John"....]

结果= [PERSON]和[PERSON]今天上学了

正如您所看到的，我可能仍想使用玛丽亚和玛丽这样的名字，因为它们的拼写不同但很接近。

我知道我可以使用循环，但是由于列表和句子很大，因此可能会有更有效的方法。谢谢

Answer 1

使用fuzzywuzzy检查句子中的每个单词是否与名称紧密匹配（匹配百分比高于80％），如果是，则将其替换为[PERSON]

>>> from fuzzywuzzy import process, fuzz
>>> names = ["Maria", "John"]
>>> sentence = "John and Marie went to school today....."
>>>
>>> match = lambda word: process.extractOne(word, names, scorer=fuzz.ratio, score_cutoff=80)
>>> ' '.join('[PERSON]' if match(word) else word  for word in sentence.split())
'[PERSON] and [PERSON] went to school today.....'

Answer 2

您可以在输入列表中使用regex来匹配具有拼写变化的单词。例如，如果您需要匹配 Marie 和 Maria ，则可以使用 Mari（e | a）作为正则表达式。这是可以使用的后续代码：

import re

mySentence = "John and Marie and Maria went to school today....."
myList = ["Mari(e|a)", "John"]

myNewSentence = re.compile("|".join(myList)).sub('[PERSON]', mySentence)

print(myNewSentence)  # [PERSON] and [PERSON] and [PERSON] went to school today.....

从列表替换子字符串的有效方法

2 个答案: