嗨,我有一个大的文档另存为一个句子,并在文档中列出了一些专有名称。
我想用标签[PERSON]替换列表的实例
ex: sentence = "John and Marie went to school today....."
list = ["Maria", "John"....]
结果= [PERSON]和[PERSON]今天上学了
正如您所看到的,我可能仍想使用玛丽亚和玛丽这样的名字,因为它们的拼写不同但很接近。
我知道我可以使用循环,但是由于列表和句子很大,因此可能会有更有效的方法。谢谢
答案 0 :(得分:1)
使用fuzzywuzzy
检查句子中的每个单词是否与名称紧密匹配(匹配百分比高于80%),如果是,则将其替换为[PERSON]
>>> from fuzzywuzzy import process, fuzz
>>> names = ["Maria", "John"]
>>> sentence = "John and Marie went to school today....."
>>>
>>> match = lambda word: process.extractOne(word, names, scorer=fuzz.ratio, score_cutoff=80)
>>> ' '.join('[PERSON]' if match(word) else word for word in sentence.split())
'[PERSON] and [PERSON] went to school today.....'
答案 1 :(得分:0)
您可以在输入列表中使用regex来匹配具有拼写变化的单词。例如,如果您需要匹配 Marie 和 Maria ,则可以使用 Mari(e | a)作为正则表达式。这是可以使用的后续代码:
import re
mySentence = "John and Marie and Maria went to school today....."
myList = ["Mari(e|a)", "John"]
myNewSentence = re.compile("|".join(myList)).sub('[PERSON]', mySentence)
print(myNewSentence) # [PERSON] and [PERSON] and [PERSON] went to school today.....