从文本中删除大量字符串

时间:2018-12-10 01:25:30

标签: regex python-3.x

假设

txt='Daniel Johnson and Ana Hickman are friends. They know each other for a long time. Daniel Johnson is a professor and Ana Hickman is writer.'

是一大段文字,我想删除一大串字符串,例如

removalLists=['Daniel Johnson','Ana Hickman']

从他们那里。我的意思是我想用

替换列表中的所有元素
' '

我知道我可以使用诸如

这样的循环轻松完成此操作
for string in removalLists:
    txt=re.sub(string,' ',txt)

我想知道我是否可以更快地做到这一点。

1 个答案:

答案 0 :(得分:3)

一种方法是生成单个正则表达式模式,该模式是替换项的替代。因此,我建议您使用以下正则表达式模式作为您的示例:

\bDaniel Johnson\b|\bAna Hickman\b

要生成此结果,我们首先要用单词边界(\b)包装每个术语。然后,使用|作为分隔符将列表折叠为单个字符串。最后,我们可以使用re.sub将所有出现的任何术语替换为一个空格。

txt = 'Daniel Johnson and Ana Hickman are friends. They know each other for a long time. Daniel Johnson is a professor and Ana Hickman is writer.'
removalLists = ['Daniel Johnson','Ana Hickman']

regex = '|'.join([r'\b' + s + r'\b' for s in removalLists])
output = re.sub(regex, " ", txt)

print(output)

  and   are friends. They know each other for a long time.   is a professor and   is writer.