正则表达式用客户评论中的逗号替换某些点

时间:2016-10-08 22:54:26

标签: python regex

在某些患者对药物的评论中,我需要编写一个正则表达式,将'.'替换为','。他们应该在提到副作用后使用逗号,但其中一些使用了点。例如:

text = "the drug side-effects are: night mare. nausea. night sweat. bad dream. dizziness. severe headache.  I suffered. she suffered. she told I should change it."

我写了一个正则表达式代码,用两个点来检测一个单词(如头疼)或两个单词(如坏梦):

检测由两个点包围的单词:

text=  re.sub (r'(\.)(\s*\w+\s*\.)',r',\2 ', text )

检测两个点包围的两个单词:

text =  re.sub (r'(\.)(\s*\w+\s\w+\s*\.)',r',\2 ', text11 )

这是输出:

the drug side-effects are: night mare, nausea,  night sweat.  bad dream, dizziness,  severe headache.   I suffered, she suffered.  she told I should change it.

但它应该是:

the drug side-effects are: night mare, nausea,  night sweat,  bad dream, dizziness,  severe headache.   I suffered. she suffered.  she told I should change it.

我的代码在dot之后没有替换night sweat to ','。另外,if a sentence starts with a subject pronoun (such as I and she) I do not want to change dot to comma after it, even if it has two words (such as, I suffered)。我不知道如何将这个条件添加到我的代码中。

有什么建议吗?谢谢!

1 个答案:

答案 0 :(得分:1)

您可以使用以下模式:

\.(\s*(?!(?:i|she)\b)\w+(?:\s+\w+)?\s*)(?=[^\w\s]|$)

这匹配一个点,然后捕获一个或两个单词,其中第一个单词不是您提到的代词(您最需要扩展该列表)。接下来必须是一个既不是单词字符也不是空格的字符(例如. ! : ,)或字符串的结尾。

然后,您必须将其替换为,\1

在python中

import re
text = "the drug side-effects are: night mare. nausea. night sweat. bad dream. dizziness. severe headache.  I suffered. she suffered. she told I should change it."
text = re.sub(r'\.(\s*(?!(?:i|she)\b)\w+(?:\s+\w+)?\s*)(?=[^\w\s]|$)', r',\1', text, flags=re.I)
print(text)

输出

the drug side-effects are: night mare, nausea, night sweat, bad dream, dizziness, severe headache.  I suffered. she suffered. she told I should change it.

这可能不是绝对安全的,您可能需要扩展某些边缘情况的模式。