在某些患者对药物的评论中,我需要编写一个正则表达式,将'.'
替换为','
。他们应该在提到副作用后使用逗号,但其中一些使用了点。例如:
text = "the drug side-effects are: night mare. nausea. night sweat. bad dream. dizziness. severe headache. I suffered. she suffered. she told I should change it."
我写了一个正则表达式代码,用两个点来检测一个单词(如头疼)或两个单词(如坏梦):
text= re.sub (r'(\.)(\s*\w+\s*\.)',r',\2 ', text )
text = re.sub (r'(\.)(\s*\w+\s\w+\s*\.)',r',\2 ', text11 )
这是输出:
the drug side-effects are: night mare, nausea, night sweat. bad dream, dizziness, severe headache. I suffered, she suffered. she told I should change it.
但它应该是:
the drug side-effects are: night mare, nausea, night sweat, bad dream, dizziness, severe headache. I suffered. she suffered. she told I should change it.
我的代码在dot
之后没有替换night sweat to ','
。另外,if a sentence starts with a subject pronoun (such as I and she) I do not want to change dot to comma after it, even if it has two words (such as, I suffered)
。我不知道如何将这个条件添加到我的代码中。
有什么建议吗?谢谢!
答案 0 :(得分:1)
您可以使用以下模式:
\.(\s*(?!(?:i|she)\b)\w+(?:\s+\w+)?\s*)(?=[^\w\s]|$)
这匹配一个点,然后捕获一个或两个单词,其中第一个单词不是您提到的代词(您最需要扩展该列表)。接下来必须是一个既不是单词字符也不是空格的字符(例如.
!
:
,
)或字符串的结尾。
然后,您必须将其替换为,\1
在python中
import re
text = "the drug side-effects are: night mare. nausea. night sweat. bad dream. dizziness. severe headache. I suffered. she suffered. she told I should change it."
text = re.sub(r'\.(\s*(?!(?:i|she)\b)\w+(?:\s+\w+)?\s*)(?=[^\w\s]|$)', r',\1', text, flags=re.I)
print(text)
输出
the drug side-effects are: night mare, nausea, night sweat, bad dream, dizziness, severe headache. I suffered. she suffered. she told I should change it.
这可能不是绝对安全的,您可能需要扩展某些边缘情况的模式。