如何用空白而不是全部空白替换句点?

时间:2019-02-04 00:29:58

标签: python regex string

如何用空格替换某些时段,而不是所有时段?

例如:

this_string = 'Man is weak.So they die'
that_string = 'I have a Ph.d'

在这里,我想要这样的结果:

this_string = 'Man is weak So they die'
some_string = 'I have a Phd'

我希望像Ph.d这样的标题保留为一个单词,而将连接2个句子的句点替换为空格。


这是我到目前为止所拥有的:

re.sub('[^A-Za-z0-9\s]+',' ', this_string)

这将用空格替换所有句点。

有什么想法可以改善这一点吗?

2 个答案:

答案 0 :(得分:0)

您可以使用两个正则表达式作为规则来更改文本:

import re

text = 'Man is weak.So they die. I have a Ph.d.'

text = re.sub(r'([A-Za-z ]{1})(\.)([A-Z]{1})', r'\g<1>. \g<3>', text)  # remove the dot in r'\g<1>. \g<3>' to get '...weak So...'
print(text)  # Man is weak. So they die. I have a Ph.d.

text = re.sub(r'([A-Za-z ]{1})(\.)([a-z]{1})', r'\g<1>\g<3>', text)
print(text)  # Man is weak. So they die. I have a Phd.

最后,它不可靠,因为它是基于规则的转换。像Ph.D之类的东西不起作用。

答案 1 :(得分:0)

您可以先用新符号替换所有有问题的点,然后再用该符号拆分:

import re

abbreviations = ["Dr.", "Mrs.", "Mr.", "Ph.d"]
rx = re.compile(r'''(?:{})|((?<=[a-z])\.(?=\s*[A-Z]))'''.format("|".join(abbreviations)))

data = "Man is weak.So they die. I have a Ph.d"

# substitute them first
def repl(match):
    if match.group(1) is not None:
        return "#!#"
    return match.group(0)

data = rx.sub(repl, data)
for sent in re.split(r"#!#\s*", data):
    print(sent.replace(".", ""))

这产生

Man is weak
So they die
I have a Phd

请参见a demo on ideone.com