正则表达式为匹配的字符串添加字符

时间:2017-03-11 06:07:02

标签: python regex nlp

我有一个长字符串,这是一个段落,但是句点后没有空格。例如:

para = "I saw this film about 20 years ago and remember it as being particularly nasty. I believe it is based on a true incident: a young man breaks into a nurses\' home and rapes, tortures and kills various women.It is in black and white but saves the colour for one shocking shot.At the end the film seems to be trying to make some political statement but it just comes across as confused and obscene.Avoid."

我正在尝试使用re.sub来解决这个问题,但输出并不是我所期望的。

这就是我所做的:

re.sub("(?<=\.).", " \1", para)

我匹配每个句子的第一个字符,我想在它前面放一个空格。我的匹配模式是(?<=\.).,它(假设)检查一段时间后出现的任何字符。我从其他stackoverflow问题中了解到\ 1匹配最后匹配的模式,因此我将替换模式写为\1,后面跟着先前匹配的字符串。

这是输出:

"I saw this film about 20 years ago and remember it as being particularly nasty. \x01I believe it is based on a true incident: a young man breaks into a nurses\' home and rapes, tortures and kills various women. \x01t is in black and white but saves the colour for one shocking shot. \x01t the end the film seems to be trying to make some political statement but it just comes across as confused and obscene. \x01void. \x01

re.sub将匹配的字符替换为\x01,而不是匹配句点前面的任何字符并在其前面添加空格。为什么?如何在匹配的字符串之前添加字符?

5 个答案:

答案 0 :(得分:8)

(?<=a)bpositive lookbehind。它与b后的a匹配。未捕获a。所以在你的表达中,我不确定\1在这种情况下的价值是什么,但它不是(?<=...)里面的内容。

你当前的方法有另一个缺陷:它会在.之后添加一个空格,即使已经有一个空格。

要在.之后添加缺少空间,我建议采用不同的策略: 将 . - 后跟非空格非点替换为.和空格:

re.sub(r'\.(?=[^ .])', '. ', para)

答案 1 :(得分:2)

您可能使用以下正则表达式(使用positive look-behindnegative look-ahead断言)

(?<=\.)(?!\s)

<强>蟒

re.sub(r"(?<=\.)(?!\s)", " ", para)

参见 demo

答案 2 :(得分:2)

regex稍加修改的版本也可以使用:

print re.sub(r"([\.])([^\s])", r"\1 \2", para)

# I saw this film about 20 years ago and remember it as being particularly nasty. I believe it is based on a true incident: a young man breaks into a nurses' home and rapes, tortures and kills various women. It is in black and white but saves the colour for one shocking shot. At the end the film seems to be trying to make some political statement but it just comes across as confused and obscene. Avoid.

答案 3 :(得分:1)

我认为这就是你想要做的。您可以传递一个函数来进行替换。

import re

def my_replace(match):
    return " " + match.group()

my_string = "dhd.hd hd hs fjs.hello"
print(re.sub(r'(?<=\.).', my_replace, my_string))

打印:

dhd. hd hd hs fjs. hello

正如@ Seanny123指出的那样,即使在这段时间之后已经有空格,这也会增加一个空间。

答案 4 :(得分:0)

您可以使用的最简单的正则表达式替换是:

re.sub(r'\.(?=\w)', '. ', para)

它只是匹配每个句点,并使用前瞻(?=\w)确保接下来有一个单词字符,并且在句点之后还没有空格并将其替换为.