我试图根据python中的正常语法规则正确地分割句子。
我要分割的句子是
s = """Mr. Smith bought cheapsite.com for 1.5 million dollars,
i.e. he paid a lot for it. Did he mind? Adam Jones Jr. thinks he didn't. In any case, this isn't true... Well, with a
probability of .9 it isn't."""
预期输出
Mr. Smith bought cheapsite.com for 1.5 million dollars, i.e. he paid a lot for it.
Did he mind?
Adam Jones Jr. thinks he didn't.
In any case, this isn't true...
Well, with a probability of .9 it isn't.
为了达到这个目的,我经常使用,经过大量的搜索后,我发现了下面的正则表达式,这就是诀窍.new_str突然从's'删除了一些\ n
m = re.split(r'(?<!\w\.\w.)(?<![A-Z][a-z]\.)(?<=\.|\?)\s',new_str)
for i in m:
print (i)
Mr. Smith bought cheapsite.com for 1.5 million dollars,i.e. he paid a lot for it.
Did he mind?
Adam Jones Jr. thinks he didn't.
In any case, this isn't true...
Well, with aprobability of .9 it isn't.
所以我理解reg ex的方式是我们首先选择
1)所有字符如ie
2)从第一个选择的过滤空格中,我们选择那些字符 没有像夫人先生那样的话等等。
3)从过滤后的第二步中,我们只选择那些我们有点或问题且前面有空格的主题。
所以我试着改变下面的顺序
1)首先过滤掉所有标题。
2)从过滤后的步骤中选择以空格
开头的那些3)删除所有短语,例如ie
但是当我这样做时,后面的空白也被分开了
m = re.split(r'(?<![A-Z][a-z]\.)(?<=\.|\?)\s(?<!\w\.\w.)',new_str)
for i in m:
print (i)
Mr. Smith bought cheapsite.com for 1.5 million dollars,i.e.
he paid a lot for it.
Did he mind?
Adam Jones Jr. thinks he didn't.
In any case, this isn't true...
Well, with aprobability of .9 it isn't.
修改后的程序中的最后一步不应该能够识别出类似的短语,为什么它没有检测到它?
答案 0 :(得分:1)
首先,.
中的最后一个(?<!\w\.\w.)
看起来很可疑,如果你需要匹配一个文字点,请将其转义((?<!\w\.\w\.)
)。
回到这个问题,当你使用r'(?<![A-Z][a-z]\.)(?<=\.|\?)\s(?<!\w\.\w.)'
正则表达式时,最后一个负面的lookbehind检查空格后面的位置是否前面没有单词char,dot,word char,任何字符(因为.
未转义)。这种情况属实,因为在该位置之前有一个点e
,另一个.
和一个空格。
为了使lookbehind工作与\s
之前的工作方式相同,将\s
置于lookbehind模式中:
(?<![A-Z][a-z]\.)(?<=\.|\?)\s(?<!\w\.\w.\s)
请参阅regex demo
另一个增强功能可以是在第二个lookbehind中使用角色类:(?<=\.|\?)
- &gt; (?<=[.?])
。