如何将包含单词和标点符号之间空格的字符串分隔成句子?

时间:2015-01-07 04:42:02

标签: python regex

我有以下字符串:

string = "Mr . john bought greatsite . com for 1 . 5 million dollars , i . e . he paid a lot for it . Did he mind ? Steve jones jr . thinks he didn't . In any case , this isn't true ... Well , with a probability of  . 9 it isn't . What a great site ! I really loved it !!! Did you ???"

我需要把它分成这样的句子:

Mr . john bought greatsite . com for 1 . 5 million dollars , i . e . he paid a lot for it . 
Did he mind ? 
Steve jones jr . thinks he didn't .
In any case , this isn't true ...
Well , with a probability of  . 9 it isn't . 
What a great site !
I really loved it !!!
Did you ???

并将它们保存到句子列表中。

我使用了以下代码:

sents = re.split(r"(?<!\w\.\w.)(?<![A-Z][a-z]\.)(?<=\.|\?|\!)\s", input_doc2)
print (sents)

我得到的输出是:

    ['mr .', 'smith bought cheapsite .', 'com for 1 .', '5 million dollars , i .', 'e .', 'he paid a lot for it .', 'did he mind ?', 'adam jones jr .', "thinks he didn't .", "in any case , this isn't true ...", 'well , with a probability of  .', "9 it isn't .", 'what a great movie !', 'i loved it .', 'i loved it !!!', 'did you ???', 'i did .!?', 'not really it was bad !', '']

哪个错了。似乎没有办法解决这个问题。有办法解决这个问题吗?

提前致谢。

1 个答案:

答案 0 :(得分:1)

(?<!\w\.\w.)(?<![A-Z][a-z]\.)(?<=\.|\?|\!)\s(?=[A-Z])

试试这个。看看演示。

https://regex101.com/r/sH8aR8/3

sents = re.split(r"(?<!\w\.\w.)(?<![A-Z][a-z]\.)(?<=\.|\?|\!)\s(?=[A-Z])", input_doc2)
print (sents)