如何在分隔符上拆分字符串但排除其他字符串

时间:2016-01-05 04:58:55

标签: regex python-2.7 delimiter

我想要在句号中分割这个字符串:

j = 'you can get it cheaper than $20.99. shop at amazon.com. hurry before prices go up.'

这是我想要的结果:

['you can get it cheaper than $20.99. ', 'shop at amazon.com.', ' hurry before prices go up.']

我在每个小写字母上分隔一段时间,在它之后分隔任何带有句点和空格的数字。

x = []
sentences = re.split(r'([a-z]\.|\d\.\s)', j)
sentence_endings = sentences[1::2]
for position in range(len(sentences)):
        if sentences[position] in sentence_endings:
            x.append(sentences[position -1] + sentences[position])

打印x给了我:

['you can get it cheaper than $20.99. ', 'shop at amazon.', 'com.', ' hurry before prices go up.']

我希望“amazon.com”成为一个字符串,所以我指示正则表达式忽略“.com”和re.split(r'([a-z]\.|\d\.\s)[^.com]', j) 但这并没有让我得到我想要的结果。最好的方法是什么?

2 个答案:

答案 0 :(得分:3)

要分隔句点后跟空格的简单正则表达式可以是\.\s

您可以使用lookbehind来保留分割中的句点:(?<=\.)\s

如果你想使用分裂方法来获得&#34; amazon.com&#34;从您的字符串中,您可以尝试.*(?=amazon.com)|(?<=amazon.com).*

答案 1 :(得分:1)

非正则表达式选项可以使用nltk.sent_tokenize()

>>> import nltk
>>> j = 'you can get it cheaper than $20.99. shop at amazon.com. hurry before prices go up.'
>>> nltk.sent_tokenize(j)
['you can get it cheaper than $20.99.', 'shop at amazon.com.', 'hurry before prices go up.']