我想要在句号中分割这个字符串:
j = 'you can get it cheaper than $20.99. shop at amazon.com. hurry before prices go up.'
这是我想要的结果:
['you can get it cheaper than $20.99. ', 'shop at amazon.com.', ' hurry before prices go up.']
我在每个小写字母上分隔一段时间,在它之后分隔任何带有句点和空格的数字。
x = []
sentences = re.split(r'([a-z]\.|\d\.\s)', j)
sentence_endings = sentences[1::2]
for position in range(len(sentences)):
if sentences[position] in sentence_endings:
x.append(sentences[position -1] + sentences[position])
打印x给了我:
['you can get it cheaper than $20.99. ', 'shop at amazon.', 'com.', ' hurry before prices go up.']
我希望“amazon.com”成为一个字符串,所以我指示正则表达式忽略“.com”和re.split(r'([a-z]\.|\d\.\s)[^.com]', j)
但这并没有让我得到我想要的结果。最好的方法是什么?
答案 0 :(得分:3)
要分隔句点后跟空格的简单正则表达式可以是\.\s
。
您可以使用lookbehind来保留分割中的句点:(?<=\.)\s
如果你想使用分裂方法来获得&#34; amazon.com&#34;从您的字符串中,您可以尝试.*(?=amazon.com)|(?<=amazon.com).*
答案 1 :(得分:1)
非正则表达式选项可以使用nltk.sent_tokenize()
:
>>> import nltk
>>> j = 'you can get it cheaper than $20.99. shop at amazon.com. hurry before prices go up.'
>>> nltk.sent_tokenize(j)
['you can get it cheaper than $20.99.', 'shop at amazon.com.', 'hurry before prices go up.']