Question

我想要在句号中分割这个字符串：

j = 'you can get it cheaper than $20.99. shop at amazon.com. hurry before prices go up.'

这是我想要的结果：

['you can get it cheaper than $20.99. ', 'shop at amazon.com.', ' hurry before prices go up.']

我在每个小写字母上分隔一段时间，在它之后分隔任何带有句点和空格的数字。

x = []
sentences = re.split(r'([a-z]\.|\d\.\s)', j)
sentence_endings = sentences[1::2]
for position in range(len(sentences)):
        if sentences[position] in sentence_endings:
            x.append(sentences[position -1] + sentences[position])

打印x给了我：

['you can get it cheaper than $20.99. ', 'shop at amazon.', 'com.', ' hurry before prices go up.']

我希望“amazon.com”成为一个字符串，所以我指示正则表达式忽略“.com”和re.split(r'([a-z]\.|\d\.\s)[^.com]', j) 但这并没有让我得到我想要的结果。最好的方法是什么？

Answer 1

要分隔句点后跟空格的简单正则表达式可以是\.\s。

您可以使用lookbehind来保留分割中的句点：(?<=\.)\s

如果你想使用分裂方法来获得＆＃34; amazon.com＆＃34;从您的字符串中，您可以尝试.*(?=amazon.com)|(?<=amazon.com).*

Answer 2

非正则表达式选项可以使用nltk.sent_tokenize()：

>>> import nltk
>>> j = 'you can get it cheaper than $20.99. shop at amazon.com. hurry before prices go up.'
>>> nltk.sent_tokenize(j)
['you can get it cheaper than $20.99.', 'shop at amazon.com.', 'hurry before prices go up.']

如何在分隔符上拆分字符串但排除其他字符串

2 个答案: