应用错误收集

I have a string with sentences I wanted to separate into individual sentences. The string has a lot of subtleties that are difficult to capture and split. I cannot use the nltk library either. My current regex does the best job among all others I have tried, but misses some sentences that start in a new line (implying a new paragraph). I was wondering if there was an easy way to modify the current expression to also split when there is a new line.

import re
file = open('data.txt','r')
text = file.read()
sentences = re.split('(?<!\w\.\w.)(?<![A-Z][a-z]\.)(?<=\.|\?)\s', text)

The current regexp is as follows:

sentences = re.split('(?<!\w\.\w.)(?<![A-Z][a-z]\.)(?<=\.|\?)\s', text)

I would essentially need to modify the expression to also split when there is a new line.

Need to modify my regexp to split string at every new line

0 个答案: