为了解决其中一条评论,我的总体目标是理解如何实现一个正则表达式,允许我在正面或负面的后面使用单词边界,因为它似乎不能使用量词。
因此,对于我的具体情况,我希望能够检查一段时间之前的单词('。')是不是大写单词。因此,我可以从脑海中的两条不同路径来看待这个问题:
1)正面观察&#39;之前的单词。&#39;全部是小写,但是我收到正面后视为零宽度的错误,因此我不能使用量词&#39; +&#39;像这样:(?<=[^A-Z][a-z]+)
2)在&#39;之前的单词的负面观察。&#39;以大写字母开头,如下:(?<![A-Z][a-z])
我更愿意继续改进选项1,因为它对我更有意义,但对其他建议持开放态度。我能在这里使用单词边界吗?
我使用它最终将段落分成相应的句子,我想坚持使用正则表达式而不是使用nltk。问题主要在于处理名字的缩写或缩写。
CURRENT REGEX:
(?<=[^A-Z][a-z])\.(?=\s[A-Z]+)
INPUT:
Koehler rides the bus. Bowman was passed into the first grade; Koehler advanced to third grade. Jon. Williams walked down the road to school. Bowman decided to go fishing; Koehler did not. C. Robinson asked to go to recess, and the teacher said no.
期望的输出:
Koehler rides the bus.
Bowman was passed into the first grade; Koehler advanced to third grade.
Jon. Williams walked down the road to school.
Bowman decided to go fishing; Koehler did not.
C. Robinson asked to go to recess, and the teacher said no.
答案 0 :(得分:3)
我建议print(re.sub(r'(\b[a-z]+\.\s*(?!$))', r'\1\n', text, re.M))
,针对您的具体情况。你的正则表达式以这种方式简化了 lot ,你不需要使用lookbehind,因为这些有很多限制(需要修复宽度等等)。
<强>代码强>
Koehler rides the bus.
Bowman was passed into the first grade; Koehler advanced to third grade.
Jon. Williams walked down the road to school.
Bowman decided to go fishing; Koehler did not.
C. Robinson asked to go to recess, and the teacher said no.
<强>输出强>
( # first capture group
\b # word boundary
[a-z]+ # lower case a-z
\. # literal period
\s* # any other whitespace characters (added for cosmetic effect)
(?!$) # negative lookahead - don't insert a newline when you're at the end of a sentence
)
正则表达式详细信息
\1 # reference to the first capture group
\n # a newline
此模式被替换为:
{{1}}
答案 1 :(得分:1)
尝试
mystr="Koehler rides the bus. Bowman was passed into the first grade; Koehler advanced to third grade. Jon. Williams walked down the road to school. Bowman decided to go fishing; Koehler did not. C. Robinson asked to go to recess, and the teacher said no."
lst=re.findall(r'.+?\b(?![A-Z])\w+\.',mystr)
如果是多行,则使用以下内容: -
lst=re.findall(r'.+?(?:$|\b(?![A-Z])\w+\b\.)',mystr,re.M)
他们两个都会产生......
['Koehler rides the bus.', ' Bowman was passed into the first grade; Koehler advanced', 'to third grade.', ' Jon. Williams walked down the road to school.', ' Bowman decided to go fishing; Koehler did not.', ' C. Robinson asked to go to recess, and the teacher said no.']
'.+?\b(?![A-Z])\w+\.'
.+? #As minimal of characters as possible after the end of previous match, this makes sure we have as many distinct sentences
\b #word boundary
(?![A-Z]) #negative lookahead => don't follow \b with [A-Z] => skip capitalized words
\w+ #the whole word
\. #followed by a dot
答案 2 :(得分:0)
如果你想创建一个句子列表,可以选择另一个选项:
# Split into sentences (last word is split off too)
temp = re.split('( [a-z]+\.)', text)
temp = filter(bool, temp)
['Koehler rides the', ' bus.', ' Bowman was passed into the first grade; Koehler advanced to third', ' grade.', ' Jon. Williams walked down the road to', ' school.', ' Bowman decided to go fishing; Koehler did', ' not.', ' C. Robinson asked to go to recess, and the teacher said', ' no.']
# Join the pieces back together
sentences = [''.join([temp[i], temp[i + 1]]).strip() for i in range(0, len(temp), 2)]
['Koehler rides the bus.', 'Bowman was passed into the first grade; Koehler advanced to third grade.', 'Jon. Williams walked down the road to school.', 'Bowman decided to go fishing; Koehler did not.', 'C. Robinson asked to go to recess, and the teacher said no.']