Question

为了解决其中一条评论，我的总体目标是理解如何实现一个正则表达式，允许我在正面或负面的后面使用单词边界，因为它似乎不能使用量词。

因此，对于我的具体情况，我希望能够检查一段时间之前的单词（＆＃39;。＆＃39;）是不是大写单词。因此，我可以从脑海中的两条不同路径来看待这个问题：

1）正面观察＆＃39;之前的单词。＆＃39;全部是小写，但是我收到正面后视为零宽度的错误，因此我不能使用量词＆＃39; +＆＃39;像这样：(?<=[^A-Z][a-z]+)

2）在＆＃39;之前的单词的负面观察。＆＃39;以大写字母开头，如下：(?<![A-Z][a-z])

我更愿意继续改进选项1，因为它对我更有意义，但对其他建议持开放态度。我能在这里使用单词边界吗？

我使用它最终将段落分成相应的句子，我想坚持使用正则表达式而不是使用nltk。问题主要在于处理名字的缩写或缩写。

CURRENT REGEX：

(?<=[^A-Z][a-z])\.(?=\s[A-Z]+)

INPUT：

Koehler rides the bus. Bowman was passed into the first grade; Koehler advanced to third grade. Jon. Williams walked down the road to school. Bowman decided to go fishing; Koehler did not. C. Robinson asked to go to recess, and the teacher said no.

期望的输出：

Koehler rides the bus.
Bowman was passed into the first grade; Koehler advanced to third grade.
Jon. Williams walked down the road to school.
Bowman decided to go fishing; Koehler did not.
C. Robinson asked to go to recess, and the teacher said no.

Answer 1

我建议print(re.sub(r'(\b[a-z]+\.\s*(?!$))', r'\1\n', text, re.M))，针对您的具体情况。你的正则表达式以这种方式简化了 lot ，你不需要使用lookbehind，因为这些有很多限制（需要修复宽度等等）。

<强>代码

Koehler rides the bus. 
Bowman was passed into the first grade; Koehler advanced to third grade. 
Jon. Williams walked down the road to school. 
Bowman decided to go fishing; Koehler did not. 
C. Robinson asked to go to recess, and the teacher said no.

<强>输出

(         # first capture group
\b        # word boundary
[a-z]+    # lower case a-z
\.        # literal period
\s*       # any other whitespace characters (added for cosmetic effect)
(?!$)     # negative lookahead - don't insert a newline when you're at the end of a sentence
)

正则表达式详细信息

\1        # reference to the first capture group 
\n        # a newline

此模式被替换为：

{{1}}

Answer 2

尝试

mystr="Koehler rides the bus. Bowman was passed into the first grade; Koehler advanced to third grade. Jon. Williams walked down the road to school. Bowman decided to go fishing; Koehler did not. C. Robinson asked to go to recess, and the teacher said no."
lst=re.findall(r'.+?\b(?![A-Z])\w+\.',mystr)

如果是多行，则使用以下内容： -

lst=re.findall(r'.+?(?:$|\b(?![A-Z])\w+\b\.)',mystr,re.M)

他们两个都会产生......

['Koehler rides the bus.', ' Bowman was passed into the first grade; Koehler advanced', 'to third grade.', ' Jon. Williams walked down the road to school.', ' Bowman decided to go fishing; Koehler did not.', ' C. Robinson asked to go to recess, and the teacher said no.']

'.+?\b(?![A-Z])\w+\.'

的说明

.+?       #As minimal of characters as possible after the end of previous match, this makes sure we have as many distinct sentences
\b        #word boundary
(?![A-Z]) #negative lookahead => don't follow \b with [A-Z] => skip capitalized words
\w+       #the whole word
\.        #followed by a dot

测试正则表达式here 测试代码here。

Answer 3

如果你想创建一个句子列表，可以选择另一个选项：

# Split into sentences (last word is split off too)    
temp = re.split('( [a-z]+\.)', text)
temp = filter(bool, temp)

['Koehler rides the', ' bus.', ' Bowman was passed into the first grade; Koehler advanced to third', ' grade.', ' Jon. Williams walked down the road to', ' school.', ' Bowman decided to go fishing; Koehler did', ' not.', ' C. Robinson asked to go to recess, and the teacher said', ' no.']

# Join the pieces back together
sentences = [''.join([temp[i], temp[i + 1]]).strip() for i in range(0, len(temp), 2)]

['Koehler rides the bus.', 'Bowman was passed into the first grade; Koehler advanced to third grade.', 'Jon. Williams walked down the road to school.', 'Bowman decided to go fishing; Koehler did not.', 'C. Robinson asked to go to recess, and the teacher said no.']

使用带正面后视的正则表达式在python中拆分字符串

3 个答案: