使用带正面后视的正则表达式在python中拆分字符串

时间:2017-09-18 03:43:39

标签: python regex string regex-lookarounds

为了解决其中一条评论,我的总体目标是理解如何实现一个正则表达式,允许我在正面或负面的后面使用单词边界,因为它似乎不能使用量词。

因此,对于我的具体情况,我希望能够检查一段时间之前的单词('。')是不是大写单词。因此,我可以从脑海中的两条不同路径来看待这个问题:

1)正面观察&#39;之前的单词。&#39;全部是小写,但是我收到正面后视为零宽度的错误,因此我不能使用量词&#39; +&#39;像这样:(?<=[^A-Z][a-z]+)

2)在&#39;之前的单词的负面观察。&#39;以大写字母开头,如下:(?<![A-Z][a-z])

我更愿意继续改进选项1,因为它对我更有意义,但对其他建议持开放态度。我能在这里使用单词边界吗?

我使用它最终将段落分成相应的句子,我想坚持使用正则表达式而不是使用nltk。问题主要在于处理名字的缩写或缩写。

CURRENT REGEX:

(?<=[^A-Z][a-z])\.(?=\s[A-Z]+)

INPUT:

Koehler rides the bus. Bowman was passed into the first grade; Koehler advanced to third grade. Jon. Williams walked down the road to school. Bowman decided to go fishing; Koehler did not. C. Robinson asked to go to recess, and the teacher said no.

期望的输出:

Koehler rides the bus.
Bowman was passed into the first grade; Koehler advanced to third grade.
Jon. Williams walked down the road to school.
Bowman decided to go fishing; Koehler did not.
C. Robinson asked to go to recess, and the teacher said no.

3 个答案:

答案 0 :(得分:3)

我建议print(re.sub(r'(\b[a-z]+\.\s*(?!$))', r'\1\n', text, re.M)) ,针对您的具体情况。你的正则表达式以这种方式简化了 lot ,你不需要使用lookbehind,因为这些有很多限制(需要修复宽度等等)。

<强>代码

Koehler rides the bus. 
Bowman was passed into the first grade; Koehler advanced to third grade. 
Jon. Williams walked down the road to school. 
Bowman decided to go fishing; Koehler did not. 
C. Robinson asked to go to recess, and the teacher said no.

<强>输出

(         # first capture group
\b        # word boundary
[a-z]+    # lower case a-z
\.        # literal period
\s*       # any other whitespace characters (added for cosmetic effect)
(?!$)     # negative lookahead - don't insert a newline when you're at the end of a sentence
)

正则表达式详细信息

\1        # reference to the first capture group 
\n        # a newline

此模式被替换为:

{{1}}

答案 1 :(得分:1)

尝试

mystr="Koehler rides the bus. Bowman was passed into the first grade; Koehler advanced to third grade. Jon. Williams walked down the road to school. Bowman decided to go fishing; Koehler did not. C. Robinson asked to go to recess, and the teacher said no."
lst=re.findall(r'.+?\b(?![A-Z])\w+\.',mystr)

如果是多行,则使用以下内容: -

lst=re.findall(r'.+?(?:$|\b(?![A-Z])\w+\b\.)',mystr,re.M)

他们两个都会产生......

['Koehler rides the bus.', ' Bowman was passed into the first grade; Koehler advanced', 'to third grade.', ' Jon. Williams walked down the road to school.', ' Bowman decided to go fishing; Koehler did not.', ' C. Robinson asked to go to recess, and the teacher said no.']

'.+?\b(?![A-Z])\w+\.'

的说明
.+?       #As minimal of characters as possible after the end of previous match, this makes sure we have as many distinct sentences
\b        #word boundary
(?![A-Z]) #negative lookahead => don't follow \b with [A-Z] => skip capitalized words
\w+       #the whole word
\.        #followed by a dot

测试正则表达式here 测试代码here

答案 2 :(得分:0)

如果你想创建一个句子列表,可以选择另一个选项:

# Split into sentences (last word is split off too)    
temp = re.split('( [a-z]+\.)', text)
temp = filter(bool, temp)

['Koehler rides the', ' bus.', ' Bowman was passed into the first grade; Koehler advanced to third', ' grade.', ' Jon. Williams walked down the road to', ' school.', ' Bowman decided to go fishing; Koehler did', ' not.', ' C. Robinson asked to go to recess, and the teacher said', ' no.']

# Join the pieces back together
sentences = [''.join([temp[i], temp[i + 1]]).strip() for i in range(0, len(temp), 2)]

['Koehler rides the bus.', 'Bowman was passed into the first grade; Koehler advanced to third grade.', 'Jon. Williams walked down the road to school.', 'Bowman decided to go fishing; Koehler did not.', 'C. Robinson asked to go to recess, and the teacher said no.']