使用包含缩写的正则表达式在Python中拆分段落

时间:2011-08-10 20:16:13

标签: python regex split

尝试在包含3个字符串和缩写的段落上使用此功能。

#!/usr/bin/env python
# -*- coding: UTF-8 -*-

def splitParagraphIntoSentences(paragraph):
    ''' break a paragraph into sentences
        and return a list '''
    import re
    # to split by multile characters

    #   regular expressions are easiest (and fastest)
    sentenceEnders = re.compile('[.!?][\s]{1,2}[A-Z]')
    sentenceList = sentenceEnders.split(paragraph)
    return sentenceList

if __name__ == '__main__':
    p = "While other species (e.g. horse mango, M. foetida) are also grown ,Mangifera indica – the common mango or Indian mango – is the only mango tree. Commonly cultivated in many tropical and subtropical regions, and its fruit is distributed essentially worldwide.In several cultures, its fruit and leaves are ritually used as floral decorations at weddings, public celebrations and religious "

    sentences = splitParagraphIntoSentences(p)
    for s in sentences:
        print s.strip()

删除下一个缩写句子的第一个字符,

O/p Recieved:
 While other Mangifera species (e.g. horse mango, M. foetida) are also grown on a
 more localized basis, Mangifera indica ΓÇô the common mango or Indian mango ΓÇô
 is the only mango tree
ommonly cultivated in many tropical and subtropical regions, and its fruit is di
stributed essentially worldwide.In several cultures, its fruit and leaves are ri
tually used as floral decorations at weddings, public celebrations and religious.

因此,字符串只被调整到2个字符串,下一个句子的第一个字符被消除了。也可以看到一些奇怪的字符,我猜python不能转换泛音。

如果我将正则表达式更改为[.!?][\s]{1,2}

While other species (e.g
horse mango, M
foetida) are also grown ,Mangifera indica ΓÇô the common mango or Indian mango Γ
Çô is the only mango tree
Commonly cultivated in many tropical and subtropical regions, and its fruit is d
istributed essentially worldwide.In several cultures, its fruit and leaves are r
itually used as floral decorations at weddings, public celebrations and religiou
s

因此即使缩写也会被分割。

1 个答案:

答案 0 :(得分:2)

你想要的正则表达式是:

[.!?][\s]{1,2}(?=[A-Z])

你想要一个肯定的先行断言,这意味着如果它后面跟一个大写字母但不匹配大写字母,你想要匹配该模式。

只有第一个匹配的原因是你在第二个时期之后没有空格。