进行\ G锚定解析循环的Python方法是什么?

时间:2015-12-06 19:35:42

标签: python regex

以下是我多年前写过的perl函数。它是一个智能标记器,它可以识别一些可能不应该被粘在一起的事物。例如,给定左侧的输入,它将分割字符串,如右图所示:

abc123  -> abc|123
abcABC  -> abc|ABC
ABC123  -> ABC|123
123abc  -> 123|abc
123ABC  -> 123|ABC
AbcDef  -> Abc|Def    (e.g. CamelCase)
ABCDef  -> ABC|Def    
1stabc  -> 1st|abc    (recognize valid ordinals)
1ndabc  -> 1|ndabc    (but not invalid ordinals)
11thabc -> 11th|abc   (recognize that 11th - 13th are different than 1st - 3rd)
11stabc -> 11|stabc

我现在正在做一些机器学习实验,我想做一些使用这个标记器的实验。但首先,我需要将它从Perl移植到Python。这段代码的关键是使用\ G锚点的循环,我听到的东西在python中不存在。我已经尝试使用谷歌搜索如何在Python中完成,但我不确定究竟要搜索什么,所以我很难找到答案。

你会如何用Python编写这个函数?

sub Tokenize
# Breaks a string into tokens using special rules,
# where a token is any sequence of characters, be they a sequence of letters, 
# a sequence of numbers, or a sequence of non-alpha-numeric characters
# the list of tokens found are returned to the caller
{
    my $value = shift;
    my @list = ();
    my $word;

    while ( $value ne '' && $value =~ m/
        \G                # start where previous left off
        ([^a-zA-Z0-9]*)   # capture non-alpha-numeric characters, if any
        ([a-zA-Z0-9]*?)   # capture everything up to a token boundary
        (?:               # identify the token boundary
            (?=[^a-zA-Z0-9])       # next character is not a word character 
        |   (?=[A-Z][a-z])         # Next two characters are upper lower
        |   (?<=[a-z])(?=[A-Z])    # lower followed by upper
        |   (?<=[a-zA-Z])(?=[0-9]) # letter followed by digit
                # ordinal boundaries
        |   (?<=^1(?i:st))         # first
        |   (?<=[^1][1](?i:st))    # first but not 11th
        |   (?<=^2(?i:nd))         # second
        |   (?<=[^1]2(?i:nd))      # second but not 12th
        |   (?<=^3(?i:rd))         # third
        |   (?<=[^1]3(?i:rd))      # third but not 13th
        |   (?<=1[123](?i:th))     # 11th - 13th
        |   (?<=[04-9](?i:th))     # other ordinals
                # non-ordinal digit-letter boundaries
        |   (?<=^1)(?=[a-zA-Z])(?!(?i)st)       # digit-letter but not first
        |   (?<=[^1]1)(?=[a-zA-Z])(?!(?i)st)    # digit-letter but not 11th
        |   (?<=^2)(?=[a-zA-Z])(?!(?i)nd)       # digit-letter but not first
        |   (?<=[^1]2)(?=[a-zA-Z])(?!(?i)nd)    # digit-letter but not 12th
        |   (?<=^3)(?=[a-zA-Z])(?!(?i)rd)       # digit-letter but not first
        |   (?<=[^1]3)(?=[a-zA-Z])(?!(?i)rd)    # digit-letter but not 13th
        |   (?<=1[123])(?=[a-zA-Z])(?!(?i)th)   # digit-letter but not 11th - 13th
        |   (?<=[04-9])(?=[a-zA-Z])(?!(?i)th)   # digit-letter but not ordinal
        |   (?=$)                               # end of string
        )
    /xg )
    {
        push @list, $1 if $1 ne '';
        push @list, $2 if $2 ne '';
    }
    return @list;
}

我确实尝试使用re.split()以及上面的变体。但是,split()拒绝在零宽度匹配上进行拆分(如果真的知道一个人正在做什么,这种能力应该是可能的)。

我确实找到了解决这个特定问题的方法,但没有找到&#34;我如何使用\ G based解析&#34;的一般问题。 - 我有一些示例代码在使用\ G锚定的循环中执行正则表达式,然后在正文中它使用另一个锚定在\ G的匹配来查看继续解析的方法。所以我还在寻找答案。

那就是说,这是我将上述内容翻译成Python的最终工作代码:

import re

IsA                 = lambda s: '['  + s + ']'
IsNotA              = lambda s: '[^' + s + ']'

Upper               = IsA( 'A-Z' )
Lower               = IsA( 'a-z' )
Letter              = IsA( 'a-zA-Z' )
Digit               = IsA( '0-9' )
AlphaNumeric        = IsA( 'a-zA-Z0-9' )
NotAlphaNumeric     = IsNotA( 'a-zA-Z0-9' ) 

EndOfString         = '$'
OR                  = '|'

ZeroOrMore          = lambda s: s + '*'
ZeroOrMoreNonGreedy = lambda s: s + '*?'
OneOrMore           = lambda s: s + '+'
OneOrMoreNonGreedy  = lambda s: s + '+?'

StartsWith          = lambda s: '^' + s
Capture             = lambda s: '('    + s + ')'
PreceededBy         = lambda s: '(?<=' + s + ')'
FollowedBy          = lambda s: '(?='  + s + ')'
NotFollowedBy       = lambda s: '(?!'  + s + ')'
StopWhen            = lambda s: s
CaseInsensitive     = lambda s: '(?i:' + s + ')'

ST                  = '(?:st|ST)'
ND                  = '(?:nd|ND)'
RD                  = '(?:rd|RD)'
TH                  = '(?:th|TH)'

def OneOf( *args ):
  return '(?:' + '|'.join( args ) + ')'

pattern = '(.+?)' + \
  OneOf( 
    # ABC | !!! - break at whitespace or non-alpha-numeric boundary
    PreceededBy( AlphaNumeric ) + FollowedBy( NotAlphaNumeric ),
    PreceededBy( NotAlphaNumeric ) + FollowedBy( AlphaNumeric ),

    # ABC | Abc - break at what looks like the start of a word or sentence
    FollowedBy( Upper + Lower ),

    # abc | ABC - break when a lower-case letter is followed by an upper case
    PreceededBy( Lower )  + FollowedBy( Upper ),

    # abc | 123 - break between words and digits
    PreceededBy( Letter ) + FollowedBy( Digit ),

    # 1st | oak - recognize when the string starts with an ordinal
    PreceededBy( StartsWith( '1' + ST ) ),
    PreceededBy( StartsWith( '2' + ND ) ),
    PreceededBy( StartsWith( '3' + RD ) ),

    # 1st | abc - contains an ordinal
    PreceededBy( IsNotA( '1' ) + '1' + ST ),
    PreceededBy( IsNotA( '1' ) + '2' + ND ),
    PreceededBy( IsNotA( '1' ) + '3' + RD ),
    PreceededBy( '1' + IsA( '123' )  + TH ),
    PreceededBy( IsA( '04-9' )       + TH ),

    # 1 | abcde - recognize when it starts with or contains a non-ordinal digit/letter boundary
    PreceededBy( StartsWith( '1' ) ) + FollowedBy( Letter ) + NotFollowedBy( ST ),
    PreceededBy( StartsWith( '2' ) ) + FollowedBy( Letter ) + NotFollowedBy( ND ),
    PreceededBy( StartsWith( '3' ) ) + FollowedBy( Letter ) + NotFollowedBy( RD ),
    PreceededBy( IsNotA( '1' ) + '1' ) + FollowedBy( Letter ) + NotFollowedBy( ST ),
    PreceededBy( IsNotA( '1' ) + '2' ) + FollowedBy( Letter ) + NotFollowedBy( ND ),
    PreceededBy( IsNotA( '1' ) + '3' ) + FollowedBy( Letter ) + NotFollowedBy( RD ),
    PreceededBy( '1' + IsA( '123' ) )  + FollowedBy( Letter ) + NotFollowedBy( TH ),
    PreceededBy( IsA( '04-9' ) )       + FollowedBy( Letter ) + NotFollowedBy( TH ),

    # abcde | $ - end of the string
    FollowedBy( EndOfString )
  )

matcher = re.compile( pattern )

def tokenize( s ):
  return matcher.findall( s )

1 个答案:

答案 0 :(得分:2)

使用\G

在正则表达式的开头模拟re.RegexObject.match

您可以通过跟踪并提供re.RegexObject.match的起始位置来模拟\Gre模块的正则表达式开头的效果,这会强制匹配从pos中指定的位置。

def tokenize(w):
    index = 0
    m = matcher.match(w, index)
    o = []
    # Although index != m.end() check zero-length match, it's more of
    # a guard against accidental infinite loop.
    # Don't expect a regex which can match empty string to work.
    # See Caveat section.
    while m and index != m.end():
        o.append(m.group(1))
        index = m.end()
        m = matcher.match(w, index)
    return o

买者

这种方法的一个警告是,它与主匹配中匹配空字符串的正则表达不能很好地匹配,因为Python没有任何工具可以强制正则表达式重试匹配同时防止零长度匹配。

例如,re.findall(r'(.??)', 'abc')返回一个包含4个空字符串['', '', '', '']的数组,而在PCRE中,您可以找到7个匹配['', 'a', '', 'b', '', 'c' ''],其中第2,第4和第6个匹配开始于与第1,第3和第5场比赛相同的指数。通过使用防止空字符串匹配的标志在相同索引处重试,可以找到PCRE中的其他匹配项。

我知道问题是关于Perl,而不是PCRE,但全局匹配行为应该是相同的。否则,原始代码无法正常工作。

正如问题中所做的那样,将([^a-zA-Z0-9]*)([a-zA-Z0-9]*?)重写为(.+?)可避免此问题,但您可能希望使用re.S标记。

关于正则表达式的其他评论

由于Python中不区分大小写的标志会影响整个模式,因此必须重写不区分大小写的子模式。我会将(?i:st)重写为[sS][tT]以保留原始含义,但如果符合您的要求,请与(?:st|ST)一起使用。

由于Python支持free-spacing mode with re.X flag,您可以编写类似于Perl代码中的正则表达式:

matcher = re.compile(r'''
    (.+?)
    (?:               # identify the token boundary
        (?=[^a-zA-Z0-9])       # next character is not a word character 
    |   (?=[A-Z][a-z])         # Next two characters are upper lower
    |   (?<=[a-z])(?=[A-Z])    # lower followed by upper
    |   (?<=[a-zA-Z])(?=[0-9]) # letter followed by digit
            # ordinal boundaries
    |   (?<=^1[sS][tT])         # first
    |   (?<=[^1][1][sS][tT])    # first but not 11th
    |   (?<=^2[nN][dD])         # second
    |   (?<=[^1]2[nN][dD])      # second but not 12th
    |   (?<=^3[rR][dD])         # third
    |   (?<=[^1]3[rR][dD])      # third but not 13th
    |   (?<=1[123][tT][hH])     # 11th - 13th
    |   (?<=[04-9][tT][hH])     # other ordinals
            # non-ordinal digit-letter boundaries
    |   (?<=^1)(?=[a-zA-Z])(?![sS][tT])       # digit-letter but not first
    |   (?<=[^1]1)(?=[a-zA-Z])(?![sS][tT])    # digit-letter but not 11th
    |   (?<=^2)(?=[a-zA-Z])(?![nN][dD])       # digit-letter but not first
    |   (?<=[^1]2)(?=[a-zA-Z])(?![nN][dD])    # digit-letter but not 12th
    |   (?<=^3)(?=[a-zA-Z])(?![rR][dD])       # digit-letter but not first
    |   (?<=[^1]3)(?=[a-zA-Z])(?![rR][dD])    # digit-letter but not 13th
    |   (?<=1[123])(?=[a-zA-Z])(?![tT][hH])   # digit-letter but not 11th - 13th
    |   (?<=[04-9])(?=[a-zA-Z])(?![tT][hH])   # digit-letter but not ordinal
    |   (?=$)                               # end of string
    )
''', re.X)