进行\ G锚定解析循环的Python方法是什么?

时间:2015-12-06 19:35:42

标签: python regex


abc123  -> abc|123
abcABC  -> abc|ABC
ABC123  -> ABC|123
123abc  -> 123|abc
123ABC  -> 123|ABC
AbcDef  -> Abc|Def    (e.g. CamelCase)
ABCDef  -> ABC|Def    
1stabc  -> 1st|abc    (recognize valid ordinals)
1ndabc  -> 1|ndabc    (but not invalid ordinals)
11thabc -> 11th|abc   (recognize that 11th - 13th are different than 1st - 3rd)
11stabc -> 11|stabc

我现在正在做一些机器学习实验,我想做一些使用这个标记器的实验。但首先,我需要将它从Perl移植到Python。这段代码的关键是使用\ G锚点的循环,我听到的东西在python中不存在。我已经尝试使用谷歌搜索如何在Python中完成,但我不确定究竟要搜索什么,所以我很难找到答案。


sub Tokenize
# Breaks a string into tokens using special rules,
# where a token is any sequence of characters, be they a sequence of letters, 
# a sequence of numbers, or a sequence of non-alpha-numeric characters
# the list of tokens found are returned to the caller
    my $value = shift;
    my @list = ();
    my $word;

    while ( $value ne '' && $value =~ m/
        \G                # start where previous left off
        ([^a-zA-Z0-9]*)   # capture non-alpha-numeric characters, if any
        ([a-zA-Z0-9]*?)   # capture everything up to a token boundary
        (?:               # identify the token boundary
            (?=[^a-zA-Z0-9])       # next character is not a word character 
        |   (?=[A-Z][a-z])         # Next two characters are upper lower
        |   (?<=[a-z])(?=[A-Z])    # lower followed by upper
        |   (?<=[a-zA-Z])(?=[0-9]) # letter followed by digit
                # ordinal boundaries
        |   (?<=^1(?i:st))         # first
        |   (?<=[^1][1](?i:st))    # first but not 11th
        |   (?<=^2(?i:nd))         # second
        |   (?<=[^1]2(?i:nd))      # second but not 12th
        |   (?<=^3(?i:rd))         # third
        |   (?<=[^1]3(?i:rd))      # third but not 13th
        |   (?<=1[123](?i:th))     # 11th - 13th
        |   (?<=[04-9](?i:th))     # other ordinals
                # non-ordinal digit-letter boundaries
        |   (?<=^1)(?=[a-zA-Z])(?!(?i)st)       # digit-letter but not first
        |   (?<=[^1]1)(?=[a-zA-Z])(?!(?i)st)    # digit-letter but not 11th
        |   (?<=^2)(?=[a-zA-Z])(?!(?i)nd)       # digit-letter but not first
        |   (?<=[^1]2)(?=[a-zA-Z])(?!(?i)nd)    # digit-letter but not 12th
        |   (?<=^3)(?=[a-zA-Z])(?!(?i)rd)       # digit-letter but not first
        |   (?<=[^1]3)(?=[a-zA-Z])(?!(?i)rd)    # digit-letter but not 13th
        |   (?<=1[123])(?=[a-zA-Z])(?!(?i)th)   # digit-letter but not 11th - 13th
        |   (?<=[04-9])(?=[a-zA-Z])(?!(?i)th)   # digit-letter but not ordinal
        |   (?=$)                               # end of string
    /xg )
        push @list, $1 if $1 ne '';
        push @list, $2 if $2 ne '';
    return @list;


我确实找到了解决这个特定问题的方法,但没有找到&#34;我如何使用\ G based解析&#34;的一般问题。 - 我有一些示例代码在使用\ G锚定的循环中执行正则表达式,然后在正文中它使用另一个锚定在\ G的匹配来查看继续解析的方法。所以我还在寻找答案。


import re

IsA                 = lambda s: '['  + s + ']'
IsNotA              = lambda s: '[^' + s + ']'

Upper               = IsA( 'A-Z' )
Lower               = IsA( 'a-z' )
Letter              = IsA( 'a-zA-Z' )
Digit               = IsA( '0-9' )
AlphaNumeric        = IsA( 'a-zA-Z0-9' )
NotAlphaNumeric     = IsNotA( 'a-zA-Z0-9' ) 

EndOfString         = '$'
OR                  = '|'

ZeroOrMore          = lambda s: s + '*'
ZeroOrMoreNonGreedy = lambda s: s + '*?'
OneOrMore           = lambda s: s + '+'
OneOrMoreNonGreedy  = lambda s: s + '+?'

StartsWith          = lambda s: '^' + s
Capture             = lambda s: '('    + s + ')'
PreceededBy         = lambda s: '(?<=' + s + ')'
FollowedBy          = lambda s: '(?='  + s + ')'
NotFollowedBy       = lambda s: '(?!'  + s + ')'
StopWhen            = lambda s: s
CaseInsensitive     = lambda s: '(?i:' + s + ')'

ST                  = '(?:st|ST)'
ND                  = '(?:nd|ND)'
RD                  = '(?:rd|RD)'
TH                  = '(?:th|TH)'

def OneOf( *args ):
  return '(?:' + '|'.join( args ) + ')'

pattern = '(.+?)' + \
    # ABC | !!! - break at whitespace or non-alpha-numeric boundary
    PreceededBy( AlphaNumeric ) + FollowedBy( NotAlphaNumeric ),
    PreceededBy( NotAlphaNumeric ) + FollowedBy( AlphaNumeric ),

    # ABC | Abc - break at what looks like the start of a word or sentence
    FollowedBy( Upper + Lower ),

    # abc | ABC - break when a lower-case letter is followed by an upper case
    PreceededBy( Lower )  + FollowedBy( Upper ),

    # abc | 123 - break between words and digits
    PreceededBy( Letter ) + FollowedBy( Digit ),

    # 1st | oak - recognize when the string starts with an ordinal
    PreceededBy( StartsWith( '1' + ST ) ),
    PreceededBy( StartsWith( '2' + ND ) ),
    PreceededBy( StartsWith( '3' + RD ) ),

    # 1st | abc - contains an ordinal
    PreceededBy( IsNotA( '1' ) + '1' + ST ),
    PreceededBy( IsNotA( '1' ) + '2' + ND ),
    PreceededBy( IsNotA( '1' ) + '3' + RD ),
    PreceededBy( '1' + IsA( '123' )  + TH ),
    PreceededBy( IsA( '04-9' )       + TH ),

    # 1 | abcde - recognize when it starts with or contains a non-ordinal digit/letter boundary
    PreceededBy( StartsWith( '1' ) ) + FollowedBy( Letter ) + NotFollowedBy( ST ),
    PreceededBy( StartsWith( '2' ) ) + FollowedBy( Letter ) + NotFollowedBy( ND ),
    PreceededBy( StartsWith( '3' ) ) + FollowedBy( Letter ) + NotFollowedBy( RD ),
    PreceededBy( IsNotA( '1' ) + '1' ) + FollowedBy( Letter ) + NotFollowedBy( ST ),
    PreceededBy( IsNotA( '1' ) + '2' ) + FollowedBy( Letter ) + NotFollowedBy( ND ),
    PreceededBy( IsNotA( '1' ) + '3' ) + FollowedBy( Letter ) + NotFollowedBy( RD ),
    PreceededBy( '1' + IsA( '123' ) )  + FollowedBy( Letter ) + NotFollowedBy( TH ),
    PreceededBy( IsA( '04-9' ) )       + FollowedBy( Letter ) + NotFollowedBy( TH ),

    # abcde | $ - end of the string
    FollowedBy( EndOfString )

matcher = re.compile( pattern )

def tokenize( s ):
  return matcher.findall( s )

1 个答案:

答案 0 :(得分:2)




def tokenize(w):
    index = 0
    m = matcher.match(w, index)
    o = []
    # Although index != m.end() check zero-length match, it's more of
    # a guard against accidental infinite loop.
    # Don't expect a regex which can match empty string to work.
    # See Caveat section.
    while m and index != m.end():
        index = m.end()
        m = matcher.match(w, index)
    return o



例如,re.findall(r'(.??)', 'abc')返回一个包含4个空字符串['', '', '', '']的数组,而在PCRE中,您可以找到7个匹配['', 'a', '', 'b', '', 'c' ''],其中第2,第4和第6个匹配开始于与第1,第3和第5场比赛相同的指数。通过使用防止空字符串匹配的标志在相同索引处重试,可以找到PCRE中的其他匹配项。





由于Python支持free-spacing mode with re.X flag,您可以编写类似于Perl代码中的正则表达式:

matcher = re.compile(r'''
    (?:               # identify the token boundary
        (?=[^a-zA-Z0-9])       # next character is not a word character 
    |   (?=[A-Z][a-z])         # Next two characters are upper lower
    |   (?<=[a-z])(?=[A-Z])    # lower followed by upper
    |   (?<=[a-zA-Z])(?=[0-9]) # letter followed by digit
            # ordinal boundaries
    |   (?<=^1[sS][tT])         # first
    |   (?<=[^1][1][sS][tT])    # first but not 11th
    |   (?<=^2[nN][dD])         # second
    |   (?<=[^1]2[nN][dD])      # second but not 12th
    |   (?<=^3[rR][dD])         # third
    |   (?<=[^1]3[rR][dD])      # third but not 13th
    |   (?<=1[123][tT][hH])     # 11th - 13th
    |   (?<=[04-9][tT][hH])     # other ordinals
            # non-ordinal digit-letter boundaries
    |   (?<=^1)(?=[a-zA-Z])(?![sS][tT])       # digit-letter but not first
    |   (?<=[^1]1)(?=[a-zA-Z])(?![sS][tT])    # digit-letter but not 11th
    |   (?<=^2)(?=[a-zA-Z])(?![nN][dD])       # digit-letter but not first
    |   (?<=[^1]2)(?=[a-zA-Z])(?![nN][dD])    # digit-letter but not 12th
    |   (?<=^3)(?=[a-zA-Z])(?![rR][dD])       # digit-letter but not first
    |   (?<=[^1]3)(?=[a-zA-Z])(?![rR][dD])    # digit-letter but not 13th
    |   (?<=1[123])(?=[a-zA-Z])(?![tT][hH])   # digit-letter but not 11th - 13th
    |   (?<=[04-9])(?=[a-zA-Z])(?![tT][hH])   # digit-letter but not ordinal
    |   (?=$)                               # end of string
''', re.X)