将命名组添加到pyparsing正则表达式逆变器

时间:2012-06-21 18:41:34

标签: pyparsing

希望Paul McGuire可以发现这一点并拯救我......

我抓住了'正则表达式逆变器'示例脚本http://pyparsing.wikispaces.com/file/view/invRegex.py

我正在尝试破解对python命名组的支持,例如(?P<blob_key>[a-zA-Z0-9-_=]+)

我是pyparsing的新手,我意识到正则表达式解析器可能不是最好的学习方式(我只是试图用结果来实现一些实际操作)。

我已经编辑了解析器函数,如下所示:

def parser():
    global _parser
    if _parser is None:
        lbrack = Literal("[")
        rbrack = Literal("]")
        lbrace = Literal("{")
        rbrace = Literal("}")
        lparen = Literal("(")
        rparen = Literal(")")
        pyspec = Literal("?P")
        langle = Literal("<")
        rangle = Literal(">")

        reMacro = Combine("\\" + oneOf(list("dws")))
        escapedChar = ~reMacro + Combine("\\" + oneOf(list(printables)))
        reLiteralChar = "".join(c for c in printables if c not in r"\[]{}().*?+|")

        reRange = Combine(lbrack + SkipTo(rbrack,ignore=escapedChar) + rbrack)
        reLiteral = ( escapedChar | oneOf(list(reLiteralChar)) )
        reDot = Literal(".")
        repetition = (
            ( lbrace + Word(nums).setResultsName("count") + rbrace ) |
            ( lbrace + Word(nums).setResultsName("minCount")+","+ Word(nums).setResultsName("maxCount") + rbrace ) |
            oneOf(list("*+?")) 
            )

        reNamedGroup = Combine(lparen + pyspec + langle + SkipTo(rangle) + rangle
                               + SkipTo(rparen, include=True) + rparen)

        reNamedGroup.setParseAction(handleNamedGroup)
        reRange.setParseAction(handleRange)
        reLiteral.setParseAction(handleLiteral)
        reMacro.setParseAction(handleMacro)
        reDot.setParseAction(handleDot)

        reTerm = ( reLiteral | reNamedGroup | reRange | reMacro | reDot )
        reExpr = operatorPrecedence( reTerm,
            [
            (repetition, 1, opAssoc.LEFT, handleRepetition),
            (None, 2, opAssoc.LEFT, handleSequence),
            (Suppress('|'), 2, opAssoc.LEFT, handleAlternative),
            ]
        )
        _parser = reExpr

    return _parser

当我针对我的测试正则表达式运行时,reNamedGroup似乎正确地找到并处理了命名组(我在SkipTo和其他方法中记录了一些...)但同时它似乎根本没有参与输出,我的handleNamedGroup函数从未被调用过。

日志输出如下:

invert(r'serve_blob/[A-Z]{2}/(?P<blob_key>[a-zA-Z0-9-_=]+)/')
DEBUG:root: serve_blob/[A-Z]{2}/(?P<blob_key>[a-zA-Z0-9-_=]+)/, 12
DEBUG:root: *** 15, A-Z
DEBUG:root: serve_blob/[A-Z]{2}/(?P<blob_key>[a-zA-Z0-9-_=]+)/, 12
DEBUG:root: *** 15, A-Z
DEBUG:root: serve_blob/[A-Z]{2}/(?P<blob_key>[a-zA-Z0-9-_=]+)/, 24
DEBUG:root: *** 32, blob_key
DEBUG:root: serve_blob/[A-Z]{2}/(?P<blob_key>[a-zA-Z0-9-_=]+)/, 33
DEBUG:root: * 49, [')'], [a-zA-Z0-9-_=]+
DEBUG:root: ** ['[a-zA-Z0-9-_=]+', ')']
DEBUG:root: serve_blob/[A-Z]{2}/(?P<blob_key>[a-zA-Z0-9-_=]+)/, 24
DEBUG:root: *** 32, blob_key
DEBUG:root: serve_blob/[A-Z]{2}/(?P<blob_key>[a-zA-Z0-9-_=]+)/, 33
DEBUG:root: * 49, [')'], [a-zA-Z0-9-_=]+
DEBUG:root: ** ['[a-zA-Z0-9-_=]+', ')']
DEBUG:root: handleLiteral: ['s']
DEBUG:root: handleLiteral: ['e']
DEBUG:root: handleLiteral: ['r']
DEBUG:root: handleLiteral: ['v']
DEBUG:root: handleLiteral: ['e']
DEBUG:root: handleLiteral: ['_']
DEBUG:root: handleLiteral: ['b']
DEBUG:root: handleLiteral: ['l']
DEBUG:root: handleLiteral: ['o']
DEBUG:root: handleLiteral: ['b']
DEBUG:root: handleLiteral: ['/']
DEBUG:root: serve_blob/[A-Z]{2}/(?P<blob_key>[a-zA-Z0-9-_=]+)/, 12
DEBUG:root: *** 15, A-Z
DEBUG:root: serve_blob/[A-Z]{2}/(?P<blob_key>[a-zA-Z0-9-_=]+)/, 12
DEBUG:root: *** 15, A-Z
DEBUG:root: handleRange: ['[A-Z]']
DEBUG:root: handleRepetition: [[[ABCDEFGHIJKLMNOPQRSTUVWXYZ], '{', '2', '}']]
DEBUG:root: handleLiteral: ['/']
DEBUG:root: serve_blob/[A-Z]{2}/(?P<blob_key>[a-zA-Z0-9-_=]+)/, 24
DEBUG:root: *** 32, blob_key
DEBUG:root: serve_blob/[A-Z]{2}/(?P<blob_key>[a-zA-Z0-9-_=]+)/, 33
DEBUG:root: * 49, [')'], [a-zA-Z0-9-_=]+
DEBUG:root: ** ['[a-zA-Z0-9-_=]+', ')']
DEBUG:root: serve_blob/[A-Z]{2}/(?P<blob_key>[a-zA-Z0-9-_=]+)/, 24
DEBUG:root: *** 32, blob_key
DEBUG:root: serve_blob/[A-Z]{2}/(?P<blob_key>[a-zA-Z0-9-_=]+)/, 33
DEBUG:root: * 49, [')'], [a-zA-Z0-9-_=]+
DEBUG:root: ** ['[a-zA-Z0-9-_=]+', ')']
DEBUG:root: handleSequence: [[Lit:s, Lit:e, Lit:r, Lit:v, Lit:e, Lit:_, Lit:b, Lit:l, Lit:o, Lit:b, Lit:/, <libs.exreg.exreg.GroupEmitter object at 0x34cfa30>, Lit:/]]

前缀为**的行是从skipRes返回的SkipTo值...它对我来说是正确的。我难以理解的部分是他们被忽视的原因。

我敏锐地意识到我只是盲目地复制和粘贴东西......我试图仔细复制适用于reRange的东西......但是范围有效,而我的类似位则没有。

我猜测周围的括号可能在解析的后期阶段从输出中“隐藏”已解析的命名组,但我对如何丢失感到遗憾。

1 个答案:

答案 0 :(得分:1)

您不希望在reNamedGroup表达式中对parens执行任何操作。请注意,parens中包含的re组没有其他定义的语法,但它们绝对有效。在此解析器中,parens作为operatorPrecedence表达式的一部分进行处理。刚刚将reNamedGroup的定义更改为:

reNamedGroup = pyspec + langle + SkipTo(rangle) + rangle

让operatorPrecedence处理所有的paren分组。

[由OP编辑]
以上更改单独的工作,但命名组的所有输出都以P?开头,因此pyspec部分以某种方式泄漏到输出中。最后我不需要以堆栈形式重写(见注释),以下额外的更改使其正常工作:

reTerm = ( reLiteral | reRange | reMacro | reDot )
reExpr = operatorPrecedence( reTerm,
    [
    (reNamedGroup.suppress(), 1, opAssoc.RIGHT, handleNamedGroup),
    (repetition, 1, opAssoc.LEFT, handleRepetition),
    (None, 2, opAssoc.LEFT, handleSequence),
    (Suppress('|'), 2, opAssoc.LEFT, handleAlternative),
    ]
)