在Pyparsing

时间:2017-03-24 08:46:27

标签: python parsing pyparsing

我有一些句子需要转换为正则表达式代码,我试图使用Pyparsing。这些句子基本上都是搜索规则,告诉我们要搜索什么。

句子的例子 -

  1. LINE_CONTAINS this is a phrase - 这是一个示例搜索规则,告知您搜索的行应该包含短语this is a phrase

  2. LINE_STARTSWITH However we - 这是一个示例搜索规则,告知您要搜索的行应以短语However we

  3. 开头
  4. 规则也可以合并,例如 - LINE_CONTAINS phrase one BEFORE {phrase2 AND phrase3} AND LINE_STARTSWITH However we

  5. 现在,我正在尝试解析这些句子,然后将它们转换为正则表达式代码。所有行都以上面提到的2个符号中的任何一个开头(称为line_directives)。我希望能够考虑这些line_directives,并适当地解析它们并对它们后面的短语执行相同的操作,尽管解析不同。使用Paul McGuire(here)的帮助和我自己的输入,我有以下代码 -

    from pyparsing import *
    import re
    
    UPTO, AND, OR, WORDS = map(Literal, "upto AND OR words".split())
    keyword = UPTO | WORDS | AND | OR
    LBRACE,RBRACE = map(Suppress, "{}")
    integer = pyparsing_common.integer()
    
    LINE_CONTAINS, LINE_STARTSWITH, LINE_ENDSWITH = map(Literal,
        """LINE_CONTAINS LINE_STARTSWITH LINE_ENDSWITH""".split()) # put option for LINE_ENDSWITH. Users may use, I don't presently
    BEFORE, AFTER, JOIN = map(Literal, "BEFORE AFTER JOIN".split())
    word = ~keyword + Word(alphas)
    phrase = Group(OneOrMore(word))
    upto_expr = Group(LBRACE + UPTO + integer("numberofwords") + WORDS + RBRACE)
    
    class Node(object):
        def __init__(self, tokens):
            self.tokens = tokens
    
        def generate(self):
            pass
    
    class LiteralNode(Node):
        def generate(self):
            print (self.tokens[0], 20)
            for el in self.tokens[0]:
                print (el,type(el), 19)
            print (type(self.tokens[0]), 18)
            return "(%s)" %(' '.join(self.tokens[0])) # here, merged the elements, so that re.escape does not have to do an escape for the entire list
        def __repr__(self):
            return repr(self.tokens[0])
    
    class AndNode(Node):
        def generate(self):
            tokens = self.tokens[0]
            return '.*'.join(t.generate() for t in tokens[::2]) # change this to the correct form of AND in regex
    
        def __repr__(self):
            return ' AND '.join(repr(t) for t in self.tokens[0].asList()[::2])
    
    
    class OrNode(Node):
        def generate(self):
            tokens = self.tokens[0]
            return '|'.join(t.generate() for t in tokens[::2])
        def __repr__(self):
            return ' OR '.join(repr(t) for t in self.tokens[0].asList()[::2])
    
    
    class UpToNode(Node):
        def generate(self):
            tokens = self.tokens[0]
            ret = tokens[0].generate()
            print (123123)
            word_re = r"\s+\S+"
            space_re = r"\s+"
            for op, operand in zip(tokens[1::2], tokens[2::2]):
                # op contains the parsed "upto" expression
                ret += "((%s){0,%d}%s)" % (word_re, op.numberofwords, space_re) + operand.generate()
            print ret
            return ret
    
    
        def __repr__(self):
            tokens = self.tokens[0]
            ret = repr(tokens[0])
            for op, operand in zip(tokens[1::2], tokens[2::2]):
                # op contains the parsed "upto" expression
                ret += " {0-%d WORDS} " % (op.numberofwords) + repr(operand)
            return ret
    
    phrase_expr = infixNotation(phrase,
                                [
                                 ((BEFORE | AFTER | JOIN), 2, opAssoc.LEFT,), # (opExpr, numTerms, rightLeftAssoc, parseAction)
                                 (AND, 2, opAssoc.LEFT,),
                                 (OR, 2, opAssoc.LEFT),
                                ],
                                lpar=Suppress('{'), rpar=Suppress('}')
                                ) # structure of a single phrase with its operators
    line_term = Group((LINE_CONTAINS | LINE_STARTSWITH | LINE_ENDSWITH)("line_directive") + 
                      Group(phrase_expr)("phrase")) # basically giving structure to a single sub-rule having line-term and phrase
    line_contents_expr = infixNotation(line_term,
                                       [(AND, 2, opAssoc.LEFT,),
                                        (OR, 2, opAssoc.LEFT),
                                        ]
                                       ) # grammar for the entire rule/sentence
    
    phrase_expr = infixNotation(line_contents_expr.setParseAction(LiteralNode),
            [
            (upto_expr, 2, opAssoc.LEFT, UpToNode),
            (AND, 2, opAssoc.LEFT, AndNode),
            (OR, 2, opAssoc.LEFT, OrNode),
            ])
    
    tests1 = """LINE_CONTAINS overexpressing gene AND other things""".splitlines()        
    for t in tests1:
        t = t.strip()
        if not t:
            continue
    #    print(t, 12)
        try:
            parsed = phrase_expr.parseString(t)
        except ParseException as pe:
            print(' '*pe.loc + '^')
            print(pe)
            continue
    print (parsed[0], 14)
    print (type(parsed[0]))
    print(parsed[0].generate(), 15)
    

    这个简单的代码在运行时会出现以下错误 -

      

    ((''LINE_CONTAINS',([(['过度表达','基因'],{})],{})],   {'词组':[(([(['过度表达','基因'],{})],{}),1)],   'line_directive':[('LINE_CONTAINS',0)]}),14)

         

         

    ((''LINE_CONTAINS',([(['过度表达','基因'],{})],{})],   {'词组':[(([(['过度表达','基因'],{})],{}),1)],   'line_directive':[('LINE_CONTAINS',0)]}),20)

         

    ('LINE_CONTAINS',<,19)

         

    (([(['过度表达','基因'],{})],{}),,, 19)

         

    (,18)

         

    TypeError:序列项1:期望字符串,找到ParseResults(行   29)

    (错误代码不完全正确,因为这里的blockquote不支持尖括号)

    所以我的问题是,即使我已经编写了语法(使用infixnotation),它将LINE_CONTAINS视为line_directive并相应地解析剩余的行,为什么它无法解析正常吗?什么是解析这些线的好方法?

0 个答案:

没有答案