Question

我有一些句子需要转换为正则表达式代码，我试图使用Pyparsing。这些句子基本上都是搜索规则，告诉我们要搜索什么。

句子的例子 -

LINE_CONTAINS this is a phrase - 这是一个示例搜索规则，告知您搜索的行应该包含短语this is a phrase
LINE_STARTSWITH However we - 这是一个示例搜索规则，告知您要搜索的行应以短语However we
规则也可以合并，例如 - LINE_CONTAINS phrase one BEFORE {phrase2 AND phrase3} AND LINE_STARTSWITH However we

现在，我正在尝试解析这些句子，然后将它们转换为正则表达式代码。所有行都以上面提到的2个符号中的任何一个开头（称为line_directives）。我希望能够考虑这些line_directives，并适当地解析它们并对它们后面的短语执行相同的操作，尽管解析不同。使用Paul McGuire（here）的帮助和我自己的输入，我有以下代码 -

from pyparsing import *
import re

UPTO, AND, OR, WORDS = map(Literal, "upto AND OR words".split())
keyword = UPTO | WORDS | AND | OR
LBRACE,RBRACE = map(Suppress, "{}")
integer = pyparsing_common.integer()

LINE_CONTAINS, LINE_STARTSWITH, LINE_ENDSWITH = map(Literal,
    """LINE_CONTAINS LINE_STARTSWITH LINE_ENDSWITH""".split()) # put option for LINE_ENDSWITH. Users may use, I don't presently
BEFORE, AFTER, JOIN = map(Literal, "BEFORE AFTER JOIN".split())
word = ~keyword + Word(alphas)
phrase = Group(OneOrMore(word))
upto_expr = Group(LBRACE + UPTO + integer("numberofwords") + WORDS + RBRACE)

class Node(object):
    def __init__(self, tokens):
        self.tokens = tokens

    def generate(self):
        pass

class LiteralNode(Node):
    def generate(self):
        print (self.tokens[0], 20)
        for el in self.tokens[0]:
            print (el,type(el), 19)
        print (type(self.tokens[0]), 18)
        return "(%s)" %(' '.join(self.tokens[0])) # here, merged the elements, so that re.escape does not have to do an escape for the entire list
    def __repr__(self):
        return repr(self.tokens[0])

class AndNode(Node):
    def generate(self):
        tokens = self.tokens[0]
        return '.*'.join(t.generate() for t in tokens[::2]) # change this to the correct form of AND in regex

    def __repr__(self):
        return ' AND '.join(repr(t) for t in self.tokens[0].asList()[::2])


class OrNode(Node):
    def generate(self):
        tokens = self.tokens[0]
        return '|'.join(t.generate() for t in tokens[::2])
    def __repr__(self):
        return ' OR '.join(repr(t) for t in self.tokens[0].asList()[::2])


class UpToNode(Node):
    def generate(self):
        tokens = self.tokens[0]
        ret = tokens[0].generate()
        print (123123)
        word_re = r"\s+\S+"
        space_re = r"\s+"
        for op, operand in zip(tokens[1::2], tokens[2::2]):
            # op contains the parsed "upto" expression
            ret += "((%s){0,%d}%s)" % (word_re, op.numberofwords, space_re) + operand.generate()
        print ret
        return ret


    def __repr__(self):
        tokens = self.tokens[0]
        ret = repr(tokens[0])
        for op, operand in zip(tokens[1::2], tokens[2::2]):
            # op contains the parsed "upto" expression
            ret += " {0-%d WORDS} " % (op.numberofwords) + repr(operand)
        return ret

phrase_expr = infixNotation(phrase,
                            [
                             ((BEFORE | AFTER | JOIN), 2, opAssoc.LEFT,), # (opExpr, numTerms, rightLeftAssoc, parseAction)
                             (AND, 2, opAssoc.LEFT,),
                             (OR, 2, opAssoc.LEFT),
                            ],
                            lpar=Suppress('{'), rpar=Suppress('}')
                            ) # structure of a single phrase with its operators
line_term = Group((LINE_CONTAINS | LINE_STARTSWITH | LINE_ENDSWITH)("line_directive") + 
                  Group(phrase_expr)("phrase")) # basically giving structure to a single sub-rule having line-term and phrase
line_contents_expr = infixNotation(line_term,
                                   [(AND, 2, opAssoc.LEFT,),
                                    (OR, 2, opAssoc.LEFT),
                                    ]
                                   ) # grammar for the entire rule/sentence

phrase_expr = infixNotation(line_contents_expr.setParseAction(LiteralNode),
        [
        (upto_expr, 2, opAssoc.LEFT, UpToNode),
        (AND, 2, opAssoc.LEFT, AndNode),
        (OR, 2, opAssoc.LEFT, OrNode),
        ])

tests1 = """LINE_CONTAINS overexpressing gene AND other things""".splitlines()        
for t in tests1:
    t = t.strip()
    if not t:
        continue
#    print(t, 12)
    try:
        parsed = phrase_expr.parseString(t)
    except ParseException as pe:
        print(' '*pe.loc + '^')
        print(pe)
        continue
print (parsed[0], 14)
print (type(parsed[0]))
print(parsed[0].generate(), 15)

这个简单的代码在运行时会出现以下错误 -

（（''LINE_CONTAINS'，（[（['过度表达'，'基因']，{}）]，{}）]，   {'词组'：[（（[（['过度表达'，'基因']，{}）]，{}），1）]，   'line_directive'：[（'LINE_CONTAINS'，0）]}），14）



（（''LINE_CONTAINS'，（[（['过度表达'，'基因']，{}）]，{}）]，   {'词组'：[（（[（['过度表达'，'基因']，{}）]，{}），1）]，   'line_directive'：[（'LINE_CONTAINS'，0）]}），20）

（'LINE_CONTAINS'，＆lt;，19）

（（[（['过度表达'，'基因']，{}）]，{}）,,, 19）

（，18）

TypeError：序列项1：期望字符串，找到ParseResults（行   29）

（错误代码不完全正确，因为这里的blockquote不支持尖括号）

所以我的问题是，即使我已经编写了语法（使用infixnotation），它将LINE_CONTAINS视为line_directive并相应地解析剩余的行，为什么它无法解析正常吗？什么是解析这些线的好方法？

在Pyparsing

0 个答案: