Question

我有一堆句子需要解析并转换为相应的正则表达式搜索代码。我的句子的例子 -

LINE_CONTAINS phrase one BEFORE {phrase2 AND phrase3} AND LINE_STARTSWITH Therefore we

- 这意味着在行中phrase one来到某个地方 phrase2和phrase3。此外，该行必须以Therefore we

开头

LINE_CONTAINS abc {upto 4 words} xyz {upto 3 words} pqr

- 这意味着我需要在前两个短语之间允许最多4个单词最后2个短语之间最多3个字

使用Paul Mcguire（here）的帮助，编写了以下语法 -

from pyparsing import (CaselessKeyword, Word, alphanums, nums, MatchFirst, quotedString, 
    infixNotation, Combine, opAssoc, Suppress, pyparsing_common, Group, OneOrMore, ZeroOrMore)

LINE_CONTAINS, LINE_STARTSWITH = map(CaselessKeyword,
    """LINE_CONTAINS LINE_STARTSWITH """.split()) 

NOT, AND, OR = map(CaselessKeyword, "NOT AND OR".split())
BEFORE, AFTER, JOIN = map(CaselessKeyword, "BEFORE AFTER JOIN".split())

lpar=Suppress('{') 
rpar=Suppress('}')

keyword = MatchFirst([LINE_CONTAINS, LINE_STARTSWITH, LINE_ENDSWITH, NOT, AND, OR, 
                      BEFORE, AFTER, JOIN]) # declaring all keywords and assigning order for all further use

phrase_word = ~keyword + (Word(alphanums + '_'))

upto_N_words = Group(lpar + 'upto' + pyparsing_common.integer('numberofwords') + 'words' + rpar)

phrase_term = Group(OneOrMore(phrase_word) + ZeroOrMore((upto_N_words) + OneOrMore(phrase_word))  



phrase_expr = infixNotation(phrase_term,
                            [
                             ((BEFORE | AFTER | JOIN), 2, opAssoc.LEFT,), # (opExpr, numTerms, rightLeftAssoc, parseAction)
                             (NOT, 1, opAssoc.RIGHT,),
                             (AND, 2, opAssoc.LEFT,),
                             (OR, 2, opAssoc.LEFT),
                            ],
                            lpar=Suppress('{'), rpar=Suppress('}')
                            ) # structure of a single phrase with its operators

line_term = Group((LINE_CONTAINS | LINE_STARTSWITH | LINE_ENDSWITH)("line_directive") + 
                  Group(phrase_expr)("phrase")) # basically giving structure to a single sub-rule having line-term and phrase
line_contents_expr = infixNotation(line_term,
                                   [(NOT, 1, opAssoc.RIGHT,),
                                    (AND, 2, opAssoc.LEFT,),
                                    (OR, 2, opAssoc.LEFT),
                                    ]
                                   ) # grammar for the entire rule/sentence

sample1 = """
LINE_CONTAINS phrase one BEFORE {phrase2 AND phrase3} AND LINE_STARTSWITH Therefore we
"""
sample2 = """
LINE_CONTAINS abcd {upto 4 words} xyzw {upto 3 words} pqrs BEFORE something else
"""

我现在的问题是 - 如何访问已解析的元素以将句子转换为我的正则表达式代码。为此，我尝试了以下内容 -

parsed = line_contents_expr.parseString(sample1)/(sample2)
print (parsed[0].asDict())
print (parsed)
pprint.pprint(parsed)

sample1的上述代码的结果是 -

{}

[[['LINE_CONTAINS'，[[['sentence'，'one']，'BEFORE'，[['sentence2']，   'AND'，['sentence3']]]]]，'AND'，['LINE_STARTSWITH'，[['因此'，   '我们']]]]]

（[（[（['LINE_CONTAINS'，（[（[（['sentence'，'one']，{}），'BEFORE'，   （[（['sentence2']，{}），'AND'，（['sentence3']，{}）]，{}）]，{}）]，{}）]，   {'短语'：[（（[（[（['句子'，'一个']，{}），'BEFORE'，   （[（['sentence2']，{}），'AND'，（['sentence3']，{}）]，{}）]，{}）]，{}），   1）]，' line_directive '：[（'LINE_CONTAINS'，0）]}），'AND'，   （['LINE_STARTSWITH'，（[（['因此'，'我们']，{}）]，{}）]，{'词组'：   [（（[（['因此'，'我们']，{}）]，{}），1）]，' line_directive '：   [（'LINE_STARTSWITH'，0）]}）]，{}）]，{}）

sample2的上述代码的结果是 -

{'词组'：[[['abcd'，{' numberofwords '：4}，'xyzw'，{' numberofwords “：   3}，'pqrs']，'BEFORE'，['something'，'else']]]，' line_directive '：   'LINE_CONTAINS'}

[['LINE_CONTAINS'，[[['abcd'，['upto'，4，'words']，'xyzw'，['upto'，   3，'words']，'pqrs']，'BEFORE'，['something'，'else']]]]]

（[（['LINE_CONTAINS'，（[（[（['abcd'，（['upto'，4，'words']，   {' numberofwords '：[（4,1）]}），'xyzw'，（['upto'，3，'words']，   {' numberofwords '：[（3,1）]}），'pqrs']，{}），'BEFORE'，（['something'，   'else']，{}）]，{}）]，{}）]，{'词组'：[（（[（[（['abcd'，（['upto'，4，   'words']，{' numberofwords '：[（4,1）]}），'xyzw'，（['upto'，3，'words']，   {'numberofwords'：[（3,1）]}），'pqrs']，{}），'BEFORE'，（['something'，   'else']，{}）]，{}）]，{}），1）]，' line_directive '：[（'LINE_CONTAINS'，   0）]}）]，{}）

基于上述输出的问题是 -

为什么pprint（漂亮的打印）比普通打印有更详细的解析？
为什么asDict()方法不会为sample1提供输出，但会为sample2提供输出？
每当我尝试使用print (parsed.numberofwords)或parsed.line_directive或parsed.line_term访问已解析的元素时，它都不会给我任何东西。如何访问这些元素以使用它们来构建我的正则表达式代码？

Answer 1

回答您的打印问题。 1）pprint可以打印嵌套的标记列表，而不显示任何结果名称 - 它实际上是调用pprint.pprint(results.asList())的环绕声。 2）asDict()可以将解析后的结果转换为实际的Python dict，因此仅显示结果名称（如果名称中有名称，则使用嵌套）。

要查看已解析输出的内容，最好使用print(result.dump())。 dump()显示结果和嵌套的任何命名结果。

result = line_contents_expr.parseString(sample2)
print(result.dump())

我还建议使用expr.runTests为您提供dump()输出以及任何异常和异常定位器。使用您的代码，您可以使用以下方法轻松完成此操作：

line_contents_expr.runTests([sample1, sample2])

但我也建议你退一步，考虑这个{upto n words}业务的全部内容。查看您的示例并围绕行术语绘制矩形，然后在行术语中围绕术语术语绘制圆圈。（这将是一个很好的练习，可以为自己写一个BNF对这种语法的描述，我总是建议你做这个问题的步骤。）如果对待{{1}怎么办？表达式作为另一个运算符？要查看此内容，请将upto更改为您拥有的方式：

phrase_term

然后将定义短语表达式的第一个优先顺序更改为：

phrase_term = Group(OneOrMore(phrase_word))

或者考虑让((BEFORE | AFTER | JOIN | upto_N_words), 2, opAssoc.LEFT,),运算符的优先级高于或低于BEFORE，AFTER和JOIN，并相应地调整优先级列表。

通过此更改，我可以在样本上调用runTests来获取此输出：

upto

您可以迭代这些结果并将它们分开，但是您正在迅速达到应该从不同的优先级别构建可执行节点的程度 - 请参阅pyparsing wiki上的SimpleBool.py示例以了解如何执行此操作

编辑：请查看LINE_CONTAINS phrase one BEFORE {phrase2 AND phrase3} AND LINE_STARTSWITH Therefore we [[['LINE_CONTAINS', [[['phrase', 'one'], 'BEFORE', [['phrase2'], 'AND', ['phrase3']]]]], 'AND', ['LINE_STARTSWITH', [['Therefore', 'we']]]]] [0]: [['LINE_CONTAINS', [[['phrase', 'one'], 'BEFORE', [['phrase2'], 'AND', ['phrase3']]]]], 'AND', ['LINE_STARTSWITH', [['Therefore', 'we']]]] [0]: ['LINE_CONTAINS', [[['phrase', 'one'], 'BEFORE', [['phrase2'], 'AND', ['phrase3']]]]] - line_directive: 'LINE_CONTAINS' - phrase: [[['phrase', 'one'], 'BEFORE', [['phrase2'], 'AND', ['phrase3']]]] [0]: [['phrase', 'one'], 'BEFORE', [['phrase2'], 'AND', ['phrase3']]] [0]: ['phrase', 'one'] [1]: BEFORE [2]: [['phrase2'], 'AND', ['phrase3']] [0]: ['phrase2'] [1]: AND [2]: ['phrase3'] [1]: AND [2]: ['LINE_STARTSWITH', [['Therefore', 'we']]] - line_directive: 'LINE_STARTSWITH' - phrase: [['Therefore', 'we']] [0]: ['Therefore', 'we'] LINE_CONTAINS abcd {upto 4 words} xyzw {upto 3 words} pqrs BEFORE something else [['LINE_CONTAINS', [[['abcd'], ['upto', 4, 'words'], ['xyzw'], ['upto', 3, 'words'], ['pqrs'], 'BEFORE', ['something', 'else']]]]] [0]: ['LINE_CONTAINS', [[['abcd'], ['upto', 4, 'words'], ['xyzw'], ['upto', 3, 'words'], ['pqrs'], 'BEFORE', ['something', 'else']]]] - line_directive: 'LINE_CONTAINS' - phrase: [[['abcd'], ['upto', 4, 'words'], ['xyzw'], ['upto', 3, 'words'], ['pqrs'], 'BEFORE', ['something', 'else']]] [0]: [['abcd'], ['upto', 4, 'words'], ['xyzw'], ['upto', 3, 'words'], ['pqrs'], 'BEFORE', ['something', 'else']] [0]: ['abcd'] [1]: ['upto', 4, 'words'] - numberofwords: 4 [2]: ['xyzw'] [3]: ['upto', 3, 'words'] - numberofwords: 3 [4]: ['pqrs'] [5]: BEFORE [6]: ['something', 'else']解析器的这个简化版本，以及它如何创建自己生成输出的phrase_expr个实例。了解如何在Node类中的运算符上访问numberofwords。了解＆＃34; xyz abc＆＃34;被解释为＆＃34; xyz AND abc＆＃34;使用隐式AND运算符。

UpToNode

打印：

from pyparsing import *
import re

UPTO, WORDS, AND, OR = map(CaselessKeyword, "upto words and or".split())
keyword = UPTO | WORDS | AND | OR
LBRACE,RBRACE = map(Suppress, "{}")
integer = pyparsing_common.integer()

word = ~keyword + Word(alphas)
upto_expr = Group(LBRACE + UPTO + integer("numberofwords") + WORDS + RBRACE)

class Node(object):
    def __init__(self, tokens):
        self.tokens = tokens

    def generate(self):
        pass

class LiteralNode(Node):
    def generate(self):
        return "(%s)" % re.escape(self.tokens[0])
    def __repr__(self):
        return repr(self.tokens[0])

class AndNode(Node):
    def generate(self):
        tokens = self.tokens[0]
        return '.*'.join(t.generate() for t in tokens[::2])

    def __repr__(self):
        return ' AND '.join(repr(t) for t in self.tokens[0].asList()[::2])

class OrNode(Node):
    def generate(self):
        tokens = self.tokens[0]
        return '|'.join(t.generate() for t in tokens[::2])

    def __repr__(self):
        return ' OR '.join(repr(t) for t in self.tokens[0].asList()[::2])

class UpToNode(Node):
    def generate(self):
        tokens = self.tokens[0]
        ret = tokens[0].generate()
        word_re = r"\s+\S+"
        space_re = r"\s+"
        for op, operand in zip(tokens[1::2], tokens[2::2]):
            # op contains the parsed "upto" expression
            ret += "((%s){0,%d}%s)" % (word_re, op.numberofwords, space_re) + operand.generate()
        return ret

    def __repr__(self):
        tokens = self.tokens[0]
        ret = repr(tokens[0])
        for op, operand in zip(tokens[1::2], tokens[2::2]):
            # op contains the parsed "upto" expression
            ret += " {0-%d WORDS} " % (op.numberofwords) + repr(operand)
        return ret

IMPLICIT_AND = Empty().setParseAction(replaceWith("AND"))

phrase_expr = infixNotation(word.setParseAction(LiteralNode),
        [
        (upto_expr, 2, opAssoc.LEFT, UpToNode),
        (AND | IMPLICIT_AND, 2, opAssoc.LEFT, AndNode),
        (OR, 2, opAssoc.LEFT, OrNode),
        ])

tests = """\
        xyz
        xyz abc
        xyz {upto 4 words} def""".splitlines()

for t in tests:
    t = t.strip()
    if not t:
        continue
    print(t)
    try:
        parsed = phrase_expr.parseString(t)
    except ParseException as pe:
        print(' '*pe.loc + '^')
        print(pe)
        continue
    print(parsed)
    print(parsed[0].generate())
    print()

展开此选项以支持您的xyz ['xyz'] (xyz) xyz abc ['xyz' AND 'abc'] (xyz).*(abc) xyz {upto 4 words} def ['xyz' {0-4 WORDS} 'def'] (xyz)((\s+\S+){0,4}\s+)(def)表达式。

使用Pyparsing

1 个答案: