我有一些句子需要转换为正则表达式代码,我试图使用Pyparsing。这些句子基本上都是搜索规则,告诉我们要搜索什么。
句子的例子 -
LINE_CONTAINS this is a phrase
- 这是一个示例搜索规则,告知您搜索的行应该包含短语this is a phrase
LINE_STARTSWITH However we
- 这是一个示例搜索规则,告知您要搜索的行应以短语However we
规则也可以合并,例如 - LINE_CONTAINS phrase one BEFORE {phrase2 AND phrase3} AND LINE_STARTSWITH However we
现在,我正在尝试解析这些句子,然后将它们转换为正则表达式代码。所有行都以上面提到的2个符号中的任何一个开头(称为line_directives)。我希望能够考虑这些line_directives,并适当地解析它们并对它们后面的短语执行相同的操作,尽管解析不同。使用Paul McGuire(here)的帮助和我自己的输入,我有以下代码 -
from pyparsing import *
import re
UPTO, AND, OR, WORDS = map(Literal, "upto AND OR words".split())
keyword = UPTO | WORDS | AND | OR
LBRACE,RBRACE = map(Suppress, "{}")
integer = pyparsing_common.integer()
LINE_CONTAINS, LINE_STARTSWITH, LINE_ENDSWITH = map(Literal,
"""LINE_CONTAINS LINE_STARTSWITH LINE_ENDSWITH""".split()) # put option for LINE_ENDSWITH. Users may use, I don't presently
BEFORE, AFTER, JOIN = map(Literal, "BEFORE AFTER JOIN".split())
word = ~keyword + Word(alphas)
phrase = Group(OneOrMore(word))
upto_expr = Group(LBRACE + UPTO + integer("numberofwords") + WORDS + RBRACE)
class Node(object):
def __init__(self, tokens):
self.tokens = tokens
def generate(self):
pass
class LiteralNode(Node):
def generate(self):
print (self.tokens[0], 20)
for el in self.tokens[0]:
print (el,type(el), 19)
print (type(self.tokens[0]), 18)
return "(%s)" %(' '.join(self.tokens[0])) # here, merged the elements, so that re.escape does not have to do an escape for the entire list
def __repr__(self):
return repr(self.tokens[0])
class AndNode(Node):
def generate(self):
tokens = self.tokens[0]
return '.*'.join(t.generate() for t in tokens[::2]) # change this to the correct form of AND in regex
def __repr__(self):
return ' AND '.join(repr(t) for t in self.tokens[0].asList()[::2])
class OrNode(Node):
def generate(self):
tokens = self.tokens[0]
return '|'.join(t.generate() for t in tokens[::2])
def __repr__(self):
return ' OR '.join(repr(t) for t in self.tokens[0].asList()[::2])
class UpToNode(Node):
def generate(self):
tokens = self.tokens[0]
ret = tokens[0].generate()
print (123123)
word_re = r"\s+\S+"
space_re = r"\s+"
for op, operand in zip(tokens[1::2], tokens[2::2]):
# op contains the parsed "upto" expression
ret += "((%s){0,%d}%s)" % (word_re, op.numberofwords, space_re) + operand.generate()
print ret
return ret
def __repr__(self):
tokens = self.tokens[0]
ret = repr(tokens[0])
for op, operand in zip(tokens[1::2], tokens[2::2]):
# op contains the parsed "upto" expression
ret += " {0-%d WORDS} " % (op.numberofwords) + repr(operand)
return ret
phrase_expr = infixNotation(phrase,
[
((BEFORE | AFTER | JOIN), 2, opAssoc.LEFT,), # (opExpr, numTerms, rightLeftAssoc, parseAction)
(AND, 2, opAssoc.LEFT,),
(OR, 2, opAssoc.LEFT),
],
lpar=Suppress('{'), rpar=Suppress('}')
) # structure of a single phrase with its operators
line_term = Group((LINE_CONTAINS | LINE_STARTSWITH | LINE_ENDSWITH)("line_directive") +
Group(phrase_expr)("phrase")) # basically giving structure to a single sub-rule having line-term and phrase
line_contents_expr = infixNotation(line_term,
[(AND, 2, opAssoc.LEFT,),
(OR, 2, opAssoc.LEFT),
]
) # grammar for the entire rule/sentence
phrase_expr = infixNotation(line_contents_expr.setParseAction(LiteralNode),
[
(upto_expr, 2, opAssoc.LEFT, UpToNode),
(AND, 2, opAssoc.LEFT, AndNode),
(OR, 2, opAssoc.LEFT, OrNode),
])
tests1 = """LINE_CONTAINS overexpressing gene AND other things""".splitlines()
for t in tests1:
t = t.strip()
if not t:
continue
# print(t, 12)
try:
parsed = phrase_expr.parseString(t)
except ParseException as pe:
print(' '*pe.loc + '^')
print(pe)
continue
print (parsed[0], 14)
print (type(parsed[0]))
print(parsed[0].generate(), 15)
这个简单的代码在运行时会出现以下错误 -
((''LINE_CONTAINS',([(['过度表达','基因'],{})],{})], {'词组':[(([(['过度表达','基因'],{})],{}),1)], 'line_directive':[('LINE_CONTAINS',0)]}),14)
((''LINE_CONTAINS',([(['过度表达','基因'],{})],{})], {'词组':[(([(['过度表达','基因'],{})],{}),1)], 'line_directive':[('LINE_CONTAINS',0)]}),20)
('LINE_CONTAINS',<,19)
(([(['过度表达','基因'],{})],{}),,, 19)
(,18)
TypeError:序列项1:期望字符串,找到ParseResults(行 29)
(错误代码不完全正确,因为这里的blockquote不支持尖括号)
所以我的问题是,即使我已经编写了语法(使用infixnotation
),它将LINE_CONTAINS
视为line_directive并相应地解析剩余的行,为什么它无法解析正常吗?什么是解析这些线的好方法?