如何在RegEx中处理嵌套组?

时间:2017-07-20 14:38:03

标签: python regex

假设我有这样的文字:

/TT0 1 Tf
0.002 Tc -0.002 Tw 11.04 0 0 11.04 221.16 707.04 Tm
[(\()-2(Y)7(o)7(u )-3(g)7(o)-2(t)4(i)-3(t)(\))]TJ
EMC 

它是PDF文件的一部分。这条线

[(\()-2(Y)7(o)7(u've )-3(g)7(o)-2(t)4(i)-3(t)(\))]TJ

包含文字“(你已经拥有它)”。所以我首先需要匹配文本行

^[(.*)]TJ$

拥有该组的捕获组,我可以应用\(((.*?)\)[-0-9]*)并将所有匹配项替换为\2

是否可以一步完成?

2 个答案:

答案 0 :(得分:0)

使用正则表达式模块,您可以使用此模式:

pat=r'(?:\G(?!\A)\)|\[(?=[^]]*]))[^](]*\(([^)\\]*(?:\\.[^)\\]*)*)(?:\)[^(]*]TJ)?'
regex.sub(pat, r'\1', s)

demo

模式细节:

(?: # two possible starts
    \G     # contiguous to a previous match
    (?!\A) # not at the start of the string
    \)     # a literal closing round bracket
  | # OR
    \[          # an opening square bracket
     (?=[^]]*]) # followed by a closing square bracket
)
[^](]* # all that isn't a closing square bracket or an opening round bracket
\(     # a literal opening round bracket
(      # capture group 1
    [^)\\]* # all characters except a closing round bracket or a backslash
    (?:\\.[^)\\]*)* # to deal with eventual escaped characters 
)
(?: \) [^(]* ] TJ )? # eventual end of the square bracket parts

答案 1 :(得分:0)

使用正则表达式来解析嵌套组可能很难,难以辨认或无法实现。

解决嵌套组的一种方法是使用parsing grammar。以下是使用Eric Rose的parsimonious库的3个步骤示例。

import itertools as it

import parsimonious as pars

# Given a source text*
source  = """\
/TT0 1 Tf
0.002 Tc -0.002 Tw 11.04 0 0 11.04 221.16 707.04 Tm
[(\()-2(Y)7(o)7(u )-3(g)7(o)-2(t )4(i)-3(t)(\))]TJ
EMC"""

# 1. Define a Grammar
rules = r"""

    root            = line line message end

    line            = ANY NEWLINE
    message         = _ TEXT (_ TEXT*)* NEWLINE
    end             = "EMC" NEWLINE*

    TEXT            = ~r"[a-zA-Z ]+" 
    NEWLINE         = ~r"\n"
    ANY             = ~r"[^\n\r]*"

    _               = meaninglessness*
    meaninglessness = ~r"(TJ)*[^a-zA-Z\n\r]*"    

"""

# 2. Parse source text and Build an AST
grammar = pars.grammar.Grammar(rules)
tree = grammar.parse(source)
# print(tree)

# 3. Resolve the AST
class Translator(pars.NodeVisitor):

    def visit_root(self, node, children):
        return children

    def visit_line(self, node, children):
        return node.text

    def visit_message(self, node, children):
        _, s, remaining, nl = children
        return (s + "".join(it.chain.from_iterable(i[1] for i in remaining)) + nl)

    def visit_end(self, node, children):
        return node.text

    def visit_meaninglessness(self, node, children):
        return children

    def visit__(self, node, children):
        return children[0]

    def visit_(self, node, children):
        return children

    def visit_TEXT(self, node, children):
        return node.text

    def visit_NEWLINE(self, node, children):
        return node.text

    def visit_ANY(self, node, children):
        return node.text

tr = Translator().visit(tree)
print("".join(tr))

输出

/TT0 1 Tf
0.002 Tc -0.002 Tw 11.04 0 0 11.04 221.16 707.04 Tm
You got it
EMC

步骤

  1. 我们定义了一组类似于regex / EBNF的语法规则see docs for details,而不是严格的(有时是难以理解的正则表达式)。一旦定义了语法,如果需要,可以更容易地进行调整。
    • 注意:原始文本已修改,在2(t)(第3行)中添加了一个空格,因为据信OP中缺少该空格。
  2. 解析步骤很简单。只需parse语法。如果语法被充分定义,则AST结果具有一组节点,这些节点反映来自源的已解析组件的结构。 AST使这种方法变得灵活。
  3. 定义访问每个节点时要执行的操作。可以使用任何所需技术来解析AST。在这里,我们演示了实现Visitor PatternNodeVisitor的子类parsmonious
  4. 现在,对于PDF中遇到的新文本或意外文本,只需修改语法并再次解析即可。