假设我有这样的文字:
/TT0 1 Tf
0.002 Tc -0.002 Tw 11.04 0 0 11.04 221.16 707.04 Tm
[(\()-2(Y)7(o)7(u )-3(g)7(o)-2(t)4(i)-3(t)(\))]TJ
EMC
它是PDF文件的一部分。这条线
[(\()-2(Y)7(o)7(u've )-3(g)7(o)-2(t)4(i)-3(t)(\))]TJ
包含文字“(你已经拥有它)”。所以我首先需要匹配文本行
^[(.*)]TJ$
拥有该组的捕获组,我可以应用\(((.*?)\)[-0-9]*)
并将所有匹配项替换为\2
。
是否可以一步完成?
答案 0 :(得分:0)
使用正则表达式模块,您可以使用此模式:
pat=r'(?:\G(?!\A)\)|\[(?=[^]]*]))[^](]*\(([^)\\]*(?:\\.[^)\\]*)*)(?:\)[^(]*]TJ)?'
regex.sub(pat, r'\1', s)
模式细节:
(?: # two possible starts
\G # contiguous to a previous match
(?!\A) # not at the start of the string
\) # a literal closing round bracket
| # OR
\[ # an opening square bracket
(?=[^]]*]) # followed by a closing square bracket
)
[^](]* # all that isn't a closing square bracket or an opening round bracket
\( # a literal opening round bracket
( # capture group 1
[^)\\]* # all characters except a closing round bracket or a backslash
(?:\\.[^)\\]*)* # to deal with eventual escaped characters
)
(?: \) [^(]* ] TJ )? # eventual end of the square bracket parts
答案 1 :(得分:0)
使用正则表达式来解析嵌套组可能很难,难以辨认或无法实现。
解决嵌套组的一种方法是使用parsing grammar。以下是使用Eric Rose的parsimonious
库的3个步骤示例。
import itertools as it
import parsimonious as pars
# Given a source text*
source = """\
/TT0 1 Tf
0.002 Tc -0.002 Tw 11.04 0 0 11.04 221.16 707.04 Tm
[(\()-2(Y)7(o)7(u )-3(g)7(o)-2(t )4(i)-3(t)(\))]TJ
EMC"""
# 1. Define a Grammar
rules = r"""
root = line line message end
line = ANY NEWLINE
message = _ TEXT (_ TEXT*)* NEWLINE
end = "EMC" NEWLINE*
TEXT = ~r"[a-zA-Z ]+"
NEWLINE = ~r"\n"
ANY = ~r"[^\n\r]*"
_ = meaninglessness*
meaninglessness = ~r"(TJ)*[^a-zA-Z\n\r]*"
"""
# 2. Parse source text and Build an AST
grammar = pars.grammar.Grammar(rules)
tree = grammar.parse(source)
# print(tree)
# 3. Resolve the AST
class Translator(pars.NodeVisitor):
def visit_root(self, node, children):
return children
def visit_line(self, node, children):
return node.text
def visit_message(self, node, children):
_, s, remaining, nl = children
return (s + "".join(it.chain.from_iterable(i[1] for i in remaining)) + nl)
def visit_end(self, node, children):
return node.text
def visit_meaninglessness(self, node, children):
return children
def visit__(self, node, children):
return children[0]
def visit_(self, node, children):
return children
def visit_TEXT(self, node, children):
return node.text
def visit_NEWLINE(self, node, children):
return node.text
def visit_ANY(self, node, children):
return node.text
tr = Translator().visit(tree)
print("".join(tr))
输出
/TT0 1 Tf
0.002 Tc -0.002 Tw 11.04 0 0 11.04 221.16 707.04 Tm
You got it
EMC
步骤
2(t)
(第3行)中添加了一个空格,因为据信OP中缺少该空格。parse
语法。如果语法被充分定义,则AST结果具有一组节点,这些节点反映来自源的已解析组件的结构。 AST使这种方法变得灵活。NodeVisitor
的子类parsmonious
。现在,对于PDF中遇到的新文本或意外文本,只需修改语法并再次解析即可。