我需要一种方法,在python中给出一串文本,将其内容分成一个列表,按3个参数分割 - 最外面的括号与最外面的括号与普通文本,保留原始语法。
例如,给定一个字符串
(([a] b) c ) [d] (e) f
预期的输出将是此列表:
['(([a] b) c )', '[d]', '(e)', ' f']
我用正则表达式尝试了几件事,比如
\[.+?\]|\(.+?\)|[\w+ ?]+
给了我
>>> re.findall(r'\[.+?\]|\(.+?\)|[\w+ ?]+', '(([a] b) c ) [d] (e) f')
['(([a] b)', ' c ', ' ', '[d]', ' ', '(e)', ' f']
(项目c在错误的列表中)
我也试过它的贪心版本,
\[.+\]|\(.+\)|[\w+ ?]+
但是当字符串具有相同类型的单独运算符时,它就会失败:
>>> re.findall(r'\[.+\]|\(.+\)|[\w+ ?]+', '(([a] b) c ) [d] (e) f')
['(([a] b) c ) [d] (e)', ' f']
然后我从正则表达式转而使用堆栈:
>>> def parenthetic_contents(string):
stack = []
for i, c in enumerate(string):
if c == '(' or c == '[':
stack.append(i)
elif (c == ')' or c == ']'):
start = stack.pop()
yield (len(stack), string[start + 0:i+1])
对括号和圆括号来说效果很好,除了我无法得到平面文字(或者我这样做,但我不知道它?):
>>> list(parenthetic_contents('(([a] b) c ) [d] (e) f'))
[(2, '[a]'), (1, '([a] b)'), (0, '(([a] b) c )'), (0, '[d]'), (0, '(e)')]
我不熟悉pyparsing。它首先看起来好像nestedExpr()会做这个技巧,但它只需要一个分隔符(()或[],但不是两个),这对我来说不起作用。
我现在全都没有想法了。任何建议都会受到欢迎。
答案 0 :(得分:1)
我设法使用一个简单的解析器,使用level
变量跟踪您在堆栈中的深度。
import string
def get_string_items(s):
in_object = False
level = 0
current_item = ''
for char in s:
if char in string.ascii_letters:
current_item += char
continue
if not in_object:
if char == ' ':
continue
if char in ('(', '['):
in_object = True
level += 1
elif char in (')', ']'):
level -= 1
current_item += char
if level == 0:
yield current_item
current_item = ''
in_object = False
yield current_item
输出:
list(get_string_items(s))
Out[4]: ['(([a] b) c )', '[d]', '(e)', 'f']
list(get_string_items('(hi | hello) world'))
Out[12]: ['(hi | hello)', 'world']
答案 1 :(得分:1)
仅进行非常轻微的测试(输出包括空白区域)。与@Marius的答案(以及关于需要PDA的paren匹配的一般规则)一样,我使用堆栈。但是,我对我的内心有一点额外的偏执狂。
def paren_matcher(string, opens, closes):
"""Yield (in order) the parts of a string that are contained
in matching parentheses. That is, upon encounting an "open
parenthesis" character (one in <opens>), we require a
corresponding "close parenthesis" character (the corresponding
one from <closes>) to close it.
If there are embedded <open>s they increment the count and
also require corresponding <close>s. If an <open> is closed
by the wrong <close>, we raise a ValueError.
"""
stack = []
if len(opens) != len(closes):
raise TypeError("opens and closes must have the same length")
# could make sure that no closes[i] is present in opens, but
# won't bother here...
result = []
for char in string:
# If it's an open parenthesis, push corresponding closer onto stack.
pos = opens.find(char)
if pos >= 0:
if result and not stack: # yield accumulated pre-paren stuff
yield ''.join(result)
result = []
result.append(char)
stack.append(closes[pos])
continue
result.append(char)
# If it's a close parenthesis, match it up.
pos = closes.find(char)
if pos >= 0:
if not stack or stack[-1] != char:
raise ValueError("unbalanced parentheses: %s" %
''.join(result))
stack.pop()
if not stack: # final paren closed
yield ''.join(result)
result = []
if stack:
raise ValueError("unclosed parentheses: %s" % ''.join(result))
if result:
yield ''.join(result)
print list(paren_matcher('(([a] b) c ) [d] (e) f', '([', ')]'))
print list(paren_matcher('foo (bar (baz))', '(', ')'))
答案 2 :(得分:1)
你仍然可以使用nestedExpr
,你想创建几个表达式,每种表达式都有一个分隔符:
from pyparsing import nestedExpr, Word, printables, quotedString, OneOrMore
parenList = nestedExpr('(', ')')
brackList = nestedExpr('[', ']')
printableWord = Word(printables, excludeChars="()[]")
expr = OneOrMore(parenList | brackList | quotedString | printableWord)
sample = """(([a] b) c ")" ) [d] (e) f "(a quoted) [string] with ()'s" """
import pprint
pprint.pprint(expr.parseString(sample).asList())
打印:
[[['[a]', 'b'], 'c', '")"'],
['d'],
['e'],
'f',
'"(a quoted) [string] with ()\'s"']
请注意,默认情况下,nestedExpr
会在嵌套结构中返回已解析的内容。要保留原始文本,请将表达式包装在originalTextFor
:
# preserve nested expressions as their original strings
from pyparsing import originalTextFor
parenList = originalTextFor(parenList)
brackList = originalTextFor(brackList)
expr = OneOrMore(parenList | brackList | quotedString | printableWord)
pprint.pprint(expr.parseString(sample).asList())
打印:
['(([a] b) c ")" )', '[d]', '(e)', 'f', '"(a quoted) [string] with ()\'s"']
答案 3 :(得分:0)
好吧,一旦你放弃了解析嵌套表达式应该在无限深度工作的想法,可以通过提前指定最大深度来正确使用正则表达式。方法如下:
def nested_matcher (n):
# poor man's matched paren scanning, gives up after n+1 levels.
# Matches any string with balanced parens or brackets inside; add
# the outer parens yourself if needed. Nongreedy. Does not
# distinguish parens and brackets as that would cause the
# expression to grow exponentially rather than linearly in size.
return "[^][()]*?(?:[([]"*n+"[^][()]*?"+"[])][^][()]*?)*?"*n
import re
p = re.compile('[^][()]+|[([]' + nested_matcher(10) + '[])]')
print p.findall('(([a] b) c ) [d] (e) f')
这将输出
['(([a] b) c )', ' ', '[d]', ' ', '(e)', ' f']
这不是你上面所说的,但是你的描述和例子并没有明确说明你打算用空格做什么。