之前已经多次询问和回答过这个问题。一些示例:[1],[2]。但似乎没有更普遍的东西。我正在寻找的是一种用逗号分隔字符串的方法,这些逗号不在引号或分隔符对中。例如:
s1 = 'obj<1, 2, 3>, x(4, 5), "msg, with comma"'
应该拆分成三个元素的列表
['obj<1, 2, 3>', 'x(4, 5)', '"msg, with comma"']
现在的问题是,由于我们可以查看成对的<>
和()
,因此会变得更加复杂。
s2 = 'obj<1, sub<6, 7>, 3>, x(4, y(8, 9), 5), "msg, with comma"'
应分为:
['obj<1, sub<6, 7>, 3>', 'x(4, y(8, 9), 5)', '"msg, with comma"']
不使用正则表达式的天真解决方案是通过查找字符,<(
来解析字符串。如果找到<
或(
,我们就开始计算奇偶校验。如果奇偶校验为零,我们只能以逗号分割。例如,我们想要分割s2
,我们可以从parity = 0
开始,当我们到达s2[3]
时,我们会遇到<
,这会使奇偶校验增加1.奇偶校验只会减少当遇到>
或)
时会遇到<
或(
。虽然奇偶校验不是0,但我们可以简单地忽略逗号而不进行任何拆分。
这里的问题是,有没有办法快速使用正则表达式?我真的在研究这个solution,但这似乎并没有涵盖我给出的例子。
更通用的功能是这样的:
def split_at(text, delimiter, exceptions):
"""Split text at the specified delimiter if the delimiter is not
within the exceptions"""
有些用途是这样的:
split_at('obj<1, 2, 3>, x(4, 5), "msg, with comma"', ',', [('<', '>'), ('(', ')'), ('"', '"')]
regex是否能够处理这个问题,还是有必要创建一个专门的解析器?
答案 0 :(得分:8)
虽然无法使用正则表达式,但以下简单代码将实现所需的结果:
def split_at(text, delimiter, opens='<([', closes='>)]', quotes='"\''):
result = []
buff = ""
level = 0
is_quoted = False
for char in text:
if char in delimiter and level == 0 and not is_quoted:
result.append(buff)
buff = ""
else:
buff += char
if char in opens:
level += 1
if char in closes:
level -= 1
if char in quotes:
is_quoted = not is_quoted
if not buff == "":
result.append(buff)
return result
在解释器中运行它:
>>> split_at('obj<1, 2, 3>, x(4, 5), "msg, with comma"', ',')
#=>['obj<1, 2, 3>', ' x(4, 5)', ' "msg with comma"']
答案 1 :(得分:5)
使用迭代器和生成器:
def tokenize(txt, delim=',', pairs={'"':'"', '<':'>', '(':')'}):
fst, snd = set(pairs.keys()), set(pairs.values())
it = txt.__iter__()
def loop():
from collections import defaultdict
cnt = defaultdict(int)
while True:
ch = it.__next__()
if ch == delim and not any (cnt[x] for x in snd):
return
elif ch in fst:
cnt[pairs[ch]] += 1
elif ch in snd:
cnt[ch] -= 1
yield ch
while it.__length_hint__():
yield ''.join(loop())
和
>>> txt = 'obj<1, sub<6, 7>, 3>,x(4, y(8, 9), 5),"msg, with comma"'
>>> [x for x in tokenize(txt)]
['obj<1, sub<6, 7>, 3>', 'x(4, y(8, 9), 5)', '"msg, with comma"']
答案 2 :(得分:4)
如果您有递归嵌套表达式,则可以在逗号上拆分并验证它们是否与pyparsing匹配:
import pyparsing as pp
def CommaSplit(txt):
''' Replicate the function of str.split(',') but do not split on nested expressions or in quoted strings'''
com_lok=[]
comma = pp.Suppress(',')
# note the location of each comma outside an ignored expression:
comma.setParseAction(lambda s, lok, toks: com_lok.append(lok))
ident = pp.Word(pp.alphas+"_", pp.alphanums+"_") # python identifier
ex1=(ident+pp.nestedExpr(opener='<', closer='>')) # Ignore everthing inside nested '< >'
ex2=(ident+pp.nestedExpr()) # Ignore everthing inside nested '( )'
ex3=pp.Regex(r'("|\').*?\1') # Ignore everything inside "'" or '"'
atom = ex1 | ex2 | ex3 | comma
expr = pp.OneOrMore(atom) + pp.ZeroOrMore(comma + atom )
try:
result=expr.parseString(txt)
except pp.ParseException:
return [txt]
else:
return [txt[st:end] for st,end in zip([0]+[e+1 for e in com_lok],com_lok+[len(txt)])]
tests='''\
obj<1, 2, 3>, x(4, 5), "msg, with comma"
nesteobj<1, sub<6, 7>, 3>, nestedx(4, y(8, 9), 5), "msg, with comma"
nestedobj<1, sub<6, 7>, 3>, nestedx(4, y(8, 9), 5), 'msg, with comma', additional<1, sub<6, 7>, 3>
bare_comma<1, sub(6, 7), 3>, x(4, y(8, 9), 5), , 'msg, with comma', obj<1, sub<6, 7>, 3>
bad_close<1, sub<6, 7>, 3), x(4, y(8, 9), 5), 'msg, with comma', obj<1, sub<6, 7>, 3)
'''
for te in tests.splitlines():
result=CommaSplit(te)
print(te,'==>\n\t',result)
打印:
obj<1, 2, 3>, x(4, 5), "msg, with comma" ==>
['obj<1, 2, 3>', ' x(4, 5)', ' "msg, with comma"']
nesteobj<1, sub<6, 7>, 3>, nestedx(4, y(8, 9), 5), "msg, with comma" ==>
['nesteobj<1, sub<6, 7>, 3>', ' nestedx(4, y(8, 9), 5)', ' "msg, with comma"']
nestedobj<1, sub<6, 7>, 3>, nestedx(4, y(8, 9), 5), 'msg, with comma', additional<1, sub<6, 7>, 3> ==>
['nestedobj<1, sub<6, 7>, 3>', ' nestedx(4, y(8, 9), 5)', " 'msg, with comma'", ' additional<1, sub<6, 7>, 3>']
bare_comma<1, sub(6, 7), 3>, x(4, y(8, 9), 5), , 'msg, with comma', obj<1, sub<6, 7>, 3> ==>
['bare_comma<1, sub(6, 7), 3>', ' x(4, y(8, 9), 5)', ' ', " 'msg, with comma'", ' obj<1, sub<6, 7>, 3>']
bad_close<1, sub<6, 7>, 3), x(4, y(8, 9), 5), 'msg, with comma', obj<1, sub<6, 7>, 3) ==>
["bad_close<1, sub<6, 7>, 3), x(4, y(8, 9), 5), 'msg, with comma', obj<1, sub<6, 7>, 3)"]
当前行为与'(something does not split), b, "in quotes", c'.split',')
类似,包括保留前导空格和引号。从字段中删除引号和前导空格是微不足道的。
将else
下的try
更改为:
else:
rtr = [txt[st:end] for st,end in zip([0]+[e+1 for e in com_lok],com_lok+[len(txt)])]
if strip_fields:
rtr=[e.strip().strip('\'"') for e in rtr]
return rtr