我的格式是这样的:
line = 'A=15, B=8, C=false, D=[somevar, a=0.1, b=77, c=true]'
我想将这些值提取到字典中,得到这样的结果:
{
'A': '15',
'B': '8',
'C': 'false',
'D': '[somevar, a=0.1, b=77, c=true]'
}
如果不是D值,我可以使用这样简单的方法:
result = dict(e.split('=') for e in line.split(', '))
但鉴于D包含', '
作为分隔符,我得到了一个混乱
{
'A': '15',
'B': '8',
'C': 'false',
'D': '[somevar',
'a': '0.1',
'b': '77',
'c': 'true]'
}
我会很感激任何建议 - 我还没有试过regexps,但是这个东西必须很快,因为有几十GB这样的线路,而且我担心regexping会减慢很多...
我已将下面的大部分答案包含在函数中,并使用ipython的%timeit
魔术函数来衡量执行时间。
测试文件是通过简单地在RAM中的tmpfs创建的:
for i in {1..1000000}; do echo 'A=15, B=8, C=false, D=[somevar, a=0.1, b=77, c=true]' >> test_file; done
这就是整个测试程序的样子:
import shlex
import re
def kalgasnik(line):
lexer = shlex.shlex(line)
lexer.wordchars += '.'
values = [['']]
stack = [values]
for token in lexer:
if token == ',':
stack[-1] += [['']]
elif token == '=':
stack[-1][-1] += ['']
elif token == '[':
v = [['']]
stack[-1][-1][-1] = v
stack += [v]
elif token == ']':
sub = stack.pop()
stack[-1][-1][-1] = {v[0]: v[1] if len(v) > 1 else None for v in sub}
else:
stack[-1][-1][-1] += token
values = {v[0]: v[1] if len(v) > 1 else None for v in values}
return values
def roberto(myline):
mydict = {}
parsecheck = {'(':1, '[':1, '{':1, ')':-1, ']':-1, '}':-1}
parsecount = 0
chargroup = ''
myline = myline + ','
for thischar in myline:
parsecount += parsecheck.get(thischar, 0)
if parsecount == 0:
if thischar == '=':
thiskey = chargroup.strip()
chargroup = ''
elif thischar == ',':
mydict[thiskey] = chargroup
chargroup = ''
else:
chargroup += thischar
else:
chargroup += thischar
return mydict
def xavier(line):
regexp = r'(\w*)=(\[[^\]]*\]|[^,]*),?\s*'
outdict = dict((match.group(1),match.group(2)) for match in re.finditer(regexp,line))
return outdict
def wim(line):
outdict = dict(x.split('=', 1) for x in shlex.split(line.replace("[", "'[").replace("]", "]'")))
return outdict
def gorkypl(line):
outdict = dict(e.split('=') for e in line.split(', '))
return outdict
def run_test(method):
with open('test_file', 'r') as infile:
for line in infile:
method(line)
以下是结果:
%timeit run_test(kalgasnik)
1 loops, best of 3: 3min 52s per loop
%timeit run_test(roberto)
1 loops, best of 3: 30.2 s per loop
%timeit run_test(xavier)
1 loops, best of 3: 12.1 s per loop
%timeit run_test(wim)
1 loops, best of 3: 2min 41 s per loop
为了便于比较,纯粹基于split
的( not-working-correct )原始想法。
%timeit run_test(gorkypl)
1 loops, best of 3: 8.27 s per loop
所以,基本上,Xavier的基于正则表达式的解决方案不仅是最灵活的,而且是最快的解决方案,并不比基于split()
的朴素方法慢得多。
非常感谢你!
答案 0 :(得分:4)
当且仅当没有嵌套括号时,它才非常适合正则表达式。
import re
line = 'A=15, B=8, C=false, D=[somevar, a=0.1, b=77, c=true]'
regexp = r'(\w*)=(\[[^\]]*\]|[^,]*),?\s*'
print(dict((match.group(1),match.group(2)) for match in re.finditer(regexp,line)))
输出
{'A': '15', 'C': 'false', 'B': '8', 'D': '[somevar, a=0.1, b=77, c=true]'}
关于你对不快速的恐惧,不要假设。 由于正则表达式是优化的C(除了少数病理情况),你几乎没有机会做得更好。
答案 1 :(得分:1)
传递输入字符串一次并检查列表段。
如果列表可以嵌套,用计数器跟踪深度。
这将转为
line = 'A=15, B=8, C=false, D=[somevar, a=0.1, b=77, c=true]'
到
line = 'A=15, B=8, C=false, D=[somevar! a?0.1! b?77! c?true]'
生成结果后才更换?而且!与=和,再次
编辑:不要使用普通字符,而是控制字符以避免冲突
答案 2 :(得分:1)
作为不必要的复杂性的样本:
import shlex
line = 'A=15, B=8, C=false, D=[somevar, a=0.1, b=77, c=[A=15, B=8, C=false, D=[somevar, a=0.1, b=77, c=true]]]'
lexer = shlex.shlex(line)
lexer.wordchars += '.'
values = [['']]
stack = [values]
for token in lexer:
if token == ',':
stack[-1] += [['']]
elif token == '=':
stack[-1][-1] += ['']
elif token == '[':
v = [['']]
stack[-1][-1][-1] = v
stack += [v]
elif token == ']':
sub = stack.pop()
stack[-1][-1][-1] = {v[0]: v[1] if len(v) > 1 else None for v in sub}
else:
stack[-1][-1][-1] += token
values = {v[0]: v[1] if len(v) > 1 else None for v in values}
结果:
>>> line
'A=15, B=8, C=false, D=[somevar, a=0.1, b=77, c=[A=15, B=8, C=false, D=[somevar, a=0.1, b=77, c=true]]]'
>>> values
{'A': '15',
'B': '8',
'C': 'false',
'D': {'a': '0.1',
'b': '77',
'c': {'A': '15',
'B': '8',
'C': 'false',
'D': {'a': '0.1', 'b': '77', 'c': 'true', 'somevar': None}},
'somevar': None}}
答案 3 :(得分:0)
如何使用'='将其作为csv
读取>>> line = 'A=15, B=8, C=false, D=[somevar, a=0.1, b=77, c=true]'
>>> mod_line = line.replace('[','"') #replace [ and ] with " so it can be used as a csv quote char
>>> mod_line = mod_line.replace(']','"')
>>> lines_list = []
>>> lines_list.append(mod_line) #put line into an interable object for csv reader
>>> import csv
>>> reader = csv.reader(lines_list, delimiter='=', quotechar='"')
>>> for row in reader:
... print(row) # or you could call a function that will turn the returned list into the dictionary you are after
...
['A', '15, B', '8, C', 'false, D', 'somevar, a=0.1, b=77, c=true']
答案 4 :(得分:0)
这可能不是很漂亮,但是它可以工作 - 也许可以将它作为更多Python风格的起点?
myline = 'A=15, B=8, C=false, D=[somevar, a=0.1, b=77, c=true]'
def separate(myline):
mydict = {}
parsecheck = {'(':1, '[':1, '{':1, ')':-1, ']':-1, '}':-1}
parsecount = 0
chargroup = ''
myline = myline + ',' # So all the entries end with a ','
for thischar in myline:
parsecount += parsecheck.get(thischar, 0)
if parsecount == 0 and thischar in '=,':
if thischar == '=':
thiskey = chargroup.strip()
elif thischar == ',':
mydict[thiskey] = chargroup
chargroup = ''
else:
chargroup += thischar
return mydict
print separate(myline)
[编辑以清理代码]