我正在寻找有关如何通过解析文件来创建数据结构的建议。 这是我文件中的列表。
'01bpar( 2)= 0.23103878E-01 half_life= 0.3000133E+02 relax_time= 0.4328278E+02',
'01bpar( 3)= 0.00000000E+00',
'02epar( 1)= 0.49998963E+02',
'02epar( 2)= 0.23103878E-01 half_life= 0.3000133E+02 relax_time= 0.4328278E+02',
'02epar( 3)= 0.00000000E+00',
'02epar( 4)= 0.17862340E-01 half_life= 0.3880495E+02 relax_time= 0.5598371E+02',
'02bpar( 1)= 0.49998962E+02',
'02bpar( 2)= 0.23103878E-01 half_life= 0.3000133E+02 relax_time= 0.4328278E+02',
我需要做的是构建一个如下所示的数据结构:
http://img11.imageshack.us/img11/7645/datastructure.gif
(由于新的用户限制而无法发布)
我已经设法获得所有正则表达式过滤器以获得所需,但我无法构建结构。 想法?
答案 0 :(得分:3)
理论上可以让pyparsing使用解析操作创建整个结构,但如果你只是按照我的名字命名各个字段,那么构建结构也不算太糟糕。如果你想转换为使用RE,那么这个例子可以让你初步看看事情的样子:
source = """\
'01bpar( 2)= 0.23103878E-01 half_life= 0.3000133E+02 relax_time= 0.4328278E+02',
'01bpar( 3)= 0.00000000E+00',
'02epar( 1)= 0.49998963E+02',
'02epar( 2)= 0.23103878E-01 half_life= 0.3000133E+02 relax_time= 0.4328278E+02',
'02epar( 3)= 0.00000000E+00',
'02epar( 4)= 0.17862340E-01 half_life= 0.3880495E+02 relax_time= 0.5598371E+02',
'02bpar( 1)= 0.49998962E+02',
'02bpar( 2)= 0.23103878E-01 half_life= 0.3000133E+02 relax_time= 0.4328278E+02', """
from pyparsing import Literal, Regex, Word, alphas, nums, oneOf, OneOrMore, quotedString, removeQuotes
EQ = Literal('=').suppress()
scinotationnum = Regex(r'\d\.\d+E[+-]\d+')
dataname = Word(alphas+'_')
key = Word(nums,exact=2) + oneOf("bpar epar")
index = '(' + Word(nums) + ')'
keyedValue = key + EQ + scinotationnum
# define an item in the source - suppress values with keys, just want the unkeyed ones
item = key('key') + index + EQ + OneOrMore(keyedValue.suppress() | scinotationnum)('data')
# initialize summary structure
from collections import defaultdict
results = defaultdict(lambda : {'epar':[], 'bpar':[]})
# extract quoted strings from list
quotedString.setParseAction(removeQuotes)
for raw in quotedString.searchString(source):
parts = item.parseString(raw[0])
num,par = parts.key
results[num][par].extend(parts.data)
# dump out results, or do whatever
from pprint import pprint
pprint(dict(results.iteritems()))
打印:
{'01': {'bpar': ['0.23103878E-01', '0.00000000E+00'], 'epar': []},
'02': {'bpar': ['0.49998962E+02', '0.23103878E-01'],
'epar': ['0.49998963E+02',
'0.23103878E-01',
'0.00000000E+00',
'0.17862340E-01']}}
答案 1 :(得分:1)
考虑使用dicts的词典。
#!/usr/bin/env python
import re
import pprint
raw = """'01bpar( 2)= 0.23103878E-01 half_life= 0.3000133E+02 relax_time= 0.4328278E+02',
'01bpar( 3)= 0.00000000E+00',
'02epar( 1)= 0.49998963E+02',
'02epar( 2)= 0.23103878E-01 half_life= 0.3000133E+02 relax_time= 0.4328278E+02',
'02epar( 3)= 0.00000000E+00',
'02epar( 4)= 0.17862340E-01 half_life= 0.3880495E+02 relax_time= 0.5598371E+02',
'02bpar( 1)= 0.49998962E+02',
'02bpar( 2)= 0.23103878E-01 half_life= 0.3000133E+02 relax_time= 0.4328278E+02',"""
datastruct = {}
pattern = re.compile(r"""\D(?P<digits>\d+)(?P<field>[eb]par)[^=]+=\D+(?P<number>\d+\.\d+E[+-]\d+)""")
for line in raw.splitlines():
result = pattern.search(line)
parts = result.groupdict()
if not parts['digits'] in datastruct:
datastruct[parts['digits']] = {'epar':[], 'bpar':[]}
datastruct[parts['digits']][parts['field']].append(parts['number'])
pprint.pprint(datastruct, depth=4)
产地:
{'01': {'bpar': ['0.23103878E-01', '0.00000000E+00'], 'epar': []},
'02': {'bpar': ['0.49998962E+02', '0.23103878E-01'],
'epar': ['0.49998963E+02',
'0.23103878E-01',
'0.00000000E+00',
'0.17862340E-01']}}
根据评论修改版本:
pattern = re.compile(r"""\D(?P<digits>\d+)(?P<field>[eb]par)[^=]+=\D+(?P<number>\d+\.\d+E[+-]\d+)""")
default = lambda : dict((('epar',[]), ('bpar',[])))
datastruct = defaultdict( default)
for line in raw.splitlines():
result = pattern.search(line)
parts = result.groupdict()
datastruct[parts['digits']][parts['field']].append(parts['number'])
pprint.pprint(datastruct.items())
产生:
[('02',
{'bpar': ['0.49998962E+02', '0.23103878E-01'],
'epar': ['0.49998963E+02',
'0.23103878E-01',
'0.00000000E+00',
'0.17862340E-01']}),
('01', {'bpar': ['0.23103878E-01', '0.00000000E+00'], 'epar': []})]
答案 2 :(得分:0)
您的顶级结构是位置结构,因此它是列表的完美选择。由于列表可以包含任意项,因此named tuple是完美的。元组中的每个项目都可以包含一个包含该元素的列表。
所以,你的代码应该看起来像这个伪代码:
from collections import named tuple
data = []
newTuple = namedtuple('stuff', ['epar','bpar'])
for line in theFile.readlines():
eparVals = regexToGetThemFromString()
bparVals = regexToGetThemFromString()
t = newTuple(eparVals, bparVals)
data.append(t)
你说你已经可以遍历文件了,并且有各种正则表达式来获取数据,所以我没有打扰添加所有细节。