我有一组500-600个文件,我想搜索并提取数据。我正在尝试使用pyparsing,但成功非常有限。文件中只有3个内容(1)注释,(2)简单赋值和(3)嵌套赋值。嵌套深度约为6级。
我的目标是查看3级深度字段中的特定值,如果它具有特定值,则从属于同一第2级字段的另一个第3级字段中提取值。
首先,是否正在使用适当的工具?如果没有其他建议?
我知道如何构建文件列表并迭代它们。让我展示一个示例文件,然后显示我正在尝试的代码。
# TOP_OBJECT ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
TOP_OBJECT=
(
obj_fmt=
(
obj_name="foo"
obj_cre_date=737785182 # = Tue May 18 23:19:42 1993
opj_data=
(
a="continue"
b="quit"
)
obj_version=264192 # = Version 4.8.0
)
# LEVEL1_OBJECT ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
LEVEL1_OBJECT=
(
OBJ_part=
(
obj_type=1005
obj_size=120
)
# LEVEL2_OBJECT ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
LEVEL2_OBJECT_A=
(
OBJ_part=
(
obj_type=3001
obj_size=128
)
Another_part=
(
another_attr=
(
another_style=0
another_param=2
)
)
) ### End of LEVEL2_OBJECT_A ###
# LEVEL2_OBJECT ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
LEVEL2_OBJECT_B=
(
OBJ_part=
(
obj_type=3005
obj_size=128
)
Another_part=
(
another_attr=
(
another_style=0
another_param=8
)
)
) ### End of LEVEL2_OBJECT_B ###
) ### End of LEVEL1 OBJECT
) ### End of TOP_OBJECT ###
我的代码消化文件如下:
from pyparsing import *
def Syntax():
comment = Group("#" + restOfLine).suppress()
eq = Literal('=')
lpar = Literal( '(' ).suppress()
rpar = Literal( ')' ).suppress()
num = Word(nums)
var = Word(alphas + "_")
simpleAssign = var + eq
nestedAssign = Group(lpar + OneOrMore(simpleAssign) + rpar)
expr = Forward()
atom = nestedAssign | simpleAssign
expr << atom
expr.ignore(comment)
return expr
def main():
expr = Syntax()
results = expr.parseFile( "for_show.asc" )
print results
if __name__ == '__main__':
main()
我的结果没有下降:['TOP_OBJECT','=']
现在我没有处理引用的字符串或数字,只是试图理解解析嵌套列表。
答案 0 :(得分:1)
大多数情况下,解析器中只有一些空白 - 与当前代码相比,请参阅已注释掉的原始代码:
def Syntax():
comment = Group("#" + restOfLine).suppress()
eq = Literal('=')
lpar = Literal( '(' ).suppress()
rpar = Literal( ')' ).suppress()
num = Word(nums)
#~ var = Word(alphas + "_")
var = Word(alphas + "_", alphanums+"_")
#~ simpleAssign = var + eq
expr = Forward()
simpleAssign = var + eq + (num | quotedString)
#~ nestedAssign = Group(lpar + OneOrMore(simpleAssign) + rpar)
nestedAssign = var + eq + Group(lpar + OneOrMore(expr) + rpar)
atom = nestedAssign | simpleAssign
expr << atom
expr.ignore(comment)
return expr
这给出了:
['TOP_OBJECT',
'=',
['obj_fmt',
'=',
['obj_name',
'=',
'"foo"',
'obj_cre_date',
'=',
'737785182',
'opj_data',
'=',
['a', '=', '"continue"', 'b', '=', '"quit"'],
'obj_version',
'=',
'264192'],
'LEVEL1_OBJECT',
'=',
['OBJ_part',
'=',
['obj_type', '=', '1005', 'obj_size', '=', '120'],
'LEVEL2_OBJECT_A',
'=',
['OBJ_part',
'=',
['obj_type', '=', '3001', 'obj_size', '=', '128'],
'Another_part',
'=',
['another_attr',
'=',
['another_style', '=', '0', 'another_param', '=', '2']]],
'LEVEL2_OBJECT_B',
'=',
['OBJ_part',
'=',
['obj_type', '=', '3005', 'obj_size', '=', '128'],
'Another_part',
'=',
['another_attr',
'=',
['another_style', '=', '0', 'another_param', '=', '8']]]]]]
如果你将expr
包装在nestedAssMore的OneOrMore中,那么
nestedAssign = var + eq + Group(lpar + OneOrMore(Group(expr)) + rpar)
,我认为你的重复嵌套作业会得到更好的结构:
['TOP_OBJECT',
'=',
[['obj_fmt',
'=',
[['obj_name', '=', '"foo"'],
['obj_cre_date', '=', '737785182'],
['opj_data', '=', [['a', '=', '"continue"'], ['b', '=', '"quit"']]],
['obj_version', '=', '264192']]],
['LEVEL1_OBJECT',
'=',
[['OBJ_part',
'=',
[['obj_type', '=', '1005'], ['obj_size', '=', '120']]],
['LEVEL2_OBJECT_A',
'=',
[['OBJ_part',
'=',
[['obj_type', '=', '3001'], ['obj_size', '=', '128']]],
['Another_part',
'=',
[['another_attr',
'=',
[['another_style', '=', '0'], ['another_param', '=', '2']]]]]]],
['LEVEL2_OBJECT_B',
'=',
[['OBJ_part',
'=',
[['obj_type', '=', '3005'], ['obj_size', '=', '128']]],
['Another_part',
'=',
[['another_attr',
'=',
[['another_style', '=', '0'], ['another_param', '=', '8']]]]]]]]]]]
此外,您最初发布的代码包含TAB,我发现它们比它们的价值更麻烦,最好使用4空格缩进。