我有一个非常大的配置文件,由
分隔的块组成 #start <some-name> ... #end <some-name>
some-name
必须与该块相同。该块可以出现多次,但从不包含在其自身内。某些块中只能出现其他一些块。我对这些包含的块不感兴趣,但对第二层的块感兴趣。
在真实文件中,名称不是以blockX
开头,而是彼此非常不同。
一个例子:
#start block1
#start block2
/* string but no more name2 or name1 in here */
#end block2
#start block3
/* configuration data */
#end block3
#end block1
这是用正则表达式解析的,并且在没有连接调试器的情况下运行时非常快。 0.23s的 2k 2.7MB文件,简单的规则如下:
blocks2 = re.findAll('#start block2\s+(.*?)#end block2', contents)
我尝试使用pyparsing解析它,但即使没有连接调试器,速度也非常慢,同一文件需要16秒。
我的方法是生成一个pyparsing代码,它可以模仿正则表达式的简单解析,所以我现在可以使用其他一些代码,避免现在解析每个块。语法非常宽泛。
这是我试过的
block = [Group(Keyword(x) + SkipTo(Keyword('#end') + Keyword(x)) + Keyword('#end') - x )(x + '*') for x in ['block3', 'block4', 'block5', 'block6', 'block7', 'block8']]
blocks = Keyword('#start') + block
x = OneOrMore(blocks).searchString(contents) # I also tried parseString() but the results were similar.
我做错了什么?如何优化它以达到接近正则表达式实现速度的任何地方?
编辑:上一个示例与实际数据相比很简单,所以我现在创建了一个合适的示例:
/* all comments are C comments */
VERSION 1 0
#start PROJECT project_name "what is it about"
/* why not another comment here too! */
#start SECTION where_the_wild_things_are "explain this section"
/* I need all sections at this level */
/* In the real data there are about 10k of such blocks.
There are around 10 different names (types) of blocks */
#start INTERFACE_SPEC
There can be anything in the section. Not Really but i want to skip anything until the matching (hash)end.
/* can also have comments */
#end INTERFACE_SPEC
#start some_other_section
name 'section name'
#start with_inner_section
number_of_points 3 /* can have comments anywhere */
#end with_inner_section
#end some_other_section /* basically comments can be anywhere */
#start some_other_section
name 'section name'
other_section_attribute X
ref_to_section another_section
#end some_other_section
#start another_section
degrees
#start section_i_do_not_care_about_at_the_moment
ref_to some_other_section
/* of course can have comments */
#end section_i_do_not_care_about_at_the_moment
#end another_section
#end SECTION
#end PROJECT
为此我必须扩展你的原始建议。我对两个外部块(PROJECT和SECTION)进行了硬编码,因为它们必须存在。
对于这个版本,时间仍然是~16s:
def test_parse(f):
import pyparsing as pp
import io
comment = pp.cStyleComment
start = pp.Literal("#start")
end = pp.Literal("#end")
ident = pp.Word(pp.alphas + "_", pp.printables)
inner_ident = ident.copy()
inner_start = start + inner_ident
inner_end = end + pp.matchPreviousLiteral(inner_ident)
inner_block = pp.Group(inner_start + pp.SkipTo(inner_end) + inner_end)
version = pp.Literal('VERSION') - pp.Word(pp.nums)('major_version') - pp.Word(pp.nums)('minor_version')
project = pp.Keyword('#start') - pp.Keyword('PROJECT') - pp.Word(pp.alphas + "_", pp.printables)(
'project_name') - pp.dblQuotedString + pp.ZeroOrMore(comment) - \
pp.Keyword('#start') - pp.Keyword('SECTION') - pp.Word(pp.alphas, pp.printables)(
'section_name') - pp.dblQuotedString + pp.ZeroOrMore(comment) - \
pp.OneOrMore(inner_block) + \
pp.Keyword('#end') - pp.Keyword('SECTION') + \
pp.ZeroOrMore(comment) - pp.Keyword('#end') - pp.Keyword('PROJECT')
grammar = pp.ZeroOrMore(comment) - version.ignore(comment) - project.ignore(comment)
with io.open(f) as ff:
return grammar.parseString(ff.read())
编辑:错字,说是2k,但它是一个2.7MB的文件。
答案 0 :(得分:1)
首先,发布的这段代码对我不起作用:
blocks = Keyword('#start') + block
更改为:
blocks = Keyword('#start') + MatchFirst(block)
至少针对您的示例文本运行。
您可以尝试使用pyparsing的自适应表达式之一matchPreviousLiteral
,而不是对所有关键字进行硬编码:
(适用EDITED)强>
def grammar():
import pyparsing as pp
comment = pp.cStyleComment
start = pp.Keyword("#start")
end = pp.Keyword('#end')
ident = pp.Word(pp.alphas + "_", pp.printables)
integer = pp.Word(pp.nums)
inner_ident = ident.copy()
inner_start = start + inner_ident
inner_end = end + pp.matchPreviousLiteral(inner_ident)
inner_block = pp.Group(inner_start + pp.SkipTo(inner_end) + inner_end)
VERSION, PROJECT, SECTION = map(pp.Keyword, "VERSION PROJECT SECTION".split())
version = VERSION - pp.Group(integer('major_version') + integer('minor_version'))
project = (start - PROJECT + ident('project_name') + pp.dblQuotedString
+ start + SECTION + ident('section_name') + pp.dblQuotedString
+ pp.OneOrMore(inner_block)('blocks')
+ end + SECTION
+ end + PROJECT)
grammar = version + project
grammar.ignore(comment)
return grammar
只需要在语法中最顶层的表达式上调用ignore()
- 它将向下传播到所有内部表达式。此外,如果您已经调用ZeroOrMore(comment)
,则不必在语法中撒ignore()
。
我在大约16秒内解析了一个2MB的输入字符串(包含10,000个内部块),因此2K文件只需要大约1/1000的长度。