提高分隔配置块的解析速度

时间:2016-07-29 15:32:19

标签: python pyparsing

我有一个非常大的配置文件,由

分隔的块组成

#start <some-name> ... #end <some-name> some-name必须与该块相同。该块可以出现多次,但从不包含在其自身内。某些块中只能出现其他一些块。我对这些包含的块不感兴趣,但对第二层的块感兴趣。

在真实文件中,名称不是以blockX开头,而是彼此非常不同。

一个例子:

#start block1

  #start block2

    /* string but no more name2 or name1 in here */
  #end block2

  #start block3
   /* configuration data */
  #end block3

#end block1

这是用正则表达式解析的,并且在没有连接调试器的情况下运行时非常快。 0.23s的 2k 2.7MB文件,简单的规则如下:

blocks2 = re.findAll('#start block2\s+(.*?)#end block2', contents)

我尝试使用pyparsing解析它,但即使没有连接调试器,速度也非常慢,同一文件需要16秒。

我的方法是生成一个pyparsing代码,它可以模仿正则表达式的简单解析,所以我现在可以使用其他一些代码,避免现在解析每个块。语法非常宽泛。

这是我试过的

block = [Group(Keyword(x) + SkipTo(Keyword('#end') + Keyword(x)) + Keyword('#end') - x )(x + '*') for x in ['block3', 'block4', 'block5', 'block6', 'block7', 'block8']]

blocks = Keyword('#start') + block

x = OneOrMore(blocks).searchString(contents)  # I also tried parseString() but the results were similar.

我做错了什么?如何优化它以达到接近正则表达式实现速度的任何地方?

编辑:上一个示例与实际数据相比很简单,所以我现在创建了一个合适的示例:

/* all comments are C comments */
VERSION 1 0
#start PROJECT project_name "what is it about"
    /* why not another comment here too! */
    #start SECTION where_the_wild_things_are "explain this section"


        /* I need all sections at this level */

        /* In the real data there are about 10k of such blocks.
           There are around 10 different names (types) of blocks */


        #start INTERFACE_SPEC
         There can be anything in the section. Not Really but i want to skip anything until the matching (hash)end.
         /* can also have comments */

        #end INTERFACE_SPEC

        #start some_other_section
            name 'section name'

            #start with_inner_section
              number_of_points 3 /* can have comments anywhere */
            #end with_inner_section
        #end some_other_section /* basically comments can be anywhere */

        #start some_other_section
            name 'section name'
            other_section_attribute X
            ref_to_section another_section
        #end some_other_section

        #start another_section
            degrees
            #start section_i_do_not_care_about_at_the_moment
                ref_to some_other_section
                /* of course can have comments */
            #end section_i_do_not_care_about_at_the_moment
        #end another_section

    #end SECTION
#end PROJECT

为此我必须扩展你的原始建议。我对两个外部块(PROJECT和SECTION)进行了硬编码,因为它们必须存在。

对于这个版本,时间仍然是~16s:

def test_parse(f):
       import pyparsing as pp
       import io
       comment = pp.cStyleComment

       start = pp.Literal("#start")
       end = pp.Literal("#end")
       ident = pp.Word(pp.alphas + "_", pp.printables)

       inner_ident = ident.copy()
       inner_start = start + inner_ident
       inner_end = end + pp.matchPreviousLiteral(inner_ident)
       inner_block = pp.Group(inner_start + pp.SkipTo(inner_end) + inner_end)

       version = pp.Literal('VERSION') - pp.Word(pp.nums)('major_version') - pp.Word(pp.nums)('minor_version')

       project = pp.Keyword('#start') - pp.Keyword('PROJECT') - pp.Word(pp.alphas + "_", pp.printables)(
              'project_name') - pp.dblQuotedString + pp.ZeroOrMore(comment) - \
                 pp.Keyword('#start') - pp.Keyword('SECTION') - pp.Word(pp.alphas, pp.printables)(
              'section_name') - pp.dblQuotedString + pp.ZeroOrMore(comment) - \
                 pp.OneOrMore(inner_block) + \
                 pp.Keyword('#end') - pp.Keyword('SECTION') + \
                 pp.ZeroOrMore(comment) - pp.Keyword('#end') - pp.Keyword('PROJECT')

       grammar = pp.ZeroOrMore(comment) - version.ignore(comment) - project.ignore(comment)

       with io.open(f) as ff:
              return grammar.parseString(ff.read())
编辑:错字,说是2k,但它是一个2.7MB的文件。

1 个答案:

答案 0 :(得分:1)

首先,发布的这段代码对我不起作用:

blocks = Keyword('#start') + block

更改为:

blocks = Keyword('#start') + MatchFirst(block)

至少针对您的示例文本运行。

您可以尝试使用pyparsing的自适应表达式之一matchPreviousLiteral,而不是对所有关键字进行硬编码:

(适用EDITED)

def grammar():
    import pyparsing as pp
    comment = pp.cStyleComment

    start = pp.Keyword("#start")
    end = pp.Keyword('#end')
    ident = pp.Word(pp.alphas + "_", pp.printables)
    integer = pp.Word(pp.nums)

    inner_ident = ident.copy()
    inner_start = start + inner_ident
    inner_end = end + pp.matchPreviousLiteral(inner_ident)
    inner_block = pp.Group(inner_start + pp.SkipTo(inner_end) + inner_end)

    VERSION, PROJECT, SECTION = map(pp.Keyword, "VERSION PROJECT SECTION".split())

    version = VERSION - pp.Group(integer('major_version') + integer('minor_version'))

    project = (start - PROJECT + ident('project_name') + pp.dblQuotedString
               + start + SECTION + ident('section_name') + pp.dblQuotedString
               + pp.OneOrMore(inner_block)('blocks')
               + end + SECTION
               + end + PROJECT)

    grammar = version + project
    grammar.ignore(comment)

    return grammar

只需要在语法中最顶层的表达式上调用ignore() - 它将向下传播到所有内部表达式。此外,如果您已经调用ZeroOrMore(comment),则不必在语法中撒ignore()

我在大约16秒内解析了一个2MB的输入字符串(包含10,000个内部块),因此2K文件只需要大约1/1000的长度。