编辑pyparsing解析结果

时间:2018-07-11 20:14:13

标签: python parsing pyparsing logfile

这类似于question I've asked before

我为包含多个日志的文本文件编写了一个pyparsing语法logparser。日志记录每个函数调用和每个函数完成。基础进程是多线程的,因此有可能先调用慢速函数A,然后又调用快速函数B并几乎立即完成,然后在该函数A完成并给出我们的返回值。因此,手工读取日志文件非常困难,因为一个功能的调用信息和返回值信息可能相隔数千行。

我的解析器能够解析函数调用(从现在开始称为input_blocks)及其返回值(从现在开始称为output_blocks)。我的解析结果(logparser.searchString(logfile))如下:

[0]:                            # first log
  - input_blocks:
    [0]:
      - func_name: 'Foo'
      - parameters: ...
      - thread: '123'
      - timestamp_in: '12:01'
    [1]:
      - func_name: 'Bar'
      - parameters: ...
      - thread: '456'
      - timestamp_in: '12:02'
  - output_blocks:
    [0]:
      - func_name: 'Bar'
      - func_time: '1'
      - parameters: ...
      - thread: '456'
      - timestamp_out: '12:03'
    [1]:
      - func_name: 'Foo'
      - func_time: '3'
      - parameters: ...
      - thread: '123'
      - timestamp_out: '12:04'
[1]:                            # second log
    - input_blocks:
    ...

    - output_blocks:
    ...
...                             # n-th log

我想解决一个函数调用的输入和输出信息分离的问题。因此,我想将input_block和对应的output_block放入function_block中。我的最终解析结果应如下所示:

[0]:                            # first log
  - function_blocks:
    [0]:
        - input_block:
            - func_name: 'Foo'
            - parameters: ...
            - thread: '123'
            - timestamp_in: '12:01'
        - output_block:
            - func_name: 'Foo'
            - func_time: '3'
            - parameters: ...
            - thread: '123'
            - timestamp_out: '12:04'
    [1]:
        - input_block:
            - func_name: 'Bar'
            - parameters: ...
            - thread: '456'
            - timestamp_in: '12:02'
        - output_block:
            - func_name: 'Bar'
            - func_time: '1'
            - parameters: ...
            - thread: '456'
            - timestamp_out: '12:03'
[1]:                            # second log
    - function_blocks:
    [0]: ...
    [1]: ...
...                             # n-th log

为此,我定义了一个函数rearrange,该函数遍历input_blocksoutput_blocks并检查func_namethread和时间戳是否匹配。但是,将匹配的块移动到一个function_block中是我缺少的部分。然后,我将此函数设置为日志语法的解析操作:logparser.setParseAction(rearrange)

def rearrange(log_token):
    for input_block in log_token.input_blocks:
        for output_block in log_token.output_blocks:
            if (output_block.func_name == input_block.func_name
                and output_block.thread == input_block.thread
                and check_timestamp(output_block.timestamp_out,
                                    output_block.func_time,
                                    input_block.timestamp_in):
                # output_block and input_block match -> put them in a function_block
                # modify log_token
    return log_token

我的问题是:如何将匹配的output_blockinput_block放在function_block中,使我仍然喜欢以下便捷的访问方法: pyparsing.ParseResults

我的想法如下:

def rearrange(log_token):
    # define a new ParseResults object in which I store matching input & output blocks
    function_blocks = pp.ParseResults(name='function_blocks')

    # find matching blocks
    for input_block in log_token.input_blocks:
        for output_block in log_token.output_blocks:
            if (output_block.func_name == input_block.func_name
                and output_block.thread == input_block.thread
                and check_timestamp(output_block.timestamp_out,
                                    output_block.func_time,
                                    input_block.timestamp_in):
                # output_block and input_block match -> put them in a function_block
                function_blocks.append(input_block.pop() + output_block.pop())  # this addition causes a maximum recursion error?
    log_token.append(function_blocks)
    return log_token

但这不起作用。加法会导致最大的递归错误,并且.pop()不能按预期工作。它不会弹出整个块,而只会弹出该块中的最后一个条目。此外,它实际上也不会删除该条目,只是将其从列表中删除,但仍可以通过其结果名称访问它。

某些input_blocks可能没有相应的output_block(例如,如果进程在所有功能完成之前崩溃了),这也是有可能的。因此,我的解析结果应具有属性input_blocksoutput_blocks(用于备用块)和function_blocks(用于匹配块)。

感谢您的帮助!

编辑:

我做了一个简单的例子来说明我的问题。另外,我进行了实验,并找到了一种解决方案,但是有点混乱。我必须承认其中包含很多尝试和错误,因为我既没有找到关于ParseResults的文档,也无法理解ParseResults的内部工作原理以及如何正确创建自己的嵌套from pyparsing import * def main(): log_data = '''\ Func1_in Func2_in Func2_out Func1_out Func3_in''' ParserElement.inlineLiteralsUsing(Suppress) input_block = Group(Word(alphanums)('func_name') + '_in').setResultsName('input_blocks', listAllMatches=True) output_block = Group(Word(alphanums)('func_name') +'_out').setResultsName('output_blocks', listAllMatches=True) log = OneOrMore(input_block | output_block) parse_results = log.parseString(log_data) print('***** before rearranging *****') print(parse_results.dump()) parse_results = rearrange(parse_results) print('***** after rearranging *****') print(parse_results.dump()) def rearrange(log_token): function_blocks = list() for input_block in log_token.input_blocks: for output_block in log_token.output_blocks: if input_block.func_name == output_block.func_name: # found two matching blocks! now put them in a function_block # and delete them from their original positions in log_token # I have to do both __setitem__ and .append so it shows up in the dict and in the list # and .copy() is necessary because I delete the original objects later tmp_function_block = ParseResults() tmp_function_block.__setitem__('input', input_block.copy()) tmp_function_block.append(input_block.copy()) tmp_function_block.__setitem__('output', output_block.copy()) tmp_function_block.append(output_block.copy()) function_block = ParseResults(name='function_blocks', toklist=tmp_function_block, asList=True, modal=False) # I have no idea what modal and asList do, this was trial-and-error until I got acceptable output del function_block['input'], function_block['output'] # remove duplicate data function_blocks.append(function_block) # delete from original position in log_token input_block.clear() output_block.clear() log_token.__setitem__('function_blocks', sum(function_blocks)) return log_token if __name__ == '__main__': main() 结构。

***** before rearranging *****
[['Func1'], ['Func2'], ['Func2'], ['Func1'], ['Func3']]
- input_blocks: [['Func1'], ['Func2'], ['Func3']]
  [0]:
    ['Func1']
    - func_name: 'Func1'
  [1]:
    ['Func2']
    - func_name: 'Func2'
  [2]:
    ['Func3']
    - func_name: 'Func3'
- output_blocks: [['Func2'], ['Func1']]
  [0]:
    ['Func2']
    - func_name: 'Func2'
  [1]:
    ['Func1']
    - func_name: 'Func1'
***** after rearranging *****
[[], [], [], [], ['Func3']]
- function_blocks: [['Func1'], ['Func1'], ['Func2'], ['Func2'], [], []]   # why is this duplicated? I just want the inner function_blocks!
  - function_blocks: [[['Func1'], ['Func1']], [['Func2'], ['Func2']], [[], []]]
    [0]:
      [['Func1'], ['Func1']]
      - input: ['Func1']
        - func_name: 'Func1'
      - output: ['Func1']
        - func_name: 'Func1'
    [1]:
      [['Func2'], ['Func2']]
      - input: ['Func2']
        - func_name: 'Func2'
      - output: ['Func2']
        - func_name: 'Func2'
    [2]:                              # where does this come from?
      [[], []]
      - input: []
      - output: []
- input_blocks: [[], [], ['Func3']]
  [0]:                                # how do I delete these indexes?
    []                                #  I think I only cleared their contents
  [1]:
    []
  [2]:
    ['Func3']
    - func_name: 'Func3'
- output_blocks: [[], []]
  [0]:
    []
  [1]:
    []

输出:

sudo apt install libmysqlcppconn-dev

1 个答案:

答案 0 :(得分:1)

此版本的rearrange解决了我在您的示例中看到的大多数问题:

def rearrange(log_token):
    function_blocks = list()

    for input_block in log_token.input_blocks:
        # look for match among output blocks that have not been cleared
        for output_block in filter(None, log_token.output_blocks):

            if input_block.func_name == output_block.func_name:
                # found two matching blocks! now put them in a function_block
                # and clear them from in their original positions in log_token

                # create rearranged block, first with a list of the two blocks
                # instead of append()'ing, just initialize with a list containing
                # the two block copies
                tmp_function_block = ParseResults([input_block.copy(), output_block.copy()])

                # now assign the blocks by name
                # x.__setitem__(key, value) is the same as x[key] = value
                tmp_function_block['input'] = tmp_function_block[0]
                tmp_function_block['output'] = tmp_function_block[1]

                # wrap that all in another ParseResults, as if we had matched a Group
                function_block = ParseResults(name='function_blocks', toklist=tmp_function_block, asList=True,
                                              modal=False)  # I have no idea what modal and asList do, this was trial-and-error until I got acceptable output

                del function_block['input'], function_block['output']  # remove duplicate name references

                function_blocks.append(function_block)
                # clear blocks in their original positions in log_token, so they won't be matched any more
                input_block.clear()
                output_block.clear()

                # match found, no need to keep going looking for a matching output block 
                break

    # find all input blocks that weren't cleared (had matching output blocks) and append as input-only blocks
    for input_block in filter(None, log_token.input_blocks):
        # no matching output for this input
        tmp_function_block = ParseResults([input_block.copy()])
        tmp_function_block['input'] = tmp_function_block[0]
        function_block = ParseResults(name='function_blocks', toklist=tmp_function_block, asList=True,
                                      modal=False)  # I have no idea what modal and asList do, this was trial-and-error until I got acceptable output
        del function_block['input']  # remove duplicate data
        function_blocks.append(function_block)
        input_block.clear()

    # clean out log_token, and reload with rearranged function blocks
    log_token.clear()
    log_token.extend(function_blocks)
    log_token['function_blocks'] =  sum(function_blocks)

    return log_token

由于这将获取输入令牌并返回重新排列的令牌,因此您可以 将其原样进行解析操作:

    # trailing '*' on the results name is equivalent to listAllMatches=True
    input_block = Group(Word(alphanums)('func_name') + '_in')('input_blocks*')
    output_block = Group(Word(alphanums)('func_name') +'_out')('output_blocks*')
    log = OneOrMore(input_block | output_block)
    log.addParseAction(rearrange)

由于rearrange已在log_token位置进行了更新,因此,如果您将其设为解析动作,则不需要结尾return语句。

有趣的是,您如何能够通过清除找到匹配的块来就地更新列表-非常聪明。

通常,将令牌组装到ParseResults中是一个内部函数,因此文档仅涉及此主题。我只是在浏览模块文档,而对于这个主题我并没有真正的好主意。