pyparsing - 并行日志处理的性能提示

时间:2011-08-29 17:59:59

标签: multiprocessing pyparsing

我使用2个进程池来并行解析几个日志文件,

po = Pool(processes=2)
pool_object = po.apply_async(log_parse, (hostgroup_sender_dir, hostname, host_depot_dir,        synced_log, prev_last_pos, get_report_rate), )

(curr_last_pos, remote_report_datetime, report_gen_rate) = pool_object.get()

然而,在初次运行时它很慢, 约16分钟,约12~20Mb文件。

在下一次迭代中,我不会遇到大问题,因为我会每2分钟或3分钟解析一次新的字节, 但肯定还有改进的余地,我在第一次运行时如何做到这一点。 将原木预分割成几个较小尺寸的接头 (因此pyparse不必将整个日志分配到内存中) 加快速度?

我还在双核开发虚拟机上运行它, 但很快将不得不迁移到四核物理服务器(我试图获得额外的四核CPU),它可能需要能够管理~50个日志。

来自日志的拼接

log_splice = """
# XX_MAIN     (23143) Report at 2011-08-30 20:00:00.003    Type:  Periodic     #
# Report number 1790                                        State: Active      #
################################################################################
# Running since                  : 2011-08-12 04:40:06.153                     #
# Total execution time           :  18 day(s) 15:19:53.850                     #
# Last report date               : 2011-08-30 19:45:00.002                     #
# Time since last periodic report:   0 day(s) 00:15:00.000                     #
################################################################################
                            ----------------------------------------------------
                            |       Periodic        |          Global          |
----------------------------|-----------------------|--------------------------|
Simultaneous Accesses       |  Curr  Max Cumulative |      Max    Cumulative   |
--------------------------- |  ---- ---- ---------- |     ---- -------------   |
Accesses                    |     1    5          - |      180             -   |
- in start/stop state       |     1    5      12736 |      180      16314223   |
-------------------------------------------------------------------------------|
Accesses per Second         |    Max   Occurr. Date |      Max Occurrence Date |
--------------------------- | ------ -------------- |   ------ --------------- |
Accesses per second         |  21.00 08-30 19:52:33 |    40.04  08-16 20:19:18 |
-------------------------------------------------------------------------------|
Service Statistics          |  Success    Total  %  |   Success      Total  %  |
--------------------------- | -------- -------- --- | --------- ---------- --- |
Services accepted accesses  |    17926    17927  99 |  21635954   21637230 -98 |
- 98: NF                    |     7546     7546 100 |  10992492   10992492 100 |
- 99: XFC                   |    10380    10380 100 |  10643462   10643462 100 |
 ----------------------------------------------------------------------------- |
Services succ. terminations |    12736    12736 100 |  16311566   16314222  99 |
- 98: NF                    |     7547     7547 100 |  10991401   10992492  99 |
- 99: XFC                   |     5189     5189 100 |   5320165    5321730  99 |
 ----------------------------------------------------------------------------- |
""" 

使用pyparse,

unparsed_log_data = input_log.read()

#------------------------------------------------------------------------
# Define Grammars
#------------------------------------------------------------------------
integer = Word( nums )

# XX_MAIN     ( 4801) Report at 2010-01-25 06:55:00
binary_name = "# XX_MAIN"
pid = "(" + Word(nums) + ")"
report_id = Suppress(binary_name) + Suppress(pid)

# Word as a contiguous set of characters found in the string nums
year = Word(nums, max=4)
month = Word(nums, max=2)
day = Word(nums, max=2)
# 2010-01-25 grammar
yearly_day_bnf = Combine(year + "-" + month + "-" + day)
# 06:55:00. grammar
clock24h_bnf = Combine(Word(nums, max=2) + ":" + Word(nums, max=2) + ":" + Word(nums,     max=2) + Suppress("."))
timestamp_bnf = Combine(yearly_day_bnf + White(' ') + clock24h_bnf)("timestamp")

report_bnf = report_id + Suppress("Report at ") + timestamp_bnf

# Service Statistics          |  Success    Total  %  | 
# Services succ. terminations |       40       40 100 |   3494775    3497059  99 |
partial_report_ignore = Suppress(SkipTo("Services succ. terminations", include=True))
succ_term_bnf = Suppress("|") + integer("succTerms") + integer("totalTerms")
terminations_report_bnf = report_bnf + partial_report_ignore + succ_term_bnf

# Apply the BNF to the unparsed data
terms_parsing = terminations_report_bnf.searchString(unparsed_log_data)

1 个答案:

答案 0 :(得分:2)

我会围绕解析单个日志条目来构造解析器。这完成了两件事:

  1. 它将问题分解为易于并行化的块
  2. 它将您的解析器定位为在处理完日志数据后处理增量日志处理
  3. 您的并行化块大小是一个包装精美的单个项目,每个进程可以单独解析该项目(假设您不需要将任何状态或已用时间信息从一条日志消息转发到下一条消息)。 / p>

    编辑(这个问题已经变成了关于pyparsing调整的更多话题......)

    我发现使用pyparsing Regex表达式定义使用Combine(lots+of+expressions+here)构建的低级基元更好。这通常适用于实数或时间戳等表达式,例如:

    # 2010-01-25 grammar
    yearly_day_bnf = Combine(year + "-" + month + "-" + day)
    yearly_day_bnf = Regex(r"\d{4}-\d{2}-\d{2}")
    
    # 06:55:00. grammar
    clock24h_bnf = Combine(Word(nums, max=2) + ":" + 
                           Word(nums, max=2) + ":" + 
                           Word(nums, max=2) + Suppress("."))
    clock24h_bnf = Regex(r"\d{2}:\d{2}:\d{2}\.")
    clock24h_bnf.setParseAction(lambda tokens:tokens[0][:-1])
    
    timestamp_bnf = Combine(yearly_day_bnf + White(' ') + clock24h_bnf)
    timestamp_bnf = Regex(r"\d{4}-\d{2}-\d{2}\s+\d{1,2}:\d{2}:\d{2}")
    

    不需要过头了。像integer=Word(nums)之类的东西已经在生成RE了。

    请注意,我还从timestamp_bnf中删除了结果名称。我通常从原始定义中省略结果名称,并在将它们组装成更高级别的表达式时添加它们,因此我可以多次使用相同的原语,使用不同的名称,如:

    summary = ("Started:" + timestamp_bnf("startTime") + 
               "Ended:" + timestamp_bnf("endTime"))
    

    我发现这也有助于我组织解析后的结构。

    将结果名称移动到更高的表达式也会让我给该字段一个更具描述性的名称:

    report_bnf = report_id + Suppress("Report at ") + timestamp_bnf("reportTime")
    

    看看你的语法,你并没有真正破解所有这些报告信息,只是从这一行中提取报告时间:

    # XX_MAIN     (23143) Report at 2011-08-30 20:00:00.003
    

    和此行中的2个整数字段:

    Services succ. terminations |    12736    12736 100 |  16311566   16314222  99 |
    

    请改为尝试:

    report_bnf = report_id + Suppress("Report at") + timestamp_bnf("reportTime")
    succ_term_bnf = (Suppress("Services succ. terminations") + Suppress("|") + 
                            integer("succTerms") + integer("totalTerms"))
    log_data_sources_bnf = report_bnf | succ_term_bnf
    
    extractLogData = lambda logentry : sum(log_data_sources_bnf.searchString(logentry))
    
    print extractLogData(log_slice).dump()
    

    Pyparsing总是比RE慢,并且可能是你案例中的pyparsing解析器只是一个原型踩踏石。我确信使用pyparsing解析器无法获得500X性能,您可能只需使用基于RE的解决方案来处理Mb的日志文件。