如何在多行/分隔行中过滤定义的模式

时间:2016-03-21 05:24:03

标签: regex bash shell sed

这是我日志的简单通用规范:

  • 一个请求来了,记录...[XXXHandler] comming time...
  • 获取锁定并启动交易,记录...[XXXHandler] [ UID ] start time...
  • 业务已完成并返回锁定,记录...[XXXHandler] [ UID ] spend time...

在实践中,有大量的请求用相应的 UID 刷新,并且三行模式在彼此之间混乱。这是其中的一部分:

~ cat sample.log
[240] [DeleteAllLettersHandler] coming time [Fri Mar 18 05:00:00 GMT-06:00 2016]
[240] [StartBiddingAllianceBossAuctionHandler] coming time [Fri Mar 18 05:00:00 GMT-06:00 2016]
[240] [DeleteAllLettersHandler] [13497] start time [Fri Mar 18 05:00:00 GMT-06:00 2016]
[240] [DeleteAllLettersHandler] [13497] spend time [1] dbs 1 dbu 1 | {}
[240] [StartBiddingAllianceBossAuctionHandler] [1495] start time [Fri Mar 18 05:00:00 GMT-06:00 2016]
[240] [GetMazeMainInfoHandler] coming time [Fri Mar 18 05:00:00 GMT-06:00 2016]
[240] [StartBiddingAllianceBossAuctionHandler] [1495] spend time [1] dbs 1 dbu 0 | {}
[240] [GetMazeMainInfoHandler] [8941] start time [Fri Mar 18 05:00:00 GMT-06:00 2016]
[240] [GetResHarvestInfoHandler] coming time [Fri Mar 18 05:00:00 GMT-06:00 2016]
[240] [GetResHarvestInfoHandler] [1807] start time [Fri Mar 18 05:00:00 GMT-06:00 2016]
[240] [RCHandler] coming time [Fri Mar 18 05:00:00 GMT-06:00 2016]     ## gotcha
[240] [GetMazeMainInfoHandler] [8941] spend time [10] dbs 27 dbu 2 | {}
[240] [GetResHarvestInfoHandler] [1807] spend time [5] dbs 15 dbu 4 | {}
[240] [StartBiddingAllianceBossAuctionHandler] coming time [Fri Mar 18 05:00:00 GMT-06:00 2016]
[240] [StartBiddingAllianceBossAuctionHandler] [18052] start time [Fri Mar 18 05:00:00 GMT-06:00 2016]
[240] [StartBiddingAllianceBossAuctionHandler] [18052] spend time [1] dbs 1 dbu 0 | {}
[240] [GetResourceAmount] coming time [Fri Mar 18 05:00:00 GMT-06:00 2016]
[240] [GetResourceAmount] [29063] start time [Fri Mar 18 05:00:00 GMT-06:00 2016]
[240] [GetResourceAmount] [29063] spend time [1] dbs 3 dbu 0 | {}

我的要求是过滤日志,删除混乱的三行模式,同时我可以看到哪个处理程序挂起(日志即将开始但没有开始时间)。

这是我的解决方案:

- cat process.sh

sed -r '
    $!N
    $!N
    $!N
    s/(([^\n]*\n)*)[^\n]*\[([^\n]*)\] coming time[^\n]*\n(([^\n]*\n)*)[^\n]*\[\3\] \[([^\n]*)\] start time[^\n]*\n(([^\n]*\n)*)[^\n]*\[\3\] \[\6\] spend time[^\n]*(.*)/\1\4\7\9/
    t print     
    P
    D

    :print
' |

grep -v '^ *$'

这可以过滤一些模式,但不能全部过滤,因为sed可以处理分散在三个或四个中的一个模式(sed round添加可能更多)。

~ ./process.sh < sample.log
[240] [StartBiddingAllianceBossAuctionHandler] coming time [Fri Mar 18 05:00:00 GMT-06:00 2016]
[240] [StartBiddingAllianceBossAuctionHandler] [1495] start time [Fri Mar 18 05:00:00 GMT-06:00 2016]
[240] [GetMazeMainInfoHandler] coming time [Fri Mar 18 05:00:00 GMT-06:00 2016]
[240] [StartBiddingAllianceBossAuctionHandler] [1495] spend time [1] dbs 1 dbu 0 | {}
[240] [GetMazeMainInfoHandler] [8941] start time [Fri Mar 18 05:00:00 GMT-06:00 2016]
[240] [RCHandler] coming time [Fri Mar 18 05:00:00 GMT-06:00 2016]     ## gotcha
[240] [GetMazeMainInfoHandler] [8941] spend time [10] dbs 27 dbu 2 | {}
[240] [StartBiddingAllianceBossAuctionHandler] coming time [Fri Mar 18 05:00:00 GMT-06:00 2016]
[240] [StartBiddingAllianceBossAuctionHandler] [18052] start time [Fri Mar 18 05:00:00 GMT-06:00 2016]
[240] [StartBiddingAllianceBossAuctionHandler] [18052] spend time [1] dbs 1 dbu 0 | {}

使用过滤后的日志作为SEED,一次又一次地过滤,我可以得到我想要的结果:

~ ./process.sh < sample.log | ./process.sh
[240] [GetMazeMainInfoHandler] coming time [Fri Mar 18 05:00:00 GMT-06:00 2016]
[240] [GetMazeMainInfoHandler] [8941] start time [Fri Mar 18 05:00:00 GMT-06:00 2016]
[240] [RCHandler] coming time [Fri Mar 18 05:00:00 GMT-06:00 2016]     ## gotcha
[240] [GetMazeMainInfoHandler] [8941] spend time [10] dbs 27 dbu 2 | {}

~ ./process.sh < sample.log | ./process.sh | ./process.sh
[240] [RCHandler] coming time [Fri Mar 18 05:00:00 GMT-06:00 2016]     ## gotcha

似乎我只需要过滤几次以获得最终需要的结果。所以我问了一个问题:shell pipe process repeat, @tripleee的回答对我很有用。大约五次过滤后,我可以得到每个日志的最终结果。

但耗时太多,一个10K行日志通常需要花费10分钟来过滤。

所以我的问题是,你能找到一个更好的方法来做到这一点吗?或者如何改进我的方式让它跑得更快。

感谢您的时间!

1 个答案:

答案 0 :(得分:0)

我不认为bash能胜任你的问题。

我想建议你试试perl。解析日志并将[Handler Name,question,start,finish]四元组保存到哈希表中,然后您可以扫描哈希表以查找挂起的处理程序。这是一个更具扩展性的解决方案,恕我直言。