我需要解析几百兆字节的应用程序日志,如下所示:
2016/05/26 13:07:48 UTC - 15:07:48 Rear gear disengaged
2016/05/26 13:08:13 UTC - 15:08:13 RMCB : Backend in unknown position
2016/05/26 13:08:14 UTC - 15:08:14 OVERPRESSURE ALARM STATUS : no alarm
2016/05/26 13:08:14 UTC - 15:08:14 PRESSURE STATUS : Equipment Off
2016/05/26 13:08:14 UTC - 15:08:14 OVERPRESSURE LINE STATUS : line failure
2016/05/26 13:08:14 UTC - 15:08:14 FILTER EQUIPMENT STATUS : Equipment Off
2016/05/26 13:08:14 UTC - 15:08:14 FILTER LINE STATUS : line failure
2016/05/26 13:08:15 UTC - 15:08:15 RMCB : Backend closed
2016/05/26 13:08:20 UTC - 15:08:20 OVERPRESSURE ALARM STATUS : value=3
2016/05/26 13:08:20 UTC - 15:08:20 OVERPRESSURE ALARM STATUS : alarm Overpressure
2016/05/26 13:08:20 UTC - 15:08:20 PRESSURE STATUS : OK
2016/05/26 13:08:20 UTC - 15:08:20 OVERPRESSURE LINE STATUS : OK
2016/05/26 13:08:20 UTC - 15:08:20 FILTER EQUIPMENT STATUS : OK
2016/05/26 13:08:20 UTC - 15:08:20 FILTER LINE STATUS : OK
2016/05/26 13:08:20 UTC - 15:08:20 [COMMANDER] open wizard view
2016/05/26 13:08:20 UTC - 15:08:20 [DRIVER] open wizard view
2016/05/26 13:08:20 UTC - 15:08:20 [OP2] open wizard view
2016/05/26 13:08:28 UTC - 15:08:28 Acknowledge Alarm : alarm Overpressure
正如您所看到的,除了时间戳之外,它们没有任何固定的结构,但我需要从中获取单独的键/值属性。
例如这一行:
FILTER EQUIPMENT STATUS : OK
与过滤器的设备相关的状态事件,因此我需要将其解析为以下密钥/价值对:
EventType: Status
SourceContext: FILTER (could also be OVERPRESSURE etc.)
StatusType: EQUIPMENT (could also be LINE)
StatusValue: OK (could also be line failure, if it's a line status)
等等。对于这样的一行也是如此:
[COMMANDER] open wizard view
我们有:
EventType: Instruction
Sender: COMMANDER
Instruction: open wizard view
我不需要拥有数百种不同的类型或东西,例如,固定事件类型和键/值对字典很好,但我需要找到一种方法来正确识别单个属性并将它们映射到所述字典中。
我首先尝试使用Regex捕获组,但除了大量的性能问题之外,我最终得到了数百种不同的模式,其中一些模式非常松散,以至于错误匹配的数量太高了。然后我尝试手动解析它们,寻找字符串中的某些指示符(例如包含方括号等),但这导致了一个巨大的代码墙,可以解决许多特殊情况以及日志事件漏洞或错误的可能性识别。
是否有更适合解决此类问题的模式或技术?
答案 0 :(得分:0)
有时,命名捕获组可以帮助解析像这样的复杂数据结构。这个例子可能并不能满足您的所有需求,但它有望成为一个良好的起点? https://regex101.com/r/8oq5lL/1