如何合并两个正则表达式之间的“填充符”?

时间:2019-05-02 23:41:25

标签: python regex

背景

我想开发一个程序来从非结构化日志数据中提取字段。我正在使用grok来标识与输入字符串匹配的正则表达式。在完成了识别正则表达式的部分的同时,我想将识别出的正则表达式合并为一个,以便匹配整个字符串

示例-

考虑CISCO PIX日志行-

Mar 29 2004 09:54:18: %PIX-6-302005: Built UDP connection for faddr 198.207.223.240/53337 gaddr 10.0.0.187/53 laddr 192.168.0.2/53

对于上面的日志行,我确定了以下正则表达式-

CISCOTIMESTAMP - \b(?:Jan(?:uary|uar)?|Feb(?:ruary|ruar)?|M(?:a|ä)?r(?:ch|z)?|Apr(?:il)?|Ma(?:y|i)?|Jun(?:e|i)?|Jul(?:y)?|Aug(?:ust)?|Sep(?:tember)?|O(?:c|k)?t(?:ober)?|Nov(?:ember)?|De(?:c|z)(?:ember)?)\b +(?:(?:0[1-9])|(?:[12][0-9])|(?:3[01])|[1-9])(?: (?>\d\d){1,2})? (?!<[0-9])(?:2[0123]|[01]?[0-9]):(?:[0-5][0-9])(?::(?:(?:[0-5]?[0-9]|60)(?:[:.,][0-9]+)?))(?![0-9])

CISCOTAG - [A-Z0-9]+-(?:[+-]?(?:[0-9]+))-(?:[A-Z0-9_]+)

CISCOACTION - Built|Teardown|Deny|Denied|denied|requested|permitted|denied by ACL|discarded|est-allowed|Dropping|created|deleted

IPV4 - (?<![0-9])(?:(?:[0-1]?[0-9]{1,2}|2[0-4][0-9]|25[0-5])[.](?:[0-1]?[0-9]{1,2}|2[0-4][0-9]|25[0-5])[.](?:[0-1]?[0-9]{1,2}|2[0-4][0-9]|25[0-5])[.](?:[0-1]?[0-9]{1,2}|2[0-4][0-9]|25[0-5]))(?![0-9])

URIPATH - (?:/[A-Za-z0-9$.+!*'(){},~:;=@#%_\-]*)+(?:\?[A-Za-z0-9$.+!*'|(){},~@#%&/=:;_?\-\[\]<>]*)?


问题

现在,我想merge一起使用这些正则表达式,但是我也想在两者之间包括填充符。示例-

Built|Teardown|Deny|Denied|denied|requested|permitted|denied by ACL|discarded|est-allowed|Dropping|created|deleted

此正则表达式与日志行中的Built单词匹配,并且-

(?<![0-9])(?:(?:[0-1]?[0-9]{1,2}|2[0-4][0-9]|25[0-5])[.](?:[0-1]?[0-9]{1,2}|2[0-4][0-9]|25[0-5])[.](?:[0-1]?[0-9]{1,2}|2[0-4][0-9]|25[0-5])[.](?:[0-1]?[0-9]{1,2}|2[0-4][0-9]|25[0-5]))(?![0-9])

这标识了第一个198.207.223.240 (IP Address)

但是,当我像这样在regex101.com中将它们合并在一起时-

(Built|Teardown|Deny|Denied|denied|requested|permitted|denied by ACL|discarded|est-allowed|Dropping|created|deleted) ((?<![0-9])(?:(?:[0-1]?[0-9]{1,2}|2[0-4][0-9]|25[0-5])[.](?:[0-1]?[0-9]{1,2}|2[0-4][0-9]|25[0-5])[.](?:[0-1]?[0-9]{1,2}|2[0-4][0-9]|25[0-5])[.](?:[0-1]?[0-9]{1,2}|2[0-4][0-9]|25[0-5]))(?![0-9]))

显然,它们不能很好地粘合在一起,因为在它们之间有一些词-UDP connection for faddr-我称之为“填充物”

我想结合捕获的正则表达式,同时考虑它们之间的任意“填充符”。

有没有办法做到这一点?


我的方法

我尝试使用(.*)(.*?),但是它们太强大了,即取代了其他模式并匹配了其余全部行。


有人可以帮助我达到预期的结果吗?

理想的结果是-

CISCOTIMESTAMP + [FILLER REGEX] + CISCOTAG + [FILLER REGEX] + CISCOACTION + [FILLER REGEX] + IPv4 + URIPATH +依此类推。

1 个答案:

答案 0 :(得分:0)

URIPATH在regex101上似乎不可用。您没有逃脱'/'
一旦完成,就可以了。

URIPATH: ((?:\/[A-Za-z0-9$.+!*'(){},~:;=@#%_\-]*)+(?:\?[A-Za-z0-9$.+!*'|(){},~@#%&\/=:;_?\-\[\]<>]*)?)

其余的工作正常,以。*作为填充正则表达式。

CISCOTIMESTAMP + [FILLER REGEX] + CISCOTAG + [FILLER REGEX] + CISCOACTION + [FILLER REGEX] + IPv4 + URIPATH

下面的整个正则表达式

(\b(?:Jan(?:uary|uar)?|Feb(?:ruary|ruar)?|M(?:a|ä)?r(?:ch|z)?|Apr(?:il)?|Ma(?:y|i)?|Jun(?:e|i)?|Jul(?:y)?|Aug(?:ust)?|Sep(?:tember)?|O(?:c|k)?t(?:ober)?|Nov(?:ember)?|De(?:c|z)(?:ember)?)\b +(?:(?:0[1-9])|(?:[12][0-9])|(?:3[01])|[1-9])(?: (?>\d\d){1,2})? (?!<[0-9])(?:2[0123]|[01]?[0-9]):(?:[0-5][0-9])(?::(?:(?:[0-5]?[0-9]|60)(?:[:.,][0-9]+)?))(?![0-9])).*([A-Z0-9]+-(?:[+-]?(?:[0-9]+))-(?:[A-Z0-9_]+)).*(Built|Teardown|Deny|Denied|denied|requested|permitted|denied by ACL|discarded|est-allowed|Dropping|created|deleted).*((?<![0-9])(?:(?:[0-1]?[0-9]{1,2}|2[0-4][0-9]|25[0-5])[.](?:[0-1]?[0-9]{1,2}|2[0-4][0-9]|25[0-5])[.](?:[0-1]?[0-9]{1,2}|2[0-4][0-9]|25[0-5])[.](?:[0-1]?[0-9]{1,2}|2[0-4][0-9]|25[0-5]))(?![0-9]))((?:\/[A-Za-z0-9$.+!*'(){},~:;=@#%_\-]*)+(?:\?[A-Za-z0-9$.+!*'|(){},~@#%&\/=:;_?\-\[\]<>]*)?)

这是指向demo

的链接