我需要正则表达式向导的帮助。我正在尝试编写一个简单的解析器,它可以标记Snort规则的选项列表(Snort,IDS / IPS软件)。问题是,我似乎无法找到一个可行的公式,根据它们的终止分号分解各个规则选项。我制作的公式将括号中的所有选项集中到一个捕获组中。
我在GSkinner站点上使用了优秀的RegExr工具,其中包含一些来自Emerging Threats的示例规则选项(我解析了规则标题 - 这很容易标记化):
(msg:"ET DELETED Majestic-12 Spider Bot User-Agent (MJ12bot)"; flow:to_server,established; content:"|0d 0a|User-Agent\: MJ12bot|0d 0a|"; classtype:trojan-activity; reference:url,www.majestic12.co.uk/; reference:url,doc.emergingthreats.net/2003409; reference:url,www.emergingthreats.net/cgi-bin/cvsweb.cgi/sigs/POLICY/POLICY_Majestic-12; sid:2003409; rev:4;)
(msg:"ET DELETED Majestic-12 Spider Bot User-Agent Inbound (MJ12bot)"; flow:to_server,established; content:"|0d 0a|User-Agent\: MJ12bot"; classtype:trojan-activity; reference:url,www.majestic12.co.uk/; reference:url,doc.emergingthreats.net/2007762; reference:url,www.emergingthreats.net/cgi-bin/cvsweb.cgi/sigs/POLICY/POLICY_Majestic-12; sid:2007762; rev:4;)
(msg:"ET POLICY McAfee Update User Agent (McAfee AutoUpdate)"; flow:to_server,established; content:"User-Agent|3a| "; http_header; nocase; content:"McAfee AutoUpdate"; http_header; pcre:"/User-Agent\x3a[^\n]+McAfee AutoUpdate/i"; classtype:not-suspicious; reference:url,doc.emergingthreats.net/2003381; reference:url,www.emergingthreats.net/cgi-bin/cvsweb.cgi/sigs/POLICY/POLICY_McAffee; sid:2003381; rev:6;)
(msg:"ET DELETED Metacafe.com family filter off"; flow:established,to_server; content:"POST"; http_method; content:"Host|3a| www.metacafe.com"; http_header; fast_pattern:6,16; content:"submit=Continue+-+I%27m+over+18"; classtype:policy-violation; reference:url,doc.emergingthreats.net/2006367; reference:url,www.emergingthreats.net/cgi-bin/cvsweb.cgi/sigs/POLICY/POLICY_Metacafe; sid:2006367; rev:7;)
这就是公式:
([a-zA-Z0-9_:]+(?:[\w\s.,\-/=<>+!\[\]\(\)\{\}\"|\\;'?`~@#$%^&*])+;)
问题是,它不处理冒号。因此,上述两条规则将无法正确解析其“内容”选项。但是在RegExr上,每个选项都会以蓝色突出显示,包括终止分号,但不是分号后的空格。如果我将它输入.NET,我应该可以使用Regex.Split并正确地拆分所有令牌。
如果我将冒号添加到字符列表中,那么在RegExr上,整个规则集将被标记为单个blob文本,这不是我想要的。进一步尝试调整公式导致Adobe Flash崩溃,表明我遇到了Flash或RegExr中的错误。
我不排除编写自己的字符串标记符,但我希望正则表达式可以避免处理诸如计算我的开放引号,转义字符,空白等内容。
Snort规则选项通常采用以下格式:
option:value;
option:"string value";
option:!"negated string value";
option:>num;
option:param1,param2,param3;
但是有些选项往往会为其价值提供更多“异国情调”格式,比如byte_test。每个人都喜欢'pcre',它基本上是执行perl兼容正则表达式的选项。所以任何这样的标记化器必须避免混淆,如果它遇到带有正则表达式的'pcre'关键字。
思想
修改:
以下内容非常接近:
([\w]+:?(?:[\x20]|)?(?:[\x00-\xff])*?;)
但是,根据RegExr,它会被pcre语法弄乱:
(msg:"ET WEB_SPECIFIC_APPS Horde 3.0.9-3.1.0 Help Viewer Remote PHP Exploit"; flow:established,to_server; content:"/services/help/"; nocase; http_uri; pcre:"/module=[^\;]*\;.*\"/UGi"; classtype:web-application-attack; reference:url,www.milw0rm.com/exploits/1660; reference:cve,2006-1491; reference:bugtraq,17292; reference:url,doc.emergingthreats.net/2002867; reference:url,www.emergingthreats.net/cgi-bin/cvsweb.cgi/sigs/WEB_SPECIFIC_APPS/WEB_Horde; sid:2002867; rev:9; http_method;)
在上文中,除 ]*\;.*\"/
外,每个选项都会突出显示为不同的分组。我认为\x00-\xff
会得到这一切,但看起来我正在使用懒惰的匹配。贪婪的匹配获取所有内容,包括选项之间的所有空格,这是我不想要的。因此,我需要以某种方式修改正则表达式以处理标记化的pcre文本。
Edit2:这样就可以了:
([\w]+:?(?:[\x20]|)?(?<!\\)\"?.*?(?<!\\)\"?;)
我必须使用一些使用引用字符串的示例正则表达式。终于意识到我正盯着那些避免被转义的报价的负面观察。这似乎解决了任何其他转义字符,因为转义字符只显示在未转义的引号内。
答案 0 :(得分:3)
无需寻找解决方案。只需仔细编写正则表达式即可精确匹配您需要的内容。通过在详细的自由间隔模式下编写这个,可以更加清晰(并且更易于维护):(虽然VB.NET语法使得这样做很麻烦)
Dim RegexObj As New Regex(
"# Match set of Snort rules enclosed within parentheses." & chr(10) & _
"\( # Literal opening parentheses." & chr(10) & _
"(?: # Group for one or more rules." & chr(10) & _
" \w+ # Required rule name." & chr(10) & _
" (?: # Group for optional rule value." & chr(10) & _
" : # Rule name/values separated by :" & chr(10) & _
" (?: # Group for rule value alternatives." & chr(10) & _
" "" # Either a double quoted string," & chr(10) & _
" [^""\\]* # {normal} Use ""Unrolling the Loop""." & chr(10) & _
" (?: # Begin {(special normal*)*} construct." & chr(10) & _
" \\. # {special} == escaped anything." & chr(10) & _
" [^""\\]* # More {normal*} non-quote, non-escapes." & chr(10) & _
" )* # Finish {(special normal*)*} construct." & chr(10) & _
" "" # Closing quote." & chr(10) & _
" | '[^'\\]*(?:\\.[^'\\]*)*' # or a single quoted string," & chr(10) & _
" | [^;]+ # or one or more non semi-colons." & chr(10) & _
" ) # End group for rule value options." & chr(10) & _
" )? # Rule value is optional." & chr(10) & _
" ; \s* # Rule ends with ;, optional ws." & chr(10) & _
")+ # One or more rules." & chr(10) & _
"\) # LiteraL closing parentheses.",
RegexOptions.IgnorePatternWhitespace)
Dim MatchResults As Match = RegexObj.Match(SubjectString)
While MatchResults.Success
' matched text: MatchResults.Value
' match start: MatchResults.Index
' match length: MatchResults.Length
MatchResults = MatchResults.NextMatch()
End While
此正则表达式演示了使用Jeffrey Friedl的“展开循环”效率技术来正确匹配可能包含转义字符的引用字符串。 (见:MRE3)
哦,是的,还有一件事...... 伊卡洛斯找到了你!