PERL结构正则表达式

时间:2015-03-25 18:33:21

标签: regex bash perl

我熟悉使用scala和java等库的高级语言,并且在高级别上理解正则表达式时遇到的问题很少,但我的任务是尝试使用perl和regex解析一些日志。

Perl似乎很简单,但它不是原生的。)数据有一些字段可能包含也可能不包含在引号中,而其他字段则以空格分隔。此外,还有一个报价包装字段需要分解为子字段。

示例数据:

"[31/01/2015:00:00:00GMT]" "device" 255.255.255.1 2015-01-31 00:00:00 1231231234 - xxxxxxx SUB\xxxxxxx "Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; WOW54; Trident/5.0)" 1.255.255.255 text/htm;%20charset=utf-8 hxxp://www.google.com/path?query&somevar=1&anothervar="DOUBLEQUOTESHERE"&thirdvar=jlk;asdfhlkjahgjkdfgerw 200 OBSERVED "Category1;Category 2" "none" "none" TCP_MSG - 0 99 512 512 www.google.com "GET hxxp://www.google.com/somestring.php?href=http%3A%2F%2Fsomesite.banana.com%2Fquery%2Fv=1&somevar=1&fin=0 HTTP/1.1"
"[31/01/2015:00:00:00GMT]" "device" 255.255.255.1 2015-01-31 00:00:00 1231231234 - - - "agent" 1.255.255.123 - - 200 OBSERVED "none" "none" "none" TCP_MSG - - 99 256 128 www.google.com "CONNECT hxxp://www.google.com:443 HTTP/1.0"

我想要的字段第一行包含如下内容:

[31/01/2015:00:00:00 GMT]
device
255.255.255.1
2015-01-31
00:00:00
1231231234
-
xxxxxxx
SUB\xxxxxxx
Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; WOW54; Trident/5.0)
1.255.255.255
text/htm;%20charset=utf-8
hxxp://www.google.com/path?query&somevar=1&anothervar="DOUBLEQUOTESHERE"&thirdvar=jlk;asdfhlkjahgjkdfgerw
200
OBSERVED
Category1;Category 2
none
none
TCP_MSG
-
0
99
512
512
www.google.com
GET hxxp://www.google.com/somestring.php?href=http%3A%2F%2Fsomesite.banana.com%2Fquery%2Fv=1&somevar=1&fin=0 HTTP/1.1

理想情况下,我想打破最后一个字段:

GET
hxxp://www.google.com/somestring.php?href=http%3A%2F%2Fsomesite.banana.com%2Fquery%2Fv=1&somevar=1&fin=0
HTTP/1.1

第二行:

[31/01/2015:00:00:00GMT]
device
255.255.255.1
2015-01-31
00:00:00
1231231234
-
-
-
agent
1.255.255.123
-
-
200
OBSERVED
none
none
none
TCP_MSG
-
-
99
256
128
www.google.com
CONNECT hxxp://www.google.com:443 HTTP/1.0

再一次,我想打破最后一个字段:

CONNECT
hxxp://www.google.com:443
HTTP/1.0

感谢您的帮助!

2 个答案:

答案 0 :(得分:0)

这是一个执行您想要的Perl脚本。我用你的第一行数据进行测试。

use English;
use strict;

my $line = '"[31/01/2015:00:00:00GMT]" "device" 255.255.255.1 2015-01-31 00:00:00 1231231234 - xxxxxxx SUB\xxxxxxx "Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; WOW54; Trident/5.0)" 1.255.255.255 text/htm;%20charset=utf-8 hxxp://www.google.com/path?query&somevar=1&anothervar="DOUBLEQUOTESHERE"&thirdvar=jlk;asdfhlkjahgjkdfgerw 200 OBSERVED "Category1;Category 2" "none" "none" TCP_MSG - 0 99 512 512 www.google.com "GET hxxp://www.google.com/somestring.php?href=http%3A%2F%2Fsomesite.banana.com%2Fquery%2Fv=1&somevar=1&fin=0 HTTP/1.1"';

my @fields;
while ($line ne "")
    {
    if  (($line =~ /^"(?<field>[^"]*)"(\s+|$)/p) or ($line =~ /^(?<field>\S+)(\s+|$)/p))
        {
        push @fields, $+{field};
        $line = ${^POSTMATCH};
        } # if
        else {die "Parse error when line is '$line'\n";}
    } # while

my $last_field = pop @fields;
push @fields, split /\s+/, $last_field;
print join("\n", @fields) . "\n";
exit 0;

注意事项:

  1. 'use strict;'对于捕捉拼写错误和其他愚蠢错误非常重要。
  2. 在while循环中的两个正则表达式中,我使用了命名捕获,因为我认为引用$ + {field}比$ 1更具可读性。
  3. 我用/ p结束了这两个正则表达式,所以我可以引用$ {^ POSTMATCH}而不是$ POSTMATCH。 $ {^ POSTMATCH}效率更高。
  4. 如果第一个正则表达式与$ line匹配,则不会计算第二个表达式。
  5. 我快速编写了上面的代码,而不是工业实力。例如,如果输入数据以空格开头,则代码将失败。

答案 1 :(得分:0)

有了这样的问题,你甚至可以得到这样的答案

perl -nlE'my@a;push@a,$+while/\s*(?:"(.*?)"|(\S*))/g;splice@a,-1,1,split/ /,$a[-1];$,=$\;say@a,""' your_log.txt

和结果

[31/01/2015:00:00:00GMT]
device
255.255.255.1
2015-01-31
00:00:00
1231231234
-
xxxxxxx
SUB\xxxxxxx
Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; WOW54; Trident/5.0)
1.255.255.255
text/htm;%20charset=utf-8
hxxp://www.google.com/path?query&somevar=1&anothervar="DOUBLEQUOTESHERE"&thirdvar=jlk;asdfhlkjahgjkdfgerw
200
OBSERVED
Category1;Category 2
none
none
TCP_MSG
-
0
99
512
512
www.google.com
GET
hxxp://www.google.com/somestring.php?href=http%3A%2F%2Fsomesite.banana.com%2Fquery%2Fv=1&somevar=1&fin=0
HTTP/1.1


[31/01/2015:00:00:00GMT]
device
255.255.255.1
2015-01-31
00:00:00
1231231234
-
-
-
agent
1.255.255.123
-
-
200
OBSERVED
none
none
none
TCP_MSG
-
-
99
256
128
www.google.com
CONNECT
hxxp://www.google.com:443
HTTP/1.0