我熟悉使用scala和java等库的高级语言,并且在高级别上理解正则表达式时遇到的问题很少,但我的任务是尝试使用perl和regex解析一些日志。
Perl似乎很简单,但它不是原生的。)数据有一些字段可能包含也可能不包含在引号中,而其他字段则以空格分隔。此外,还有一个报价包装字段需要分解为子字段。
示例数据:
"[31/01/2015:00:00:00GMT]" "device" 255.255.255.1 2015-01-31 00:00:00 1231231234 - xxxxxxx SUB\xxxxxxx "Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; WOW54; Trident/5.0)" 1.255.255.255 text/htm;%20charset=utf-8 hxxp://www.google.com/path?query&somevar=1&anothervar="DOUBLEQUOTESHERE"&thirdvar=jlk;asdfhlkjahgjkdfgerw 200 OBSERVED "Category1;Category 2" "none" "none" TCP_MSG - 0 99 512 512 www.google.com "GET hxxp://www.google.com/somestring.php?href=http%3A%2F%2Fsomesite.banana.com%2Fquery%2Fv=1&somevar=1&fin=0 HTTP/1.1"
"[31/01/2015:00:00:00GMT]" "device" 255.255.255.1 2015-01-31 00:00:00 1231231234 - - - "agent" 1.255.255.123 - - 200 OBSERVED "none" "none" "none" TCP_MSG - - 99 256 128 www.google.com "CONNECT hxxp://www.google.com:443 HTTP/1.0"
我想要的字段第一行包含如下内容:
[31/01/2015:00:00:00 GMT]
device
255.255.255.1
2015-01-31
00:00:00
1231231234
-
xxxxxxx
SUB\xxxxxxx
Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; WOW54; Trident/5.0)
1.255.255.255
text/htm;%20charset=utf-8
hxxp://www.google.com/path?query&somevar=1&anothervar="DOUBLEQUOTESHERE"&thirdvar=jlk;asdfhlkjahgjkdfgerw
200
OBSERVED
Category1;Category 2
none
none
TCP_MSG
-
0
99
512
512
www.google.com
GET hxxp://www.google.com/somestring.php?href=http%3A%2F%2Fsomesite.banana.com%2Fquery%2Fv=1&somevar=1&fin=0 HTTP/1.1
理想情况下,我想打破最后一个字段:
GET
hxxp://www.google.com/somestring.php?href=http%3A%2F%2Fsomesite.banana.com%2Fquery%2Fv=1&somevar=1&fin=0
HTTP/1.1
第二行:
[31/01/2015:00:00:00GMT]
device
255.255.255.1
2015-01-31
00:00:00
1231231234
-
-
-
agent
1.255.255.123
-
-
200
OBSERVED
none
none
none
TCP_MSG
-
-
99
256
128
www.google.com
CONNECT hxxp://www.google.com:443 HTTP/1.0
再一次,我想打破最后一个字段:
CONNECT
hxxp://www.google.com:443
HTTP/1.0
感谢您的帮助!
答案 0 :(得分:0)
这是一个执行您想要的Perl脚本。我用你的第一行数据进行测试。
use English;
use strict;
my $line = '"[31/01/2015:00:00:00GMT]" "device" 255.255.255.1 2015-01-31 00:00:00 1231231234 - xxxxxxx SUB\xxxxxxx "Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; WOW54; Trident/5.0)" 1.255.255.255 text/htm;%20charset=utf-8 hxxp://www.google.com/path?query&somevar=1&anothervar="DOUBLEQUOTESHERE"&thirdvar=jlk;asdfhlkjahgjkdfgerw 200 OBSERVED "Category1;Category 2" "none" "none" TCP_MSG - 0 99 512 512 www.google.com "GET hxxp://www.google.com/somestring.php?href=http%3A%2F%2Fsomesite.banana.com%2Fquery%2Fv=1&somevar=1&fin=0 HTTP/1.1"';
my @fields;
while ($line ne "")
{
if (($line =~ /^"(?<field>[^"]*)"(\s+|$)/p) or ($line =~ /^(?<field>\S+)(\s+|$)/p))
{
push @fields, $+{field};
$line = ${^POSTMATCH};
} # if
else {die "Parse error when line is '$line'\n";}
} # while
my $last_field = pop @fields;
push @fields, split /\s+/, $last_field;
print join("\n", @fields) . "\n";
exit 0;
注意事项:
我快速编写了上面的代码,而不是工业实力。例如,如果输入数据以空格开头,则代码将失败。
答案 1 :(得分:0)
有了这样的问题,你甚至可以得到这样的答案
perl -nlE'my@a;push@a,$+while/\s*(?:"(.*?)"|(\S*))/g;splice@a,-1,1,split/ /,$a[-1];$,=$\;say@a,""' your_log.txt
和结果
[31/01/2015:00:00:00GMT]
device
255.255.255.1
2015-01-31
00:00:00
1231231234
-
xxxxxxx
SUB\xxxxxxx
Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; WOW54; Trident/5.0)
1.255.255.255
text/htm;%20charset=utf-8
hxxp://www.google.com/path?query&somevar=1&anothervar="DOUBLEQUOTESHERE"&thirdvar=jlk;asdfhlkjahgjkdfgerw
200
OBSERVED
Category1;Category 2
none
none
TCP_MSG
-
0
99
512
512
www.google.com
GET
hxxp://www.google.com/somestring.php?href=http%3A%2F%2Fsomesite.banana.com%2Fquery%2Fv=1&somevar=1&fin=0
HTTP/1.1
[31/01/2015:00:00:00GMT]
device
255.255.255.1
2015-01-31
00:00:00
1231231234
-
-
-
agent
1.255.255.123
-
-
200
OBSERVED
none
none
none
TCP_MSG
-
-
99
256
128
www.google.com
CONNECT
hxxp://www.google.com:443
HTTP/1.0