Perl Regex Capture

时间:2012-07-15 02:39:02

标签: regex perl

我有以下文字:

GET /mac/_base_v1/images/chrome/background_repeat.jpg HTTP/1.1  
Host: www.microsoft.com  
User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.6; rv:13.0) Gecko/20100101 Firefox/13.0  
Accept: image/png,image/*;q=0.8,*/*;q=0.5  
Accept-Language: en-us,en;q=0.5  
Accept-Encoding: gzip, deflate  
Referer: http://www.microsoft.com/mac/base-css  
DNT: 1  
Connection: keep-alive  
HTTP/1.1 200 OK  
Cache-Control: max-age=900  
Content-Type: image/jpegGET /mac/_base_v1/modules/button/images  /buttonlarge_yellownormal.png HTTP/1.1  
Host: www.microsoft.com  
User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.6; rv:13.0) Gecko/20100101 Firefox/13.0   
Accept: image/png,image/*;q=0.8,*/*;q=0.5  
Accept-Language: en-us,en;q=0.5  
Accept-Encoding: gzip, deflate  
Referer: http://www.microsoft.com/mac/css  
DNT: 1  

以及以下Perl正则表达式

while ($1 =~m/((GET|PUT|POST|CONNECT)\s+\S+)(?:(?!GET|PUT|POST|CONNECT\s+\S+).)*?Host:\s([^\n]+).*?User-Agent:\s([^\n]+).*?Referer:\s([^\n]+).*?Connection:/msg) {
    # do something
}

它匹配这个好

GET /mac/_base_v1/modules/button/images/buttonlarge_yellownormal.png  
www.microsoft.com  
Mozilla/5.0 (Macintosh; Intel Mac OS X 10.6; rv:13.0) Gecko/20100101 Firefox/13.0  
http://www.microsoft.com/mac/css  

但是,我还需要它来检查以下文字:

GET /vi/k_dbVP4r4V4/hqdefault.jpg HTTP/1.1  
Host: i.ytimg.com  
User-Agent: Apple iPad v4.3.5 YouTube v1.0.0.8L1  
Accept-Language: en-us, *;q=0.5  
Gdata-Version: 2  
X-Gdata-Client: ytapi-apple-ipad  
Accept: */*  
Accept-Encoding: gzip, deflate  
Connection: keep-alive  
Q2J}  

并匹配以下内容:

GET /vi/k_dbVP4r4V4/hqdefault.jpg HTTP/1.1  
i.ytimg.com  
Apple iPad v4.3.5 YouTube v1.0.0.8L1  

虽然仍然能够与之前成功提交的文字相匹配。

2 个答案:

答案 0 :(得分:2)

因此,如果我正确理解您的问题,您需要Referrer标头是可选的。您可以通过在正则表达式中与该标题匹配的部分周围添加非捕获括号并在右括号后面添加问号来执行此操作:

(?:Referer:\s([^\n]+))?

如果任何其他标题是可选的,您可以对它们执行相同的操作。

编辑:数据在第一个丢失的标题后停止捕获。

这还不完美,因为如果单个数据文件中有多个HTTP请求它不起作用,但它应该让你朝着正确的方向前进:

use warnings;
use strict;

my $str = <<'END_OF_STR';
GET /vi/k_dbVP4r4V4/hqdefault.jpg HTTP/1.1
Host: i.ytimg.com
User-Agent: Apple iPad v4.3.5 YouTube v1.0.0.8L1
Accept-Language: en-us, *;q=0.5
Gdata-Version: 2
X-Gdata-Client: ytapi-apple-ipad
Accept: */*
Accept-Encoding: gzip, deflate
Connection: keep-alive
END_OF_STR

my @lines = split m/[\n]/xms, $str;

# Build the regex to match the HTTP methods we care about.
my @methods = qw(GET PUT POST CONNECT);
my $methods_re = join '|', map { quotemeta $_ } @methods;

# Skip to the first request line and print it.
while ( $lines[0] !~ m/ \A $methods_re /xms ) {
    shift @lines;
}
print "$lines[0]\n";
shift @lines;

# Build the regex to match the headers we care about.
my @headers = qw(Host User-Agent Referer Connection);
my $headers_re = join '|', map { quotemeta $_ } @headers;

# Find the headers that we matched.
for my $line (@lines) {
    if ( $line =~ m/ \A (?:$headers_re):\s*(.*) /xms ) {
        print "$1\n";
    }
}

exit;

我将很快添加另一个更新,它将在单个文件中考虑多个HTTP请求。

编辑:此解决方案正确打印您正在查找的值,但它只打印它们。如果您想获得每个特定请求的值,则需要更复杂的东西。

use warnings;
use strict;

my $str = <<'END_OF_STR';
GET /vi/k_dbVP4r4V4/hqdefault.jpg HTTP/1.1
Host: i.ytimg.com
User-Agent: Apple iPad v4.3.5 YouTube v1.0.0.8L1
Accept-Language: en-us, *;q=0.5
Gdata-Version: 2
X-Gdata-Client: ytapi-apple-ipad
Accept: */*
Accept-Encoding: gzip, deflate
Connection: keep-alive
END_OF_STR

my @lines = split m/[\n]/xms, $str;

# Build the regexes to match the HTTP methods and headers we care about.
my @methods = qw(GET PUT POST CONNECT);
my $methods_re = join '|', map { quotemeta $_ } @methods;
my @headers = qw(Host User-Agent Referer Connection);
my $headers_re = join '|', map { quotemeta $_ } @headers;

for my $line (@lines) {
    if ( $line =~ m/ \A $methods_re /xms ) {
        print "$line\n";
    }
    elsif ( $line =~ m/ \A (?:$headers_re):\s*(.*) /xms ) {
        print "$1\n";
    }
}

exit;

答案 1 :(得分:2)

HTTP请求和响应标头的解析并不像预期的那样简单。例如,以下内容都是等效的:

Accept-Encoding: gzip, deflate

Accept-Encoding: gzip,
    deflate

Accept-Encoding: gzip
Accept-Encoding: deflate

因此,我建议您使用现有的解析器

use strict;
use warnings;
use feature qw( say );
use HTTP::Request qw( );

my $s = <<'__EOI__';
GET /mac/_base_v1/images/chrome/background_repeat.jpg HTTP/1.1  
Host: www.microsoft.com  
User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.6; rv:13.0) Gecko/20100101 Firefox/13.0  
Accept: image/png,image/*;q=0.8,*/*;q=0.5  
Accept-Language: en-us,en;q=0.5  
Accept-Encoding: gzip, deflate  
Referer: http://www.microsoft.com/mac/base-css  
DNT: 1  
Connection: keep-alive  
HTTP/1.1 200 OK  
Cache-Control: max-age=900  
Content-Type: image/jpegGET /mac/_base_v1/modules/button/images  /buttonlarge_yellownormal.png HTTP/1.1  
Host: www.microsoft.com  
User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.6; rv:13.0) Gecko/20100101 Firefox/13.0   
Accept: image/png,image/*;q=0.8,*/*;q=0.5  
Accept-Language: en-us,en;q=0.5  
Accept-Encoding: gzip, deflate  
Referer: http://www.microsoft.com/mac/css  
DNT: 1  
__EOI__

my ($raw_req, $raw_resp) = split qr{(?=^HTTP/)}m, $s;
my $req = HTTP::Request->parse($raw_req);
say $req->method;
say $req->url;
say $req->user_agent;
say $req->header('User-Agent');  # Same as previous