我有每个GET / POST以制表符分隔形式的HTTP标头请求和回复数据,并以不同的行回复。此数据使得一个TCP流有多个GET,POST和REPLY。我需要在这些情况下只选择第一个有效的GET - REPLY对。一个例子(简化)是:
ID Source Dest Bytes Type Content-Length host lines....
1 A B 10 GET NA yahoo.com 2
1 A B 10 REPLY 10 NA 2
2 C D 40 GET NA google.com 4
2 C D 40 REPLY 20 NA 4
2 C D 40 GET NA google.com 4
2 C D 40 REPLY 30 NA 4
3 A B 250 POST NA mail.yahoo.com 5
3 A B 250 REPLY NA NA 5
3 A B 250 REPLY 15 NA 5
3 A B 250 GET NA yimg.com 5
3 A B 250 REPLY 35 NA 5
4 G H 415 REPLY 10 NA 6
4 G H 415 POST NA facebook.com 6
4 G H 415 REPLY NA NA 6
4 G H 415 REPLY NA NA 6
4 G H 415 GET NA photos.facebook.com 6
4 G H 415 REPLY 50 NA 6
....
所以,基本上我需要为每个ID获取一个请求 - 回复对,并将它们写入新文件。
对于'1',它只是一对,所以很容易。 但也有一些错误的情况,两行都是GET,POST或REPLY。因此,这些情况将被忽略。
对于'2',我会选择第一个GET - REPLY对。
对于'3',我会选择第一个GET,但是第二个REPLY,因为Content-Length在第一个中不存在(使子请求REPLY成为更好的候选者)。
对于'4',我会选择第一个POST(或GET),因为第一个头不能是REPLY。即使在POST之后缺少内容长度,我也不会在第二次GET之后选择REPLY,因为REPLY在此之后出现。所以我只选择第一个REPLY。
因此,在选择最佳请求和回复对之后,我需要将它们配对在一行中。例如,输出将是:
ID Source Dest Bytes Type Content-Length host ....
1 A B 10 GET 10 yahoo.com
2 C D 40 GET 20 google.com
3 A B 250 POST 15 mail.yahoo.com
4 G H 415 POST NA facebook.com
实际数据中有很多其他标题,但这个例子几乎显示了我需要的内容。如何在Perl中做到这一点?我几乎陷入了困境,所以我一次只能读取一行文件。
open F, "<", "file.txt" || die "Cannot open $f: $!";
while (<F>) {
chomp;
my @line = split /\t/;
# get the valid pairs for cases with multiple request - replies
# get the paired up data together
}
close (F);
* 编辑:我添加了一个额外的列,给出了每个ID的HTTP标题行数。这可能有助于了解要检查的后续行数。此外,我修改了ID'4',以便第一个标题行是REPLY。 *
答案 0 :(得分:3)
以下程序可以满足我的需要。
这是评论,我认为它是相当清晰的。如果有任何不清楚的地方,请询问。
use strict;
use warnings;
use List::Util 'max';
my $file = $ARGV[0] // 'file.txt';
open my $fh, '<', $file or die qq(Unable to open "$file" for reading: $!);
# Read the field names from the first line to index the hashes
# Remember where the data in the file starts so we can get back here
#
my @fields = split ' ', <$fh>;
my $start = tell $fh;
# Build a format to print the accumulated data
# Create a hash that relates column headers to their widths
#
my @headers = qw/ ID Source Dest Bytes Type Content-Length host /;
my %len = map { $_ => length } @headers;
# Read through the file to find the maximum data width for each column
#
while (<$fh>) {
my %data;
@data{@fields} = split;
next unless $data{ID} =~ /^\d/;
$len{$_} = max($len{$_}, length $data{$_}) for @headers;
}
# Build a format string using the values calculated
#
my $format = join ' ', map sprintf('%%%ds', $_), @len{@headers};
$format .= "\n";
# Go back to the start of the data
# Print the column headers
#
seek $fh, $start, 0;
printf $format, @headers;
# Build transaction data hashes into $record and print them
# Ignore any events before the first request
# Ignore the second request and anything after it
# Update the stored Content-Length field if a value other than NA appears
#
my $record;
my $nreq = 0;
while (<$fh>) {
my %data;
@data{@fields} = split;
my ($id, $type) = @data{ qw/ ID Type / };
next unless $id =~ /^\d/;
if ($record and $id ne $record->{ID}) {
printf $format, @{$record}{@headers};
undef $record;
$nreq = 0;
}
if ($type eq 'GET' or $type eq 'POST') {
$record = \%data if $nreq == 0;
$nreq++;
}
elsif ($nreq == 1) {
if ($record->{'Content-Length'} eq 'NA' and $data{'Content-Length'} ne 'NA') {
$record->{'Content-Length'} = $data{'Content-Length'};
}
}
}
printf $format, @{$record}{@headers} if $record;
<强>输出强>
根据问题中给出的数据,该程序产生
ID Source Dest Bytes Type Content-Length host
1 A B 10 GET 10 yahoo.com
2 C D 40 GET 20 google.com
3 A B 250 POST 15 mail.yahoo.com
4 G H 415 POST NA facebook.com
答案 1 :(得分:1)
这似乎适用于给定的数据:
#!/usr/bin/env perl
use strict;
use warnings;
# Shape of input records
use constant ID => 0;
use constant Source => 1;
use constant Dest => 2;
use constant Bytes => 3;
use constant Type => 4;
use constant Length => 5;
use constant Host => 6;
use constant fmt_head => "%-6s %-6s %-6s %-6s %-6s %-6s %s\n";
use constant fmt_data => "%-6d %-6s %-6s % 6d %-6s % 6s %s\n";
printf fmt_head, "ID", "Source", "Dest", "Bytes", "Type", "Length", "Host";
my @post_get;
my @reply;
my $lastid = -1;
my $pg_count = 0;
sub print_data
{
# Final validity checking
if ($lastid != -1)
{
printf fmt_data, $post_get[ID], $post_get[Source],
$post_get[Dest], $post_get[Bytes], $post_get[Type], $reply[Length], $post_get[Host];
# Reset arrays;
@post_get = ();
@reply = ();
$pg_count = 0;
}
}
while (<>)
{
chomp;
my @record = split;
# Validate record here (number of fields, etc)
# Detect change in ID
print_data if ($record[ID] != $lastid);
$lastid = $record[ID];
if ($record[Type] eq "REPLY")
{
# Discard REPLY if there wasn't already a POST/GET
next unless defined $post_get[ID];
# Discard REPLY if there was a second POST/GET
next if $pg_count > 1;
@reply = @record if !defined $reply[ID];
$reply[Length] = $record[Length]
if $reply[Length] eq "NA" && $record[Length] ne "NA";
}
else
{
$pg_count++;
@post_get = @record if !defined $post_get[ID];
$post_get[Length] = $record[Length]
if $post_get[Length] eq "NA" && $record[Length] ne "NA";
}
}
print_data;
它产生:
ID Source Dest Bytes Type Content-Length host
1 A B 10 GET 10 yahoo.com
2 C D 40 GET 20 google.com
3 A B 250 POST 15 mail.yahoo.com
4 G H 415 POST NA facebook.com
与问题的主要偏差是替换长度&#39; for&#39; Content-Length&#39 ;;如果需要,修复很容易 - 将fmt_data
和fmt_head
中的第6个长度更改为长度14,并将"Length"
更改为"Content-Length"
。