Perl问题从不同的行提取,操作和合并相关数据

时间:2013-01-22 22:58:12

标签: perl parsing merge

我有一个非常具体的问题,我无法解决,它涉及解析和合并来自不同行的相关数据

我的文件包含以下格式的文字:

======================================================
8:27:24 PM  http://10.11.12.13:80
======================================================
GET /dog-pictures HTTP/1.1
Host: 10.11.12.13
Language: english
Agent: Unknown
Connection: closed

======================================================



======================================================
8:28:56 PM  http://192.114.126.245:80
======================================================
GET /flowers HTTP/1.1
Host: 10.11.12.13
Language: english

======================================================



======================================================
8:29:07 PM  http://10.11.12.13:80
======================================================
GET /africas-animals HTTP/1.1
Host: 10.11.12.13
Language: english
Agent: Unknown
Connection: open

======================================================

如上所示,文本文件中的每个数据由三行等号(=======)组成,但可以包含不同数量的行其中的数据。

我需要输出的格式如下:

    http://10.11.12.13/dog-pictures
    http://192.114.126.245/flowers
    http://10.11.12.13/africas-animals

我需要合并的位的说明:

======================================================
8:27:24 PM  http://10.11.12.13:80                     <--- Gets the first part from here**
======================================================
GET /dog-pictures HTTP/1.1                            <--- Gets the seconds part from here**
Host: 10.11.12.13
Language: english
Agent: Unknown
Connection: closed

======================================================

非常感谢您对此问题的帮助,                                                  提前谢谢

3 个答案:

答案 0 :(得分:1)

尝试在Perl中的shell中执行此操作:

perl -lane '
    if (/^\d+:\d+:\d+\s+\w+\s+([^:]+):/) {
        $scheme = $1;
    }
    if (/^(GET|HEAD|POST|PUT|DELETE|OPTION|TRACE)/) {
        $path = $F[1];
    }
    if (/^Host/) {
        print "$scheme://$F[1]$path";
    }
' file.txt

SCRIPT VERSION

perl -MO=Deparse生成并略微调整......

#!/usr/bin/env perl
# mimic `-l` switch to print like "say"
BEGIN { $/ = "\n"; $\ = "\n"; }

use strict; use warnings;

my ($scheme, $path);

# magic diamond operator
while (<ARGV>) {
    chomp $_;
    # splitting current line in @F array
    my (@F) = split(' ', $_, 0);

    # regex to catch the scheme (http)
    if (/^\d+:\d+:\d+\s+\w+\s+([^:]+):/) {
        $scheme = $1;
    }
    # if the current line match an HTTP verb, we feed $path variable
    # with second column
    if (/^(GET|HEAD|POST|PUT|DELETE|OPTION|TRACE)/) {
        $path = $F[1];
    }
    # if the current line match HOST, we print the needed line
    if (/^Host/) {
        print "${scheme}://$F[1]$path";
    }
}

USAGE

chmod +x script.pl
./script.pl file.txt

输出

http://10.11.12.13/dog-pictures
http://10.11.12.13/flowers
http://10.11.12.13/africas-animals

答案 1 :(得分:1)

以下可能会对您有所帮助:

use strict;
use warnings;

open my $fh, '<', 'data.txt' or die $!;

# Read a file line
while (<$fh>) {

    # If url captured on line beginning with time and read (separator) line
    if ( my ($url) = /^\d+:\d+:\d+.+?(\S+):\d+$/ and <$fh> ) {

        # Capture path
        my ($path) = <$fh> =~ /\s+(\/\S+)\s+/;

        print "$url$path\n" if $url and $path;
    }
}

输出:

http://10.11.12.13/dog-pictures
http://192.114.126.245/flowers
http://10.11.12.13/africas-animals

只有两行包含您想要的信息,并且这些信号由等号线分隔。第一个正则表达式尝试匹配时间字符串并捕获该行上的URL。 and <$fh>用于通过分隔符。第二个正则表达式捕获下一行的路径。最后,打印网址和路径。

答案 2 :(得分:0)

的Perl:

perl -F -lane 'if(/http/){$x=$F[2]}if(/GET/){print $x.$F[1]}' your_file

如果您想要使用awk:

awk '/http/{x=$3}/GET/{print x""substr($2,1)}' your_file