用于搜索文件中的模式和concat行的Perl脚本

时间:2009-06-12 21:29:21

标签: regex perl string text

我有一个文本文件(基本上是一个包含日期,时间戳和一些数据的错误日志),格式如下:

mm/dd/yy 12:00:00:0001  
This is line 1
This is line 2

mm/dd/yy 12:00:00:0004  
This is line 3
This is line 4
This is line 5


mm/dd/yy 12:00:00:0004
This is line 6
This is line 7

我是Perl的新手,需要编写一个脚本来搜索文件中的时间戳,并合并其中包含相同时间戳的数据。

我期待以上示例的以下输出。

mm/dd/yy 12:00:00:0001  
This is line 1
This is line 2

mm/dd/yy 12:00:00:0004  
This is line 3
This is line 4
This is line 5
This is line 6
This is line 7

完成这项工作的最佳方法是什么?

4 个答案:

答案 0 :(得分:4)

我之前必须在一些非常大的文件上执行此任务,并且时间戳没有按顺序排列。我不想把它全部存储在内存中。我通过使用三遍解决方案完成了任务:

  • 使用时间戳标记每个输入行并保存在临时文件中
  • 使用快速排序器对临时文件进行排序,例如sort(1)
  • 将已排序的文件恢复为初始格式

这对我的任务来说足够快,我可以让它在我去喝杯咖啡时运行,但是如果你真的很快就需要结果,你可能需要做更多的事情。

use strict;
use warnings;
use File::Temp qw(tempfile);

my( $temp_fh, $temp_filename )  = tempfile( UNLINK => 1 );

# read each line, tag with timestamp, and write to temp file
# will sort and undo later.
my $current_timestamp = '';
LINE: while( <DATA> )
    {
    chomp;

    if( m|^\d\d/\d\d/\d\d \d\d:\d\d:\d\d:\d\d\d\d$| ) # timestamp line
        {
        $current_timestamp = $_;
        next LINE;
        }
    elsif( m|\S| ) # line with non-whitespace (not a "blank line")
        {
        print $temp_fh "[$current_timestamp] $_\n";
        }
    else # blank lines
        {
        next LINE;
        }
    }

close $temp_fh;

# sort the file by lines using some very fast sorter
system( "sort", qw(-o sorted.txt), $temp_filename );

# read the sorted file and turn back into starting format
open my($in), "<", 'sorted.txt' or die "Could not read sorted.txt: $!";

$current_timestamp = '';
while( <$in> )
    {
    my( $timestamp, $line ) = m/\[(.*?)] (.*)/;
    if( $timestamp ne $current_timestamp )
        {
        $current_timestamp = $timestamp;
        print $/, $timestamp, $/;
        }

    print $line, $/;
    }

unlink $temp_file, 'sorted.txt';

__END__
01/01/70 12:00:00:0004
This is line 3
This is line 4
This is line 5

01/01/70 12:00:00:0001
This is line 1
This is line 2


01/01/70 12:00:00:0004
This is line 6
This is line 7

答案 1 :(得分:2)

如果日志文件不是太大而无法保留在内存中,则可以保留日期字符串=&gt;的哈希值。文本。像这样:

my %h;
my $cur = "*** No date ***";
while(<>) {
  if (m"^(\d\d/\d\d/\d\d \d\d:\d\d:\d\d:\d{4})") {
    $cur = $1;
  } else {
    $h{$cur} .= $_ unless /^\s*$/;
  }
}

print "$_\n$h{$_}\n" foreach (sort keys %h);

你要把它保存为t.pl并按照perl t.pl&lt;运行它。 yourlog.txt。 如果需要,调整正则表达式。

答案 2 :(得分:1)

如果输入很大,最好分两个阶段执行此操作:使用单个表创建一个SQLite数据库,该表包含一个表,其中包含时间戳和行的列(可能还有行号和文件名)。然后,您可以以任何方式输出数据。

答案 3 :(得分:0)

考虑这个解决方案......

    #!/usr/bin/perl

    use strict;

    my (%time, $id);
    while (<DATA>) {
        if ( /^mm/ ... /\n\n/ ) {
            chomp;
            s/^mm\/dd\/yy\s(.*)// and $id = $1;
            next if ( /^mm/ || /^$/ );
            push (@{$time{$id}}, $_);
       }

}

for my $i ( keys %time ) {
    print "mm/dd/yy $i\n";
    for my $j ( @{$time{$i}} ) {
        print "$j\n";
    }
    print "\n";
}

__DATA__
mm/dd/yy 12:00:00:0001
This is line 1
This is line 2

mm/dd/yy 12:00:00:0004
This is line 3
This is line 4
This is line 5


mm/dd/yy 12:00:00:0004
This is line 6
This is line 7