处理日志文件以识别停滞的数据创建

时间:2019-05-04 17:07:28

标签: perl

我有一些正在运行的进程以不同的速率创建数据。我想使用perl来识别那些没有创建数据超过1小时的实验,以便尽早终止它们。日志文件大致上是这样的(每15分钟生成一次,为便于阅读而缩短):

# Dataset,Timestamp,dataset size
exp-201905040115a,1556932502,0
exp-201905040115b,1556932502,0
exp-201905040115a,1556934301,213906
exp-201905040115b,1556934301,25487
exp-201905040115a,1556936102,399950
exp-201905040115b,1556936102,210548
exp-201905040115a,1556937002,399950
exp-201905040115b,1556937002,487250
exp-201905040115a,1556937902,399950
exp-201905040115b,1556937902,487250
exp-201905040115a,1556938802,399950
exp-201905040115b,1556938802,502145
exp-201905040115a,1556939701,399950
exp-201905040115b,1556939701,502145
exp-201905040115a,1556940601,399950
exp-201905040115b,1556940601,502145
exp-201905040115a,1556941502,399950
exp-201905040115b,1556941502,502145
exp-201905040115a,1556942401,399950
exp-201905040115b,1556942401,502145

第一个数据集的大小通常为0,但有时很小(<100)。

我已经学习了如何从日志文件中读取数据并逐行检查(或将其转换为数组以提取列条目)。

#!/usr/bin/perl

use warnings;
use strict;

my @datasets = ( 'exp-201905040115a', 'exp-201905040115b' );

foreach my $dataset (@datasets) {
        open my $logfile, '<', 'data.log' or die "Cannot open: $!";
        while (my $line = <$logfile>) {
                chomp $line;
                my ( $log_dataset, $log_timestamp, $log_datasize ) = split /,/, $line ;

                if ( $dataset eq $log_dataset ) {
                        print "Matched: " , $dataset, "\t" ;
                        printf('%10d', $log_datasize) ;
                        print " at " , $log_timestamp , "\n" ;
                }
        }
        close $logfile;
}

  1. 我有些困惑,但是该如何告诉我最近3600秒内第三列是否发生了任何变化。我想我必须比较各个行中的值,但是这里有这么多要比较的东西吗?

  2. 还有,是否有比遍历整个日志文件(每个数据集一次)更有效的方法?

有人可以给我一个建议吗?谢谢!

1 个答案:

答案 0 :(得分:1)

CSV输入和按分组进行分组的多个数据集让我想到了数据库。确实...

#!/bin/sh
logfile="$1"
sqlite3 -batch -noheader -csv <<EOF
CREATE TABLE logs(dataset TEXT, ts INTEGER, size INTEGER
                , PRIMARY KEY(dataset, size, ts)) WITHOUT ROWID;
.import "$logfile" logs
SELECT dataset
FROM logs AS l
GROUP BY dataset, size
HAVING max(ts) - min(ts) >= 3600
   AND max(ts) = (SELECT max(ts) FROM logs AS l2 WHERE l.dataset = l2.dataset
                                                   AND l.size = l2.size)
ORDER BY dataset;
EOF

在运行示例数据时将打印出exp-201905040115a

但是您想要perl。对于DBI,有一个方便的driver处理CSV文件,但是它支持的SQL方言不包含HAVING并且非常慢。所以,计划b。

#!/usr/bin/perl
use warnings;
use strict;
use feature qw/say/;

my %datasets;

# Read the log file into a hash table of lists of (time,size) pairs.
while (<>) {
  chomp;
  my ($ds, $time, $size) = split /,/;
  push @{$datasets{$ds}}, [ $time => $size ];
}

# For each dataset listed in the file:
DATASET:
while (my ($ds, $data) = each %datasets) {
  # Sort list in reverse order of time
  @$data = sort { $b->[0] <=> $a->[0] } @$data;
  # Get the most recent entry
  my ($time, $size) = @{shift @$data};
  # And compare it against the rest until...
  for my $rec (@$data) {
    # ... different size
    next DATASET if $size != $rec->[1];
    # ... Same size, entry more than an hour old
    if ($time - $rec->[0] >= 3600) {
      say $ds;
      next DATASET;
    }
  }
}