我有一些正在运行的进程以不同的速率创建数据。我想使用perl来识别那些没有创建数据超过1小时的实验,以便尽早终止它们。日志文件大致上是这样的(每15分钟生成一次,为便于阅读而缩短):
# Dataset,Timestamp,dataset size
exp-201905040115a,1556932502,0
exp-201905040115b,1556932502,0
exp-201905040115a,1556934301,213906
exp-201905040115b,1556934301,25487
exp-201905040115a,1556936102,399950
exp-201905040115b,1556936102,210548
exp-201905040115a,1556937002,399950
exp-201905040115b,1556937002,487250
exp-201905040115a,1556937902,399950
exp-201905040115b,1556937902,487250
exp-201905040115a,1556938802,399950
exp-201905040115b,1556938802,502145
exp-201905040115a,1556939701,399950
exp-201905040115b,1556939701,502145
exp-201905040115a,1556940601,399950
exp-201905040115b,1556940601,502145
exp-201905040115a,1556941502,399950
exp-201905040115b,1556941502,502145
exp-201905040115a,1556942401,399950
exp-201905040115b,1556942401,502145
第一个数据集的大小通常为0,但有时很小(<100)。
我已经学习了如何从日志文件中读取数据并逐行检查(或将其转换为数组以提取列条目)。
#!/usr/bin/perl
use warnings;
use strict;
my @datasets = ( 'exp-201905040115a', 'exp-201905040115b' );
foreach my $dataset (@datasets) {
open my $logfile, '<', 'data.log' or die "Cannot open: $!";
while (my $line = <$logfile>) {
chomp $line;
my ( $log_dataset, $log_timestamp, $log_datasize ) = split /,/, $line ;
if ( $dataset eq $log_dataset ) {
print "Matched: " , $dataset, "\t" ;
printf('%10d', $log_datasize) ;
print " at " , $log_timestamp , "\n" ;
}
}
close $logfile;
}
我有些困惑,但是该如何告诉我最近3600秒内第三列是否发生了任何变化。我想我必须比较各个行中的值,但是这里有这么多要比较的东西吗?
还有,是否有比遍历整个日志文件(每个数据集一次)更有效的方法?
有人可以给我一个建议吗?谢谢!
答案 0 :(得分:1)
CSV输入和按分组进行分组的多个数据集让我想到了数据库。确实...
#!/bin/sh
logfile="$1"
sqlite3 -batch -noheader -csv <<EOF
CREATE TABLE logs(dataset TEXT, ts INTEGER, size INTEGER
, PRIMARY KEY(dataset, size, ts)) WITHOUT ROWID;
.import "$logfile" logs
SELECT dataset
FROM logs AS l
GROUP BY dataset, size
HAVING max(ts) - min(ts) >= 3600
AND max(ts) = (SELECT max(ts) FROM logs AS l2 WHERE l.dataset = l2.dataset
AND l.size = l2.size)
ORDER BY dataset;
EOF
在运行示例数据时将打印出exp-201905040115a
。
但是您想要perl。对于DBI,有一个方便的driver处理CSV文件,但是它支持的SQL方言不包含HAVING
并且非常慢。所以,计划b。
#!/usr/bin/perl
use warnings;
use strict;
use feature qw/say/;
my %datasets;
# Read the log file into a hash table of lists of (time,size) pairs.
while (<>) {
chomp;
my ($ds, $time, $size) = split /,/;
push @{$datasets{$ds}}, [ $time => $size ];
}
# For each dataset listed in the file:
DATASET:
while (my ($ds, $data) = each %datasets) {
# Sort list in reverse order of time
@$data = sort { $b->[0] <=> $a->[0] } @$data;
# Get the most recent entry
my ($time, $size) = @{shift @$data};
# And compare it against the rest until...
for my $rec (@$data) {
# ... different size
next DATASET if $size != $rec->[1];
# ... Same size, entry more than an hour old
if ($time - $rec->[0] >= 3600) {
say $ds;
next DATASET;
}
}
}