合并具有不同格式的日志文件

时间:2016-03-04 21:06:40

标签: perl

我有两个日志/时间格式的日志文件,我想合并。

第一个文件是标准的Apache access_log文件,如下所示:

  

127.0.0.1 - - [29 / Feb / 2016:16:57:52 -0600]" GET / application / wcs / api / version?nodeRef = workspace:// SpacesStore / ecd62cfa-fd19-4d6b -b45d-14f0e5b92cf0 HTTP / 1.1" 200 567
  127.0.0.1 - - [29 / Feb / 2016:16:57:52 -0600]" GET / application / wcs / api / node / workspace / SpacesStore / ecd62cfa-fd19-4d6b-b45d-14f0e5b92cf0 / workflow-实例HTTP / 1.1" 200 40
  127.0.0.1 - - [29 / Feb / 2016:16:57:52 -0600]" GET / application / wcs / cisco / appId?userId = abcdefg& requestType = get HTTP / 1.1" 200 45
  173.37.239.93 - abcdefg [29 / Feb / 2016:16:57:52 -0600]" GET / share / page / site / nextgen-edcs / document-details?nodeRef = workspace:// SpacesStore / ecd62cfa- fd19-4d6b-b45d-14f0e5b92cf0 HTTP / 1.1" 200 124492
  173.37.239.93 - abcdefg [29 / Feb / 2016:16:57:53 -0600]" GET /share/service/messages_69bcdfdb058bb873ff49cc2a10c958b7.js?locale=en_US HTTP / 1.1" 200 81698
  173.37.239.93 - abcdefg [29 / Feb / 2016:16:57:53 -0600]" GET /share/res/yui/history/history_543b42a00a378f4d4b6e70c81d915b0a.js HTTP / 1.1" 200 5781

。 。 。在哪里' abcdedfg' = userid。

第二个日志文件的格式如下:

  

2016-02-12 08:16:03,630 WARN [cluster.cache.HazelcastSimpleCache] [http-bio-8443-exec-212]群集处于非活动状态,但是为缓存调用了put(k,v)HazelcastSimpleCache [cacheName = cache.readersSharedCache]
   2016-02-12 08:16:03,630 WARN [cluster.cache.HazelcastSimpleCache] [http-bio-8443-exec-212]群集处于非活动状态但是调用了get(key)缓存HazelcastSimpleCache [cacheName = cache.readersSharedCache], key = AclEntity [ID = 1893033,version = 55,aclId = 16cf5bc3-27d0-4d50-a93d-3bee1ddd​​112e,isLatest = true,aclVersion = 1,inherits = true,inheritsFrom = 1889292,type = 1,inheritedAcl = 1893034,isVersioned = false,requiresVersion = false,aclChangeSet = 1451473]
   2016-02-12 08:16:03,630 WARN [cluster.cache.HazelcastSimpleCache] [http-bio-8443-exec-212]群集处于非活动状态,但是为缓存调用了put(k,v)HazelcastSimpleCache [cacheName = cache.readersSharedCache ]

我的目标是:

  1. 将第一个日志文件中的日期/时间格式转换为第二个日志文件的日期/时间格式
  2. 从第一个日志文件中删除IP地址,但保留用户ID。
  3. 将两个日志文件合并在一起
  4. 按日期/时间排序。
  5. 这是我到目前为止所拥有的 -

    $LOGFILE1 = "catalina.out";
    $LOGFILE2 = "access_log";
    
    open(LOGFILE1) or die("Could not open log file.");
    foreach $line (<LOGFILE1>) {
        chomp($line);
        if ($line =~ /^2016.+$/) {
             print $line . "\n";
        }
    }
    
    open(LOGFILE2) or die("Could not open log file.");
    foreach $line (<LOGFILE2>) {
    chomp($line);
    if ($line =~ /\d{2}\/\S{3}\/\d{4}:\d{2}:\d{2}:\d{2} -\d{3}/) {
    print $line . "\n";
    }
    
        # format of file 1
        # DD/MMM/YYYY:HH:MM:SS -NNNN
        # 29/Feb/2016:20:03:07 -600
        # format of file 2
        # YYYY-MM-DD HH:MM:SS,NNN
        # 2016-02-12 08:16:03,631
    }
    

    所以我基本上只对有日期/时间信息的行感兴趣,所以上面的代码丢弃了其他行。

    我被困的地方是:
    1)如何将文件1中的日期/时间格式转换为文件2的数据/时间格式?
    2)我对IP地址不感兴趣,但我确实希望保留用户ID。由于文件1不以文件2之类的日期/时间信息开头,因此在转换后,如何在合并两者之后对日期进行排序?
    任何帮助将不胜感激!

2 个答案:

答案 0 :(得分:0)

虽然我不会为您编写脚本,但通用脚本应该如下所示:

use strict;
use warnings;
use DateTime::Format::Strptime;

sub firstFileLine {
    # parse line as needed, and return a hash reference with 2 keys:
    #   1. `line`: the contents of the line, possibly edited 
    #   2. `ts`: the UTC unix timestamp, via the DateTime::Format::Strptime module
}

sub secondFileLine {
    # similar to `firstFileLine`, return a hash reference
}

my @firstLines = map { firstFileLine($_) } <FILE1>;
my @secondLines = map { secondFileLine($_) } <FILE2>;

my @sorted = map { $_->{line} } sort {$a->{ts} <=> $b->{ts}} (@firstLines, @secondLines);

阅读DateTime::Format::Strptimemapsort上的文档。你很幸运Perl是那里记录最好的语言之一,充分利用这一事实!

答案 1 :(得分:0)

以下是使用Time::Piece的解决方案。我使用Inline :: Files来模拟2个文件。你需要打开像

这样的日志文件
my $logfile1 = "catalina.out";
my $logfile2 = "access_log";


open my $log1_fh, '<', $logfile1 or die $1;
open my $log2_fh, '<', $logfile2 or die $1;

程序看起来像这个,它给了我我想你想要的结果。

#!/usr/bin/perl
use strict;
use warnings;
use Inline::Files;
use Time::Piece;

my %data;

while (<FILE2>) {
    # get date_time
    my ($dt) = /^(\d{4}-\d\d-\d\d \d\d:\d\d:\d\d),/ or next;
    push @{ $data{$dt} }, $_;
}

my $format = '%d/%b/%Y:%H:%M:%S';

while (<FILE1>) {
    /\[(\S+)/;
    my $t = Time::Piece->strptime($1, $format)
        or die "Cannot parse $1. $!";

    my $dt = $t->strftime('%Y-%m-%d %H:%M:%S');

    s/^\S+ (?:- )+//;
    s/(?<=\[)[^\]]+/$dt/;
    push @{ $data{$dt} }, $_;
}

for my $dt (sort keys %data) {
    my $aref = $data{$dt};
    print for @$aref;   
}


__FILE1__
127.0.0.1 - - [29/Feb/2016:16:57:52 -0600] "GET /application/wcs/api/version?nodeRef=workspace://SpacesStore/ecd62cfa-fd19-4d6b-b45d-14f0e5b92cf0 HTTP/1.1" 200 567
127.0.0.1 - - [29/Feb/2016:16:57:52 -0600] "GET /application/wcs/api/node/workspace/SpacesStore/ecd62cfa-fd19-4d6b-b45d-14f0e5b92cf0/workflow-instances HTTP/1.1" 200 40
127.0.0.1 - - [29/Feb/2016:16:57:52 -0600] "GET /application/wcs/cisco/appId?userId=abcdefg&requestType=get HTTP/1.1" 200 45
173.37.239.93 - abcdefg [29/Feb/2016:16:57:52 -0600] "GET /share/page/site/nextgen-edcs/document-details?nodeRef=workspace://SpacesStore/ecd62cfa-fd19-4d6b-b45d-14f0e5b92cf0 HTTP/1.1" 200 124492
173.37.239.93 - abcdefg [29/Feb/2016:16:57:53 -0600] "GET /share/service/messages_69bcdfdb058bb873ff49cc2a10c958b7.js?locale=en_US HTTP/1.1" 200 81698
173.37.239.93 - abcdefg [29/Feb/2016:16:57:53 -0600] "GET /share/res/yui/history/history_543b42a00a378f4d4b6e70c81d915b0a.js HTTP/1.1" 200 5781
__FILE2__
2016-02-12 08:16:03,630  WARN  [cluster.cache.HazelcastSimpleCache] [http-bio-8443-exec-212] Cluster is inactive but put(k,v) was called for cache HazelcastSimpleCache[cacheName=cache.readersSharedCache]
2016-02-12 08:16:03,630  WARN  [cluster.cache.HazelcastSimpleCache] [http-bio-8443-exec-212] Cluster is inactive but get(key) was called for cache HazelcastSimpleCache[cacheName=cache.readersSharedCache], key=AclEntity[ ID=1893033, version=55, aclId=16cf5bc3-27d0-4d50-a93d-3bee1ddd112e, isLatest=true, aclVersion=1, inherits=true, inheritsFrom=1889292, type=1, inheritedAcl=1893034, isVersioned=false, requiresVersion=false, aclChangeSet=1451473]
2016-02-12 08:16:03,630  WARN  [cluster.cache.HazelcastSimpleCache] [http-bio-8443-exec-212] Cluster is inactive but put(k,v) was called for cache HazelcastSimpleCache[cacheName=cache.readersSharedCache]

我使用散列%data来存储这些行。关键是转换日期,所以稍后在程序中,您可以按排序顺序打印它们。

该程序的输出是:

2016-02-12 08:16:03,630  WARN  [cluster.cache.HazelcastSimpleCache] [http-bio-8443-exec-212] Cluster is inactive but put(k,v) was called for cache HazelcastSimpleCache[cacheName=cache.readersSharedCache]
2016-02-12 08:16:03,630  WARN  [cluster.cache.HazelcastSimpleCache] [http-bio-8443-exec-212] Cluster is inactive but get(key) was called for cache HazelcastSimpleCache[cacheName=cache.readersSharedCache], key=AclEntity[ ID=1893033, version=55, aclId=16cf5bc3-27d0-4d50-a93d-3bee1ddd112e, isLatest=true, aclVersion=1, inherits=true, inheritsFrom=1889292, type=1, inheritedAcl=1893034, isVersioned=false, requiresVersion=false, aclChangeSet=1451473]
2016-02-12 08:16:03,630  WARN  [cluster.cache.HazelcastSimpleCache] [http-bio-8443-exec-212] Cluster is inactive but put(k,v) was called for cache HazelcastSimpleCache[cacheName=cache.readersSharedCache]
[2016-02-29 16:57:52] "GET /application/wcs/api/version?nodeRef=workspace://SpacesStore/ecd62cfa-fd19-4d6b-b45d-14f0e5b92cf0 HTTP/1.1" 200 567
[2016-02-29 16:57:52] "GET /application/wcs/api/node/workspace/SpacesStore/ecd62cfa-fd19-4d6b-b45d-14f0e5b92cf0/workflow-instances HTTP/1.1" 200 40
[2016-02-29 16:57:52] "GET /application/wcs/cisco/appId?userId=abcdefg&requestType=get HTTP/1.1" 200 45
abcdefg [2016-02-29 16:57:52] "GET /share/page/site/nextgen-edcs/document-details?nodeRef=workspace://SpacesStore/ecd62cfa-fd19-4d6b-b45d-14f0e5b92cf0 HTTP/1.1" 200 124492
abcdefg [2016-02-29 16:57:53] "GET /share/service/messages_69bcdfdb058bb873ff49cc2a10c958b7.js?locale=en_US HTTP/1.1" 200 81698
abcdefg [2016-02-29 16:57:53] "GET /share/res/yui/history/history_543b42a00a378f4d4b6e70c81d915b0a.js HTTP/1.1" 200 5781