Question

我有一个日志文件，其运行大小为千兆字节，我将其解析为csv文件，用于处理和数据分析。在创建csv文件时，我希望日期为特定格式。

输入文件：

Apr 22 23:08:26 a,x,y
Apr 22 23:08:26 b,y,z
Apr 22 23:08:26 c,s,s

输出文件：

20140422,23:08:26,a,x,y
20140422,23:08:26,b,y,z
20140422,23:08:26,c,s,s

目前我正在使用以下awk语句执行此操作 - 但运行大小超过1 GB的文件需要数小时。

awk 'BEGIN { OFS = "," } {getDate="date -f \"%b %d %H:%M:%S\" \""$1" "$2" "$3"\" \"+%Y%m%d\",\"%H:%M:%S\""
while ( ( getDate | getline date ) > 0 ) { }
close(getDate);
print date,$4}' inputFile

这可以进一步优化吗？ awk是否适合在这里使用？

Answer 1

你可以试试（假设今年总是这样）：

sed -e 's/\(:[0-9]\{2\}\) /\1,/
s/^Jan \([0-9]*\) /201401\1,/;t
s/^Feb \([0-9]*\) /201402\1,/;t
s/^Mar \([0-9]*\) /201403\1,/;t
s/^Apr \([0-9]*\) /201404\1,/;t
s/^May \([0-9]*\) /201405\1,/;t
s/^Jun \([0-9]*\) /201406\1,/;t
s/^Jul \([0-9]*\) /201407\1,/;t
s/^Aug \([0-9]*\) /201408\1,/;t
s/^Sep \([0-9]*\) /201409\1,/;t
s/^Oct \([0-9]*\) /201410\1,/;t
s/^Nov \([0-9]*\) /201411\1,/;t
s/^Dec \([0-9]*\) /201412\1,/' YourFile

t是替换发生时的优化，不需要在同一行上测试另一个替换。对于纯性能，您可以删除未使用的行（如果您只有1或2个月的日志，而不需要测试其他日志）。

Answer 2

对于数百万行，每行运行一次date命令将会非常缓慢。任何避免这种情况的事情都会更快。一个答案表明sed - 有许多优点;另一个建议Perl - ditto。

使用awk，您可以查看：

awk 'BEGIN { m["Jan"] = "01"; m["Feb"] = "02"; m["Mar"] = "03";
             m["Apr"] = "04"; m["May"] = "05"; m["Jun"] = "06";
             m["Jul"] = "07"; m["Aug"] = "08"; m["Sep"] = "09";
             m["Oct"] = "10"; m["Nov"] = "11"; m["Dec"] = "12";
           }
           {
             printf "2014%s%02d,%s,", m[$1], $2, $3;
             pad=""
             for (i = 4; i <= NF; i++) { printf("%s%s", pad, $i); pad = " " }
             printf "\n"
           }
    ' log-file

如果您有GNU awk，它内置了时间操作功能，但坦率地将日期信息视为字符串和数字如图所示非常有效。

给定一个像这样的输入日志文件：

Apr 22 23:08:26 a,x,y
Apr 22 23:08:26 b,y,z
Apr 22 23:08:26 c,s,s
Jan 31 00:19:50 c,info with spaces,some more info
Feb  2 00:20:41 c,info with spaces,some more info
Mar 13 00:31:32 c,info with spaces,some more info
May  5 00:42:23 c,info with spaces,some more info
Jun 16 00:53:14 c,info with spaces,some more info
Jul 27 00:04:05 c,info with spaces,some more info
Aug  8 00:15:56 c,info with spaces,some more info
Sep 29 00:26:47 c,info with spaces,some more info
Oct 30 00:37:38 c,info with spaces,some more info
Nov 12 00:49:29 c,info with spaces,some more info
Dec 22 00:50:10 c,info with spaces,some more info

它生成如下输出：

20140422,23:08:26,a,x,y
20140422,23:08:26,b,y,z
20140422,23:08:26,c,s,s
20140131,00:19:50,c,info with spaces,some more info
20140202,00:20:41,c,info with spaces,some more info
20140313,00:31:32,c,info with spaces,some more info
20140505,00:42:23,c,info with spaces,some more info
20140616,00:53:14,c,info with spaces,some more info
20140727,00:04:05,c,info with spaces,some more info
20140808,00:15:56,c,info with spaces,some more info
20140929,00:26:47,c,info with spaces,some more info
20141030,00:37:38,c,info with spaces,some more info
20141112,00:49:29,c,info with spaces,some more info
20141222,00:50:10,c,info with spaces,some more info

Answer 3

我知道你没有用perl标记，也许它不是一个选项，但我个人会考虑使用它。你可以这样做：

#!/usr/bin/env perl

use strict;
use warnings;

use Time::Piece;

{
    open my $in, "<", "logfile" or die "couldn't open logfile: $!";
    open my $out, ">", "new_logfile" or die "couldn't open new_logfile: $!";

    while(<$in>) {
        my @cols = split;
        my $t = Time::Piece->strptime("$cols[0] $cols[1] 2014", "%b %e %Y");
        print $out join(",", ($t->strftime("%Y%m%d"),@cols[2,-1])),"\n";
    }
}

这使用核心Time::Piece模块来解析日志文件中的时间并将其转换为您需要的格式。在不调用任何外部函数的情况下使用perl可能比目前的速度快很多。我在2014年硬编码，因为我不确定它会从哪里来。

Answer 4

这是使用awk的单向方式。像：

一样运行

awk -f script.awk input.txt

script.awk的内容：

BEGIN {

    OFS=","
}

{
    i = index("JanFebMarAprMayJunJulAugSepOctNovDec", $1)

    m = sprintf ("%02d", ((i - 1) / 3) + 1)

    print "2014" m $2, $3, $4
}

结果：

20140422,23:08:26,a,x,y
20140422,23:08:26,b,y,z
20140422,23:08:26,c,s,s

有效地更改现有日志文件的日期格式

4 个答案: