我有一个日志文件,其运行大小为千兆字节,我将其解析为csv文件,用于处理和数据分析。在创建csv文件时,我希望日期为特定格式。
输入文件:
Apr 22 23:08:26 a,x,y
Apr 22 23:08:26 b,y,z
Apr 22 23:08:26 c,s,s
输出文件:
20140422,23:08:26,a,x,y
20140422,23:08:26,b,y,z
20140422,23:08:26,c,s,s
目前我正在使用以下awk语句执行此操作 - 但运行大小超过1 GB的文件需要数小时。
awk 'BEGIN { OFS = "," } {getDate="date -f \"%b %d %H:%M:%S\" \""$1" "$2" "$3"\" \"+%Y%m%d\",\"%H:%M:%S\""
while ( ( getDate | getline date ) > 0 ) { }
close(getDate);
print date,$4}' inputFile
这可以进一步优化吗? awk是否适合在这里使用?
答案 0 :(得分:3)
你可以试试(假设今年总是这样):
sed -e 's/\(:[0-9]\{2\}\) /\1,/
s/^Jan \([0-9]*\) /201401\1,/;t
s/^Feb \([0-9]*\) /201402\1,/;t
s/^Mar \([0-9]*\) /201403\1,/;t
s/^Apr \([0-9]*\) /201404\1,/;t
s/^May \([0-9]*\) /201405\1,/;t
s/^Jun \([0-9]*\) /201406\1,/;t
s/^Jul \([0-9]*\) /201407\1,/;t
s/^Aug \([0-9]*\) /201408\1,/;t
s/^Sep \([0-9]*\) /201409\1,/;t
s/^Oct \([0-9]*\) /201410\1,/;t
s/^Nov \([0-9]*\) /201411\1,/;t
s/^Dec \([0-9]*\) /201412\1,/' YourFile
t
是替换发生时的优化,不需要在同一行上测试另一个替换。对于纯性能,您可以删除未使用的行(如果您只有1或2个月的日志,而不需要测试其他日志)。
答案 1 :(得分:2)
对于数百万行,每行运行一次date
命令将会非常缓慢。任何避免这种情况的事情都会更快。一个答案表明sed
- 有许多优点;另一个建议Perl - ditto。
使用awk
,您可以查看:
awk 'BEGIN { m["Jan"] = "01"; m["Feb"] = "02"; m["Mar"] = "03";
m["Apr"] = "04"; m["May"] = "05"; m["Jun"] = "06";
m["Jul"] = "07"; m["Aug"] = "08"; m["Sep"] = "09";
m["Oct"] = "10"; m["Nov"] = "11"; m["Dec"] = "12";
}
{
printf "2014%s%02d,%s,", m[$1], $2, $3;
pad=""
for (i = 4; i <= NF; i++) { printf("%s%s", pad, $i); pad = " " }
printf "\n"
}
' log-file
如果您有GNU awk
,它内置了时间操作功能,但坦率地将日期信息视为字符串和数字如图所示非常有效。
给定一个像这样的输入日志文件:
Apr 22 23:08:26 a,x,y
Apr 22 23:08:26 b,y,z
Apr 22 23:08:26 c,s,s
Jan 31 00:19:50 c,info with spaces,some more info
Feb 2 00:20:41 c,info with spaces,some more info
Mar 13 00:31:32 c,info with spaces,some more info
May 5 00:42:23 c,info with spaces,some more info
Jun 16 00:53:14 c,info with spaces,some more info
Jul 27 00:04:05 c,info with spaces,some more info
Aug 8 00:15:56 c,info with spaces,some more info
Sep 29 00:26:47 c,info with spaces,some more info
Oct 30 00:37:38 c,info with spaces,some more info
Nov 12 00:49:29 c,info with spaces,some more info
Dec 22 00:50:10 c,info with spaces,some more info
它生成如下输出:
20140422,23:08:26,a,x,y
20140422,23:08:26,b,y,z
20140422,23:08:26,c,s,s
20140131,00:19:50,c,info with spaces,some more info
20140202,00:20:41,c,info with spaces,some more info
20140313,00:31:32,c,info with spaces,some more info
20140505,00:42:23,c,info with spaces,some more info
20140616,00:53:14,c,info with spaces,some more info
20140727,00:04:05,c,info with spaces,some more info
20140808,00:15:56,c,info with spaces,some more info
20140929,00:26:47,c,info with spaces,some more info
20141030,00:37:38,c,info with spaces,some more info
20141112,00:49:29,c,info with spaces,some more info
20141222,00:50:10,c,info with spaces,some more info
答案 2 :(得分:1)
我知道你没有用perl标记,也许它不是一个选项,但我个人会考虑使用它。你可以这样做:
#!/usr/bin/env perl
use strict;
use warnings;
use Time::Piece;
{
open my $in, "<", "logfile" or die "couldn't open logfile: $!";
open my $out, ">", "new_logfile" or die "couldn't open new_logfile: $!";
while(<$in>) {
my @cols = split;
my $t = Time::Piece->strptime("$cols[0] $cols[1] 2014", "%b %e %Y");
print $out join(",", ($t->strftime("%Y%m%d"),@cols[2,-1])),"\n";
}
}
这使用核心Time::Piece模块来解析日志文件中的时间并将其转换为您需要的格式。在不调用任何外部函数的情况下使用perl可能比目前的速度快很多。我在2014年硬编码,因为我不确定它会从哪里来。
答案 3 :(得分:1)
这是使用awk
的单向方式。像:
awk -f script.awk input.txt
script.awk
的内容:
BEGIN {
OFS=","
}
{
i = index("JanFebMarAprMayJunJulAugSepOctNovDec", $1)
m = sprintf ("%02d", ((i - 1) / 3) + 1)
print "2014" m $2, $3, $4
}
结果:
20140422,23:08:26,a,x,y
20140422,23:08:26,b,y,z
20140422,23:08:26,c,s,s