如何使用awk在所有频繁的时间间隔之间读取数据

时间:2015-02-03 09:07:16

标签: bash unix awk unix-timestamp

我的日志文件格式如下

[30/Jan/2015:10:10:30 +0000] 12.30.30.204 xff=- reqId=[-] status_check len=- GET /api/getstatus HTTP/1.1 mi=- ec=- 200 425
[30/Jan/2015:10:11:00 +0000] 12.30.30.204 xff=- reqId=[-] status_check len=- GET /api/getstatus HTTP/1.1 mi=- ec=- 200 261
[30/Jan/2015:10:11:29 +0000] 12.30.30.204 xff=- reqId=[-] status_check len=- GET /api/getstatus HTTP/1.1 mi=- ec=- 200 232
[30/Jan/2015:10:12:00 +0000] 12.30.30.204 xff=- reqId=[-] status_check len=- GET /api/getstatus HTTP/1.1 mi=- ec=- 200 315
[30/Jan/2015:10:12:29 +0000] 12.30.30.204 xff=- reqId=[-] status_check len=- GET /api/getstatus HTTP/1.1 mi=- ec=- 200 221
[30/Jan/2015:10:12:57 +0000] 12.30.30.182 xff=- reqId=[-] status_check len=- GET /api/getstatus HTTP/1.1 mi=- ec=- 200 218

此日志文件中的每一行在第一个字段中都有时间戳,在最后一个字段中有响应时间。在awk中是否有办法读取所有特定时间间隔内的平均响应时间?例如,根据日志文件中的时间戳,每五分钟计算一次avg响应时间。

除了awk之外,还有其他最好的替代方法吗?请建议。

更新

我尝试了以下方法,这是静态方式,并且只给出一个时间间隔的平均值。

$ grep "30/Jan/2015:10:1[0-4]" mylog.log | awk '{resp+=$NF;cnt++;}END{print "Avg:"int(resp/cnt)}'

但是我需要为整个文件做5分钟。即使我循环命令,我如何动态地将日期传递给命令?因为日志文件每次都在变化,并且日期也在变化。

1 个答案:

答案 0 :(得分:3)

嗯。 GNU日期不喜欢你的日期格式,所以我想我们必须自己解析它。我正在思考这些问题(这需要mktime gawk):

# returns the seconds since epoch that stamp represents. This will be
# the first field in the line, with [] and everything. It's rather
# rudimentary:
function parse_timestamp(stamp) {
  # Split stamp into tokens delimited by [, ], /, : or space
  split(stamp, c, "[][/: ]")

  # reassemble (using the lookup table for the months from below) in a
  # format that mktime understands (then call mktime).
  return mktime(c[4] " " mnums[c[3]] " " c[2] " " c[5] " " c[6] " " c[7])
}

BEGIN {
  # parse_timestamp needs this lookup table.
  split("Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec", mnames)
  for(i = 1; i <= length(mnames); ++i) {
    mnums[mnames[i]] = i
  }

  # time is a parameter supplied by you.
  start = parse_timestamp(time)
  end   = start + 300

  if(start == -1) {
    print "Warning: Could not parse timestamp \"" time "\""
  }
}

{ 
  # in each line: parse the timestamp
  curtime = parse_timestamp($1)
}

# if it lies in the interval you want, sum up the last field and increase
# the counter
curtime >= start && curtime < end {
  sum += $NF
  ++count
}

END {
  # and in the end, print the average.
  print "Avg: " (count == 0 ? "undef" : sum / count)
}

将此文件放入文件中,例如average.awk,然后调用

awk -v time='[30/Jan/2015:10:11:20 +0000]' -f average.awk foo.log

如果您确定日志文件将按升序排序(可能就是这种情况),您可以通过替换

来提高效率
curtime >= start && curtime < end {
  sum += $NF
  ++count
}

curtime >= end {
  exit
}

curtime >= start {
  sum += $NF
  ++count
}

在找到第一个条目之后,这将停止搜索拟合日志条目。

附录:由于OP澄清了他想要在排序的makefile中所有五分钟间隔的Summaries,这样做的调整脚本是

#!/usr/bin/awk -f

function parse_timestamp(stamp) {
  split(stamp, c, "[][/: ]")
  return mktime(c[4] " " mnums[c[3]] " " c[2] " " c[5] " " c[6] " " c[7])
}

BEGIN {
  split("Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec", mnames)
  for(i = 1; i <= length(mnames); ++i) {
    mnums[mnames[i]] = i
  }
}

{ 
  curtime = parse_timestamp($1)
}

NR == 1 {
  # pull the start time from the first line
  start = curtime
  end   = start + 300
}

curtime > end {
  # print result, reset counters when endtimes are past
  print "Avg: " (count == 0 ? "undef" : sum / count)
  sum   = 0
  count = 0
  end  += 300
}

{
  sum += $NF
  ++count
}

END {
  # print once more at the very end for the last, unfinished interval.
  print "Avg: " (count == 0 ? "undef" : sum / count)
}