awk:用于解析文件并将数据与下一行进行比较并以csv格式打印的命令

时间:2015-04-09 11:03:55

标签: awk

我跟随空间分离的i / p。 第一列是时间戳,接下来是线程ID。

我想将o / p转换为csv文件

示例输入

04/09/15,08:49:05.001210  [Dispatch#3 (0x1b3b738)] NOTI  
04/09/15,08:49:05.118592  [Dispatch#0 (0x1b3b708)] NOTI  
04/09/15,08:49:05.225846  [Dispatch#2 (0x1b3b728)] NOTI  
04/09/15,08:49:05.361914  [Dispatch#1 (0x1b3b718)] NOTI  
04/09/15,08:49:05.469372  [Dispatch#3 (0x1b3b738)] NOTI  
04/09/15,08:49:05.569784  [Dispatch#0 (0x1b3b708)] NOTI  
04/09/15,08:49:05.738324  [Dispatch#2 (0x1b3b728)] NOTI  
04/09/15,08:49:05.851328  [Dispatch#1 (0x1b3b718)] NOTI  
04/09/15,08:49:05.965042  [Dispatch#3 (0x1b3b738)] NOTI  
04/09/15,08:49:06.041505  [Dispatch#0 (0x1b3b708)] NOTI  
04/09/15,08:49:06.151353  [Dispatch#2 (0x1b3b728)] NOTI  
04/09/15,08:49:07.814024  [Dispatch#1 (0xb29718)] NOTI   
04/09/15,08:49:07.588469  [Dispatch#1 (0xb29718)] NOTI   
04/09/15,08:49:07.371815  [Dispatch#0 (0xb29708)] NOTI   
04/09/15,08:49:07.160045  [Dispatch#0 (0xb29708)] NOTI   
04/09/15,08:49:07.979571  [Dispatch#0 (0xb29708)] NOTI   
04/09/15,08:50:08.385921  [Dispatch#0 (0x120e708)] NOTI  
04/09/15,08:50:08.450522  [Dispatch#3 (0x120e738)] NOTI  
04/09/15,08:50:08.550118  [Dispatch#1 (0x120e718)] NOTI  
04/09/15,08:50:08.600923  [Dispatch#0 (0x120e708)] NOTI  

采用csv格式的o / p

TimeStamp,Thread1,Thread2,Thread3,Thread4    
04/09/15 08:49:05,2,2,2,3    
04/09/15 08:49:06,1,0,1,0    
04/09/15 08:49:07,3,2,0,0    
04/09/15 08:49:08,2,1,0,1

所以我想在特定时间打印每个线程处理的记录数。

所以在上面的例子中, 04/09/15 08:49:07 主题1( 0x1b3b718 )有 3 记录,线程2( 0xb29718 )有 2 个记录,第3个& 4没有任何记录。

请建议是否可以通过awk命令获取此信息。

1 个答案:

答案 0 :(得分:0)

如果我理解你要做的正确,那么

awk -F '[,.# ]+' -v OFS=, 'function ts() { return $1 " " $2 } function dump() { print saved, a[0]+0, a[1]+0, a[2]+0, a[3]+0 } BEGIN { print "TimeStamp", "Thread1", "Thread2", "Thread3", "Thread4" } ts() != saved { if(NR != 1) dump(); delete a; saved = ts() } { ++a[$5] } END { dump() }' filename

是一种粗略的方式。

诀窍在于,使用字段分隔符regex [,.# ]+,行将被拆分,以便时间戳位于字段1和2中,而线程编号位于字段5中。-v OFS=,选项集输出字段分隔符为逗号,以便输出数据为CSV。然后:

function ts() {       # function to build a full timestamp as it is printed
  return $1 " " $2    # later
}

function dump() {     # function to print a result line. The +0 is to force
                      # the fields to be numbers, in case one remained empty.
  print saved, a[0]+0, a[1]+0, a[2]+0, a[3]+0
}

BEGIN {               # in the beginning, print the header line.
  print "TimeStamp", "Thread1", "Thread2", "Thread3", "Thread4"
} 

ts() != saved {       # if the timestamp changed:
  if(NR != 1) dump()  # if we're not just starting, print the result for
                      # the last block
  delete a            # discard counters
  saved = ts()        # save new timestamp
}
{ ++a[$5] }           # increase the counter for the thread this line mentions
END { dump() }        # and in the end, print the result for the last block.

附录重新评论:对于动态数量的线程,我们需要对文件进行两次传递。在第一遍中,我们找出有多少线程,在第二遍中我们打印出来。这是因为文件中第一秒的条目可能无法告诉我们所有线程。由于这对于单行而言变得难以处理,因此将以下代码放入文件中:

#!/usr/bin/awk -f

BEGIN {
  FS  = "[,.# ]+"
  OFS = ","
}

function ts() {
  return $1 " " $2
}

function dump() {
  printf("%s", saved);
  for(i = 0; i <= threads; ++i) {
    printf("%s%d", OFS, a[i])
  }
  print ""
}

# NR == FNR is true only for the first pass.    
NR == FNR {
  threads = $5 > threads ? $5 : threads
  next
}

FNR == 1 {
  printf("TimeStamp");
  for(i = 0; i <= threads; ++i) {
    printf("%sThread%d", OFS, i + 1)
  }
  print "";
} 

ts() != saved {
  if(FNR != 1) {
    dump()
  }

  delete a
  saved = ts()
}
{ ++a[$5] }
END { dump() }

称之为foo.awk,然后运行

awk -f foo.awk filename filename

请注意,文件名必须提供给awk 两次。它的工作方式几乎相同,只是在打印之前有一个传递,它找到最大的线程数,并且打印是在循环中完成的。