awk改进命令 - Count&和

时间:2014-06-04 07:25:04

标签: awk

想要获得改进此命令的建议,并希望删除不需要的执行以避免时间消耗, 实际上我想找CountOfLines and SumOf$6 group by $2,substr($3,4,6),substr($4,4,6),$10,$8,$6

GunZip输入文件包含大约300 Mn行的行。

Input.gz

2067,0,09-MAY-12.04:05:14,09-MAY-12.04:05:14,21-MAR-16,600,INR,RO312,20120321_1C,K1,,32
2160,0,26-MAY-14.02:05:27,26-MAY-14.02:05:27,18-APR-18,600,INR,RO414,20140418_7,K1,,30
2160,0,26-MAY-14.02:05:27,26-MAY-14.02:05:27,18-APR-18,600,INR,RO414,20140418_7,K1,,30
2160,0,26-MAY-14.02:05:27,26-MAY-14.02:05:27,18-APR-18,600,INR,RO414,20140418_7,K1,,30
2104,5,13-JAN-13.01:01:38,,13-JAN-17,4150,INR,RO113,CD1301_RC50_B1_20130113,K2,,21

使用以下命令并正常工作。

   zcat Input.gz | awk -F"," '{OFS=","; print $2,substr($3,4,6),substr($4,4,6),$10,$8,$6}'  | \
awk -F"," 'BEGIN {count=0; sum=0; OFS=","} {key=$0; a[key]++;b[key]=b[key]+$6} \
END {for (i in a) print i,a[i],b[i]}' >Output.txt

Output.txt的

0,MAY-14,MAY-14,K1,RO414,600,3,1800
0,MAY-12,MAY-12,K1,RO312,600,1,600
5,JAN-13,,K2,RO113,4150,1,4150

欢迎任何改善上述命令的建议。

2 个答案:

答案 0 :(得分:1)

这似乎更有效:

zcat Input.gz | awk -F, '{key=$2","substr($3,4,6)","substr($4,4,6)","$10","$8","$6;++a[key];b[key]=b[key]+$6}END{for(i in a)print i","a[i]","b[i]}'

输出:

0,MAY-14,MAY-14,K1,RO414,600,3,1800
0,MAY-12,MAY-12,K1,RO312,600,1,600
5,JAN-13,,K2,RO113,4150,1,4150

未收缩的表格:

zcat Input.gz | awk -F, '{
    key = $2 "," substr($3, 4, 6) "," substr($4, 4, 6) "," $10 "," $8 "," $6
    ++a[key]
    b[key] = b[key] + $6
}
END {
    for (i in a)
        print i "," a[i] "," b[i]
}'

答案 1 :(得分:0)

您可以通过根据第一个awk脚本重新定义字段来执行一次awk调用,例如:

$1 = $2
$2 = substr($3, 4, 6)
$3 = substr($4, 4, 6)
$4 = $10
$5 = $8

无需更改$6,因为它是相同的字段。现在,如果您将密钥基于新字段,则第二个脚本几乎不会改变。下面是我如何编写它,将代码移动到脚本文件中以获得更好的可读性和可维护性:

zcat Input.gz | awk -f parse.awk

parse.awk包含的位置:

BEGIN {
  FS = OFS = ","
}

{ 
  $1 = $2
  $2 = substr($3, 4, 6)
  $3 = substr($4, 4, 6)
  $4 = $10
  $5 = $8

  key = $1 OFS $2 OFS $3 OFS $4 OFS $5 OFS $6
  a[key]++
  b[key] += $6
}

END {
  for (i in a) 
    print i, a[i], b[i]
}

你当然可以将它作为单行运行,但它看起来会更加神秘:

zcat Input.gz | awk '{ key = $2 FS substr($3,4,6) FS substr($4,4,6) FS $10 FS $8 FS $6; a[key]++; b[key]+=$6 } END { for (i in a) print i,a[i],b[i] }' FS=, OFS=,

两种情况下的输出:

0,MAY-14,MAY-14,K1,RO414,600,3,1800
0,MAY-12,MAY-12,K1,RO312,600,1,600
5,JAN-13,,K2,RO113,4150,1,4150