awk如何提取其他字段“如果在字段中重复”,只有先前的字段相同且大于零

时间:2011-08-04 16:49:39

标签: linux shell scripting awk duplicates

我正在尝试在标签限定文件中查看来自特定字段(第1列和第4列)的重复行,并从重复字段块的第一行和最后一行中提取特定列;只有前面的字段相同且值都高于0.例如:

如果两列($ 1和$ 4)在其他位置散布的不同位置相同,则需要将它们视为单独的块

示例输入:

  1  tmp1   153446387   153446446   -0.2    1.0888042
  2  tmp1   153446925   153446973   0   0.87891006
  3  tmp1   153451902   153451951   1.43854 1.2709045
  4  tmp1   153454056   153454105   1.43854 1.4132746
  5  tmp1   153456192   153456250   1.43854 0.87553155
  6  tmp1   153458717   153458776   1.335858    1.1829022
  7  tmp1   153460782   153460841   1.335858    0.006651476
  8  tmp1   153462035   153462094   0   0.13484457
  9  tmp1   153463690   153463749   1.43854 0.45511296
 10  tmp1   153467589   153467673   1.43854 1.4431274
 11  tmp1   153467873   153468632   0.31841 1.70443
 12  tmp1   154451904   154451951   1.43854 1.3709045
 13  tmp1   154454054   154454109   1.43854 1.132746
 14  tmp1   154456194   154456259   1.43854 0.8553
 15  tmp2   153472147   153472194   1.43854 0.99288875
 16  tmp2   153476511   153476559   0   0.99288875

输出:

tmp1    153451902   153456250   1.43854
tmp1    153458717   153460841   1.335858
tmp1    153463690   153467673   1.43854
tmp1    154451904   154456259   1.43854
tmp2    153472147   153472194   1.43854

关于如何解决这个问题的任何想法

1 个答案:

答案 0 :(得分:2)

awk '
    BEGIN {OFS = FS = "\t"}
    function output(key, ary) {
        split(key, ary, FS)
        print ary[1], start, end, ary[2]
    }
    $4 <= 0 {next}
    key != $1 FS $4 {
        if (end) {output(key)}
        key = $1 FS $4
        start = $2
    }
    {end = $3}
    END {output(key)}
' filename