Question

我有一个制表符分隔文件：

2L      31651   31752   60      -       18
2L      31660   31761   60      -       18
2L      31685   31786   60      -       18
2L      55854   55955   60      +       33
2L      67008   67109   60      -       37
2L      68606   68707   60      -       41
2L      83548   83649   60      +       56
2L      155486  155587  60      +       118
2L      169998  170099  60      -       131
2L      170000  170101  60      -       131
2L      170015  170116  60      -       131
2L      170025  170126  60      -       131
2L      170055  170156  60      -       131
2L      170062  170163  60      -       131
2L      170067  170168  60      -       131
2L      170116  170217  60      -       131
2L      327889  327990  60      -       283
2L      327908  328009  60      -       283
2L      329343  329444  60      -       284

第6列显示每行所属的集群。我只想保留每个群集有超过3个成员的行。例如，前3行都属于一个集群（集群18）。

我正在尝试awk -F "\t" '++a[$6] > 3'，但它没有像我想象的那样工作。以上示例的预期输出位于具有七行的集群上：

2L      169998  170099  60      -       131
2L      170000  170101  60      -       131
2L      170015  170116  60      -       131
2L      170025  170126  60      -       131
2L      170055  170156  60      -       131
2L      170062  170163  60      -       131
2L      170067  170168  60      -       131
2L      170116  170217  60      -       131

任何帮助将不胜感激

Answer 1

一种方法是对文件进行两次传递：

awk 'NR==FNR{a[$6]++;next}a[$6]>3' file file

如果我们添加一些评论，很容易看到会发生什么：

awk ' NR == FNR { # For the lines of the first file
         a[$6]++  # increment the number of times we found word $6
         next     # skip to the next record, so the following is
      }           # executed only on the second file:
      a[$6]>3     # print the current line if the counter for word $6 is 
                  # above 1
     ' file file  # input the file twice

Answer 2

awk中的另一个人：

$ awk '
$6==p || NR==1 {           # check if $6 hasn't changed (compare to p)
    b=b (b==""?"":ORS) $0  # gather buffer 
    p=$6                   # set p
    i++                    # counter
    next }                 # next record
{                          # $6 has changed:
    p=$6                   # set p
    if(i>3)                # if counter > 3
    print b                # output buffer
    b=$0                   # and initialize
    i=1 }                  # counter too
END {                      # in the end
    if(i>3)                # if needed
        print b }          # flush buffer
' file
2L      169998  170099  60      -       131
2L      170000  170101  60      -       131
2L      170015  170116  60      -       131
2L      170025  170126  60      -       131
2L      170055  170156  60      -       131
2L      170062  170163  60      -       131
2L      170067  170168  60      -       131
2L      170116  170217  60      -       131

它也可以从管道中读取。

如果多次看到列值，则打印

2 个答案: