我有一个制表符分隔文件:
2L 31651 31752 60 - 18
2L 31660 31761 60 - 18
2L 31685 31786 60 - 18
2L 55854 55955 60 + 33
2L 67008 67109 60 - 37
2L 68606 68707 60 - 41
2L 83548 83649 60 + 56
2L 155486 155587 60 + 118
2L 169998 170099 60 - 131
2L 170000 170101 60 - 131
2L 170015 170116 60 - 131
2L 170025 170126 60 - 131
2L 170055 170156 60 - 131
2L 170062 170163 60 - 131
2L 170067 170168 60 - 131
2L 170116 170217 60 - 131
2L 327889 327990 60 - 283
2L 327908 328009 60 - 283
2L 329343 329444 60 - 284
第6列显示每行所属的集群。我只想保留每个群集有超过3个成员的行。例如,前3行都属于一个集群(集群18)。
我正在尝试awk -F "\t" '++a[$6] > 3'
,但它没有像我想象的那样工作。以上示例的预期输出位于具有七行的集群上:
2L 169998 170099 60 - 131
2L 170000 170101 60 - 131
2L 170015 170116 60 - 131
2L 170025 170126 60 - 131
2L 170055 170156 60 - 131
2L 170062 170163 60 - 131
2L 170067 170168 60 - 131
2L 170116 170217 60 - 131
任何帮助将不胜感激
答案 0 :(得分:1)
一种方法是对文件进行两次传递:
awk 'NR==FNR{a[$6]++;next}a[$6]>3' file file
如果我们添加一些评论,很容易看到会发生什么:
awk ' NR == FNR { # For the lines of the first file
a[$6]++ # increment the number of times we found word $6
next # skip to the next record, so the following is
} # executed only on the second file:
a[$6]>3 # print the current line if the counter for word $6 is
# above 1
' file file # input the file twice
答案 1 :(得分:1)
awk中的另一个人:
$ awk '
$6==p || NR==1 { # check if $6 hasn't changed (compare to p)
b=b (b==""?"":ORS) $0 # gather buffer
p=$6 # set p
i++ # counter
next } # next record
{ # $6 has changed:
p=$6 # set p
if(i>3) # if counter > 3
print b # output buffer
b=$0 # and initialize
i=1 } # counter too
END { # in the end
if(i>3) # if needed
print b } # flush buffer
' file
2L 169998 170099 60 - 131
2L 170000 170101 60 - 131
2L 170015 170116 60 - 131
2L 170025 170126 60 - 131
2L 170055 170156 60 - 131
2L 170062 170163 60 - 131
2L 170067 170168 60 - 131
2L 170116 170217 60 - 131
它也可以从管道中读取。