我有一个包含~8,000,000行的大文件(my_file.txt),如下所示:
1 13110 13110 rs540538026 0 NA -1.33177622457982
1 13116 13116 rs62635286 0 NA -2.87540758021667
1 13118 13118 rs200579949 0 NA -2.87540758021667
1 13013178 13013178 rs374183434 0 NA -2.22383195384362
1 13013178 13013178 rs11122075 0 NA -1.57404917386838
我想根据前三列找到重复项,然后删除第7列中值较低的行,第一部分我可以完成:
awk -F"\t" '!seen[$2, $3]++' my_file.txt
但是我不知道如何处理有关使用较低值删除副本的部分,所需的输出将是这个:
1 13110 13110 rs540538026 0 NA -1.33177622457982
1 13116 13116 rs62635286 0 NA -2.87540758021667
1 13118 13118 rs200579949 0 NA -2.87540758021667
1 13013178 13013178 rs11122075 0 NA -1.57404917386838
速度是一个问题所以我可以使用awk,sed或其他bash命令 感谢
答案 0 :(得分:3)
$ awk '(i=$1 FS $2 FS $3) && !(i in seventh) || seventh[i] < $7 {seventh[i]=$7; all[i]=$0} END {for(i in a) print all[i]}' my_file.txt
1 13013178 13013178 rs11122075 0 NA -1.57404917386838
1 13116 13116 rs62635286 0 NA -2.87540758021667
1 13118 13118 rs200579949 0 NA -2.87540758021667
1 13110 13110 rs540538026 0 NA -1.33177622457982
感谢@fedorqui的高级索引。 :d
说明:
(i=$1 FS $2 FS $3) && !(i in seventh) || $7 > seventh[i] { # set index to first 3 fields
# AND if index not yet stored in array
# OR the seventh field is greater than the previous value of the seventh field by the same index:
seventh[i]=$7 # new biggest value
all[i]=$0 # store that record
}
END {
for(i in all) # for all stored records of the biggest seventh value
print all[i] # print them
}