如何删除重复行并在awk中创建索引

时间:2016-04-27 17:05:56

标签: linux awk

我有制表符分隔文件,如下所示:

CNV_chr1_12623251_12632176  8925    3   RR123   XX
CNV_chr1_13398757_13402091  3334    4   RR123   YY
CNV_chr1_13398757_13402091  3334    4   RR224   YY
CNV_chr1_14001365_14004064  2699    1   RR123   YX
CNV_chr1_14001365_14004064  2699    1   RR224   YX

列$ 1和$ 2保持相同。在这种情况下,我需要通过索引第4列中的值来删除重复的行。并在$ 4中添加额外的5美元,字符串数用逗号分隔。示例输出如下所示:

CNV_chr1_12623251_12632176  8925    3   RR123    1    XX
CNV_chr1_13398757_13402091  3334    4   RR123,RR124    2    YY     
CNV_chr1_14001365_14004064  2699    1   RR123,RR224    2    YX

任何有效的灵魂都会有所帮助。

1 个答案:

答案 0 :(得分:1)

试试这个:

awk '($1 in ar){ar[$1]=ar[$1]; br[$1]=br[$1]","$4; next;}
     {br[$1]=$4; $4="REPLACE_ME"; ar[$1]=$0}
     END{for(key in ar){c=split(br[key],s,",")
                        gsub("REPLACE_ME", br[key] FS c, ar[key])
                        print ar[key]}}' test.txt

输出:

CNV_chr1_14001365_14004064 2699 1 RR123,RR224 2 YX
CNV_chr1_13398757_13402091 3334 4 RR123,RR224 2 YY
CNV_chr1_12623251_12632176 8925 3 RR123 1 XX

对于制表符分隔的输入,只需将-F"\t"添加到awk

awk -F"\t" '($1 in ar){ar[$1]=ar[$1]; br[$1]=br[$1]","$4; next;}
            {br[$1]=$4; $4="REPLACE_ME"; ar[$1]=$0}
            END{for(key in ar){c=split(br[key],s,",")
                        gsub("REPLACE_ME", br[key] FS c, ar[key])
                        print ar[key]}}' test.txt

并获得:

CNV_chr1_14001365_14004064 2699 1 RR123,RR224   2 YX
CNV_chr1_13398757_13402091 3334 4 RR123,RR224   2 YY
CNV_chr1_12623251_12632176 8925 3 RR123 1 XX