比较两个基于两列的文件,但保留两个带有模式的重复行

时间:2014-07-09 10:32:22

标签: unix awk

文件1:

scaffold2232_size19577   gene       8878    9258
scaffold2232_size19577   CDS        8878    9258
scaffold2232_size19577   gene       10631   14562
scaffold2232_size19577   intron     10693   11242
scaffold2232_size19577   intron     11343   14252
scaffold2232_size19577   intron     14346   14499
scaffold2232_size19577   CDS        10631   10692
scaffold2232_size19577   CDS        11243   11342
scaffold2232_size19577   CDS        14253   14345
scaffold2232_size19577   CDS        14500   14562
scaffold2232_size19577   gene       18807   19055
scaffold2232_size19577   CDS        18807   19055

file2的:

scaffold2232_size19577   8878   9258    Os12t0508300-01
scaffold2232_size19577   8878   9258    Os12t0508300-01
scaffold2232_size19577   10631  14562   Os12t0508300-01
scaffold2232_size19577   10693  11242   Os12t0508300-01
scaffold2232_size19577   11343  14252   Os12t0508300-01
scaffold2232_size19577   14346  14499   Os12t0508400-00
scaffold2232_size19577   14346  14499   Os12t0508400-00
scaffold2232_size19577   14346  14499   Os12t0508400-00
scaffold2232_size19577   10631  10692   Os12t0508300-01
scaffold2232_size19577   11243  11342   Os12t0508300-01
scaffold2232_size19577   14253  14345   Os12t0508400-00
scaffold2232_size19577   14253  14345   Os12t0508400-00
scaffold2232_size19577   14253  14345   Os12t0508400-00
scaffold2232_size19577   14500  14562   Os12t0508400-00
scaffold2232_size19577   14500  14562   Os12t0508400-00
scaffold2232_size19577   14500  14562   Os12t0508400-00
scaffold2232_size19577   18807  19055   Os12t0508400-00
scaffold2232_size19577   18807  19055   Os12t0508400-00
scaffold2232_size19577   18807  19055   Os12t0508400-00
scaffold2232_size19577   18807  19055   Os12t0508400-00
scaffold2232_size19577   18807  19055   Os12t0508400-00
scaffold2232_size19577   18807  19055   Os12t0508400-00

期望的输出:

scaffold2232_size19577   8878   9258    Os12t0508300-01 gene
scaffold2232_size19577   8878   9258    Os12t0508300-01 CDS 
scaffold2232_size19577   10631  14562   Os12t0508300-01 gene
scaffold2232_size19577   10693  11242   Os12t0508300-01 intron
scaffold2232_size19577   11343  14252   Os12t0508300-01 intron
scaffold2232_size19577   14346  14499   Os12t0508400-00 intron
scaffold2232_size19577   10631  10692   Os12t0508300-01 CDS
scaffold2232_size19577   11243  11342   Os12t0508300-01 CDS
scaffold2232_size19577   14253  14345   Os12t0508400-00 CDS
scaffold2232_size19577   14500  14562   Os12t0508400-00 CDS
scaffold2232_size19577   18807  19055   Os12t0508400-00 gene
scaffold2232_size19577   18807  19055   Os12t0508400-00 CDS

我尝试过:awk '{a[$1,$2,$3]=$0}END{for(i in a) print a[i]}' file2

但有了这个,我失去了一个基因/ CDS系列,因为他们在col [2],[3]中有相同的坐标 所以产出即将到来:

scaffold2232_size19577    8878  9258    Os12t0508300-01 
scaffold2232_size19577   10631  14562   Os12t0508300-01 
scaffold2232_size19577   10693  11242   Os12t0508300-01
scaffold2232_size19577   11343  14252   Os12t0508300-01
scaffold2232_size19577   14346  14499   Os12t0508400-00
scaffold2232_size19577   10631  10692   Os12t0508300-01
scaffold2232_size19577   11243  11342   Os12t0508300-01
scaffold2232_size19577   14253  14345   Os12t0508400-00
scaffold2232_size19577   14500  14562   Os12t0508400-00
scaffold2232_size19577   18807  19055   Os12t0508400-00

我以为我以后可以将file1的col [2]添加到file2但是在awk的这个操作之后行数减少了,所以我无法添加它们。 我希望这就像我想要的输出。

1 个答案:

答案 0 :(得分:1)

这样的东西?

awk 'FNR==NR {a[$2FS$3]=$4;next} {print $1,$3,$4,a[$3FS$4],$2}' OFS="\t" f2 f1
scaffold2232_size19577  8878    9258    Os12t0508300-01 gene
scaffold2232_size19577  8878    9258    Os12t0508300-01 CDS
scaffold2232_size19577  10631   14562   Os12t0508300-01 gene
scaffold2232_size19577  10693   11242   Os12t0508300-01 intron
scaffold2232_size19577  11343   14252   Os12t0508300-01 intron
scaffold2232_size19577  14346   14499   Os12t0508400-00 intron
scaffold2232_size19577  10631   10692   Os12t0508300-01 CDS
scaffold2232_size19577  11243   11342   Os12t0508300-01 CDS
scaffold2232_size19577  14253   14345   Os12t0508400-00 CDS
scaffold2232_size19577  14500   14562   Os12t0508400-00 CDS
scaffold2232_size19577  18807   19055   Os12t0508400-00 gene
scaffold2232_size19577  18807   19055   Os12t0508400-00 CDS