填写第二或第三个文件中的缺失值(bash)

时间:2017-07-18 10:50:29

标签: bash awk

我有以下三个文件:

list1.txt

AB0001  COG0593
AB0002  COG0592
AB0003  COG1195
AB0005  COG1005
AB0006  COG5621
AB0007  COG4591
AB0008  COG1136
AB0009  COG0071
AB0010  COG3212

list2.txt

AB0001  COG0593
AB0002  COG0592
AB0003  COG1195
AB0004  
AB0005  
AB0006  COG5621
AB0007  COG3127
AB0008  COG1136
AB0009  COG0071
AB0010  COG3212

list3.txt

AB0001  COG0593
AB0002  COG0592
AB0003  COG1195
AB0004  COG5146
AB0005  NOG84439
AB0006  COG5621
AB0007  COG0577
AB0008  COG1136
AB0009  COG0071
AB0010  NOG218375

我希望用其他列表的column2中的值填充缺失值(来自第一列AB00[01-10]),其中list1具有最高优先级,list2具有最高优先级,list3具有最低优先级。 所以期望的输出是:

AB0001  COG0593
AB0002  COG0592
AB0003  COG1195
AB0004  COG5146
AB0005  COG1005
AB0006  COG5621
AB0007  COG4591
AB0008  COG1136
AB0009  COG0071
AB0010  COG3212

意味着list1应该作为基础,如果缺少值,则从list2获取它,如果list2中也缺少该值,则从list3中取出。

2 个答案:

答案 0 :(得分:2)

按照优先顺序的相反顺序处理文件,优先级越高,胜利越高。使用NF>1可确保忽略具有缺失值的行。

$ awk 'BEGIN {FS=OFS="\t"} NF > 1 {a[$1] = $2} END {for (i in a) print i, a[i]}' list3.txt list2.txt list1.txt | sort
AB0001 COG0593
AB0002 COG0592
AB0003 COG1195
AB0004 COG5146
AB0005 COG1005
AB0006 COG5621
AB0007 COG4591
AB0008 COG1136
AB0009 COG0071
AB0010 COG3212

答案 1 :(得分:0)

加入 + awk 组合:

join -a2 list1.txt list2.txt | join -a2 - list3.txt | awk '{print $1,$2}' OFS='\t'

输出:

AB0001  COG0593
AB0002  COG0592
AB0003  COG1195
AB0004  COG5146
AB0005  COG1005
AB0006  COG5621
AB0007  COG4591
AB0008  COG1136
AB0009  COG0071
AB0010  COG3212