VLOOKUP就像使用awk的1个班轮

时间:2015-02-22 03:41:22

标签: text-processing

关于使用awk作为VLOOKUP的大量线程,但是当我试用它们时似乎没有工作。

我有2个文件:

@BioPower3-IBM ~/Goldfish/Assemblies/HighLength/blastx $ head GAGA_all_merged_k125_VS_Danio.blastp_results
Sequence name   Hit desc.   E-Value Similarity
Locus_1_Transcript_1/1_Confidence_1.000_Length_2223 gnl|BL_ORD_ID|19336gi|50540432|ref|NP_001002682.1| calsequestrin-2 precursor [Danio rerio]  0.0 89
Locus_2_Transcript_11/19_Confidence_0.580_Length_7240   gnl|BL_ORD_ID|42660gi|688610863|ref|XP_009294955.1| PREDICTED: band 4.1-like protein 1 isoform X1 [Danio rerio] 0.0 97
Locus_4_Transcript_46/49_Confidence_0.453_Length_5901   gnl|BL_ORD_ID|39369gi|59858543|ref|NP_001012312.1| gelsolin [Danio rerio]   0.0 92
Locus_5_Transcript_115/115_Confidence_0.452_Length_8023 gnl|BL_ORD_ID|30731gi|528504026|ref|XP_001345885.4| PREDICTED: protein Jumonji [Danio rerio]    0.0 91
Locus_6_Transcript_18/27_Confidence_0.299_Length_3005   gnl|BL_ORD_ID|28851gi|688587725|ref|XP_009289915.1| PREDICTED: phosphatidylinositol binding clathrin assembly protein b isoform X6 [Danio rerio]    0.0 98
Locus_8_Transcript_198/200_Confidence_0.159_Length_4179 gnl|BL_ORD_ID|45364gi|52219062|ref|NP_001004604.1| BCSC-1 [Danio rerio] 0.0 86
Locus_9_Transcript_1/6_Confidence_0.600_Length_1266 gnl|BL_ORD_ID|10854gi|528479736|ref|XP_005165325.1| PREDICTED: cathepsin L1 isoform X1 [Danio rerio]    0.0 97
Locus_10_Transcript_2635/2635_Confidence_0.015_Length_11912 gnl|BL_ORD_ID|39467gi|116004513|ref|NP_001070618.1| 3-oxoacid CoA transferase 1b [Danio rerio]  0.0 97
Locus_11_Transcript_7/7_Confidence_0.647_Length_1989    gnl|BL_ORD_ID|6732gi|528475412|ref|XP_005164328.1| PREDICTED: cerebellar degeneration-related protein 2-like isoform X2 [Danio rerio]   0.0 96

@BioPower3-IBM ~/Goldfish/Assemblies/HighLength/blastx $ head GAGA_all_merged_k125.LocusList
Locus_1_Transcript_1/1_Confidence_1.000_Length_2223
Locus_2_Transcript_11/19_Confidence_0.580_Length_7240
Locus_3_Transcript_1/1_Confidence_1.000_Length_417
Locus_4_Transcript_46/49_Confidence_0.453_Length_5901
Locus_5_Transcript_115/115_Confidence_0.452_Length_8023
Locus_6_Transcript_18/27_Confidence_0.299_Length_3005
Locus_7_Transcript_2/7_Confidence_0.611_Length_2222
Locus_8_Transcript_198/200_Confidence_0.159_Length_4179
Locus_9_Transcript_1/6_Confidence_0.600_Length_1266
Locus_10_Transcript_2635/2635_Confidence_0.015_Length_11912

注意第二个文件如何从1开始计算所有Loci,而第一个文件跳过几个,3和7。

当文件#1中存在Locus时,我需要从文件1获取文件2的输出以及列(让#2表示第2列)。如果在File1中没有Locus,我想看到NA。

到目前为止,这是我得到的最接近的,但它没有显示来自file1的列:

@BioPower3-IBM ~/Goldfish/Assemblies/HighLength/blastx $ awk 'FNR == NR {keys[$1]; next} {if ($1 in keys) {print $1, $2} else {print $1, "NA"} }' GAGA_all_merged_k125_VS_Danio.blastp_results GAGA_all_merged_k125.LocusList | head
Locus_1_Transcript_1/1_Confidence_1.000_Length_2223 
Locus_2_Transcript_11/19_Confidence_0.580_Length_7240 
Locus_3_Transcript_1/1_Confidence_1.000_Length_417 NA
Locus_4_Transcript_46/49_Confidence_0.453_Length_5901 
Locus_5_Transcript_115/115_Confidence_0.452_Length_8023 
Locus_6_Transcript_18/27_Confidence_0.299_Length_3005 
Locus_7_Transcript_2/7_Confidence_0.611_Length_2222 NA
Locus_8_Transcript_198/200_Confidence_0.159_Length_4179 
Locus_9_Transcript_1/6_Confidence_0.600_Length_1266 
Locus_10_Transcript_2635/2635_Confidence_0.015_Length_11912 

通知3和7具有所需的NA,但是,如何让其他人显示file1中的内容?谢谢,阿德里安

1 个答案:

答案 0 :(得分:1)

你快要结束了。什么问题?你这样做:

FNR == NR {keys[$1]; next}

在关联数组中不保存任何内容。替换为:

FNR == NR {keys[$1] = $1; next}

打印时,$2不存在:

if ($1 in keys) {print $1, $2}

而是将保存在关联数组中的内容放在:

之前
if ($1 in keys) {print $1, keys[$1]}

所以,它仍然像:

awk '
    FNR == NR {keys[$1] = $1; next} 
    { if ($1 in keys) { print $1, keys[$1] } 
          else {print $1, "NA"} 
        }
' GAGA_all_merged_k125_VS_Danio.blastp_results GAGA_all_merged_k125.LocusList

基于评论的更新:它与前一个类似。只需删除第一个字段,然后将整行保存在数组中。

awk '
    FNR == NR {f1 = $1; $1 = ""; keys[f1] = $0; next} 
    { if ($1 in keys) { print $1, keys[$1] } 
          else {print $1, "NA"} 
        }
' GAGA_all_merged_k125_VS_Danio.blastp_results GAGA_all_merged_k125.LocusList

它产生:

Locus_1_Transcript_1/1_Confidence_1.000_Length_2223  gnl|BL_ORD_ID|19336gi|50540432|ref|NP_001002682.1| calsequestrin-2 precursor [Danio rerio] 0.0 89
Locus_2_Transcript_11/19_Confidence_0.580_Length_7240  gnl|BL_ORD_ID|42660gi|688610863|ref|XP_009294955.1| PREDICTED: band 4.1-like protein 1 isoform X1 [Danio rerio] 0.0 97
Locus_3_Transcript_1/1_Confidence_1.000_Length_417 NA
Locus_4_Transcript_46/49_Confidence_0.453_Length_5901  gnl|BL_ORD_ID|39369gi|59858543|ref|NP_001012312.1| gelsolin [Danio rerio] 0.0 92
Locus_5_Transcript_115/115_Confidence_0.452_Length_8023  gnl|BL_ORD_ID|30731gi|528504026|ref|XP_001345885.4| PREDICTED: protein Jumonji [Danio rerio] 0.0 91
Locus_6_Transcript_18/27_Confidence_0.299_Length_3005  gnl|BL_ORD_ID|28851gi|688587725|ref|XP_009289915.1| PREDICTED: phosphatidylinositol binding clathrin assembly protein b isoform X6 [Danio rerio] 0.0 98
Locus_7_Transcript_2/7_Confidence_0.611_Length_2222 NA
Locus_8_Transcript_198/200_Confidence_0.159_Length_4179  gnl|BL_ORD_ID|45364gi|52219062|ref|NP_001004604.1| BCSC-1 [Danio rerio] 0.0 86
Locus_9_Transcript_1/6_Confidence_0.600_Length_1266  gnl|BL_ORD_ID|10854gi|528479736|ref|XP_005165325.1| PREDICTED: cathepsin L1 isoform X1 [Danio rerio] 0.0 97
Locus_10_Transcript_2635/2635_Confidence_0.015_Length_11912  gnl|BL_ORD_ID|39467gi|116004513|ref|NP_001070618.1| 3-oxoacid CoA transferase 1b [Danio rerio] 0.0 97