我有一个制表符分隔文件,如下所示:
locus_tag="PSE_0001" codon_start=1 transl_table=11 product="Peptidase M23 M37 family protein" protein_id="AEV34513.1" db_xref="GI:359341139" translation="MVDSLASSSDQPARLNGRWLIGTILTGMTSMVLMGGALMAALDGQYTYKTAKAPASNAADLTPQRNTSGKGDRLTSATDGFSNRQIIEVNTVTRSEGRDHVKAKPYALVSASLESFKKQETAADIPPFDPITMYQGEQVAPLQVASDAIYGADIEGEVSISQRDFPLEAMSMVALPDHKEEAVQQQVKKAAMFMLDNSTDIAAIPSVEDINAGFAPLSEQSFENIEVRITEENVSFQPKSRKTTQANQIEERIVPILTQTDFIDILLDGEASETEAEGYIKAFTDNFGIDTIKAGQIFRLSLNTDQIEEDDGILVRVSIYEDQRHVGTIARNDEGEFVVAPEPTTQMAADAFNSQQQNSVGPRATYYDSIYQTGLDNEVPSSLIKELIRIYSYSVDFNASVKSGDEMSVFYGLDADQTTGASEILYTSITVNGRSHRFYRFRTPDDGVVDYYDENGQSAKQFLLRKPIAAGRFTSGFGMRRHPVLKTRRLHTGTDWAAPRGTAIFAAGDGVIQKAAWSGGYGKRVEIKHANGYVTTYNHMTRFATGIQKGQRIRQGTVIGYVGTTGLSTGNHLHYEVKVNGRFVNSLKIKVPQGRVLEAQVLENFKRERDRINALMETGRPSQRVASLRN" GenBank_acc="CP003147"; Source="Pseudovibrio sp. FO-BEG1"; feature_type="CDS"; strand="+";
locus_tag="PSE_0002" codon_start=1 transl_table=11 product="hypothetical protein" protein_id="AEV34514.1" db_xref="GI:359341140" translation="MENVLIYLVGFAGTGKLTIARALAEATSAKVVDNQWINNPIFGLLDHDRLTPYPEGVWRQIDKVREAVLETVATLGAPHASYIFTHEGFEDDASDRQIYEAIRETAQRRKARFLPVRLLCNEDEIAKRVVSPERALRLKSMDPERSRNAVRNSTVLKPNHENELTLDISDKQPADVVVLILEQVAHCKT" GenBank_acc="CP003147"; Source="Pseudovibrio sp. FO-BEG1"; feature_type="CDS"; strand="-";
我想只提取包含特定信息的字段:
e.g。
locus_tag
product
获取以下制表符分隔结果
locus_tag="PSE_0001" product="Peptidase M23 M37 family protein"
locus_tag="PSE_0002" product="hypothetical protein"
我试过这个awk代码:
awk '{for(i=1;i<=NF;i++)if ($i~/^locus_tag|^product|db_xref/) print $i}' Chrom.txt| head
但我获得了:
locus_tag="PSE_0001"
codon_start=1
transl_table=11
product="Peptidase
M23
M37
family
protein"
db_xref="GI:359341139"
有关我如何修复代码的任何建议吗?
答案 0 :(得分:2)
在你的代码中,你并没有真正按照你的要求去做:
awk '{for(i=1;i<=NF;i++)if $1~/^locus_tag|^product|db_xref/) print $i}' Chrom.txt
例如,你没有要求dbref,if后面有一个缺少的括号。此外,如果您的文件是制表符分隔符,则应添加-F"\t"
。此外,它会打破行,因为打印会在每次调用后断行。所以你想使用不自动添加“\ n”的printf。
我将如何做:
awk -F"\t" '{for (i=1;i<=NF;i++) {if($i~/locus_tag/) printf $i"\t"; if($i~/product/) printf $i"\n"}}' file
由于locus标签首先出现,我打印Field和一个标签,当我找到产品时,我打印字段并打破行
编辑:
如果要提取的字段数超过2个,此处有3个,则可以将它们存储在数组中:
awk -F"\t" 'BEGIN{j=1}
{for (i=1;i<=NF;i++) if($i~/locus_tag|product|db_xref/) {a[j]=$i;j=j+1}}
END{for (i=1;i<=length(a);i=i+3) print a[i],a[i+1],a[i+2]}' file
locus_tag="PSE_0001" product="Peptidase M23 M37 family protein" db_xref="GI:359341139"
locus_tag="PSE_0002" product="hypothetical protein" db_xref="GI:359341140"