提取包含特定单词的多个字段

时间:2014-08-05 09:19:41

标签: linux awk

我有一个制表符分隔文件,如下所示:

locus_tag="PSE_0001"    codon_start=1   transl_table=11     product="Peptidase M23  M37 family protein"   protein_id="AEV34513.1"   db_xref="GI:359341139"  translation="MVDSLASSSDQPARLNGRWLIGTILTGMTSMVLMGGALMAALDGQYTYKTAKAPASNAADLTPQRNTSGKGDRLTSATDGFSNRQIIEVNTVTRSEGRDHVKAKPYALVSASLESFKKQETAADIPPFDPITMYQGEQVAPLQVASDAIYGADIEGEVSISQRDFPLEAMSMVALPDHKEEAVQQQVKKAAMFMLDNSTDIAAIPSVEDINAGFAPLSEQSFENIEVRITEENVSFQPKSRKTTQANQIEERIVPILTQTDFIDILLDGEASETEAEGYIKAFTDNFGIDTIKAGQIFRLSLNTDQIEEDDGILVRVSIYEDQRHVGTIARNDEGEFVVAPEPTTQMAADAFNSQQQNSVGPRATYYDSIYQTGLDNEVPSSLIKELIRIYSYSVDFNASVKSGDEMSVFYGLDADQTTGASEILYTSITVNGRSHRFYRFRTPDDGVVDYYDENGQSAKQFLLRKPIAAGRFTSGFGMRRHPVLKTRRLHTGTDWAAPRGTAIFAAGDGVIQKAAWSGGYGKRVEIKHANGYVTTYNHMTRFATGIQKGQRIRQGTVIGYVGTTGLSTGNHLHYEVKVNGRFVNSLKIKVPQGRVLEAQVLENFKRERDRINALMETGRPSQRVASLRN"    GenBank_acc="CP003147";     Source="Pseudovibrio sp. FO-BEG1";  feature_type="CDS";     strand="+";
locus_tag="PSE_0002"    codon_start=1   transl_table=11 product="hypothetical protein"  protein_id="AEV34514.1" db_xref="GI:359341140"  translation="MENVLIYLVGFAGTGKLTIARALAEATSAKVVDNQWINNPIFGLLDHDRLTPYPEGVWRQIDKVREAVLETVATLGAPHASYIFTHEGFEDDASDRQIYEAIRETAQRRKARFLPVRLLCNEDEIAKRVVSPERALRLKSMDPERSRNAVRNSTVLKPNHENELTLDISDKQPADVVVLILEQVAHCKT"     GenBank_acc="CP003147";     Source="Pseudovibrio sp. FO-BEG1";  feature_type="CDS";     strand="-";

我想只提取包含特定信息的字段:

e.g。

locus_tag
product

获取以下制表符分隔结果

locus_tag="PSE_0001"    product="Peptidase M23  M37 family protein"
locus_tag="PSE_0002"    product="hypothetical protein"

我试过这个awk代码:

awk '{for(i=1;i<=NF;i++)if ($i~/^locus_tag|^product|db_xref/) print $i}' Chrom.txt| head

但我获得了:

locus_tag="PSE_0001"
codon_start=1
transl_table=11
product="Peptidase
M23
M37
family
protein"
db_xref="GI:359341139"

有关我如何修复代码的任何建议吗?

1 个答案:

答案 0 :(得分:2)

在你的代码中,你并没有真正按照你的要求去做:

awk '{for(i=1;i<=NF;i++)if $1~/^locus_tag|^product|db_xref/) print $i}' Chrom.txt

例如,你没有要求dbref,if后面有一个缺少的括号。此外,如果您的文件是制表符分隔符,则应添加-F"\t"。此外,它会打破行,因为打印会在每次调用后断行。所以你想使用不自动添加“\ n”的printf。

我将如何做:

awk -F"\t" '{for (i=1;i<=NF;i++) {if($i~/locus_tag/) printf $i"\t"; if($i~/product/) printf $i"\n"}}' file

由于locus标签首先出现,我打印Field和一个标签,当我找到产品时,我打印字段并打破行

编辑:

如果要提取的字段数超过2个,此处有3个,则可以将它们存储在数组中:

awk -F"\t" 'BEGIN{j=1}
{for (i=1;i<=NF;i++) if($i~/locus_tag|product|db_xref/) {a[j]=$i;j=j+1}}
END{for (i=1;i<=length(a);i=i+3) print a[i],a[i+1],a[i+2]}' file

locus_tag="PSE_0001" product="Peptidase M23  M37 family protein" db_xref="GI:359341139"
locus_tag="PSE_0002" product="hypothetical protein" db_xref="GI:359341140"