这是我的input.file
(数千行):
FN545816.1 EMBL CDS 9450 9857 . + 0 ID=cds-CBE01461.1;Parent=gene-CDR20291_3551;Dbxref=EnsemblGenomes-Gn:CDR20291_3551,EnsemblGenomes-Tr:CBE01461,GOA:C9YHF8,InterPro:IPR003594,UniProtKB/TrEMBL:C9YHF8,NCBI_GP:CBE01461.1;Name=CBE01461.1;gbkey=CDS;gene=rsbW;product=anti-sigma-B factor (serine-protein kinase);protein_id=CBE01461.1;transl_table=11
我只想提取product=
之后到下一个;
之后的内容
因此,在这种情况下,我想获得“抗-sigma-B因子(丝氨酸蛋白激酶)”
我尝试过:
awk '{for(i=1; i<=NF; i++) if($i~/*product=/) print $(i+1)}' input.file > output.file
,但它仅打印“因数”(大概是因为“ product =“和“ anti-sigma-B”之间没有空格。它也不打印其余部分。
我尝试过许多以前的解决方案,但没有一个能满足我的要求。
谢谢。
答案 0 :(得分:1)
请您尝试以下。
awk 'match($0,/product=[^;]*/){print substr($0,RSTART+8,RLENGTH-8)}' Input_file
说明: 现在也为上述代码添加了说明。
awk ' ##Starting awk program here.
match($0,/product=[^;]*/){ ##Using match function for awk here, where giving REGEX to match from string product= till first occurrence of ;
print substr($0,RSTART+8,RLENGTH-8) ##Printing substring whose value is from RSTART+8 to till RLENGTH-8, where RSTART and RLENGTH are out of the box keywords which will be set once REGEX condition is satisfied. RSTART mean starting point of regex and RLENGTH is length of REGEX matched.
}' Input_file ##Mentioning Input_file name here.