Question

我有一个像这样的文件：

/db_xref="SEED:fig|1240086.14.peg.1"
             /translation="MDGVTQQNAALVQEATTAAASLEEQARNLTAAVAAFDLGDKQTV
             LITPRAAVPALKRPALKASLPASSSHGNWETF"

/product="Methyl-accepting chemotaxis protein I (serine

chemoreceptor protein)"

CDS             complement(471..590)

/db_xref="SEED:fig|1240086.14.peg.2"

/translation="MHQYQSAILAKICRYGGIEKPEITPASVYKLDSHWRYVI"

/product="hypothetical protein"

CDS             717..2354

/db_xref="SEED:fig|1240086.14.peg.3"

，结果应为：

solanii.1    Methyl-accepting chemotaxis protein I (serine chemoreceptor protein)

solanii.2    hypothetical protein

我需要获得以/ product开头的所有行，但如果它们不以“我需要获得下一行，然后加入它们”结束。同样来自每个部分图| 1240086.14.peg.1我需要得到最后一个数字并用solanii替换其余部分

我正在使用此代码来获取在产品之后编写的所有内容：

awk -v RS='/| CDS' -F'"' '/^product/{gsub("\n +"," "); print $2}'

但我不知道如何做其余的事。

Answer 1

基于你所拥有的，我认为awk可能适合你：

awk -v RS='/|CDS' -F'"' '
{
   gsub("\n", "") 
}
/^db_xref/ { 
   num = gensub(/^.*([0-9]+)"\s*$/, "\\1", "1") 
} 
/^product/ { 
   print "solanii." num " " $2 
}' input_file

编辑：使用awk的更好解决方案（谢谢@EdMorton）。请注意，这使用了特定于gawk的工具：

awk -v RS='/(product|db_xref)="[^"]+"' -F'"' '
RT{
   $0=RT
   gsub("\n", "")
   if (/^\/db_xref/) num = gensub(/^.*([0-9]+)"$/, "\\1", "1")
   else print "solanii." num " " $2 
}' file

Answer 2

以下是我给your last question的答案：

gawk -v RS='/product="[^"]+"' -F'"' 'RT{$0=RT; gsub(/\s+/," "); print $2}' file

以下是如何根据当前问题对其进行修改：

gawk -v RS='/product="[^"]+"' -F'"' 'RT{$0=RT; gsub(/\s+/," "); print "solanii." NR, $2}' file

再次，它使用GNU awk进行多字符RS和RT

如何在UNIX中连接包含不同模式的不同行？

2 个答案: