Question

好的，我在SO上找到了类似的答案但是我的sed / grep / awk fu非常糟糕，以至于我无法完全适应我的任务。这是给定的文件＆＃34; test.gff＆＃34;：

accn|CP014704   RefSeq  CDS 403 915 .   +   0   ID=AZ909_00020;locus_tag=AZ909_00020;product=transcriptional regulator
accn|CP014704   RefSeq  CDS 928 2334    .   +   0   ID=AZ909_00025;locus_tag=AZ909_00025;product=FAD/NAD(P)-binding oxidoreductase
accn|CP014704   RefSeq  CDS 31437   32681   .   +   0   ID=AZ909_00145;locus_tag=AZ909_00145;product=gamma-glutamyl-phosphate reductase;gene=proA
accn|CP014704   RefSeq  CDS 2355    2585    .   +   0   ID=AZ909_00030;locus_tag=AZ909_00030;product=hypothetical protein

我想提取两个值1）文本右边的＃34; ID =＆＃34;直到分号和2）文本右边的＆＃34; product =＆＃34;直到行的结尾或分号（因为你可以看到其中一行也有一个＆＃34;基因=＆＃34;值。

所以我想要这样的事情：

ID    product
AZ909_00020    transcriptional regulator
AZ909_00025    FAD/NAD(P)-binding oxidoreductase
AZ909_00145    gamma-glutamyl-phosphate reductase

据我所知：

printf "ID\tproduct\n"

sed -nr 's/^.*ID=(.*);.*product=(.*);/\1\t\2\p/' test.gff

谢谢！

Answer 1

尝试以下方法：

sed 's/.*ID=\([^;]*\);.*product=\([^;]*\).*/\1\t\2/' test.gff

与您的尝试相比，我改变了您对产品的匹配方式。由于我们不知道字段是以;还是EOL结尾，因此我们只匹配尽可能多的非;字符。我还在末尾添加了.*以匹配产品后剩余的任何剩余字符。这样，当我们进行替换时，整行将匹配，我们将能够完全重写它。

如果你想要更强大的东西，这里有一个perl one-liner：

perl -nle '($id)=/ID=([^;]*)/; ($prod)=/product=([^;]*)/; print "$id\t$prod"' test.gff

这使用正则表达式分别提取两个字段。它将正常工作，即使字段以相反的顺序显示。

Answer 2

如果你有GNU-awk又名gawk，你可以试试下面的内容：

使用awk

gawk 'BEGIN{printf "ID\tProduct%s",RS}
     {printf "%s\t%s%s",gensub(/^.*[[:blank:]]+ID=([^;]*);.*$/,"\\1","1",$0),
      gensub(/^.*;product=([^;]*)[;]*.*$/,"\\1","1",$0),RS}
    ' test.gff | expand -t20

<强>输出

ID                  Product
AZ909_00020         transcriptional regulator
AZ909_00025         FAD/NAD(P)-binding oxidoreductase
AZ909_00145         gamma-glutamyl-phosphate reductase
AZ909_00030         hypothetical protein

正如你已经注意到的那样，这两个gensub正在进行繁重的工作。

在gensub(/^.*[[:blank:]]+ID=([^;]*);.*$/,"\\1","1",$0)中，除ID=与后面的第一个分号之间包含的内容之外的所有内容都会从记录中删除（请参阅$0）。注意gensub不修改记录本身，但它只返回打印的修改后的字符串。
在gensub(/^.*;product=([^;]*)[;]*.*$/,"\\1","1",$0)中，除product=与第一个分号（或结尾）之间的内容之外的任何内容都被删除
最后，我们使用expand -t来增加标签宽度，以获得格式良好的输出。
由于硬编码\n是一种不好的做法，因此我使用内置记录分隔符变量RS在每条记录后打印换行符。

使用类似逻辑的sed解决方案如下：

使用sed

printf "%-20s%s\n" "ID" "Product"
sed -E "s/^.*[[:blank:]]+ID=([^;]*);.*;product=([^;]*)[;]*.*$/\\1\t\\2/" 39322581 | expand -t20

<强>输出

ID                  Product
AZ909_00020         transcriptional regulator
AZ909_00025         FAD/NAD(P)-binding oxidoreductase
AZ909_00145         gamma-glutamyl-phosphate reductase
AZ909_00030         hypothetical protein

考虑到您已经获得了一个简短而优雅的perl解决方案，如果您有自己的支持，也可以考虑使用它。

^{附注：在printf中使用\n会降低脚本的可移植性}

Answer 3

正则表达式的主要问题是使用.*而不是[^;]*，因为.*将匹配所有字符，但您只想匹配非分号。试试这个：

$ sed -E 's/.*ID=([^;]+).*product=([^;]+).*/\1\t\2/' file
AZ909_00020     transcriptional regulator
AZ909_00025     FAD/NAD(P)-binding oxidoreductase
AZ909_00145     gamma-glutamyl-phosphate reductase
AZ909_00030     hypothetical protein

或：

$ awk -F'[=;]' -v OFS='\t' '{print $2, $6}' file
AZ909_00020     transcriptional regulator
AZ909_00025     FAD/NAD(P)-binding oxidoreductase
AZ909_00145     gamma-glutamyl-phosphate reductase
AZ909_00030     hypothetical protein

您也可以使用awk轻松提取标题值：

$ awk -F'[=;]' -v OFS='\t' 'NR==1{sub(/.* /,"",$1); print $1, $5} {print $2, $6}' file
ID      product
AZ909_00020     transcriptional regulator
AZ909_00025     FAD/NAD(P)-binding oxidoreductase
AZ909_00145     gamma-glutamyl-phosphate reductase
AZ909_00030     hypothetical protein

Answer 4

awk中的另一个人。我们添加＆＃34;;＆＃34;到字段分隔符（FS）列表，去掉字符串＆＃34; ID =＆＃34;和＆＃34;产品=＆＃34;并打印字段9和10：

$ awk -F'([ \t\n]+|;)' 'BEGIN{print "ID" OFS "Product"}{gsub(/product=|ID=/,""); print $9,$10}' test.gff
ID Product
AZ909_00020 locus_tag=AZ909_00020
AZ909_00025 locus_tag=AZ909_00025
AZ909_00145 locus_tag=AZ909_00145
AZ909_00030 locus_tag=AZ909_00030

使用sed从一行中一次提取两段文本

4 个答案: