Question

所以我有一个看起来像这样的文件：

/translation="MDGVTQQNAALVQEATTAAASLEEQARNLTAAVAAFDLGDKQTV
                 LITPRAAVPALKRPALKASLPASSSHGNWETF"
                 /product="Methyl-accepting chemotaxis protein I (serine
                 chemoreceptor protein)"
 CDS             complement(471..590)
                 /db_xref="SEED:fig|1240086.14.peg.2"
                 /translation="MHQYQSAILAKICRYGGIEKPEITPASVYKLDSHWRYVI"
                 /product="hypothetical protein"
 CDS             717..2354
                 /db_xref="SEED:fig|1240086.14.peg.3"
                 /translation="MGFFVVLWGGASGFSLYSLKQVTTLLHDNSTQGRTYTYLVYGND
                 QYFRSVTRMARVMDYSQFSDAAIASLEEQAQQLTKAVEVFHLGSEYQTAAS
                 RTRPAGNMALKRPALSGMAPALPPARTASDEGSWEKF"
                 /product="Methyl-accepting chemotaxis protein I (serine
                 chemoreceptor protein)"
                 /product="macromolecule metabolism; macromolecule
                 degradation; degradation of proteins, peptides,
                 glycopeptides"

我需要在＆＃34; / product =＆＃34;之后提取引号之间的文字，所以我需要这个：

Methyl-accepting chemotaxis protein I (serine chemoreceptor protein)
hypothetical protein
Methyl-accepting chemotaxis protein I (serine chemoreceptor protein)
macromolecule metabolism; macromolecule degradation; degradation of proteins, peptides, glycopeptides

我必须使用awk，所以我写了这个：

awk '/\/product/ {split($0, a, "\""); printf a[2] "\n"}'

但这只会将信息与＆＃34; / product＆＃34;放在同一行，有时候信息会在两三行上。我对如何获取的想法一无所知引号之间的整个信息，任何人都可以提供帮助吗？

Answer 1

Awk 解决方案：

awk -v RS='"' '!(NR%2) && f{ gsub(/[[:space:]]+/," "); print }
               /\/[[:alnum:]_-]+=$/{ f=(/product=/? 1:0) }' file

-v RS='"' - 将双引号"视为记录分隔符
!(NR%2) - 在每个偶数行
gsub(/[[:space:]]+/," ") - 删除额外的空格
f=(/product=/? 1:0) - 将标记f设置为1行上的有效状态/product= ...

输出：

Methyl-accepting chemotaxis protein I (serine chemoreceptor protein)
hypothetical protein
Methyl-accepting chemotaxis protein I (serine chemoreceptor protein)
macromolecule metabolism; macromolecule degradation; degradation of proteins, peptides, glycopeptides

Answer 2

可以使用GNU grep完成，输出由\0 0字节

分隔

grep -Pzo '/product="\K[^"]*'  | tr -s '\0\t\n' '\n '

或perl用单个空格替换多个（空格，换行符或制表符），用换行符分隔

perl -0777 -ne 'print s/\s+/ /gr."\n" for /\/product="\K[^"]*/g'

Answer 3

awk救援！需要多个字符RS支持（gawk）

$ awk -v RS='/| CDS' -F'"' '/^product/{gsub("\n +"," "); print $2}' file


Methyl-accepting chemotaxis protein I (serine chemoreceptor protein)
hypothetical protein
Methyl-accepting chemotaxis protein I (serine chemoreceptor protein)
macromolecule metabolism; macromolecule degradation; degradation of proteins, peptides, glycopeptides

<强>解释设置记录结构（以“/”或“CDS”开头，查找相关记录（以产品开头），修剪额外空格并在两个引号之间打印字段（第二个字段基于设置字段分隔符到双引号）。

Answer 4

使用GNU awk进行多字符RS和RT：

$ gawk -v RS='/product="[^"]+"' -F'"' 'RT{$0=RT; gsub(/\s+/," "); print $2}' file
Methyl-accepting chemotaxis protein I (serine chemoreceptor protein)
hypothetical protein
Methyl-accepting chemotaxis protein I (serine chemoreceptor protein)
macromolecule metabolism; macromolecule degradation; degradation of proteins, peptides, glycopeptides

Answer 5

假设文件名为 file.txt

echo $(cat file.txt ) | sed 's/\//\n/g' | grep product | sed 's/product="//g;s/".*//'

说明：

将所有行合并为一行

echo $（cat file.txt）
将“/”替换为新行

echo $（cat file.txt）| sed's /// \ n / g'
有线产品的grep线

echo $（cat file.txt）| sed's /// \ n / g'| grep产品
将“product =”替换为双引号后的所有成员

echo $（cat file.txt）| sed's /// \ n / g'| grep产品| sed's / product =“// g; s /".*//'

awk：提取多行数据

5 个答案: