awk:提取多行数据

时间:2017-12-05 12:37:56

标签: bash awk

所以我有一个看起来像这样的文件:

/translation="MDGVTQQNAALVQEATTAAASLEEQARNLTAAVAAFDLGDKQTV
                 LITPRAAVPALKRPALKASLPASSSHGNWETF"
                 /product="Methyl-accepting chemotaxis protein I (serine
                 chemoreceptor protein)"
 CDS             complement(471..590)
                 /db_xref="SEED:fig|1240086.14.peg.2"
                 /translation="MHQYQSAILAKICRYGGIEKPEITPASVYKLDSHWRYVI"
                 /product="hypothetical protein"
 CDS             717..2354
                 /db_xref="SEED:fig|1240086.14.peg.3"
                 /translation="MGFFVVLWGGASGFSLYSLKQVTTLLHDNSTQGRTYTYLVYGND
                 QYFRSVTRMARVMDYSQFSDAAIASLEEQAQQLTKAVEVFHLGSEYQTAAS
                 RTRPAGNMALKRPALSGMAPALPPARTASDEGSWEKF"
                 /product="Methyl-accepting chemotaxis protein I (serine
                 chemoreceptor protein)"
                 /product="macromolecule metabolism; macromolecule
                 degradation; degradation of proteins, peptides,
                 glycopeptides"

我需要在" / product ="之后提取引号之间的文字,所以我需要这个:

Methyl-accepting chemotaxis protein I (serine chemoreceptor protein)
hypothetical protein
Methyl-accepting chemotaxis protein I (serine chemoreceptor protein)
macromolecule metabolism; macromolecule degradation; degradation of proteins, peptides, glycopeptides

我必须使用awk,所以我写了这个:

awk '/\/product/ {split($0, a, "\""); printf a[2] "\n"}'

但这只会将信息与" / product"放在同一行,有时候信息会在两三行上。我对如何获取的想法一无所知引号之间的整个信息,任何人都可以提供帮助吗?

5 个答案:

答案 0 :(得分:1)

Awk 解决方案:

awk -v RS='"' '!(NR%2) && f{ gsub(/[[:space:]]+/," "); print }
               /\/[[:alnum:]_-]+=$/{ f=(/product=/? 1:0) }' file
  • -v RS='"' - 将双引号"视为记录分隔符
  • !(NR%2) - 在每个偶数
  • gsub(/[[:space:]]+/," ") - 删除额外的空格
  • f=(/product=/? 1:0) - 将标记f设置为1行上的有效状态/product= ...

输出:

Methyl-accepting chemotaxis protein I (serine chemoreceptor protein)
hypothetical protein
Methyl-accepting chemotaxis protein I (serine chemoreceptor protein)
macromolecule metabolism; macromolecule degradation; degradation of proteins, peptides, glycopeptides

答案 1 :(得分:1)

可以使用GNU grep完成,输出由\0 0字节

分隔
grep -Pzo '/product="\K[^"]*'  | tr -s '\0\t\n' '\n '

或perl用单个空格替换多个(空格,换行符或制表符),用换行符分隔

perl -0777 -ne 'print s/\s+/ /gr."\n" for /\/product="\K[^"]*/g'

答案 2 :(得分:1)

awk救援!需要多个字符RS支持(gawk

$ awk -v RS='/| CDS' -F'"' '/^product/{gsub("\n +"," "); print $2}' file


Methyl-accepting chemotaxis protein I (serine chemoreceptor protein)
hypothetical protein
Methyl-accepting chemotaxis protein I (serine chemoreceptor protein)
macromolecule metabolism; macromolecule degradation; degradation of proteins, peptides, glycopeptides

<强>解释 设置记录结构(以“/”或“CDS”开头,查找相关记录(以产品开头),修剪额外空格并在两个引号之间打印字段(第二个字段基于设置字段分隔符到双引号)。

答案 3 :(得分:0)

使用GNU awk进行多字符RS和RT:

$ gawk -v RS='/product="[^"]+"' -F'"' 'RT{$0=RT; gsub(/\s+/," "); print $2}' file
Methyl-accepting chemotaxis protein I (serine chemoreceptor protein)
hypothetical protein
Methyl-accepting chemotaxis protein I (serine chemoreceptor protein)
macromolecule metabolism; macromolecule degradation; degradation of proteins, peptides, glycopeptides

答案 4 :(得分:-1)

假设文件名为 file.txt

echo $(cat file.txt ) | sed 's/\//\n/g' | grep product | sed 's/product="//g;s/".*//'

说明:

  1. 将所有行合并为一行

    echo $(cat file.txt)

  2. 将“/”替换为新行

    echo $(cat file.txt)| sed's /// \ n / g'

  3. 有线产品的grep线

    echo $(cat file.txt)| sed's /// \ n / g'| grep产品

  4. 将“product =”替换为双引号后的所有成员

    echo $(cat file.txt)| sed's /// \ n / g'| grep产品| sed's / product =“// g; s /".*//'