所以我有一个看起来像这样的文件:
/translation="MDGVTQQNAALVQEATTAAASLEEQARNLTAAVAAFDLGDKQTV
LITPRAAVPALKRPALKASLPASSSHGNWETF"
/product="Methyl-accepting chemotaxis protein I (serine
chemoreceptor protein)"
CDS complement(471..590)
/db_xref="SEED:fig|1240086.14.peg.2"
/translation="MHQYQSAILAKICRYGGIEKPEITPASVYKLDSHWRYVI"
/product="hypothetical protein"
CDS 717..2354
/db_xref="SEED:fig|1240086.14.peg.3"
/translation="MGFFVVLWGGASGFSLYSLKQVTTLLHDNSTQGRTYTYLVYGND
QYFRSVTRMARVMDYSQFSDAAIASLEEQAQQLTKAVEVFHLGSEYQTAAS
RTRPAGNMALKRPALSGMAPALPPARTASDEGSWEKF"
/product="Methyl-accepting chemotaxis protein I (serine
chemoreceptor protein)"
/product="macromolecule metabolism; macromolecule
degradation; degradation of proteins, peptides,
glycopeptides"
我需要在" / product ="之后提取引号之间的文字,所以我需要这个:
Methyl-accepting chemotaxis protein I (serine chemoreceptor protein)
hypothetical protein
Methyl-accepting chemotaxis protein I (serine chemoreceptor protein)
macromolecule metabolism; macromolecule degradation; degradation of proteins, peptides, glycopeptides
我必须使用awk,所以我写了这个:
awk '/\/product/ {split($0, a, "\""); printf a[2] "\n"}'
但这只会将信息与" / product"放在同一行,有时候信息会在两三行上。我对如何获取的想法一无所知引号之间的整个信息,任何人都可以提供帮助吗?
答案 0 :(得分:1)
Awk
解决方案:
awk -v RS='"' '!(NR%2) && f{ gsub(/[[:space:]]+/," "); print }
/\/[[:alnum:]_-]+=$/{ f=(/product=/? 1:0) }' file
-v RS='"'
- 将双引号"
视为记录分隔符!(NR%2)
- 在每个偶数行gsub(/[[:space:]]+/," ")
- 删除额外的空格f=(/product=/? 1:0)
- 将标记f
设置为1
行上的有效状态/product= ...
输出:
Methyl-accepting chemotaxis protein I (serine chemoreceptor protein)
hypothetical protein
Methyl-accepting chemotaxis protein I (serine chemoreceptor protein)
macromolecule metabolism; macromolecule degradation; degradation of proteins, peptides, glycopeptides
答案 1 :(得分:1)
可以使用GNU grep完成,输出由\0
0字节
grep -Pzo '/product="\K[^"]*' | tr -s '\0\t\n' '\n '
或perl用单个空格替换多个(空格,换行符或制表符),用换行符分隔
perl -0777 -ne 'print s/\s+/ /gr."\n" for /\/product="\K[^"]*/g'
答案 2 :(得分:1)
awk
救援!需要多个字符RS
支持(gawk
)
$ awk -v RS='/| CDS' -F'"' '/^product/{gsub("\n +"," "); print $2}' file
Methyl-accepting chemotaxis protein I (serine chemoreceptor protein)
hypothetical protein
Methyl-accepting chemotaxis protein I (serine chemoreceptor protein)
macromolecule metabolism; macromolecule degradation; degradation of proteins, peptides, glycopeptides
<强>解释强> 设置记录结构(以“/”或“CDS”开头,查找相关记录(以产品开头),修剪额外空格并在两个引号之间打印字段(第二个字段基于设置字段分隔符到双引号)。
答案 3 :(得分:0)
使用GNU awk进行多字符RS和RT:
$ gawk -v RS='/product="[^"]+"' -F'"' 'RT{$0=RT; gsub(/\s+/," "); print $2}' file
Methyl-accepting chemotaxis protein I (serine chemoreceptor protein)
hypothetical protein
Methyl-accepting chemotaxis protein I (serine chemoreceptor protein)
macromolecule metabolism; macromolecule degradation; degradation of proteins, peptides, glycopeptides
答案 4 :(得分:-1)
假设文件名为 file.txt
echo $(cat file.txt ) | sed 's/\//\n/g' | grep product | sed 's/product="//g;s/".*//'
说明:
将所有行合并为一行
echo $(cat file.txt)
将“/”替换为新行
echo $(cat file.txt)| sed's /// \ n / g'
有线产品的grep线
echo $(cat file.txt)| sed's /// \ n / g'| grep产品
将“product =”替换为双引号后的所有成员
echo $(cat file.txt)| sed's /// \ n / g'| grep产品| sed's / product =“// g; s /".*//'