我有一个大型数据文件,如下所示:
//
ID 1.1.1.258
DE 6-hydroxyhexanoate dehydrogenase.
CA 6-hydroxyhexanoate + NAD(+) = 6-oxohexanoate + NADH.
CC -!- Involved in the cyclohexanol degradation pathway in Acinetobacter
CC NCIB 9871.
//
ID 1.1.1.259
DE 3-hydroxypimeloyl-CoA dehydrogenase.
CA 3-hydroxypimeloyl-CoA + NAD(+) = 3-oxopimeloyl-CoA + NADH.
CC -!- Involved in the anaerobic pathway of benzoate degradation in
CC bacteria.
//
ID 1.1.1.260
DE Sulcatone reductase.
CA Sulcatol + NAD(+) = sulcatone + NADH.
CC -!- Studies on the effects of growth-stage and nutrient supply on the
CC stereochemistry of sulcatone reduction in Clostridia pasteurianum,
CC C.tyrobutyricum and Lactobacillus brevis suggest that there may be at
CC least two sulcatone reductases with different stereospecificities.
//
我想提取包含工作anaerobic
的此文件的部分。我特别想要ID线。
是否有办法在ID和//之间搜索文件以查找anaerobic
并将输出打印到新文件?如果整个部分印刷得很好,我想我可以把它弄出来。
预期应该是
ID 1.1.1.259
或
ID 1.1.1.259
DE 3-hydroxypimeloyl-CoA dehydrogenase.
CA 3-hydroxypimeloyl-CoA + NAD(+) = 3-oxopimeloyl-CoA + NADH.
CC -!- Involved in the anaerobic pathway of benzoate degradation in
CC bacteria.
//
答案 0 :(得分:3)
awk '/anaerobic/' RS='//\n' ORS='\n//' ./file.txt
答案 1 :(得分:2)
tac file | sed -n '/anaerobic/,$p' | sed -n '/^ID/ {p;q}'
tac **file**
:从头到尾打印文件sed -n '/anaerobic/,$p'
:从第一次出现厌氧打印到文件末尾sed -n '/^ID/ {p;q}'
:搜索以ID开头的行,
仅打印第一次出现答案 2 :(得分:2)
对于多样化,可能的GNU sed
解决方案:
sed -nr ':a; \@(^|\n)//$@! { N; ba }; /anaerobic/p' data
-n
=>抑制模式空间的自动打印-r
=>扩展正则表达式:a
=>标签的定义ba
=>跳转到标签a
N
=>将下一行追加到模式空间\@(^|\n)//$@!
=>匹配"部分"不以//
\@(^|\n)//$@! { N; ba }
将下一行附加到模式空间,直到找到//
部分分隔符。 /anaerobic/p
然后检查当前部分是否包含anaerobic
,如果是,p
命令将打印出来。