我有一个巨大的数据txt文件,如下所示,我想将其转换为可以轻松查看的格式。我试图搜索ID
我尝试使用命令| sed -n'ID',但它仅尝试查找ID,因此我真的不知道如何制作以下格式
//
ID 1.1.1.1
DE Alcohol dehydrogenase.
AN Aldehyde reductase.
CA (1) A primary alcohol + NAD(+) = an aldehyde + NADH.
CA (2) A secondary alcohol + NAD(+) = a ketone + NADH.
CF Zn(2+) or Fe cation.
CC -!- Acts on primary or secondary alcohols or hemi-acetals with very broad
CC specificity; however the enzyme oxidizes methanol much more poorly
CC than ethanol.
CC -!- The animal, but not the yeast, enzyme acts also on cyclic secondary
CC alcohols.
PR PROSITE; PDOC00058;
PR PROSITE; PDOC00059;
PR PROSITE; PDOC00060;
DR P07327, ADH1A_HUMAN; P28469, ADH1A_MACMU; Q5RBP7, ADH1A_PONAB;
DR P25405, ADH1A_SAAHA; P00325, ADH1B_HUMAN; Q5R1W2, ADH1B_PANTR;
DR P14139, ADH1B_PAPHA; P25406, ADH1B_SAAHA; P00327, ADH1E_HORSE;
DR P00326, ADH1G_HUMAN; O97959, ADH1G_PAPHA; P00328, ADH1S_HORSE;
//
ID 1.1.1.2
DE Alcohol dehydrogenase (NADP(+)).
AN Aldehyde reductase (NADPH).
CA An alcohol + NADP(+) = an aldehyde + NADPH.
CF Zn(2+).
CC -!- Some members of this group oxidize only primary alcohols; others act
CC also on secondary alcohols.
CC -!- May be identical with EC 1.1.1.19, EC 1.1.1.33 and EC 1.1.1.55.
CC -!- Re-specific with respect to NADPH.
PR PROSITE; PDOC00061;
DR Q6AZW2, A1A1A_DANRE; Q568L5, A1A1B_DANRE; Q24857, ADH3_ENTHI ;
DR Q04894, ADH6_YEAST ; P25377, ADH7_YEAST ; O57380, ADH8_PELPE ;
DR Q9F282, ADHA_THEET ; P0CH36, ADHC1_MYCS2; P0CH37, ADHC2_MYCS2;
DR P0A4X1, ADHC_MYCBO ; P9WQC4, ADHC_MYCTO ; P9WQC5, ADHC_MYCTU ;
DR P27250, AHR_ECOLI ; Q3ZCJ2, AK1A1_BOVIN; Q5ZK84, AK1A1_CHICK;
DR O70473, AK1A1_CRIGR; P14550, AK1A1_HUMAN; Q9JII6, AK1A1_MOUSE;
DR P50578, AK1A1_PIG ; Q5R5D5, AK1A1_PONAB; P51635, AK1A1_RAT ;
DR Q6GMC7, AK1A1_XENLA; Q28FD1, AK1A1_XENTR; Q9UUN9, ALD2_SPOSA ;
DR P27800, ALDX_SPOSA ; P75691, YAHK_ECOLI ;
我想在每个部分的顶部提取ID,然后将其粘贴到每个蛋白质名称的前面。它们用分隔;彼此
所以输出看起来像
PR PROSITE; 1.1.1.1
PR PDOC00058; 1.1.1.1
PR PROSITE; 1.1.1.1
PR PDOC00059; 1.1.1.1
DR P07327, ADH1A_HUMAN; 1.1.1.1
DR P28469, ADH1A_MACMU; 1.1.1.1
DR Q5RBP7, ADH1A_PONAB; 1.1.1.1
DR P25405, ADH1A_SAAHA; 1.1.1.1
DR P00325, ADH1B_HUMAN; 1.1.1.1
DR Q5R1W2, ADH1B_PANTR; 1.1.1.1
DR P14139, ADH1B_PAPHA; 1.1.1.1
DR P25406, ADH1B_SAAHA; 1.1.1.1
DR P00327, ADH1E_HORSE; 1.1.1.1
DR P00326, ADH1G_HUMAN; 1.1.1.1
DR O97959, ADH1G_PAPHA; 1.1.1.1
DR P00328, ADH1S_HORSE; 1.1.1.1
PR PROSITE; 1.1.1.2
PR PDOC00061; 1.1.1.2
DR Q6AZW2, A1A1A_DANRE; 1.1.1.2
DR Q568L5, A1A1B_DANRE; 1.1.1.2
DR Q24857, ADH3_ENTHI ; 1.1.1.2
DR Q04894, ADH6_YEAST ; 1.1.1.2
DR P25377, ADH7_YEAST ; 1.1.1.2
DR O57380, ADH8_PELPE ; 1.1.1.2
DR Q9F282, ADHA_THEET ; 1.1.1.2
DR P0CH36, ADHC1_MYCS2; 1.1.1.2
DR P0CH37, ADHC2_MYCS2; 1.1.1.2
DR P0A4X1, ADHC_MYCBO ; 1.1.1.2
DR P9WQC4, ADHC_MYCTO ; 1.1.1.2
DR P9WQC5, ADHC_MYCTU ; 1.1.1.2
DR P27250, AHR_ECOLI ; 1.1.1.2
DR Q3ZCJ2, AK1A1_BOVIN; 1.1.1.2
DR Q5ZK84, AK1A1_CHICK; 1.1.1.2
DR O70473, AK1A1_CRIGR; 1.1.1.2
DR P14550, AK1A1_HUMAN; 1.1.1.2
DR Q9JII6, AK1A1_MOUSE; 1.1.1.2
DR P50578, AK1A1_PIG ; 1.1.1.2
DR Q5R5D5, AK1A1_PONAB; 1.1.1.2
DR P51635, AK1A1_RAT ; 1.1.1.2
DR Q6GMC7, AK1A1_XENLA; 1.1.1.2
DR Q28FD1, AK1A1_XENTR; 1.1.1.2
DR Q9UUN9, ALD2_SPOSA ; 1.1.1.2
DR P27800, ALDX_SPOSA ; 1.1.1.2
DR P75691, YAHK_ECOLI ; 1.1.1.2
答案 0 :(得分:1)
尽管我敢肯定awk
可以使用,但我会使用perl
,因为我更了解:
#!/usr/bin/perl -n
use vars qw/$id/;
# save the id to use later
if (/^\s*id\s+(.*?)$/i) {
$id = $1;
}
# when we see a PR or DR line, save the interesting bits
if (/^\s*([pd]r)\s+((?:[^;]+;\s*)+)/i) {
my ($type, $labels) = ($1, $2);
# split it on semis
for my $label (split(/;\s*/, $labels)) {
# and output the desired format
printf("%s\t%s;\t%s\n", $type, $label, $id)
}
}
这样称呼:
./tx in.txt > gen.txt
给出预期的输出,再加上PDOC00060
的两行:
$ diff -bwiu gen.txt expected.txt
--- gen.txt 2019-01-09 10:06:59.000000000 -0700
+++ expected.txt 2019-01-09 09:58:32.000000000 -0700
@@ -2,8 +2,6 @@
PR PDOC00058; 1.1.1.1
PR PROSITE; 1.1.1.1
PR PDOC00059; 1.1.1.1
-PR PROSITE; 1.1.1.1
-PR PDOC00060; 1.1.1.1
DR P07327, ADH1A_HUMAN; 1.1.1.1
DR P28469, ADH1A_MACMU; 1.1.1.1
DR Q5RBP7, ADH1A_PONAB; 1.1.1.1