我有两个含有数千种蛋白质的文件:(1)文件:蛋白质ID +氨基酸序列; (2)文件:蛋白质ID +核苷酸序列。我的第三个文件是在(这些蛋白质)中具有结构域位置的文件,其与我的氨基酸序列和核苷酸的序列文件相关。我用这段代码将这三个文件联系起来:
acids.txt文件包含:
ENST00000274849 | Q9ULW3 MEAEESEKAATEQEPLEGTEQTLDAEEEQEESEEAACGSKKRVVPGIVYLGHIPPRFRPL HVRNLLSAYGEVGRVFFQAEDRFVRRKKKAAAAAGGKKRSYTKDYTEGWVEFRDKRIAKR VAASLHNTPMGARRRSPFRYDLWNLKYLHRFTWSHLSEHLAFERQVRRQRLRAEVAQAKR ETDFYLQSVERGQRFLAADGDPARPDGSWTFAQRPTEQELRARKAARPGGRERARLATAQ DKARSNKGLLARIFGAPPPSESMEGPSLVRDS *
nucleotides.txt文件包含:
ENST00000274849 | Q9ULW3 ATGGAGGCAGAGGAATCGGAGAAGGCCGCAACGGAGCAAGAGCCGCTGGAAGGGACAGAA CAGACACTAGATGCGGAGGAGGAGCAGGAGGAATCCGAAGAAGCGGCCTGTGGCAGCAAG AAACGGGTAGTGCCAGGTATTGTGTACCTGGGCCATATCCCGCCGCGCTTCCGGCCCCTG CACGTCCGCAACCTTCTCAGCGCCTATGGCGAGGTCGGACGCGTCTTCTTTCAGGCTGAG GACCGGTTCGTGAGACGCAAGAAGAAGGCAGCAGCAGCTGCCGGAGGAAAAAAGCGGTCC TACACCAAGGACTACACCGAGGGATGGGTGGAGTTCCGTGACAAGCGCATAGCCAAGCGC GTGGCGGCCAGTCTACACAACACGCCTATGGGTGCCCGCAGGCGCAGCCCCTTCCGTTAT GATCTTTGGAACCTCAAGTACTTGCACCGTTTCACCTGGTCCCACCTCAGCGAGCACCTC GCCTTTGAGCGCCAGGTGCGCAGGCAGCGCTTGAGAGCGGAGGTTGCTCAAGCCAAGCGT GAGACCGACTTCTATCTTCAAAGTGTGGAACGGGGACAACGCTTTCTTGCGGCCGATGGG GACCCTGCTCGCCCAGATGGCTCCTGGACATTTGCCCAGCGTCCTACTGAGCAGGAACTG AGGGCCCGTAAAGCAGCACGGCCAGGGGGACGTGAACGGGCTCGCCTGGCAACTGCCCAG GACAAGGCCCGCTCCAACAAAGGGCTCCTGGCCAGGATCTTTGGAGCCCCGCCACCCTCA GAGAGCATGGAGGGACCTTCCCTTGTCAGGGACTCCTGA
domain.txt文件包含:
Q9ULW3; 46 142
注意:这个数字表示我的序列中的位置域
脚本:
use strict;
use Bio::SeqIO;
####################################################
#MODULE 1: read protein file, and save it in a hash#
####################################################
my %hash1;
my $sequence = "acid.txt";
my $multifasta = Bio::SeqIO ->new (-file => "<$sequence",-format=> "fasta");
while (my $seq= $multifasta->next_seq()) {
my $na= $seq->display_id; #Saves the ID in $na
my $ss = $seq->seq;
$hash1{$na} = $ss;
}
#############################################################
#MODULE 2: read nucleotide file, and save it in another hash#
#############################################################
my %hash2;
my $genes = "nucleotides.txt";
my $multifasta = Bio::SeqIO ->new (-file => "<$genes",-format=> "fasta");
while (my $seq= $multifasta->next_seq()) {
my $na= $seq->display_id; #Saves the ID in $na
my $des=$seq->description;
my $ss = $seq->seq;
$hash2{$na} = $ss;
}
#####################
#MODULE 3: my $name;#
#####################
my $name; # Read from standard input
chomp $name;
##############################################################################
#MODULE 4: DOMAIN ANNOTATION + RELATED AMINO ACIDS AND NUCLEOTIDES IN COLUMNS#
##############################################################################
foreach my $name (keys %hash1) {
my $ac = (split(/\s*\|/, $name))[1];
#print "$ac\n" ;
####################################################
#MODULE 4.1: DOMAIN ANNOTATION: POSITION OF DOMAINS#
####################################################
open(FILE, "<" ,"domain.txt");
my @array = (<FILE>);
my @lines = grep (/$ac/, @array);
print for @lines;
close (FILE);
############################################################
#MODULE 4.2: RELATED AMINO ACIDS AND NUCLEOTIDES IN COLUMNS#
############################################################
my @array1 = split(//, $hash1{$name}, $hash2{$name}); #CUT SEQUENCE
my @array2 = unpack("a3" x (length($hash1{$name})),$hash2{$name}); #CUT
NUCLEOTIDE SEQUENCE
my $number = "$#array1+1";
foreach (my $count = 0; $count <= $number; $count++) {
print "$count\t@array1[$count]\t@array2[$count]\n";
}
}
这是我运行脚本后得到的文件:
Q9ULW3; 46 142
0 M ATG
1 E GAG
2 A GCA
3 E GAG
4 E GAA
5 S TCG
6 E GAG
7 K AAG
8 A GCC
9 A GCA
10 T ACG
11 E GAG
12 Q CAA
13 E GAG
14 P CCG
15 L CTG
16 E GAA
17 G GGG
18 T ACA
19 E GAA
20 Q CAG
21 T ACA
22 L CTA
23 D GAT
24 A GCG
25 E GAG
26 E GAG
27 E GAG
28 Q CAG
29 E GAG
30 E GAA
31 S TCC
32 E GAA
33 E GAA
34 A GCG
35 A GCC
36 C TGT
37 G GGC
38 S AGC
39 K AAG
40 K AAA
41 R CGG
42 V GTA
43 V GTG
44 P CCA
45 G GGT
46 I ATT
47 V GTG
48 Y TAC
49 L CTG
50 G GGC
51 H CAT
52 I ATC
53 P CCG
54 P CCG
55 R CGC
56 F TTC
57 R CGG
58 P CCC
59 L CTG
60 H CAC
61 V GTC
62 R CGC
63 N AAC
64 L CTT
65 L CTC
66 S AGC
67 A GCC
68 Y TAT
69 G GGC
70 E GAG
71 V GTC
72 G GGA
73 R CGC
74 V GTC
75 F TTC
76 F TTT
77 Q CAG
78 A GCT
79 E GAG
80 D GAC
81 R CGG
82 F TTC
83 V GTG
84 R AGA
85 R CGC
86 K AAG
87 K AAG
88 K AAG
89 A GCA
90 A GCA
91 A GCA
92 A GCT
93 A GCC
94 G GGA
95 G GGA
96 K AAA
97 K AAG
98 R CGG
99 S TCC
100 Y TAC
101 T ACC
102 K AAG
103 D GAC
104 Y TAC
105 T ACC
106 E GAG
107 G GGA
108 W TGG
109 V GTG
110 E GAG
111 F TTC
112 R CGT
113 D GAC
114 K AAG
115 R CGC
116 I ATA
117 A GCC
118 K AAG
119 R CGC
120 V GTG
121 A GCG
122 A GCC
123 S AGT
124 L CTA
125 H CAC
126 N AAC
127 T ACG
128 P CCT
129 M ATG
130 G GGT
131 A GCC
132 R CGC
133 R AGG
134 R CGC
135 S AGC
136 P CCC
137 F TTC
138 R CGT
139 Y TAT
140 D GAT
141 L CTT
142 W TGG
143 N AAC
144 L CTC
145 K AAG
146 Y TAC
147 L TTG
148 H CAC
149 R CGT
150 F TTC
151 T ACC
152 W TGG
153 S TCC
154 H CAC
155 * TGA
现在我应该添加一个新的第四列,其中包含&#39; YES&#39;或者&#39; NOT&#39; - 这取决于域中的密码子 - 是,不在域中 - 不是。所以,这里是46到142位置的域名。我想得到这个输出文件:
Q9ULW3; 46 142
0 M ATG NOT
1 E GAG NOT
2 A GCA NOT
3 E GAG NOT
4 E GAA NOT
5 S TCG NOT
6 E GAG NOT
7 K AAG NOT
8 A GCC NOT
9 A GCA NOT
10 T ACG NOT
11 E GAG NOT
12 Q CAA NOT
13 E GAG NOT
14 P CCG NOT
15 L CTG NOT
16 E GAA NOT
17 G GGG NOT
18 T ACA NOT
19 E GAA NOT
20 Q CAG NOT
21 T ACA NOT
22 L CTA NOT
23 D GAT NOT
24 A GCG NOT
25 E GAG NOT
26 E GAG NOT
27 E GAG NOT
28 Q CAG NOT
29 E GAG NOT
30 E GAA NOT
31 S TCC NOT
32 E GAA NOT
33 E GAA NOT
34 A GCG NOT
35 A GCC NOT
36 C TGT NOT
37 G GGC NOT
38 S AGC NOT
39 K AAG NOT
40 K AAA NOT
41 R CGG NOT
42 V GTA NOT
43 V GTG NOT
44 P CCA NOT
45 G GGT NOT
46 I ATT YES
47 V GTG YES
48 Y TAC YES
49 L CTG YES
50 G GGC YES
51 H CAT YES
52 I ATC YES
53 P CCG YES
54 P CCG YES
55 R CGC YES
56 F TTC YES
57 R CGG YES
58 P CCC YES
59 L CTG YES
60 H CAC YES
61 V GTC YES
62 R CGC YES
63 N AAC YES
64 L CTT YES
65 L CTC YES
66 S AGC YES
67 A GCC YES
68 Y TAT YES
69 G GGC YES
70 E GAG YES
71 V GTC YES
72 G GGA YES
73 R CGC YES
74 V GTC YES
75 F TTC YES
76 F TTT YES
77 Q CAG YES
78 A GCT YES
79 E GAG YES
80 D GAC YES
81 R CGG YES
82 F TTC YES
83 V GTG YES
84 R AGA YES
85 R CGC YES
86 K AAG YES
87 K AAG YES
88 K AAG YES
89 A GCA YES
90 A GCA YES
91 A GCA YES
92 A GCT YES
93 A GCC YES
94 G GGA YES
95 G GGA YES
96 K AAA YES
97 K AAG YES
98 R CGG YES
99 S TCC YES
100 Y TAC YES
101 T ACC YES
102 K AAG YES
103 D GAC YES
104 Y TAC YES
105 T ACC YES
106 E GAG YES
107 G GGA YES
108 W TGG YES
109 V GTG YES
110 E GAG YES
111 F TTC YES
112 R CGT YES
113 D GAC YES
114 K AAG YES
115 R CGC YES
116 I ATA YES
117 A GCC YES
118 K AAG YES
119 R CGC YES
120 V GTG YES
121 A GCG YES
122 A GCC YES
123 S AGT YES
124 L CTA YES
125 H CAC YES
126 N AAC YES
127 T ACG YES
128 P CCT YES
129 M ATG YES
130 G GGT YES
131 A GCC YES
132 R CGC YES
133 R AGG YES
134 R CGC YES
135 S AGC YES
136 P CCC YES
137 F TTC YES
138 R CGT YES
139 Y TAT YES
140 D GAT YES
141 L CTT YES
142 W TGG YES
143 N AAC NOT
144 L CTC NOT
145 K AAG NOT
146 Y TAC NOT
147 L TTG NOT
148 H CAC NOT
149 R CGT NOT
150 F TTC NOT
151 T ACC NOT
152 W TGG NOT
153 S TCC NOT
154 H CAC NOT
155 * TGA NOT
这是一种蛋白质的例子,我必须为数千种蛋白质做。拜托,你有什么建议吗?
谢谢!