添加与蛋白质域位置中的密码子对应的新列

时间:2017-08-18 08:06:39

标签: perl sequence

我有两个含有数千种蛋白质的文件:(1)文件:蛋白质ID +氨基酸序列; (2)文件:蛋白质ID +核苷酸序列。我的第三个文件是在(这些蛋白质)中具有结构域位置的文件,其与我的氨基酸序列和核苷酸的序列文件相关。我用这段代码将这三个文件联系起来:

acids.txt文件包含:

  

ENST00000274849 | Q9ULW3   MEAEESEKAATEQEPLEGTEQTLDAEEEQEESEEAACGSKKRVVPGIVYLGHIPPRFRPL HVRNLLSAYGEVGRVFFQAEDRFVRRKKKAAAAAGGKKRSYTKDYTEGWVEFRDKRIAKR VAASLHNTPMGARRRSPFRYDLWNLKYLHRFTWSHLSEHLAFERQVRRQRLRAEVAQAKR ETDFYLQSVERGQRFLAADGDPARPDGSWTFAQRPTEQELRARKAARPGGRERARLATAQ DKARSNKGLLARIFGAPPPSESMEGPSLVRDS *

nucleotides.txt文件包含:

  

ENST00000274849 | Q9ULW3   ATGGAGGCAGAGGAATCGGAGAAGGCCGCAACGGAGCAAGAGCCGCTGGAAGGGACAGAA CAGACACTAGATGCGGAGGAGGAGCAGGAGGAATCCGAAGAAGCGGCCTGTGGCAGCAAG AAACGGGTAGTGCCAGGTATTGTGTACCTGGGCCATATCCCGCCGCGCTTCCGGCCCCTG C​​ACGTCCGCAACCTTCTCAGCGCCTATGGCGAGGTCGGACGCGTCTTCTTTCAGGCTGAG GACCGGTTCGTGAGACGCAAGAAGAAGGCAGCAGCAGCTGCCGGAGGAAAAAAGCGGTCC TACACCAAGGACTACACCGAGGGATGGGTGGAGTTCCGTGACAAGCGCATAGCCAAGCGC GTGGCGGCCAGTCTACACAACACGCCTATGGGTGCCCGCAGGCGCAGCCCCTTCCGTTAT GATCTTTGGAACCTCAAGTACTTGCACCGTTTCACCTGGTCCCACCTCAGCGAGCACCTC GCCTTTGAGCGCCAGGTGCGCAGGCAGCGCTTGAGAGCGGAGGTTGCTCAAGCCAAGCGT GAGACCGACTTCTATCTTCAAAGTGTGGAACGGGGACAACGCTTTCTTGCGGCCGATGGG GACCCTGCTCGCCCAGATGGCTCCTGGACATTTGCCCAGCGTCCTACTGAGCAGGAACTG AGGGCCCGTAAAGCAGCACGGCCAGGGGGACGTGAACGGGCTCGCCTGGCAACTGCCCAG GACAAGGCCCGCTCCAACAAAGGGCTCCTGGCCAGGATCTTTGGAGCCCCGCCACCCTCA GAGAGCATGGAGGGACCTTCCCTTGTCAGGGACTCCTGA

domain.txt文件包含:

 Q9ULW3;    46 142

注意:这个数字表示我的序列中的位置域

脚本:

use strict;
use Bio::SeqIO;
####################################################
#MODULE 1: read protein file, and save it in a hash#
####################################################
my %hash1;
my $sequence = "acid.txt";
my $multifasta = Bio::SeqIO ->new (-file => "<$sequence",-format=> "fasta");
while (my $seq= $multifasta->next_seq()) {
my $na= $seq->display_id;   #Saves the ID in $na
my $ss = $seq->seq; 
$hash1{$na} = $ss;  
}
#############################################################
#MODULE 2: read nucleotide file, and save it in another hash#
#############################################################
my %hash2;
my $genes = "nucleotides.txt";
my $multifasta = Bio::SeqIO ->new (-file => "<$genes",-format=> "fasta");
while (my $seq= $multifasta->next_seq()) {
my $na= $seq->display_id;   #Saves the ID in $na
my $des=$seq->description;
my $ss = $seq->seq; 
$hash2{$na} = $ss;  
}
#####################
#MODULE 3: my $name;#
#####################
my $name;               # Read from standard input
chomp $name;
##############################################################################
#MODULE 4: DOMAIN ANNOTATION + RELATED AMINO ACIDS AND NUCLEOTIDES IN COLUMNS#
##############################################################################
foreach my $name (keys %hash1) {
my $ac = (split(/\s*\|/, $name))[1];
#print "$ac\n" ; 
####################################################
#MODULE 4.1: DOMAIN ANNOTATION: POSITION OF DOMAINS#
####################################################
open(FILE, "<" ,"domain.txt");
my @array = (<FILE>);
my @lines = grep (/$ac/, @array);
print for @lines;
close (FILE);
############################################################
#MODULE 4.2: RELATED AMINO ACIDS AND NUCLEOTIDES IN COLUMNS#
############################################################
my @array1 = split(//, $hash1{$name}, $hash2{$name});  #CUT SEQUENCE
my @array2 = unpack("a3" x (length($hash1{$name})),$hash2{$name}); #CUT 
NUCLEOTIDE SEQUENCE
my $number = "$#array1+1";
foreach (my $count = 0; $count <= $number; $count++) {
        print "$count\t@array1[$count]\t@array2[$count]\n";
   }    

}

这是我运行脚本后得到的文件:

 Q9ULW3; 46    142   
  0 M   ATG
  1 E   GAG
  2 A   GCA
  3 E   GAG
  4 E   GAA
  5 S   TCG
  6 E   GAG
  7 K   AAG
  8 A   GCC
  9 A   GCA
 10 T   ACG
 11 E   GAG
 12 Q   CAA
 13 E   GAG
 14 P   CCG
 15 L   CTG
 16 E   GAA
 17 G   GGG
 18 T   ACA
 19 E   GAA
 20 Q   CAG
 21 T   ACA
 22 L   CTA
 23 D   GAT
 24 A   GCG
 25 E   GAG
 26 E   GAG
 27 E   GAG
 28 Q   CAG
 29 E   GAG
 30 E   GAA
 31 S   TCC
 32 E   GAA
 33 E   GAA
 34 A   GCG
 35 A   GCC
 36 C   TGT
 37 G   GGC
 38 S   AGC
 39 K   AAG
 40 K   AAA
 41 R   CGG
 42 V   GTA
 43 V   GTG
 44 P   CCA
 45 G   GGT
 46 I   ATT
 47 V   GTG
 48 Y   TAC
 49 L   CTG
 50 G   GGC
 51 H   CAT
 52 I   ATC
 53 P   CCG
 54 P   CCG
 55 R   CGC
 56 F   TTC
 57 R   CGG
 58 P   CCC
 59 L   CTG
 60 H   CAC
 61 V   GTC
 62 R   CGC
 63 N   AAC
 64 L   CTT
 65 L   CTC
 66 S   AGC
 67 A   GCC
 68 Y   TAT
 69 G   GGC
 70 E   GAG
 71 V   GTC
 72 G   GGA
 73 R   CGC
 74 V   GTC
 75 F   TTC
 76 F   TTT
 77 Q   CAG
 78 A   GCT
 79 E   GAG
 80 D   GAC
 81 R   CGG
 82 F   TTC
 83 V   GTG
 84 R   AGA
 85 R   CGC
 86 K   AAG
 87 K   AAG
 88 K   AAG
 89 A   GCA
 90 A   GCA
 91 A   GCA
 92 A   GCT
 93 A   GCC
 94 G   GGA
 95 G   GGA
 96 K   AAA
 97 K   AAG 
 98 R   CGG
 99 S   TCC
100 Y   TAC
101 T   ACC
102 K   AAG 
103 D   GAC
104 Y   TAC
105 T   ACC
106 E   GAG
107 G   GGA
108 W   TGG
109 V   GTG
110 E   GAG
111 F   TTC
112 R   CGT
113 D   GAC
114 K   AAG
115 R   CGC
116 I   ATA
117 A   GCC
118 K   AAG
119 R   CGC
120 V   GTG
121 A   GCG
122 A   GCC
123 S   AGT
124 L   CTA
125 H   CAC
126 N   AAC
127 T   ACG
128 P   CCT
129 M   ATG
130 G   GGT
131 A   GCC
132 R   CGC
133 R   AGG
134 R   CGC
135 S   AGC
136 P   CCC
137 F   TTC
138 R   CGT
139 Y   TAT
140 D   GAT
141 L   CTT
142 W   TGG
143 N   AAC
144 L   CTC
145 K   AAG
146 Y   TAC
147 L   TTG
148 H   CAC
149 R   CGT
150 F   TTC
151 T   ACC
152 W   TGG
153 S   TCC
154 H   CAC
155 *   TGA

现在我应该添加一个新的第四列,其中包含&#39; YES&#39;或者&#39; NOT&#39; - 这取决于域中的密码子 - 是,不在域中 - 不是。所以,这里是46到142位置的域名。我想得到这个输出文件:

 Q9ULW3;    46    142         
  0    M    ATG    NOT
  1    E    GAG    NOT
  2    A    GCA    NOT
  3    E    GAG    NOT
  4    E    GAA    NOT
  5    S    TCG    NOT
  6    E    GAG    NOT
  7    K    AAG    NOT
  8    A    GCC    NOT
  9    A    GCA    NOT
 10    T    ACG    NOT
 11    E    GAG    NOT
 12    Q    CAA    NOT
 13    E    GAG    NOT
 14    P    CCG    NOT
 15    L    CTG    NOT
 16    E    GAA    NOT
 17    G    GGG    NOT
 18    T    ACA    NOT
 19    E    GAA    NOT
 20    Q    CAG    NOT
 21    T    ACA    NOT
 22    L    CTA    NOT
 23    D    GAT    NOT
 24    A    GCG    NOT
 25    E    GAG    NOT
 26    E    GAG    NOT
 27    E    GAG    NOT
 28    Q    CAG    NOT
 29    E    GAG    NOT
 30    E    GAA    NOT
 31    S    TCC    NOT
 32    E    GAA    NOT
 33    E    GAA    NOT
 34    A    GCG    NOT
 35    A    GCC    NOT
 36    C    TGT    NOT
 37    G    GGC    NOT
 38    S    AGC    NOT
 39    K    AAG    NOT
 40    K    AAA    NOT
 41    R    CGG    NOT
 42    V    GTA    NOT
 43    V    GTG    NOT
 44    P    CCA    NOT
 45    G    GGT    NOT
 46    I    ATT    YES
 47    V    GTG    YES
 48    Y    TAC    YES
 49    L    CTG    YES
 50    G    GGC    YES
 51    H    CAT    YES
 52    I    ATC    YES
 53    P    CCG    YES
 54    P    CCG    YES
 55    R    CGC    YES
 56    F    TTC    YES
 57    R    CGG    YES
 58    P    CCC    YES
 59    L    CTG    YES
 60    H    CAC    YES
 61    V    GTC    YES
 62    R    CGC    YES
 63    N    AAC    YES
 64    L    CTT    YES
 65    L    CTC    YES
 66    S    AGC    YES
 67    A    GCC    YES
 68    Y    TAT    YES
 69    G    GGC    YES
 70    E    GAG    YES
 71    V    GTC    YES
 72    G    GGA    YES
 73    R    CGC    YES
 74    V    GTC    YES
 75    F    TTC    YES
 76    F    TTT    YES
 77    Q    CAG    YES
 78    A    GCT    YES
 79    E    GAG    YES
 80    D    GAC    YES
 81    R    CGG    YES
 82    F    TTC    YES
 83    V    GTG    YES
 84    R    AGA    YES
 85    R    CGC    YES
 86    K    AAG    YES
 87    K    AAG    YES
 88    K    AAG    YES
 89    A    GCA    YES
 90    A    GCA    YES
 91    A    GCA    YES
 92    A    GCT    YES
 93    A    GCC    YES
 94    G    GGA    YES
 95    G    GGA    YES
 96    K    AAA    YES
 97    K    AAG    YES
 98    R    CGG    YES
 99    S    TCC    YES
100    Y    TAC    YES
101    T    ACC    YES
102    K    AAG    YES
103    D    GAC    YES
104    Y    TAC    YES
105    T    ACC    YES
106    E    GAG    YES
107    G    GGA    YES
108    W    TGG    YES
109    V    GTG    YES
110    E    GAG    YES
111    F    TTC    YES
112    R    CGT    YES
113    D    GAC    YES
114    K    AAG    YES
115    R    CGC    YES
116    I    ATA    YES
117    A    GCC    YES
118    K    AAG    YES
119    R    CGC    YES
120    V    GTG    YES
121    A    GCG    YES
122    A    GCC    YES
123    S    AGT    YES
124    L    CTA    YES
125    H    CAC    YES
126    N    AAC    YES
127    T    ACG    YES
128    P    CCT    YES
129    M    ATG    YES
130    G    GGT    YES
131    A    GCC    YES
132    R    CGC    YES
133    R    AGG    YES
134    R    CGC    YES
135    S    AGC    YES
136    P    CCC    YES
137    F    TTC    YES
138    R    CGT    YES  
139    Y    TAT    YES
140    D    GAT    YES
141    L    CTT    YES
142    W    TGG    YES
143    N    AAC    NOT
144    L    CTC    NOT
145    K    AAG    NOT
146    Y    TAC    NOT
147    L    TTG    NOT
148    H    CAC    NOT
149    R    CGT    NOT
150    F    TTC    NOT
151    T    ACC    NOT
152    W    TGG    NOT
153    S    TCC    NOT
154    H    CAC    NOT 
155    *    TGA    NOT

这是一种蛋白质的例子,我必须为数千种蛋白质做。拜托,你有什么建议吗?

谢谢!

0 个答案:

没有答案