对于一组序列,将哈希值及其开放阅读帧中的序列存储为值

时间:2013-09-03 16:39:55

标签: regex arrays perl hash bioinformatics

跟进(Find multiple matches of this and that nucleotide sequence

我现在想要将每个ORF(如ATG ... TAG或ATG ... TAA)添加到每个序列的散列中,以便对于任何序列,我将ORF作为值附加。我到目前为止 -

#!/usr/bin/perl
use warnings;
use strict;

my @file = qw(ATGCCCCCCCCCCCCCTAGATGAAAAAAAAAATAAATGAAAAATAGATGCCCCCCCCCCCCCCC ATGCGCGCTATATATGCGCGGGCTAATATAT ATATGAGGTCGTAGCTAGCAAACACAAATAAA );

my %hash;
foreach (@file){
my @match = ($_ =~ /(ATG\w+?TA[AG])/g);

# then make %hash with sequence as key and ORFs as values)...

}

任何人都可以帮助我吗?

1 个答案:

答案 0 :(得分:0)

以您的代码为基础:(我已经更改了核苷酸序列,以便更容易看到停止和启动密码子,但是对于您的序列将以完全相同的方式工作......)我还存储了您的匹配数组散列中的数组中的序列如下:

#!/usr/bin/perl
use warnings;
use strict;
use Data::Dumper;

my @file = qw(ATGcgcgcgcgcgcgTAAATGatatatatataTAG ATGcccccccccTAAgggggggggATGtttttttttttTAG atATGaggggaTAGaaaatttttttctttct);

my (@match, %hash, @sequence, $line);
my $line_number = 0;
foreach  (@file){
    push @match, /(ATG\w+?TA[AG])/g;    
    push @sequence, @file for 0 .. $#match; 
}

push @ { $hash{$sequence[$_]}}, [$match[$_] ] for 0 .. $#match; # Hasho of arrays

for my $key (sort keys %hash){
        for my $orf (@ { $hash{$key}}){
            my ($match) = @$orf;
            print "Sequence:$key contains ORFs: $match\n";
    }
}

输出:

Sequence:ATGcccccccccTAAgggggggggATGtttttttttttTAG contains ORFs: ATGatatatatataTAG
Sequence:ATGcccccccccTAAgggggggggATGtttttttttttTAG contains ORFs: ATGaggggaTAG
Sequence:ATGcgcgcgcgcgcgTAAATGatatatatataTAG contains ORFs: ATGcgcgcgcgcgcgTAA
Sequence:ATGcgcgcgcgcgcgTAAATGatatatatataTAG contains ORFs: ATGtttttttttttTAG
Sequence:atATGaggggaTAGaaaatttttttctttct contains ORFs: ATGcccccccccTAA