Question

我已经创建了一个程序，可以读取可以产生互补链的DNA序列，进一步转化为mRNA。但是，我必须为该DNA找到最长的开放阅读框架。我编写了一些东西，但是当它打印出来时，我得不到答案。帮助

这就是我所拥有的;

# Search for the longest open reading frame for this DNA.
print "\nHere is the largest ORF, from 5' to 3':\n" ;
local $_ = $RNA_seq ;
while ( / AUG /g ) {
    my $start = pos () - 2 ;
    if ( / UGA|UAA|UAG /g ) {
        my $stop = pos ;
        $gene = substr ( $_ , $start - 1 , $stop - $start + 1 ), $/ ;
        print "$gene" ;
    }
}

# The next set of commands translates the ORF found above for an amino acid seq.
print "\nThe largest reading Frame is:\t\t\t" . $protein { "gene" } . "\n" ;
sub translate {
    my ( $gene , $reading_frame ) = @_ ;
    my %protein = ();
    for ( $i = $reading_frame ; $i < length ( $gene ); $i += 3 ) {
        $codon = substr ( $gene , $i , 3 );
        $amino_acid = translate_codon( $codon );
        $protein { $amino_acid }++;
        $protein { "gene" } .= $amino_acid ;
    }
    return %protein ;
}

sub translate_codon {
if ( $_ [ 0 ] =~ / GC[AGCU] /i )             { return A;} # Alanine;
if ( $_ [ 0 ] =~ / UGC|UGU /i )              { return C;} # Cysteine
if ( $_ [ 0 ] =~ / GAC|GAU /i )              { return D;} # Aspartic Acid;
if ( $_ [ 0 ] =~ / GAA|GAG /i )              { return Q;} # Glutamine;
if ( $_ [ 0 ] =~ / UUC|UUU /i )              { return F;} # Phenylalanine;
if ( $_ [ 0 ] =~ / GG[AGCU] /i )             { return G;} # Glycine;
if ( $_ [ 0 ] =~ / CAC|CAU /i )              { return His;} # Histine (start codon);
if ( $_ [ 0 ] =~ / AU[AUC] /i )              { return I;} # Isoleucine;
if ( $_ [ 0 ] =~ / AAA|AAG /i )              { return K;} # Lysine;
if ( $_ [ 0 ] =~ / UUA|UUG|CU[AGCU] /i )     { return Leu;} # Leucine;
if ( $_ [ 0 ] =~ / AUG /i )                  { return M;} # Methionine;
if ( $_ [ 0 ] =~ / AAC|AAU /i )              { return N;} # Asparagine;
if ( $_ [ 0 ] =~ / CC[AGCU] /i )             { return P;} # Proline;
if ( $_ [ 0 ] =~ / CAA|CAG /i )              { return G;} # Glutamine;
if ( $_ [ 0 ] =~ / AGA|AGG|CG[AGCU] /i )     { return R;} # Arginine;
if ( $_ [ 0 ] =~ / AGC|AGU|UC[AGCU] /i )     { return S;} # Serine;
if ( $_ [ 0 ] =~ / AC[AGCU] /i )             { return T;} # Threonine;
if ( $_ [ 0 ] =~ / GU[AGCU] /i )             { return V;} # Valine;
if ( $_ [ 0 ] =~ / UGG /i )                  { return W;} # Tryptophan;
if ( $_ [ 0 ] =~ / UAC|UAU /i )              { return Y;} # Tyrosine;
if ( $_ [ 0 ] =~ / UAA|UGA|UAG /i )          { return "***" ;} # Stop Codons;
}

我错过了什么吗？

Answer 1

把

use strict;
use warnings;

代码的开头：它将有助于发现问题

一般用

return A;

如果你想返回一个字符串，你不返回一个字符串而是一个文件句柄

return 'A';

Answer 2

如果要在正则表达式中添加空格以提高可读性，则需要将正则表达式x修饰符用作：

if ( $_ [ 0 ] =~ / GC[AGCU] /xi )

然后，如果你需要代表空白（你不应该在这里），只需使用\s 顺便说一句，而不是使用$_ [0]考虑：

sub translate_codon {
    my $codon = shift;
    return q(A) if $codon =~ m/ GC[AGCU] /xi;
    return q(F) if $codon =~ m/ UU[CU]   /xi;
}

Answer 3

你有没有检查过BioPerl？我没有检查过，但也许它已经包含了你需要的功能。或者您目前正在从事自己编写程序的作业吗？

修改

我不太确定您发布的代码的第一部分。例如，你的正则表达式中有空格。您尝试匹配的字符串实际上是否包含那些空格，或者密码子是否都写在一起，如AUGCCGGAUGA中所示？在后一种情况下，根本就没有匹配，即使你正在寻找的密码子存在（我可能会告诉你你知道的事情......我只想指出它，以防万一:) ）。

另外，pos功能的代码是什么？

还有一件事：你不必设置$_，你可以简单地将$RNA_seq与模式匹配，如下所示：

if ($RNA_seq =~ m/UGA/) { # ...

编辑2

我想到了第一部分的更多内容，我认为在这里使用index函数是可取的，因为它会立即为您提供序列中的位置：

#!/usr/bin/perl use strict; use warnings; use List::Util qw( min ); my $string = 'UGAAUGGGCCAUAUUUGACUGAGUUUCAGAUGCCAUUGGCGAUUAG'; # the genes: *-------------* *---------------* my $prev = -1; my @genes; while (1) { my $start = index($string, 'AUG', $prev+1); my $stop = min grep $_ > -1, (index($string, 'UGA', $start+1), index($string, 'UAA', $start+1), index($string, 'UAG', $start+1)); # exit the loop if index is -1, i.e. if there was no more match last if ($start < 0 or $stop < 0); # using +1 so that 'AUGA' will not count as a gene if ($stop > $start+1) { push @genes, substr($string, $start, $stop); } $prev = $stop; # I'm assuming that no second AUG will come between AUG and a stop codon } print "@genes\n";

这会给你

AUGGGCCAUAUUUGA AUGCCAUUGGCGAUUAG

我想说它可能需要一些改进，但我希望它会有所帮助。

Answer 4

ab-initio gene predictions有几个应用程序。如果您仍想从头开始编码，我建议您查看dynamic programming和Smith–Waterman algorithm for local sequence alignment。但请注意，对于真核生物，您还应该考虑RNA splicing。

使用Perl查找最大的开放阅读框架

4 个答案: