Question

我正在开发一个程序，它能够读取基因序列并给我开放阅读框（ORF），然后是每个ORF的蛋白质序列。我已经获得了用于寻找ORFs的代码 - 但是没有氨基酸会被打印出来。我在Mac上使用Perl。

我想让代码告诉我从开放阅读框产生的氨基酸串。

这是我的代码：

#!/usr/bin/perl
#ORF_Find.txt -> finds long orfs in a DNA sequence

open(CHROM, "chr03.txt");   #Open file chr03.txt containing yeastchrom. 3
$DNA = "";                          #start with empty DNA sequence
$header = <CHROM>;                  #get header of sequence

#Read line from file, join to end of $DNA, repeat until end of file

while ($current_line = <CHROM>)
{
    chomp($current_line);           #remove newline from end of current_line
    $DNA= $DNA . $current_line;
}

#length of DNA sequence
$DNA_length = length($DNA);

#flag for ORF finder
$inORF=0;

#number of ORFs found
$numORFs = 0;

#minimum length
$minimum_codons =100;

#search each reading frame
for ($frame =0; $frame<3; $frame++)
{
    print "\nFinding ORFs in frame: +" . ($frame + 1) . "\n";

    #search for sequence match and print position of match if found
    for ($i =frame; $i<=($DNA_length-3);$i += 3)
    {
        #get current codon from sequence
        $codon= substr ($DNA, $i, 3);

        #if not in orf search for ATG, else search for stop codon
        if ($inORF == 0)
        {
            #if current codon is ATG, start ORF
            if ($codon eq "ATG")
            {
                $inORF = 1;
                $ORF_length = 1;
                $ORF_start = $i;
            }
        }
        elsif($inORF ==1)
        {
            #if current codon is a stop codon, end ORF
            if ($codon eq "TGA" || $codon eq "TAG" || $codon eq "TAA")
            {
                #if ORF has at least min number of codons,print location
                if ($ORF_length >= $minimum_codons)
                {
                    print "FOUND ORF AT POSITION $ORF_start,";
                    print "length = $ORF_length\n";
                    $numORFs++;
                }

                #reset ORF variables
                $inORF = 0;
                $ORF_length = 0;
            }
            else
            {
                #increase length of ORF by one codon
                $ORF_length++;
            }
        }
    }
}

#change T to U
$DNA =~ s/T/U/g;

#search each ORF
for ($i=$ORF_start; $i<=($ORF_length-3); $i+=3)
{
    #get codon from each ORF
    $aa_codon= substr($DNA, $i, 3);

    #find amino acids
    foreach ($aa_codon eq "ATG")
    {
        print ("M")     #METHIONINE
    }
    foreach ($aa_codon =~/UU[UC]/)
    {
        print ("F")     #PHENYLALANINE
    }
    foreach ($aa_codon =~/UU[AG]/ || $aa_codon=~/CU[UCAG]/)
    {
        print ("L");    #LEUCINE
    }
    foreach ($aa_codon =~/AU[UAC]/)
    {
        print ("I");    #ISOLEUCINE
    }
    foreach ($aa_codon =~/GU[UACG]/)
    {
        print ("V");    #VALINE
    }
    foreach ($aa_codon =~/UC[UCAG]/ || $aa_codon=~/AG[UC]/)
    {
        print ("S");    #SERINE
    }
    foreach ($aa_codon =~/CC[UCAG]/)
    {
        print ("P");    #PROLINE
    }
    foreach ($aa_codon =~/AC[UCAG]/)
    {
        print ("T");    #THREONINE
    }
    foreach ($aa_codon =~/GC[UCAG]/)
    {
        print ("A");    #ALANINE
    }
    foreach ($aa_codon =~/UA[UC]/)
    {
        print ("Y");    #TYROSINE
    }
    foreach ($aa_codon =~/CA[UC]/)
    {
        print ("H");    #HISTIDINE
    }
    foreach ($aa_codon =~/CA[AG]/)
    {
        print ("G");    #GLUTAMINE
    }
    foreach ($aa_codon =~/AA[UC]/)
    {
        print ("N");    #ASPARAGINE
    }
    foreach ($aa_codon =~/AA[AG]/)
    {
        print ("K");    #LYSINE
    }
    foreach ($aa_codon =~/GA[UC]/)
    {
        print ("D");    #ASPARTIC ACID
    }
    foreach ($aa_codon =~/GA[AG]/)
    {
        print ("E");    #GLUTAMIC ACID
    }
    foreach ($aa_codon =~/UG[UC]/)
    {
        print ("C");    #CYSTINE
    }
    foreach ($aa_codon eq "UGG")
    {
        print ("W");    #TRYPTOPHAN
    }
    foreach ($aa_codon =~/AG[AG]/ || $aa_codon =~/CG[UCAG]/)
    {
        print ("R");    #ARGININE
    }
    foreach ($aa_codon =~/GG[UCAG]/)
    {
        print ("G");    #GLYCINE
    }
    foreach ($aa_codon =~/UA[AG]/|| $aa_codon eq "UGA")
    {
        print ("*")     #STOP
    }

}
#if no ORFS found, print message
if ($numORFs ==0)
{
    print ("NO ORFS FOUND\n");
}
else
{
    print ("\n$num_ORFs ORFS WERE FOUND\n");
}

Answer 1

首先，这个问题可能更适合seqAnswers或BioStars这样的论坛。除此之外，编写自己的6帧翻译脚本是一项复杂的任务，特别是如果你想要考虑IUPAC模糊的核苷酸。已经有很多脚本和工具可以做到这一点。我可以做的最简单的建议可能是使用现有工具之一。试试我的，例如：

https://github.com/hepcat72/sixFrameTranslation/archive/master.zip

我的剧本直到现在才公开。我打开它以便你可以使用它。只需运行它即可获得使用效果。

除此之外，如果你想让你的版本正常运行，你可以做的第一件事就是改变你的版本：

#!/usr/bin/perl -w

请注意-w。然后，将此行添加到脚本的顶部：

use strict;

它可以帮助您调试问题，例如您的一个for循环中缺少美元符号：

for ($i =frame; $i<=($DNA_length-3);$i += 3)

应该是：

for ($i =$frame; $i<=($DNA_length-3);$i += 3)

顺便说一下，你在Mac上运行perl并不重要。这只是perl。 “Mac perl”是一个在OS-X之前的日子里创建perl环境的项目。

打开阅读框架程序不打印氨基酸序列

1 个答案: