打开file.txt并找到其基因可能的起始和终止位置

时间:2014-03-08 17:12:37

标签: perl bioinformatics

嗨,我有一个文件,我想打开它,找到它的基因的起始和终止位置,我也有一些额外的信息。每个基因的开始按以下模式映射。存在8个字母的共识,称为Shine-Dalgarno序列(TAAGGAGG),然后在起始密码子(ATG)之前下游4-10个碱基。然而,存在Shine-Dalgarno序列的变体,其中最常见的是[TA] [AC] AGGA [GA] [GA]。基因的末端由终止密码子TAA,TAG和TGA指定。在正确的Open.Reading Frame(ORF)之后必须注意终止密码子。     现在我已经制作了一个带有基因组的txt文件,我用这个代码打开它,当我去阅读基因组并开始和结束时错误开始。任何帮助?非常感谢:

#!/usr/bin/perl -w
    use strict;
    use warnings;
    # Searching for motifs
    # Ask the user for the filename of the file containing
    my $proteinfilename = "yersinia_genome.fasta";
    print "\nYou open the filename of the protein sequence data: yersinia_genome.fasta \n";
    # Remove the newline from the protein filename
    chomp $proteinfilename;
    # open the file, or exit
    unless (open(PROTEINFILE, $proteinfilename) ) 
    {
      print "Cannot open file \"$proteinfilename\"\n\n";
      exit;
    }
    # Read the protein sequence data from the file, and store it
    # into the array variable @protein
    my @protein = <PROTEINFILE>;
    # Close the file - we've read all the data into @protein now.
    close PROTEINFILE;
    # Put the protein sequence data into a single string, as it's easier
    # to search for a motif in a string than in an array of
    # lines (what if the motif occurs over a line break?)
    my $protein = join( '', @protein);
    # Remove whitespace.
    $protein =~ s/\s//g;
    # In a loop, ask the user for a motif, search for the motif,
    # and report if it was found.
    my $motif='TAAGGAGG';
    do 
    {
      print "\n Your motif is:$motif\n";
      # Remove the newline at the end of $motif
      chomp $motif;
      # Look for the motif
        if ( $protein =~ /$motif/ ) 
        {
          print "I found it!This is the motif: $motif in line $.. \n\n";
        } 
        else 
        {
          print "I couldn't find it.\n\n";
        }
    }
    until ($motif =~ /TAAGGAGG/g); 
    my $reverse=reverse $motif;
    print "Here is the reverse Motif: $reverse. \n\n";
    #HERE STARTS THE PROBLEMS,I DONT KNOW WHERE I MAKE THE MISTAKES
    #$genome=$motif;
    #$genome = $_[0];
    my $ORF = 0;
    while (my $genome = $proteinfilename) {
        chomp $genome;
        print "processing $genome\n";
        my $mrna = split(/\s+/, $genome);
        while ($mrna =~ /ATG/g) {
          # $start and $stop are 0-based indexes
          my $start = pos($mrna) - 3; # back up to include the start sequence
          # discard remnant if no stop sequence can be found
          last unless $mrna=~ /TAA|TAG|TGA/g;
    #m/^ATG(?:[ATGC]{3}){8,}?(?:TAA|TAG|TGA)/gm;
      my $stop    = pos($mrna);
      my $genlength = $stop - $start;
      my $genome    = substr($mrna, $start, $genlength);
      print "\t" . join(' ', $start+1, $stop, $genome, $genlength) . "\n";
      #      $ORF ++;
            #print "$ORF\n";
       }
    }
    exit;

2 个答案:

答案 0 :(得分:0)

  

while (my $genome = $proteinfilename) {

这会创建一个无限循环:你一遍又一遍地复制文件名(而不是$protein数据)。

while循环的目的不明确;它永远不会终止。

也许你的意思是

my ($genome) = $protein;

这是解决代码中明显问题的简单尝试。

#!/usr/bin/perl -w
use strict;
use warnings;
my $proteinfilename = "yersinia_genome.fasta";
chomp $proteinfilename;
unless (open(PROTEINFILE, $proteinfilename) ) 
{
  # die, don't print & exit
  die "Cannot open file \"$proteinfilename\"\n";
}
# Avoid creating a potentially large temporary array
# Read directly into $protein instead
my $protein = join ('', <PROTEINFILE>);
close PROTEINFILE;
$protein =~ s/\s//g;
# As this is a static variable, no point in looping
my $motif='TAAGGAGG';
chomp $motif;
if ( $protein =~ /$motif/ ) 
{
  print "I found it! This is the motif: $motif in line $.. \n\n";
}
else 
{
  print "I couldn't find it.\n\n";
}
my $reverse=reverse $motif;
print "Here is the reverse Motif: $reverse. \n\n";
# $ORF isn't used; removed
# Again, no point in writing a loop
# Also, $genome is a copy of the data, not the filename
my $genome = $protein;
# It was already chomped, so no need to do that again
my $mrna = split(/\s+/, $genome);
while ($mrna =~ /ATG/g) {
  my $start = pos($mrna) - 3; # back up to include the start sequence
  last unless $mrna=~ /TAA|TAG|TGA/g;
  my $stop    = pos($mrna);
  my $genlength = $stop - $start;
  my $genome    = substr($mrna, $start, $genlength);
  print "\t" . join(' ', $start+1, $stop, $genome, $genlength) . "\n";
}
exit;

答案 1 :(得分:0)

谢谢,我已经解决了这个问题:

local $_=$protein;
while(/ATG/g){
my $start = pos()-3;
if(/T(?:TAA|TAG|TGA)/g){
my $stop = pos;
 print $start, " " , $stop, " " ,$stop - $start, " " ,
 substr ($_,$start,$stop - $start),$/;
 }
 }