Perl:在数组元素中搜索模式

时间:2012-06-26 07:37:40

标签: arrays perl bioinformatics

我是一个Perl新手,坚持另一个需要一些帮助和输入的生物信息学问题。

问题简述:

  1. 我有一个文件,它有超过40,000个独特的 DNA序列。通过唯一,我的意思是唯一的序列ID。我在帖子的末尾附上了一部分,以帮助你展示它的样子。

  2. 我需要将40,000个序列中的每个分成3个部分。因此,如果特定序列长度为999个字符,则3个部分中的每个将具有333个字符。

  3. 我需要通过3个单独的部分来寻找以下模式:

    $ gpat = [G] {3,5}; $ npat = [A-Z] {1,25};
    $ pattern = $ gpat。$ npat。$ gpat。$ npat。$ gpat。$ npat。$ gpat;

  4. 如果$ pattern出现在3个部分的第一个中,则增加'开始'的计数器,如果3个部分中的第2个出现$ pattern,则增加'middle'的计数器,最后如果$ pattern出现在第3部分,增加'结束'的反击。

  5. 打印“开始”,“中间”和“结束”的计数器,即基本上每个序列的“开始”,“中间”,“结束”的总和。

    在第一个序列中说,值分别为'2','5','3',在第二个序列中,值为'4','1','6',最终计数应为' 7,6,9' 。

  6. 我遇到的问题:

    1. 如果特定序列被分成3个部分,则潜在的$ pattern将丢失。例如,说出如下序列:
    2. gggatgtcgatgcatggggatgcatcgatgcggggactagctagcgggatgctacgatggggatgatgataatatcgcggcgcatatatgctagtctatatatta

      分为3个部分,产生以下3个子部分,每个部分长度为35个字符:

      gggatgtcgatgcatggggatgcatcgatgcgggg
      actagctagcgggatgctacgatggggatgatgat
      aatatcgcggcgcatatatgctagtctatatatta

      因此, $ pattern被拆分为前两部分。无论如何说“如果$ pattern从第1部分开始到第2部分结束”,增加“开始”的数量?

      ##更新## 由于Cupidvogel建议的代码,以下问题已得到解决

        

      2.如果序列长度不能被3整除,我如何将序列分成3个部分?我尝试使用int,但最后一部分是1-2   人物简短。

      以下是我到目前为止编写的代码。

      它读入文件,显示标题名称和序列,每个序列将被分成的长度,最后序列分成3个部分,如果序列长度可以被3整除,则可以正常工作。 t,最后的第3部分是1-2个字符短。

      #Take Filename from user
      print "Please enter file name : ";
      $in =<>;
      chomp $in;
      
      
      open (FASTA,"$in") or die ;
      while (<FASTA>)
      {
      $/=">";
      @array = split '\n', $_;
      $header=shift @array; # Header of the fasta sequence
      print "\n\nNext sequence: \n";
      print $header,"\n";
      
      
      $seq= join '', @array; # sequence
      $seq=~s/\s//g;
      $seq=~s/\*//g;
      $seq=~s/>//g;
      print $seq,"\n\n";
      
      $num = int(length($seq)/3);
      @arr = unpack("A$num A$num A*",$seq);
      print " New method gives this :", @arr;
      print "\nThe first element is :", $arr[0]; 
      print "\nThe second element is :",$arr[1]; 
      print "\nThe third element is :",$arr[2] ;
      
      
      
      #The following lines of code were originally written to split...
      #...the sequence into 3 parts, albeit unsuccessfully                    
      #my $split = (length $seq)/3;
      #print $split,"\n\n";
      
      #my $int = int $split;
      #print $int,"\n\n";
      
      
      #my @array2 = $seq =~ /(.{$int})/g;
      #print join (" ", @array2),"\n\n";
      
      #print $array2[0],"\n",$array2[1],"\n",$array2[2];
      
      
      }
      
      
      exit;
      

      到目前为止,我一直在尝试使用以下示例文件编写的代码:sample.fa

      >ABC_123 2
      atgtcgatcgatcggcgggcatgcgcgcgcggatg
      atatatagcgcgcgctatatagcgcgactctacgc
      atgctgctgactagctatagtcgctgactgcgcgt
      gggaaaaagggcccgggccccgttttggggatcta
      ggggatagctgatgctagcatgcatgctgactgca
      >DEF_456 4
      gggatgtcgatgcatggggatgcatcgatgcgggg
      actagctagcgggatgctacgatggggatgatgat
      aatatcgcggcgcatatatgctagtctatatatta
      >GHI_789 1
      atagctgctagtcgatcggcgcgggtatcgatcgg
      ggatcgatcgatcggggatcgatcgggggatcgat
      

      实际输入文件如下所示:

      >NR_037701 1
      aggagctatgaatattaatgaaagtggtcctgatgcatgcatattaaaca
      tgcatcttacatatgacacatgttcaccttggggtggagacttaatattt
      aaatattgcaatcaggccctatacatcaaaaggtctattcaggacatgaa
      ggcactcaagtatgcaatctctgtaaacccgctagaaccagtcatggtcg
      gtgggctccttaccaggagaaaattaccgaaatcactcttgtccaatcaa
      agctgtagttatggctggtggagttcagttagtcagcatctggtggagct
      gcaagtgttttagtattgtttatttagaggccagtgcttatttagctgct
      agagaaaaggaaaacttgtggcagttagaacatagtttattcttttaagt
      gtagggctgcatgacttaacccttgtttggcatggccttaggtcctgttt
      gtaatttggtatcttgttgccacaaagagtgtgtttggtcagtcttatga
      cctctattttgacattaatgctggttggttgtgtctaaaccataaaaggg
      aggggagtataatgaggtgtgtctgacctcttgtcctgtcatggctggga
      actcagtttctaaggtttttctggggtcctctttgccaagagcgtttcta
      ttcagttggtggaggggacttaggattttatttttagtttgcagccaggg
      tcagtacatttcagtcacccccgcccagccctcctgatcctcctgtcatt
      cctcacatcctgtcattgtcagagattttacagatatagagctgaatcat
      ttcctgccatctcttttaacacacaggcctcccagatctttctaacccag
      gacctacttggaaaggcatgctgggtctcttccacagactttaagctctc
      cctacaccagaatttaggtgagtgctttgaggacatgaagctattcctcc
      caccaccagtagccttgggctggcccacgccaactgtggagctggagcgg
      gagggaggagtacagacatggaattttaattctgtaatccagggcttcag
      ttatgtacaacatccatgccatttgatgattccaccactccttttccatc
      tcccagaagcctgctttttaatgcccgcttaatattatcagagccgagcc
      tggaatcaaactgcctctttcaaaacctgccactatatcctggctttgtg
      acctcagccaagttgcttgactattctcagtctcagtttctgcacctgtc
      aaatagggtttatgttaacctaactttcagggctgtcaggattaaatgag
      catgaaccacataaaatgtttggtgtatagtaagtgtacagtaaatactt
      ccattatcagtccctgcaattctatttttcttccttctctacacagcccc
      tgtctggctttaaaatgtcctgccctgctttttatgagtggataccccca
      gccctatgtggattagcaagttaagtaatgacactcagagacagttccat
      ctttgtccataacttgctctgtgatccagtgtgcatcactcaaacagact
      atctcttttctcctacaaaacagacagctgcctctcagataatgttgggg
      gcataggaggaatgggaagcccgctaagagaacagaagtcaaaaacagtt
      gggttctagatgggaggaggtgtgcgtgcacatgtatgtttgtgtttcag
      gtcttggaatctcagcaggtcagtcacattgcagtgtgtcgcttcacctg
      gctccctcttttaaagattttccttccctctttccaactccctgggtcct
      ggatcctccaacagtgtcagggttagatgccttttatgggccacttgcat
      tagtgtcctgatagaggcttaatcactgctcagaaactgccttctgccca
      ctggcaaagggaggcaggggaaatacatgattctaattaatggtccaggc
      agagaggacactcagaatttcaggactgaagagtatacatgtgtgtgatg
      gtaaatgggcaaaaatcatcccttggcttctcatgcataatgcatgggca
      cacagactcaaaccctctctcacacacatacacatatacattgttattcc
      acacacaaggcataatcccagtgtccagtgcacatgcatacacgcacaca
      ttcccttcctaggccactgtattgctttcctagggcatcttcttataaga
      caccagtcgtataaggagcccaccccactcatctgagcttatcaaccaat
      tacattaggaaagactgtatttcctagtaaggtcacattcagtagtactg
      agggttgggacttcaacacagctttttgggggatcataattcaacccatg
      acagccactgagattattatatctccagagaataaatgtgtggagttaaa
      aggaagatacatgtggtacaaggggtggtaaggcaagggtaaaaggggag
      ggaggggattgaactagacacagacacatgagcaggactttggggagtgt
      gttttatatctgtcagatgcctagaacagcacctgaaatatgggactcaa
      tcattttagtccccttctttctataagtgtgtgtgtgcggatatgtgtgc
      tagatgttcttgctgtgttaggaggtgataaacatttgtccatgttatat
      aggtggaaagggtcagactactaaattgtgaagacatcatctgtctgcat
      ttattgagaatgtgaatatgaaacaagctgcaagtattctataaatgttc
      actgttattagatattgtatgtctttgtgtccttttattcatgaattctt
      gcacattatgaagaaagagtccatgtggtcagtgtcttacccggtgtagg
      gtaaatgcacctgatagcaataacttaagcacacctttataatgacccta
      tatggcagatgctcctgaatgtgtgtttcgagctagaaaatccgggagtg
      gccaatcggagattcgtttcttatctataatagacatctgagcccctggc
      ccatcccatgaaacccaggctgtagagaggattgaggccttaagttttgg
      gttaaatgacagttgccaggtgtcgctcattagggaaaggggttaagtga
      aaatgctgtataaactgcatgatgtttgcaggcagttgtggttttcctgc
      ccagcctgccaccaccgggccatgcggatatgttgtccagcccaacacca
      caggaccatttctgtatgtaagacaattctatccagcccgccacctctgg
      actccctcccctgtatgtaagccctcaataaaaccccacgtctcttttgc
      tggcaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
      aaa
      >NM_198399 1
      aacagattttaactctgaaaagccatttccagtgtctatagactattgtg
      agcctggagaagtagcatttagttgggatagcttcactagagctgcctgc
      caaagacttccttccacaggatcttgtcgcaccagcaactgacaggagct
      tgggagctcgggagcttgggagagggcttatgtttttaataatgtagctg
      tcagttcgaagcctggaaatgttgaccctcaaagggcataaaatcttgtt
      attttaatttgcatctgggagaatgtctgagcaaggagacctgaatcagg
      caatagcagaggaaggagggactgagcaggagacggccactccagagaac
      ggcattgttaaatcagaaagtctggatgaagaggagaaactggaactgca
      gaggcggctggaggctcagaatcaagaaagaagaaaatccaagtcaggag
      caggaaaaggtaaactgactcgcagccttgctgtctgtgaggaatcttct
      gccagaccaggaggtgaaagtcttcaggatcagactctctgaaaactgca
      aatggaaaggaattcaaaagaatttagattaaaagttaaataaaaagtag
      gcacagtagtgctgaattttcctcaaaggctctcttttgataaggctgaa
      ccaaatataatcccaagtatcctctctccttccttgttggagatgtctta
      cctctcagctccccaaaatgcacttgcctataagaaacacaattgctggt
      tcatatgaaacttaggaaatagtgaataaggtgcatttaactttggagaa
      atacttttatggctttggtggagatttctcaatactgcaaaagttgtcca
      gaaatgaatctgagctgatggtgactttaagttaatattattaatatatc
      actgcatatttttacccttatttttgctccttacagcaagattagtaggt
      tataaaaatttaaatttaaacaaaattatttcatgacaaaatgggaaact
      tcacatcatacttatttttgtttgcctttcaggcatcatattagctttta
      taaaaaatggtcttgctgctgaaattgtacttattttatcagaggctggg
      tgcagtcaagacaaaagtaaaatggtttacctgagcccaggggagggaaa
      attgattaagatatcattatttttgtttggtttggttttgcttttttcct
      cttactttaattgaaatactctgaattcccctcatggaaacagagagcat
      tgagagcactttctttaaaaggaccaaaaataaattcctaatagattttg
      tcctaagagagtgtttttttttctagcatcattttctttacatgccactc
      atgtcataaggcatggacaggctatctttcagtggccattactatgtttc
      gtacacatgctttattttacttgggctctgagaaatgtgtggctttcctt
      cagcattttatttgtgcttctctttttaatggagattgaaaagggagaat
      aatgtgaatatcacggcttatattattaaatgttgattgatggcttgtaa
      tgtactgcacacaatatatgttaactctgcagaatgacagaccctgggag
      aagtaatgccccagttgtcccccactcctaatgccaggcagagaaggaca
      gcctttatagacttaatctgctttttgtcccatttgacaaggtaccagga
      ggaaattttttaagggatcaactgtatcacagtgcccactctggacctaa
      gtctagtgtatccatacaattggtgcagagaaataaggtgtaaatggtgc
      tttgttcctgctggttccaagctcagaaaccaagactagctttgtaggag
      agaatgagagcctgcaagcctctctttggattggctgaggagtggtggga
      gcagggggttgatagaaaacatccagacacacatataagcaagtggccgt
      gctacctttttagagaataaagaaacagacttttgagtttatatgcaatg
      ccttcattaggtaccaccggcacttacaaaatgtgcggactgaatcccag
      agaacactggcagatgtatacagtatatggattgtatcgcttccccaatg
      tttgtaaattcacagtatttggaaaactgccttcattttccagtgtggga
      aaaactcttgctacctgtattacttgatctcagacccatacctgatggtt
      cagtctgtccttaagttaaaagaattttgcttttctaatgttatactatt
      tacctgtcagtgtattactgcaacttgaatcactcttttactgttgttgg
      atataaacttatcctgtaccaatgtatttattaacacttgtattttatta
      ttgagcatatcaataaaaatattaaaaaataacagattgttttttaccaa
      aaaaaaaaaaaaa
      >NR_026816 1
      caacccactctctgtgctatgacttcattactctttcccagcccagccct
      gggcaagccccttacgaagtctcaggctacctggatgaccaccctttctt
      atgatgctgcaaggagggcaggtgggcagagccccgtgcatcctgggctc
      aggccagggacccaagagcttgggagaagctggttctcagactgaaggcc
      agagcccagcaccttgtcaccatcccggggagcatcatggcacacaacaa
      ccagagccaaggctacagctagagagttgactcctctatttgagattgac
      aggcctcggaagtcaaaataagtggtttcctagaccgggtcgagagcaag
      tctctattggtcccaactgagttttttcagctggtttttcaaccaaacag
      cacctcatctcccagtgaggggaagggaaggctgggctgagagcagcaag
      gctgctcatctcacctctccccacccagccatgccagccgcctcacctgg
      tggggagaggtgggcctcacctgggtcccctggcagtgctctgtgaaggg
      tcttgacattgcactgtaataataaaggtgtgtgtgaagtatcaaaaaaa
      >NR_027917 1
      atgaagatgattgagcagcacaatcaggaatacagggaagggaaacacag
      cttcacaatggccatgaacgcctttggagaaatgaccagtgaagaattca
      ggcaggtggtgaatggctttcaaaaccagaagcacaggaaggggaaagtg
      ctccaggaacctctgcttcatgacatccgcaaatctgtggattggagaga
      gaaaggctacgtgactcctgtgaaggatcagtgcagctggggctctgtaa
      ggacagatgttaggaaaactgagaaactagtttcactgagtgtgcagacc
      tggtggactgctctaggcttcaaggcaatgttggctgcatttttggagaa
      ccattattttgcttccagtatgttgccgacaatggaggcctggactctga
      ggaatccttttcatatgaagaaaagctctggagactggaaagtccaaggt
      cacagaggtgcatctggtgagagccttcttgctagtggggaatctcagca
      gagtcctgaggtggcacagtattctgggaagcatcaagtgcagtgtcatc
      ttatcgaggaggctctgcagatgctaagtggtggggatgaggatcacgat
      gaagacaaatggccccatgacatgaggaatcatctggctggagaggccca
      ggtgtag
      >NR_002777 3
      cttgtcctttcagaagatcagagacaagtgatatctgtgccaatttggcc
      ttttcagtgttataattatggtgtcttgggatcccaatatttctcctaat
      gtttccctgatgtgatactttgagagcccaggatgccagtacaataattg
      aaattcacaaatgtctggtatcttgtccctcgtgccccatatattatctg
      tggtttcggagagctcacttgtctcttatcttcagaaatgacagcacatg
      aaatgttgtttggagccactgtcacatcaactgtagaaaaattaacaggt
      cagctaagggatataatgtaactttatttgtgatatgagagaaatcttga
      taaagacttgagagaaaactgggaggaaccttgtttagaagttataagga
      ggggtaagttatgtgtgtcttggaaggagaatcataaatcttaaaacatg
      agcctaatagagaacataaaattctaaaagataaagataataataatgat
      aagccgcagggtggcttatgataatgtgacttctccttaccccagtagcg
      tcggacatctgtcagctctgaaatgataaaaatgcacaatattgaataca
      aacaaaggagtcagcactgaaattcattttctctccagattagggaaaga
      gtaggtatgccctatggtagggcagtaaattgctgaatgatgagatgaaa
      cagccacctagccatttcccattaaatataatcccatcagcagcagacaa
      tatctatcctcccctatcccctctatccatatttggaaactgcaccctct
      tccctatttagcaccctaacaccacttgaattccataaccctgttgttga
      tctagctctcctcacctctaaacacttctagcattcctttcagatcagga
      gctcgaaacactctcctttgattttttggaaaagtttctggcttcttcaa
      ggtcacgttctccgtcctaagaattaaaaaaaaaaaaaaaaacttccaaa
      cctttgaccttgtgtccgtggaacacccctgacttcctatcatttcaacc
      cattgaggcacttgaactctcttcttggggatcctgagaagggagagtgc
      aaactcttgaccctggaggcaaacaaaatgttctcatgtttgccttccca
      cttactttctgtgagaacgtgggaagatcttaacctctcagaagcacagt
      ttcttccttctaaaatgaaataattaacctctccctgtctacattcttaa
      actcataggacataaaaaaaaaaaaaa
      >NR_033769 1
      ggcctctggcgggcctccagccagttagaccatttgactaggacgtgtgc
      agctcagccagccacagaactggaatttttcaggagcagggggagcatgg
      agtttggactttgctgagcaactgaagtggagcgcagagcttgctcgctt
      aggagagggcagcatggatggcaaacaagggggcatggatgggagcaagc
      ccacggggccaagagactctcctgacaccaggcttctttcaaacccattg
      atgggtgattctgtgtctgattggtctcctatgcctgaagctgcaatcta
      cggacatcagctgtctctgaggaacctcatcagccacgggtggcttgtga
      acatcatcatggcagatcatgtttccccactccatgaagcctgtctcaga
      ggtcatccctctcgtgtaaagattttattaaagcatggagctcaggtgaa
      tggcgtgacaacagactggcacactccactgtttaatgtttgtatcagca
      gcagctgggattatgcttctgcagcatggagccagcgttcaacctgagag
      tgatctggcatcccccgtccatgaagctgctaggagaggccacgtggagt
      gtgtcgactctcttacagcttataggggcaaaaatgaccataacatcagc
      cacgtgggcacttcactgtatttggcttgtgaaaaccagcagatagcctg
      tgtcaagaagcttctggagtcaggagcagacctgaacccagggagaggtt
      ccccacttcatgcagtggccttcatgaaggccctcatgaaggattcccca
      cttcatgcagtggccaggacagccagtgaagagctggcctgcctgctcat
      ggattttggagcagacacccaggccaagaatgctgaaggcaaatgtcatg
      tggagctggtgcctccagagagccctttgatccagctcttcttggagaga
      gaagggcccccttcttttgatgcagttatgcctagaaatcagaagggctt
      tggaatccagcagcatcataagataaccaaagtcgtcctcccagaggatc
      tgaaatggtttctcctacatctttgtatgtatcaatggaatggattcaca
      aacaatgtgaaaacattattgagtgttgtagccactagaattttaaaatc
      aagttaggtttatagagtttgactagttttttcgattagatttgtattag
      ttataaatttgttcatagagtttgactaattttttcgattagatttgtat
      ttgttaaactctgaagccagagtttaaacacactgcatacgtttgtatga
      ttagttagaaggcatgaagacttttttccctgcttggagactgtctaaaa
      taacagctattgttttgcatatccactgcaggccaagcactttcagcatc
      atctaattcagccctcacagcaactgggtcaatctgtccaatttcccagg
      gcaaggatagaggagtcagattcaaatacaggttttctgacgttaactta
      tgtgatgatttgatcaaagcaggattttccagcatcactatccttgttcc
      atctctgctatatgggaatgaaaataaagaaatgtatttcaaaaaaataa
      aaagaaaagaaaaacagagacggtc
      >NM_016326 3
      atgcgcgcaagagagcgggaagccgagctgggcgagaagtaggggagggc
      ggtgctccgccgcggtggcggttgctatcgcttcgcagaacctactcagg
      cagccagctgagaagagttgagggaaagtgctgctgctgggtctgcagac
      gcgatggataacgtgcagccgaaaataaaacatcgccccttctgcttcag
      tgtgaaaggccacgtgaagatgctgcggctggtgtttgcacttgtgacag
      cagtatgctgtcttgccgacggggcccttatttaccggaagcttctgttc
      aatcccagcggtccttaccagaaaaagcctgtgcatgaaaaaaaagaagt
      tttgtaattttatattactttttagtttgatactaagtattaaacatatt
      tctgtattcttccacatattttctgcagttattttaactcagtataggag
      ctagaggaagagatttccgaagtctgcaccccgcgcagagcactactgta
      acttccaagggagcgctgggagcagcgggatcgggttttccggcacccgg
      gcctgggtggcagggaagaatgtgccgggatccgcctcagggatctttga
      atctctttactgcctggctggccggcagctccg
      >NM_181641 2
      atgcgcgcaagagagcgggaagccgagctgggcgagaagtaggggagggc
      ggtgctccgccgcggtggcggttgctatcgcttcgcagaacctactcagg
      cagccagctgagaagagttgagggaaagtgctgctgctgggtctgcagac
      gcgatggataacgtgcagccgaaaataaaacatcgccccttctgcttcag
      tgtgaaaggccacgtgaagatgctgcggctggcactaactgtgacatcta
      tgaccttttttatcatcgcacaagcccctgaaccatatattgttatcact
      ggatttgaagtcaccgttatcttatttttcatacttttatatgtactcag
      acttgatcgattaatgaagtggttattttggcctttgcttgtgtttgcac
      ttgtgacagcagtatgctgtcttgccgacggggcccttatttaccggaag
      cttctgttcaatcccagcggtccttaccagaaaaagcctgtgcatgaaaa
      aaaagaagttttgtaattttatattactttttagtttgatactaagtatt
      aaacatatttctgtattcttccacatattttctgcagttattttaactca
      gtataggagctagaggaagagatttccgaagtctgcaccccgcgcagagc
      actactgtaacttccaagggagcgctgggagcagcgggatcgggttttcc
      ggcacccgggcctgggtggcagggaagaatgtgccgggatccgcctcagg
      gatctttgaatctctttactgcctggctggccggcagctccg
      >NM_001144931 1
      gtttccgttcctctgcccgccatgccgttcctagagctgcacacgaattt
      ccccgccaaccgagtgcccgcggggctggagaaacggctgtgcgccgtcg
      ctgcctccatcttgggcaaacctgcagaccttgtgaacgtgacggtacgg
      ccgggcctggccagggcgctgagcgggtccaccgagccctgcgcgcagct
      gtccatctcctccatcggcgtagtgggcaccgccgaggacaaccgcagcc
      acagtgcccacttctttgagtttctcaccaaggagctagccctgggccag
      gaccggtgcgcaggggtagtaggcccggaatattattctaaaacacaatc
      agagtactccattcctgctaacagtttaaagccaaacacctaggcaggcc
      atttaggcttctgaatgactgggtcttgaccaggagagctgctgtctagg
      ttttctcttcctgaccagttcctcaagagaaatgcaaaactagtgattaa
      cagtaagagtcaggcagggcgcggtggctcacgcctgtaatcccagcact
      ttgggaggccgag
      >NR_029429 1
      ggacaccaccccaaaatttcctagtcctctttgatacgggttcctccaat
      ctgtagctgccctccatctactgccagagccaagtctgctccaatcacaa
      caggttcaatcccagcctgtcctccaccttcagaaacgatggacaaacct
      atggactatcctatgggagtggcagcctgagtgtgttcctgggctatgac
      actgtgactgttcataacatcgttgtcaataaccaggagtttggcctgag
      tgagaatgagcccagcgaccccttttactattcagactttgacgggatcc
      tgggaatggcctacccaaacatggcagaggggaattcccctacagtaatg
      caggggatgctgcagcagagccagcttactcagcccgtcttcagcttcta
      cttcacctgccagccaacccgccagtattgtggagagctcatccttggag
      gtgtggaccccaactttattctggtcagatcatctggacccctgtcagcc
      cgtaactgtactggcagattgccatcgaggaatttgccatcggtaaccag
      gccactggcttgtgctctgagggttgccaggccattgtggataccgagac
      cttcctgc
      >NR_026551 1
      tgtggcctgagaggacggccaggactggccagaaaagagagggacgtggc
      taaacgtgagggggcgtggccaagatggccgcgtgcgggatcctcgggta
      ccgggagcgaacgaggaggttctggctcagtgcatccactctgggagagc
      gtggacctggttcctgggggcgatcgccagtcacccatcaacattcggtg
      gagggacagtgtttatgatcccggcttaaaaccactgaccatctcttatg
      acccagccacctgcctccacgtctggaataatgggtactctttcctcgtg
      gaatttgaagattctacagataaatcagctgcacttagtgcattggaacg
      cagtcaaatttgaaaactttgaggatgcagcactggaagaaaatggtttg
      gctgtgataggagtatttttaaagatttcggaaacttctggcagcccagt
      gtctactggaaggcccaagccgcttgccagaaagctgcgccccgcccaaa
      agcactgggttctgcagtccaggcccttcctcagctcccaggtccaggag
      aactgcaaggtcacctacttccacaggaagcactgggtccgcatccggcc
      cctccgcaccactcctcccagctgggactacacccgcatctgcatccaga
      gagagatggtccccgcccgcatccgcgtcctgagagagatggtccccgag
      gcctggaggtgctttcccaacaggctgccgctgctgagcaacatcaggcc
      tgatttctccaaggctcccctggcctacgtgaagcggtggctttggaccg
      cccgccacccccacagcctgtccgcagcctggtgaccgtgaaaatcgccc
      cgccagagagcagaggaagcccgacgcccaggccatctgccttcaggtct
      gtgatgagaaacggagtggcctgttccgttgtgcccaggtctaggccgct
      gagcagagccctcactcccaggcagagttgtctgaatccttcct
      >NM_181640 2
      atgcgcgcaagagagcgggaagccgagctgggcgagaagtaggggagggc
      ggtgctccgccgcggtggcggttgctatcgcttcgcagaacctactcagg
      cagccagctgagaagagttgagggaaagtgctgctgctgggtctgcagac
      gcgatggataacgtgcagccgaaaataaaacatcgccccttctgcttcag
      tgtgaaaggccacgtgaagatgctgcggctggatattatcaactcactgg
      taacaacagtattcatgctcatcgtatctgtgttggcactgataccagaa
      accacaacattgacagttggtggaggggtgtttgcacttgtgacagcagt
      atgctgtcttgccgacggggcccttatttaccggaagcttctgttcaatc
      ccagcggtccttaccagaaaaagcctgtgcatgaaaaaaaagaagttttg
      taattttatattactttttagtttgatactaagtattaaacatatttctg
      tattcttccacatattttctgcagttattttaactcagtataggagctag
      aggaagagatttccgaagtctgcaccccgcgcagagcactactgtaactt
      ccaagggagcgctgggagcagcgggatcgggttttccggcacccgggcct
      gggtggcagggaagaatgtgccgggatccgcctcagggatctttgaatct
      ctttactgcctggctggccggcagctccg
      >NM_016951 3
      atgcgcgcaagagagcgggaagccgagctgggcgagaagtaggggagggc
      ggtgctccgccgcggtggcggttgctatcgcttcgcagaacctactcagg
      cagccagctgagaagagttgagggaaagtgctgctgctgggtctgcagac
      gcgatggataacgtgcagccgaaaataaaacatcgccccttctgcttcag
      tgtgaaaggccacgtgaagatgctgcggctggcactaactgtgacatcta
      tgaccttttttatcatcgcacaagcccctgaaccatatattgttatcact
      ggatttgaagtcaccgttatcttatttttcatacttttatatgtactcag
      acttgatcgattaatgaagtggttattttggcctttgcttgatattatca
      actcactggtaacaacagtattcatgctcatcgtatctgtgttggcactg
      ataccagaaaccacaacattgacagttggtggaggggtgtttgcacttgt
      gacagcagtatgctgtcttgccgacggggcccttatttaccggaagcttc
      tgttcaatcccagcggtccttaccagaaaaagcctgtgcatgaaaaaaaa
      gaagttttgtaattttatattactttttagtttgatactaagtattaaac
      atatttctgtattcttccacatattttctgcagttattttaactcagtat
      aggagctagaggaagagatttccgaagtctgcaccccgcgcagagcacta
      ctgtaacttccaagggagcgctgggagcagcgggatcgggttttccggca
      cccgggcctgggtggcagggaagaatgtgccgggatccgcctcagggatc
      tttgaatctctttactgcctggctggccggcagctccg
      >NR_002773 1
      cagcaccacaccaggaccctccagaggctgtgagaaacatcctgcaccca
      ggtcctctctatctgtttatcattgtctattttgtattctgcattcagaa
      ccaagagcctgaagacgacccaggagctttagctatggctgtcttcatta
      ttttgtccctgtttagtgttctggtgacaggcatgggtgaaggtggggct
      gggagtgagaaaggaggtgagagggaatgtaagctgaaccagcttcccca
      ttgcccctccgtatctcccagtgcccagccttggacacaccctggccaga
      gccagctgtttgcagacctgagccgagaggagctgacggctgtgatgcgc
      tttctgacccagcagctggggccagggctggtggatgcagcccaggccca
      gccctcggacaactgtgtcttctcagtggagttgcagctgcctcccaagg
      ctgcagccctggctcacttggacagggggagccccccacctgcccgggag
      gcactggccatcgtcttctttggcaggcaaccccagcccaacgtgagtga
      gctggtggtggggccactgcctcacccctcctacatgcgggacgtgactg
      tggagcgtcatggaggccccctgccctatcaccgacgccccatgttgttc
      caagagtacctggacatagaccagatgatcttcgacagagagctgcccca
      ggcttctgggcttctccatcactgttgcttctacaagcgccggggacgga
      acctggtgacaatgaccacggctccccgtggtctgcaatcaggggaccgg
      gccacctagtttggcctctactacaacatctcgggcgctgggttcttcct
      gcaccacgtgggcttggagctgctagtgaaccacaaggcccttgaccctg
      cccgctggactatccagaaggtgttctatcaaggccgctactatgacagc
      ctggcccagctggaggcccagtttgaggccggcctggtgaatgtggtgct
      gatcccagacaatggcacaggtgggtcctggtccctgaagtcccctgtgc
      ccccgggtccagctccccctctgcagttccatccccaaggcccccgcttc
      agtgtccagggaagtcgagtggcctcctcactgtggactttctcctttgg
      cctcggagcattcagtggcccaaggatctttgacgttcccttccaagggg
      agagggtggcctatgaagtcagtgtccaggcggccttggccatctatgga
      ggcaattctccttctgctctacgaagccggtacatagatagtggctttgg
      cttgggccacttctccacgcccctgacccatggggtggactgcccctacc
      tggccacctacgtggactggcacttcctttttgagtcccaggccgccaag
      acaatacgcgatgccttttgtatatttgaacagaaccagggcctccccct
      gcggcgacaccactcagatctctactcccactactttgggggccttgcgg
      aaacggtgctggtcatcagatctgtgtctactatgctcaactatgactat
      gtgtgggatatggtcttccaccctaatggggccatagaaatcagactcca
      caccaccggctacatcagctcagcattcccctttggtgctgcccagaggt
      atggaaacaaagtttcagagcacaccctgggcacggtccacacccacagc
      gcccacttcaaggtggacctggatgtagcaggtaaggcatcctggcagag
      gcaaaagtgctggaggggtgagctgaagtctccatgcctagctttaaaag
      ttttcgttgggctgggagcagtagcttatgcctgtaagcccaacactttg
      ggagactgaggggggtggatcacttgaggtcaggagttcaaaaccagcct
      ggccaacatggcgaaatcctgtctgtactaaaaatacaaaaattagctgg
      gcatgggtatgctgtaatcctagctactcgggaggctgaggcaggagaat
      cacttgaatctgggagtcagaggttgcagtgagctgagattgagccactg
      cactccatcctgcgtgactgaac
      >NR_037806 1
      attcccagtcacccactcactcagaaagccgggagtcatcggacaccttg
      ctggtcagaggtcctgggggtggttttgaaccatcagagcttggactttt
      ctgacttccccagcaaggatcttcccacttcctgctccctgtgttcccac
      cctccagtgttggcacaggcccacccctggctccaccagagccagaagca
      gaggtagaatcaggcgggccccgggctgcactccgagcagtgttcctggc
      catctttgctactttcctagagaacccggctgttgccttaaatgtgtgag
      agggacttggccaaggcaaaagctggggagatgccagtgacaacatacag
      ttcatgactaggtttaggaattgggcactgagaaaattctcaatatttca
      gagagtccttcccttatttgggactcttaacacggtatcctcgctagttg
      gttttaagggaaacactctgctcctgggtgtgagcagaggctctggtctt
      gccctgtggtttgactctccttagaaccaccgcccaccagaaacataaag
      gattaaaatcacactaataacccctggatggtcaatctgataataggatc
      agatttacgtctaccctaattcttaacattgcagctttctctccatctgc
      agattattcccagtctcccagtaacacgtttctacccagatcctttttca
      tttccttaagttttgatctccgtcttcctgatgaagcaggcagagctcag
      aggatcttggcatcacccaccaaagttagctgaaagcagggcactcctgg
      ataaagcagcttcactcaactctggggaatgctaccattttttttccaaa
      gtagaaaggaagcacttctgagccagtgaccactgaaagatgaacactct
      tcctgatcctctcctctagaattcatctcctcctgctagcagccgcgtcc
      tggaggagcagcggatggggaatccattctgtttcttcctggtgtttagg
      aagttgccccacacacagattgccccgatgtccaaccagaagaagtgaaa
      ctgctgctgggtctggagaggtgaagacccgtggccagcttctgttgttg
      ccatcggccattgctttttgttcgcttgcttttggttttgcaagaagagc
      ggcctctgtctctgatctgcttcaaatcatcattccatcagtgacagaag
      tggctgttccatcagtggtcgcagccagttcagctcctgcatccatcccc
      aagtgttctgagtggaatttgaggcctccccaaccacctaccaaaaaagg
      agggtgaaatgaaaggaagaagaaaaactcagcattctttcctctgacaa
      agagtaaaacgacaaggaatatcggcctgaattctcttcccaagaagaaa
      gaaagcacaccaacgcaggcatttgtcttctgtccatggtgctgaagttt
      attcactttcaaaccactttcagtaacagcaaattctttagaaaaggaaa
      atacagggaaagggataaacctcactgacttggaggaaatcaagaggagt
      gagcacagcatcagaaagccccctggccccagactgcacccgctttcctg
      gccctaccttgaaatccatcaggtctgcgttggacacggcattgtacatg
      ggattagctctg
      

      任何帮助和意见都将深表感谢。

      感谢您抽出宝贵时间解决我的问题!

2 个答案:

答案 0 :(得分:2)

我没有将序列分成三个部分,而是看到这个工作的方式是在整个序列中找到所有出现的$pattern并确定模式从哪个开始。

内置变量$-[0]包含最近成功匹配开头的偏移量。

以下代码执行我认为您想要的内容。它的工作原理是累积每个序列(在找到新的序列ID或到达文件末尾时结束)并将其传递给process_seq子例程。

子程序获取序列的长度,并计算字符串每三分之一结束的偏移量。惯用sprintf '%.0f', $value用于将小数值舍入到最近的字符位置。

针对序列中每次出现的@counts调整$regex数组。通过比较@counts中匹配的起始位置和序列的三个段中每个段的结束偏移来建立要增加的$-[0]元素。

处理完每个序列后,@counts中的值会累积到@totals中,以显示所有序列的总体数据。

显示使用样本数据时程序的输出。总计为(9, 1, 6)

use strict;
use warnings;

my $gpat = '[G]{3,5}';
my $npat = '[A-Z]{1,25}';
my $pattern = $gpat.$npat.$gpat.$npat.$gpat.$npat.$gpat; 
my $regex = qr/$pattern/i;

open my $fh, '<', 'sequences.txt' or die $!;

my ($id, $seq);
my @totals = (0, 0, 0);

while (<$fh>) {

  chomp;

  if (/^>(\w+)/) {
    process_seq($seq) if $id;
    $id = $1;
    $seq = '';
    print "$id\n";
  }
  elsif ($id) {
    $seq .= $_;
    process_seq($seq) if eof;
  }
}

print "Total: @totals\n";



sub process_seq {

  my $sequence = shift;
  my $length = length $sequence;

  my @offsets = map {sprintf '%.0f', $length * $_ / 3} 1..3;

  my @counts = (0, 0, 0);

  while ($sequence =~ /$regex/g) {
    my $place = $-[0];
    for my $i (0..2) {
      next if $place >= $offsets[$i];
      $counts[$i]++;
      last;
    }
  }

  print "@counts\n\n";
  $totals[$_] += $counts[$_] for 0..2;
}

<强>输出

NR_037701
0 0 1

NM_198399
1 0 0

NR_026816
1 0 1

NR_027917
0 0 0

NR_002777
0 0 0

NR_033769
1 0 0

NM_016326
1 0 1

NM_181641
1 0 1

NM_001144931
0 0 0

NR_029429
0 1 0

NR_026551
1 0 0

NM_181640
1 0 1

NM_016951
1 0 1

NR_002773
1 0 0

NR_037806
0 0 0

Total: 9 1 6

答案 1 :(得分:2)

我解除了Borodin的process_seq功能,但使用了Bio:SeqIO按顺序读取文件序列,这比逐行手动读取和确定各种处理的逻辑更有优势。我相信这些优点是:

  • 由许多其他人开发和测试的代码
  • 如果可以通过Bio :: SeqIO模块完成输出,则可以使用Bio :: SeqIO读取(next_seq)方法读取结果文件。
  • 我现在想不起的其他原因: - )

我认为生物遗传密码模块的BioPerl软件包对于开始编程的生物学家来说必定是压倒性的。他可能不愿意尝试挖掘开始构建程序所需的信息。 BioPerl wiki是一个很好的起点,特别是Howto部分,然后有一个如何为初学者和其他人。你会发现大多数(?)有用的代码示例。 Bio::Seq在开头有一些很好的代码示例,并且是大多数通用序列函数的地方。此外,对于输入/输出,使用了Bio::SeqIO模块,并在其手册的开头有示例。

#!/usr/bin/perl
use strict;
use warnings;
use Bio::SeqIO;

my $gpat = '[G]{3,5}'; 
my $npat = '[A-Z]{1,25}'; 
my $pattern = $gpat.$npat.$gpat.$npat.$gpat.$npat.$gpat;  
my $regex = qr/$pattern/i; 

my $in = Bio::SeqIO->new ( -file   => "fasta_dat.txt",
                           -format => 'fasta');
my @totals;
while ( my $seq = $in->next_seq() ) {
    process($seq);
}

print "Totals:   ";
print "@totals\n";

sub process {
    my $seq = shift;
    my @offset = map {sprintf '%.0f', $seq->length * $_ / 3} 1..3;
    my $sequence = $seq->seq;

    my @count = (0,0,0);
    while ($sequence =~ /$regex/g) {
        my $place = $-[0];
        for my $i (0 .. 2) {
            next if $place >= $offset[$i];
            $count[$i]++;
            last;
        }
    }
    print $seq->id, "\n@count\n";
    $totals[$_] += $count[$_] for 0 .. $#count; 
}