如何在标题行与另一个文件中的列表匹配的文件中提取fasta序列?

时间:2013-04-06 08:34:26

标签: regex perl extract fasta

我是Perl的新手。我试图从一个文件中提取fasta序列,该文件与另一个文件中的行匹配。这两个示例文件如下:

File1.fasta:

  

> gene_44 | 105_nt | + | 47540 | 47644   GTGCGCCGGCGCGTCGCGATCGCGAACCGGCCCGTGCGAATCCTGCCGCATGCGCGCCGCATCTCGCCACGCCGCGCATTTCATTTCGACATCCATAACGTCTGA

     

> gene_69 | 111_nt | + | 75846 | 75956   ATGCCGTTGCCGTCGCGCATCGCGGCGGCCGTGCGCGGCGCGCATGCATACGCCGGCACGGCCGATGCGCGCGCGACGCGCAAACTGCACGCGGCGCGGGATTTGTGTTGA

     

> gene_88 | 177_nt | - | 97993 | 98169
  ATGCGCCAGCCGACGCACGCCCATTCCGGGCGAAACGTTCCCCTTATCCATTCGATCATCCGTGCCGCACTGCGCGAAGCGGCCACCGCCGACACGTACCAAACCGCGCTCGATGCGACCGGCGCGGCACTCGTCGCCATCGCGGCGCTCGTGCGCGCGGAGGTGCGGCATGGCTGA

     

> gene_90 | 141_nt | - | 99016 | 99156
  TTGGAAGGGCGCTTTCCGCGTGCGAGTCGTCTGACGCAGCGTTGCACGGTCTGGTCGAATCGCGAGCTTCATCGCTGGATGGCCGATCCGTTGAACTATCGCGCTGTCGACGCGGCGAACCAGACGACGGAGGGCGCGTAA

File2.list:

  

somewordsinfront,> gene_44 | somewordsattheback

     

blablabla,> gene_88 | blablablablabla

我期望的输出如下:

  

> gene_44 | 105_nt | + | 47540 | 47644   GTGCGCCGGCGCGTCGCGATCGCGAACCGGCCCGTGCGAATCCTGCCGCATGCGCGCCGCATCTCGCCACGCCGCGCATTTCATTTCGACATCCATAACGTCTGA

     

> gene_88 | 177_nt | - | 97993 | 98169
  ATGCGCCAGCCGACGCACGCCCATTCCGGGCGAAACGTTCCCCTTATCCATTCGATCATCCGTGCCGCACTGCGCGAAGCGGCCACCGCCGACACGTACCAAACCGCGCTCGATGCGACCGGCGCGGCACTCGTCGCCATCGCGGCGCTCGTGCGCGCGGAGGTGCGGCATGGCTGA

我怎样才能实现这一目标?提前致谢! :)

1 个答案:

答案 0 :(得分:0)

下次当您提问时,请显示您的代码,例如

use strict;
use warnings;

my @genes;

open my $list, '<file2.list';
while (my $line = <$list>) {
    push (@genes, $1) if $line =~ /[^>]+>([^|]+)/;

}
my $input;
close $list;
{
    local $/ = undef;
    open my $fasta, '<file1.fasta';
    $input = <$fasta>;
    close $fasta;
}
my @lines = split(/>/,$input);
foreach my $l (@lines) {
    foreach my $reg (@genes) {
        print ">$l" if $l =~ /$reg/
    }
}