如何在multifasta文件中搜索匹配的fasta序列,并将输出追加到另一个文件中?

时间:2019-03-13 09:30:00

标签: perl hash fasta

我有三个fasta文件。 文件1 org_seqs.fasta

>OAJ152.7_org_name
GDPQGLTGSHNFIEADSSNTRDDLYVTGNNYALNMEKFMSWYNMSVDGTFDMNLMAKRAKLRFEETIQTNPNFYYGPVTGLIARNAGYIFPGRLFRNHSLENPEGILTKSHIRHFYGIYGEEHDKRFLFDSMASPIDVTGEHEFQPPNLAGGDQRGPCPGLNALANHGYIPRNGVVSFRKVIAAINEVYGMGIDLATVLAIMGTVWTGDPLSLDPGFSIGDRDTGVNNILGNLGGLLLCFVFEVLKFASPNALASLYSTLAEPLDFVGNALSVPLLNVTCPALQDLQMGGKPFWEAIQNDFPGA
>FGT15428_org_name
SSLTRNDLYVTGNAWTMNTTLFWDFHDRADENGVLSMDLLADQAARRWNDSVSTNPAFWYGPVTGMVARNAAMFFLGRLLSNHTAEHPEGILTQDIFRKFFAVYTIPANWYRKPVEYGLVPLNLDIVSWIMKHPVLGSIGGNTGTVNSFSGLDMHNITGGVTKPIEITGRHAFKAPGRFDQRGPCPGLNALANHGYISRDGITSFAEVVTAVNQVLGMGIETALILGAMGTVWTGNPLSLNPGFSIGGAANGPDNILGNVLGLLGDLRGLQGSHNWIEAD
>EGS1524_Org_name
NAGYIFPGRFFRNYSAENPEGVLTKEIVKNFFAVYGEDGNLTYKEGWERIPENWYRMPVDYTLVQLNLDLLDFGLKYPELLSIGGNTGTVNSFTGVDIANLTEKRSLLEPTSSPIDISGEHSFQPPDFSNGDQRGPCPGLNALANHGYIPRNGVVTMADVIPAINQVYGDLMAKRAKIRFEESIATNPNFYYSTILAVMGTVFVGDVLSLAPGFSIGGFSPAVQNILGNLEGLLGEPRGLNGSHNIIEADSSNTRDDLYVTGDNTRLNLTQFIEWYQMADQDgnnGTFSMGPFTGAIAR

文件2 single_seqs.fasta

>OAJ152.7_org_name
GDPQGLTGSHNFIEADSSNTRDDLYVTGNNYALNMEKFMSWYNMSVDGTFDMNLMAKRAKLRFEETIQTNPNFYYGPVTGLIARNAGYIFPGRLFRNHSLENPEGILTKSHIRHFYGIYGEEHDKRFLFDSMASPIDVTGEHEFQPPNLAGGDQRGPCPGLNALANHGYIPRNGVVSFRKVIAAINEVYGMGIDLATVLAIMGTVWTGDPLSLDPGFSIGDRDTGVNNILGNLGGLLLCFVFEVLKFASPNALASLYSTLAEPLDFVGNALSVPLLNVTCPALQDLQMGGKPFWEAIQNDFPGA
>FGT15428_org_name
LNALANHTQDIFRKFFAVYTIPANWYRKPVEYGLVPLNLDIVSWIMKHPVLGSIGGNTGTVNSFSGLDMHNITGGVTKPIEITGRHAFKAPGRFDQRGPCPGLNALANHGYISLTRNDLYVTGNAWTMNTTLFWDFHDRADENGVLSMDLLADQAARRWNDSVSTNPAFWYGPVTGMVARNAAMFFLGRLLSNHTAEHPEGILTQDIFRKFFAVYTI
>TGH4853.21_org_nam
PNFYYGPFTGMIARNAGYFFACRLLSNHTVGSTEDIMDRETLKSFFAVHEKDGKLVYKRGWERIPENWYRRSIDYGLISLNLDLLNLITKYPELGSIGGNMGRSHDKRLSLGLASKPIKVTGEHEFIPPNFEKGDQRGPCPGLNALANHGYISRKGVTSLVEV

文件3 var_seqs.fasta

>OAJ152.7_org_name
GDPQGLTGSHNFIEADSSNTRDDLYVTGNNYALNMEKFMSWYNMSVDGTFDMNLMAKRAKLRFEETIQTNPNFYYGPVTGLIARNAGYIFPGRLFRNHSLENPEGILTKSHIRHFYGIYGEEHDKRFLFDSMASPIDVTGEHEFQPPNLAGGDQRGPCPGLNALANHGYIPRNGVVSFRKVIAA
>OAJ152.7_org_name
INEVYGMGIDLATVLAIMGTVWTGDPLSLDPGFSIGDRDTGVNNILGNLGGLLLCFVFEVLKFASPNALASLYSTLAEPLDFVGNALSVPLLNVTCPALQDLQMGGKPFWEAIQNDFPGA
>FGT15428_org_name
LNALANHGYISLTRNDLYVTGNAWTMNTTLFWDFHDRADENGVLSMDLLADQAARRWNDSVSTNPAFWYGPVTGMVARNAAMFFLGRLLSNHTAEHPEGILTLNALANHTQDIFRKFFAVYTIPANWYRKPVEYGLVPLNLDIVSWIMKHPVLGSIGGNTGTVNSFSGLDMHNITGGVTKPIEITGRHAFKAPGRFDQRGPCPG

我想编写一个程序,该程序根据以下条件匹配从file1到file2和file3的每个序列: 如果来自file1的fasta标头在file2中匹配,则匹配其序列的前四个字母和最后四个字母的长度,如果它们都匹配,则打开第四个文件“ copy.fasta”,并在其中附加此标头和seq。如果唯一的标头不匹配序列,则进入file3查找相同的条件,如果条件为true,则再次在file4中追加,否则打开另一个文件“ var.fasta”并追加到其中。如果file1中的fasta标头根本不在file2中匹配,请附加“ single.fasta”。

我尝试了以下脚本,但无法正常工作,我无法弄清楚如何将来自不同文件的匹配序列存储到哈希中并使用这些序列。

#! /usr/bin/perl
use warnings;
use strict 'vars';

my $file1 = "org_seqs.fasta";
my $file2 = "single_seqs.fasta";
my $file3 = "var_seqs.fasta";
my $file4 = open( COPYF, ">>", "copy.fasta" ) or die $!;
my $file5 = open( VARF, ">>", "vars.fasta") or die $!;
my $file6 = open( SINGLEF, ">>", "single.fasta") or die $!;


my %seq1 = %{ read_fasta_as_hash( $file1 ) };
my $id1 = shift;

my %seq2 = %{ read_fasta_as_hash( $file2 ) };
my $id2 = shift;

my %seq3 = %{ read_fasta_as_hash( $file3 ) };
my $id3 = shift;

my ($match, $seen);

foreach my $seq1(%seq1){

    foreach my $seq2(%seq2){

        foreach my $seq3(%seq3){

            if( $id1 eq $id2 ){

                $match = $1;
                my $len1 = length($seq1{$id1});
                my $len2 = length($seq2{$id2});
                my $first = substr $seq1, 4;    #extract first 4 characters
                my $last = substr $seq1, -4;    #extract last 4 characters

                if(( $seq2{$id2} =~ m/^($first)(.*)($last)$/ ) && ( $len2 == $len1 )){

                    $seen = $1;
                    print COPYF $id1, "\n", $seq1{$id1}, "\n";
                }
                else{

                    open( F3, "<", $file3 ) or die $!;
                    if ($match){

                        my $len3 = length($seq3{$id3});
                        print COPYF $id1, "\n", $seq1{$id1}, "\n" if(( $seq3{$id3} =~ m/^($first)(.*)($last)$/ ) && ( $len3 == $len1 ));
                    }
                    else{

                        print VARF $id1, "\n", $seq1{$id1}, "\n";
                    }
                }   
            }
            else{

                print SINGLEF $id1, "\n", $seq1{$id1}, "\n";
            }
        }
    }
}


close(COPYF);
close(VARF);
close(SINGLEF);


sub read_fasta_as_hash {

    my $file = shift;
    my $id = '';
    my $seq = ();
    my %seq;
    open FH, "$file" or die $!;
    while(my $line = <FH>){

        chomp $line;
        if ($line =~ /^>(.*)/){

            $id = $1;
        }
        else{

            $seq{$id} .= $line;
        }
    }
    close(FH);
    return \%seq;
}

exit;

我得到的错误:

Use of uninitialized value $id1 in print at fasta_match.pl line 53.
Use of uninitialized value in print at fasta_match.pl line 53.
Use of uninitialized value $id2 in regexp compilation at fasta_match.pl line 30.
Use of uninitialized value $id1 in pattern match (m//) at fasta_match.pl line 30.
Use of uninitialized value $id1 in hash element at fasta_match.pl line 33.
Use of uninitialized value $id2 in hash element at fasta_match.pl line 34.
Use of uninitialized value $id2 in hash element at fasta_match.pl line 34.

请帮助我更正此脚本。 非常感谢!

0 个答案:

没有答案