从大型multifasta文件中提取具有基序的蛋白质序列

时间:2018-07-18 09:52:33

标签: regex perl

我有一个multifasta文件,其中包含一百多个蛋白质序列。我正在尝试获取具有用户给定图案的序列。我已经试过了问题的答案 perl Script to search for a motif in a multifasta file and print the complete sequence along with the header line 它正在提供输出,但不正确

输入

>KGHL009_Homo_sapiens
MFKSLIQFFKSKSNTSNIKKENAVQRQERQDIEGWITPYSGQELLNTELRQHHLGLLWQQVSMTREMFEH
LYQKPIERYAEMVQLLPASESHHHSHLGGMLDHGLEVISFAAKLRQNYVLPLNAAPEDQAKQKDAWTAAV
IYLALVHDIGKSIVDIEIQLQDGKRWLAWHGIPTLPYKFRYIKQRDYELHPVLGGFIANQLIAKETFDWL
ATYPEVFSALMYAMAGHYDKANVLAEIVQKADQNSVALALGGDITKLVQKPVISFAKQLI
>XIM5213_Mus_musculus
FKISSKGPGDGWLTEDGLWLMSKTTADQIRAYLMGQGISVPSDNRKLFDEMQAHRVIESTSEGNAIWYCQ
LSADAGWKPKDKFSLLRIKPEVIWDNIDDRPELFAGTICVVEKENEAEEKISNTVNEVQDTVPINKKENI
ELTSNLQEENTALQSLNPSQNPEVVVENCDNNSVDFLLNMFSDNNEQQVMNIPSADAEAGTTMILKSEPE
NLNTHIEVEANAIPKLPTNDDTHLKSEGQKFVDWLKD

以此类推

我尝试过的Perl程序

#!/usr/bin/perl -w

use strict;
use warnings;

print "Enter motif:";
$motif = <STDIN>;

my $seqfile = 'sequences.fasta';
my %seqs    = %{ read_fasta_as_hash( 'sequences.fasta' ) };

open( my $motiffile, "+>", "motifseqs.fasta" ) or die $!;

foreach my $id ( keys %seqs ) {

    if ( $seqs{$id} =~ /$motif/ ) {
        print $motiffile $id, "\n";
        print $motiffile $seqs{$id}, "\n";
    }
}

sub read_fasta_as_hash {
    my $fn = shift;

    my $current_id = '';
    my %seqs;

    open FILE, "<$fn" or die $!;

    while ( my $line = <FILE> ) {
        chomp $line;

        if ( $line =~ /^(>.*)$/ ) {
            $current_id = $1;
        }
        elsif ( $line !~ /^\s*$/ ) {    # skip blank lines
            $seqs{$current_id} .= $line;
        }
    }

    close FILE or die $!;

    return \%seqs;
}

我不知道这是怎么回事。当我在终端上单独运行它时,它可以正常工作并给出带有输入主题的准确89个序列,但是在用另一个CGI脚本编写后,它给出了189个序列,其中大多数没有主题。

0 个答案:

没有答案