perl-从multi-fasta文件中提取重复序列

时间:2018-11-06 06:36:02

标签: perl

我有一个很大的fasta文件 input.fasta ,其中包含许多重复的序列。我想输入标题名称,并提取出所有具有匹配标题的序列。我知道可以使用awk / sed / grep轻松完成此操作,但是我需要Perl代码。

input.fasta

>OGH38127_some_organism
PAAALGFSHLARQEDSALTPKHYTWTAPGEGDVRAPCPVLNTLANHEFLPHNGKNITVDK
AITALGDAMNISPALATTFFTGGLKTNPTPNATWFDLDMLHKHNVLEHDGSLSRRDMHFD
TSNKFDAATFANFLSYFDANATVLGVNETADARARHAYDMSKMNPEFTITSSMLPIMVGE
SVMMMLVWGSVEEPGAQRDYFEYFFRNERLPVELGWTPGETEIGVPVVTAMITAMVAASP
TDVP
>ABC14110_some_different_org_name
WWVAPGPGDSRGPCPGLNTLANHGYLPHDGKGITLSILADAMLDGFNIARSDALLLFTQ
AIRTSPQYPATNSFNLHDLGRDQLNRHNVLEHDASLSRADDFFGSNHIFNETVFDESRAY
AMLANSKIARQINSKAFNPQYKFTSKTEQFSLGEIAAPIIAFGNSTSGEVNRTLVEYFFM
NERLPIELGWKKSEDGIALDDILRVTQMISKAASLITPSALSWTAETLTP
>OGH38127_some_organism
LPWSRPGPGAVRAPCPMLNTLANHGFLPHDGKNISEARTVQALGRALNIEKELSQFLFEK
ALTTNPHTNATTFSLNDLSRHNLLEHDASLSRQDAYFGDNHDFNQTIFDETRSYWPHPVI
DIQAAALSRQARVNTSIAKNPTYNMSELGLDFSYGETAAYILILGDKDFGKVNRSWVEYL
FENERLPVELGWTRHNETITSDDLNTMLEKVVN
.
.
.

我尝试使用以下脚本,但未提供任何输出。

script.pl

#!/perl/bin/perl -w
use strict;
use warnings;

print "Enter a fasta header to search for:\n";
my $head = <>;

my $file = "input.fasta";
open (READ, "$file") || die "Cannot open $file: $!.\n";
my %seqs;
my $header;

while (my $line = <READ>){
    chomp $line;
    $line =~ s/^>(.*)\n//;
    if ($line =~ m/$head/){
        $header = $1;
    }
}
close (READ);

open( my $out , ">", "out.fasta" ) or die $!;

my @count_seq = keys %seqs;
foreach (@count_seq){
    print $out $header, "\n";
    print $out $seqs{$header}, "\n";
}

exit;

请帮助我更正此脚本。 谢谢!

2 个答案:

答案 0 :(得分:4)

如果您使用Bioperl模块Bio::SeqIO来处理fasta文件的解析,这将变得非常简单:

Request.Host.Value.StartsWith(“localhost:”)

运行#!/usr/bin/perl use warnings; use strict; use Bio::SeqIO; my ($file, $name) = @ARGV; my $in = Bio::SeqIO->new(-file => $file, -format => "fasta"); my $out = Bio::SeqIO->new(-fh => \*STDOUT, -format => "fasta"); while (my $s = $in->next_seq) { $out->write_seq($s) if $s->display_id eq $name; }

答案 1 :(得分:2)

无需将序列存储在内存中,您可以在读取文件时直接打印它们。使用标志变量(在示例中为$inside),该变量告诉您​​是否正在读取所需的序列。

#! /usr/bin/perl
use warnings;
use strict;

my ($file, $header) = @ARGV;

my $inside;
open my $in, '<', $file or die $!;
while (<$in>) {
    $inside = $1 eq $header if /^>(.*)/;
    print if $inside;
}

运行方式

perl script.pl file.fasta OGH38127_some_organism > output.fasta