使用Perl从文件中提取特定的FASTA序列

时间:2015-12-28 14:37:46

标签: perl

我编写了一个Perl脚本来从FASTA文件中检索假设的蛋白质列表。我只能得到所有假设蛋白质的标题行,但我希望得到所有序列以及蛋白质ID。

脚本如下。

#!/usr/bin/perl

use strict;
use warnings;

my $line;

open $fh, '<', '/home/Desktop/hypo_proteins/testprotein.fasta' or die "Cannot open file $fh, $!";   
open OUT, ">output.txt";

while ( $line = <$fh> ) {  

    chomp $line;

    if ( $line =~ /hypothetical protein/ ) {
        print OUT "$line\n";
    }
}

close $fh;

我从上面的脚本得到的输出如下

>gi|113461928|ref|YP_718205.1| hypothetical protein HS_1792 [Haemophilus somnus 129PT]
>gi|113460158|ref|YP_718214.1| hypothetical protein HS_0009 [Haemophilus somnus 129PT]
>gi|113460165|ref|YP_718221.1| hypothetical protein HS_0016 [Haemophilus somnus 129PT]

但我需要输出如下:

>gi|113461928|ref|YP_718205.1| hypothetical protein HS_1792 [Haemophilus somnus 129PT]
MFKSLIQFFKSKSNTSNIKKENAVQRQERQDIEGWITPYSGQELLNTELRQHHLGLLWQQVSMTREMFEH
LYQKPIERYAEMVQLLPASESHHHSHLGGMLDHGLEVISFAAKLRQNYVLPLNAAPEDQAKQKDAWTAAV
IYLALVHDIGKSIVDIEIQLQDGKRWLAWHGIPTLPYKFRYIKQRDYELHPVLGGFIANQLIAKETFDWL
ATYPEVFSALMYAMAGHYDKANVLAEIVQKADQNSVALALGGDITKLVQKPVISFAKQLILALRYLISQK
FKISSKGPGDGWLTEDGLWLMSKTTADQIRAYLMGQGISVPSDNRKLFDEMQAHRVIESTSEGNAIWYCQ
LSADAGWKPKDKFSLLRIKPEVIWDNIDDRPELFAGTICVVEKENEAEEKISNTVNEVQDTVPINKKENI
ELTSNLQEENTALQSLNPSQNPEVVVENCDNNSVDFLLNMFSDNNEQQVMNIPSADAEAGTTMILKSEPE
NLNTHIEVEANAIPKLPTNDDTHLKSEGQKFVDWLKDKLFKKQLTFNDRTAKVHIVNDCLFIVSPSSFEL
YLQEKGESYDEECINNLQYEFQALGLHRKRIIKNDTINFWRCKVIGPKKESFLVGYLVPNTRLFFGDKIL
INNRHLLLEE

1 个答案:

答案 0 :(得分:1)

这将按照您的要求进行

#!/usr/bin/perl

use strict;
use warnings;

use constant INPUT  => '/home/Desktop/hypo_proteins/testprotein.fasta';
use constant OUTPUT => 'output.txt';

open my $in_fh,  '<', INPUT  or die "Cannot open input file: $!";   
open my $out_fh, '>', OUTPUT or die "Cannot open output file: $!";
select $out_fh;

my $print;

while ( <$in_fh> ) {  

    if ( /^>/ ) {
        $print = /hypothetical protein/;
    }

    print if $print;
}

关于此解决方案的(已删除)问题,它在多个位置使用隐式变量$_。它相当于这个程序

#!/usr/bin/perl

use strict;
use warnings;

use constant INPUT  => '/home/Desktop/hypo_proteins/testprotein.fasta';
use constant OUTPUT => 'output.txt';

open my $in_fh,  '<', INPUT  or die "Cannot open input file: $!";   
open my $out_fh, '>', OUTPUT or die "Cannot open output file: $!";
select $out_fh;

my $print;

while ( defined( $_ = <$in_fh>) ) {  

    if ( $_ =~ /^>/ ) {
        $print = ( $_ =~ /hypothetical protein/ );
    }

    print $_ if $print;
}

所以我希望你能看到$print = $_ =~ /hypothetical protein/检查当前行($_)是否包含字符串hypothetical protein并将$print设置为 true < / em>值,如果是这样。

因为$print是在循环外部定义的,所以它在循环的迭代中保持其值,正如您所看到的那样,只有在标题行上更改时,当前行以>开头,并且将保持为真,直到下一个标题行,以便print if $print将输出包含hypothetical protein 和所有后续行的标题,直到下一个标题

我希望有帮助