我编写了一个Perl脚本来从FASTA文件中检索假设的蛋白质列表。我只能得到所有假设蛋白质的标题行,但我希望得到所有序列以及蛋白质ID。
脚本如下。
#!/usr/bin/perl
use strict;
use warnings;
my $line;
open $fh, '<', '/home/Desktop/hypo_proteins/testprotein.fasta' or die "Cannot open file $fh, $!";
open OUT, ">output.txt";
while ( $line = <$fh> ) {
chomp $line;
if ( $line =~ /hypothetical protein/ ) {
print OUT "$line\n";
}
}
close $fh;
我从上面的脚本得到的输出如下
>gi|113461928|ref|YP_718205.1| hypothetical protein HS_1792 [Haemophilus somnus 129PT]
>gi|113460158|ref|YP_718214.1| hypothetical protein HS_0009 [Haemophilus somnus 129PT]
>gi|113460165|ref|YP_718221.1| hypothetical protein HS_0016 [Haemophilus somnus 129PT]
但我需要输出如下:
>gi|113461928|ref|YP_718205.1| hypothetical protein HS_1792 [Haemophilus somnus 129PT]
MFKSLIQFFKSKSNTSNIKKENAVQRQERQDIEGWITPYSGQELLNTELRQHHLGLLWQQVSMTREMFEH
LYQKPIERYAEMVQLLPASESHHHSHLGGMLDHGLEVISFAAKLRQNYVLPLNAAPEDQAKQKDAWTAAV
IYLALVHDIGKSIVDIEIQLQDGKRWLAWHGIPTLPYKFRYIKQRDYELHPVLGGFIANQLIAKETFDWL
ATYPEVFSALMYAMAGHYDKANVLAEIVQKADQNSVALALGGDITKLVQKPVISFAKQLILALRYLISQK
FKISSKGPGDGWLTEDGLWLMSKTTADQIRAYLMGQGISVPSDNRKLFDEMQAHRVIESTSEGNAIWYCQ
LSADAGWKPKDKFSLLRIKPEVIWDNIDDRPELFAGTICVVEKENEAEEKISNTVNEVQDTVPINKKENI
ELTSNLQEENTALQSLNPSQNPEVVVENCDNNSVDFLLNMFSDNNEQQVMNIPSADAEAGTTMILKSEPE
NLNTHIEVEANAIPKLPTNDDTHLKSEGQKFVDWLKDKLFKKQLTFNDRTAKVHIVNDCLFIVSPSSFEL
YLQEKGESYDEECINNLQYEFQALGLHRKRIIKNDTINFWRCKVIGPKKESFLVGYLVPNTRLFFGDKIL
INNRHLLLEE
答案 0 :(得分:1)
这将按照您的要求进行
#!/usr/bin/perl
use strict;
use warnings;
use constant INPUT => '/home/Desktop/hypo_proteins/testprotein.fasta';
use constant OUTPUT => 'output.txt';
open my $in_fh, '<', INPUT or die "Cannot open input file: $!";
open my $out_fh, '>', OUTPUT or die "Cannot open output file: $!";
select $out_fh;
my $print;
while ( <$in_fh> ) {
if ( /^>/ ) {
$print = /hypothetical protein/;
}
print if $print;
}
关于此解决方案的(已删除)问题,它在多个位置使用隐式变量$_
。它相当于这个程序
#!/usr/bin/perl
use strict;
use warnings;
use constant INPUT => '/home/Desktop/hypo_proteins/testprotein.fasta';
use constant OUTPUT => 'output.txt';
open my $in_fh, '<', INPUT or die "Cannot open input file: $!";
open my $out_fh, '>', OUTPUT or die "Cannot open output file: $!";
select $out_fh;
my $print;
while ( defined( $_ = <$in_fh>) ) {
if ( $_ =~ /^>/ ) {
$print = ( $_ =~ /hypothetical protein/ );
}
print $_ if $print;
}
所以我希望你能看到$print = $_ =~ /hypothetical protein/
检查当前行($_
)是否包含字符串hypothetical protein
并将$print
设置为 true < / em>值,如果是这样。
因为$print
是在循环外部定义的,所以它在循环的迭代中保持其值,正如您所看到的那样,只有在标题行上更改时,当前行以>
开头,并且将保持为真,直到下一个标题行,以便print if $print
将输出包含hypothetical protein
和所有后续行的标题,直到下一个标题
我希望有帮助