欢迎来到世界各地的Perl Masters。
编程时我遇到了另一个麻烦。 我正在编写一个程序,它从带有特定输入编号的proteom fasta文件中选择随机序列。
一般的fasta文件如下所示:
> seq_ID_1说明等 ASDGDSAHSAHASDFRHGSDHSDGEWTSHSDHDSHFSDGSGASGADGHHAH ASDSADGDASHDASHSAREWAWGDASHASGASGASGSDGASDGDSAHSHAS SFASGDASGDSSDFDSFSDFSD
> seq_ID_2描述等 ASDGDSAHSAHASDFRHGSDHSDGEWTSHSDHDSHFSDGSGASGADGHHAH ASDSADGDASHDASHSAREWAWGDASHASGASGASG
依旧.......
字母代表氨基酸肽。
所以我有一个包含1000个序列的fasta文件,想要检索63.21%的序列,这将是632.1序列。但顺序不能是浮点数,所以如果它超过0.5我想要向上舍入,如果小于0.5舍入。
这是我生成随机序列子集的代码,但它的工作效果稍差。
#!/usr/bin/perl
#Selecting 63.21% of random sequnces from a proteom file.
use strict;
use warnings;
use List::Util qw(shuffle);
#Give the first argument as a proteom file.
if (@ARGV != 1)
{
print "Invalid arguments\n";
print "Usage: perl randseq.pl [proteom_file]";
exit(0);
}
my $FILE = $ARGV[0];
my $i = 0;
my %protseq = {};
my $nIdx = 0;
#Extraction and counting of the all headers from a proteom file.
open(LIST,$FILE);
open(TEMP1, ">temp1");
while (my $line = <LIST>){
chomp $line;
if ($line =~ />(\S+) (.+)/){
$i++;
print TEMP1 $1,"\n";
}
}
close(LIST);
close(TEMP1);
#Selection of random headers for generating a random subset of the proteom file.
my $GET_LINES = RoundToInt ($i*0.6321);
my @line_starts;
open(my $FH,'<','temp1');
open(TEMP2, ">temp2");
do {
push @line_starts, tell $FH
} while ( <$FH> );
my $count = @line_starts;
my @shuffled_starts = (shuffle @line_starts)[1..$GET_LINES+1];
for my $start ( @shuffled_starts ) {
seek $FH, $start, 0
or die "Unable to seek to line - $!\n";
print TEMP2 scalar <$FH>;
}
close(TEMP2);
#Assigning the sequence data to randomly generated header file.
open(DATA,'<','temp2');
while(my $line = <DATA>)
{
chomp($line);
$line =~ s/[\t\s]//g;
if($line =~ /^([^\s]+)/)
{
$protseq{$1}++;
}
}
close(DATA);
open(DATA, "$FILE");
open(OUT, ">random_seqs.fasta");
while(my $line = <DATA>)
{
chomp($line);
if($line =~ /^>([^\s]+)/)
{
if($protseq{$1} ne "")
{
$nIdx = 1;
print OUT "$line\n";
}
else
{
$nIdx = 0;
}
}
else
{
if($nIdx == 1)
{
print OUT "$line\n";
}
}
}
close(DATA);
close(OUT);
#subroutine for rounding
sub RoundToInt {
int($_[0] + .5 * ($_[0] <=> 0));
}
system("erase temp1");
system("erase temp2");
exit;
然而,它有时会提供适当数量的序列,有时还会提供一个序列。 我怎么能摆脱那个......好吗?
或者更好的短代码?
在这里你可以获得75酵母蛋白质文件。 [http://www.peroxisomedb.org/Download/Saccharomyces_cerevisiae.fas] [1]
希望我能尽快解决这个问题...... :(
答案 0 :(得分:4)
你的方法看起来很好,只是不必要的复杂。我会这样做:
use strict;
use warnings;
# usage: randseq.pl [fraction] < input.fasta > output.fasta
my $fraction = (@ARGV ? shift : 0.6321);
# Collect input lines into an array of sequences:
my @sequences;
while (<>) {
# A leading > starts a new sequence. (The "\" is only there to
# avoid confusing the Stack Overflow syntax highlighting.)
push @sequences, [] if /^\>/;
push @{ $sequences[-1] }, $_;
}
# Calculate how many sequences we want:
my $n = @sequences;
my $k = int( $n * $fraction + 0.5 );
warn "Selecting $k out of $n sequences (", 100 * $k / $n, "%).\n";
# Do a partial Fisher-Yates shuffle to select $k random sequences out of $n:
foreach my $i (0 .. $k-1) {
my $j = $i + int rand($n-$i);
@sequences[$i,$j] = @sequences[$j,$i];
}
# Print the output:
print @$_ for @sequences[0 .. $k-1];
请注意,此代码将输入文件的全部内容读入内存。如果输入文件太大,并且您只需要它的一小部分,则可以使用reservoir sampling从任意大的集合中选择 k 随机序列,而无需持有更多:
use strict;
use warnings;
my $k = (@ARGV ? shift : 632); # sample size: need to know this in advance
# Use reservoir sampling to select $k random sequences:
my @samples;
my $n = 0; # total number of sequences read
my $i; # index of current sequence
while (<>) {
if (/^\>/) {
# Select a random sequence from 0 to $n-1 to replace:
$i = int rand ++$n;
# Save all samples until we've accumulated $k of them:
$samples[$n-1] = $samples[$i] if $n <= $k;
# Only actually store the new sequence if it's one of the $k first ones:
$samples[$i] = [] if $i < $k;
}
push @{ $samples[$i] }, $_ if $i < $k;
}
warn "Only read $n < $k sequences, selected all.\n" if $n < $k;
warn "Selected $k out of $n sequences (", 100 * $k / $n, "%).\n" if $n >= $k;
# Print sampled sequences:
print @$_ for @samples;
但是,如果你真的想要输入序列的某个分数,你需要先在文件的单独传递中计算它们。
上述两个程序也统一地将采样序列混洗为副作用。 (事实上,我故意调整了水库采样算法,使所有 n 和 k 的值均匀化。)如果你不想这样,你可以随时在打印序列之前,根据您喜欢的标准对序列进行排序。
答案 1 :(得分:1)
我使用spritf函数来计算圆数和数组而不是临时文件
#!/usr/bin/perl
use strict;
if (@ARGV != 1)
{
print "Invalid arguments\n";
print "Usage: perl randseq.pl [proteom_file]";
exit(0);
}
my $FILE = $ARGV[0];
open(LIST,"<$FILE");
my @peptides;
my $element;
while (my $line = <LIST>){
if ($line =~ />.*/) {
push (@peptides, $element);
$element=$line;
}
else {
$element.=$line;
}
}
close(LIST);
my $GET_LINES = sprintf("%.0f",$#peptides*0.6321);
my @out;
for (0..$GET_LINES) {
my $index=$#peptides;
push (@out, $peptides[int(rand($index))]);
splice(@peptides, $index, 1);
}
open (OUT, '>out.fasta');
foreach (@out) {
print OUT $_."\n";
}
exit;