我正在使用文件(fasta文件),这是格式 -
>chr1 AACCCCCCCCTCCCCCCGCTTCTGGCCACAGCACTTAAACACATCTCTGC CAAACCCCAAAAACAAAGAACCCTAACACCAGCCTAACCAGATTTCAAAT TTTATCTTTAGGCGGTATGCACTTTTAACAAAAAANNNNNNNNNNNNNNN NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN GCCCATCCTACCCAGCACACACACACCGCTGCTAACCCCATACCCCGAAC CAACCAAACCCCAAAGACACCCCCCACAGTTTATGTAGCTTACCTCNNNN >chrM GATCACAGGTCTATCACCCTATTAACCACTCACGGGAGCTCTCCATGCAT TTGGTATTTTCGTCTGGGGGGTGTGCACGCGATAGCATTGCGAGACGCTG GAGCCGGAGCACCCTATGTCGCAGTATCTGTCTTTGATTCCTGCCTCATT CTATTATTTATCGCACCTACGTTCAATATTACAGGCGAACATACCTACTA AAGTGTGTTAATTAATTAATGCTTGTAGGACATAATAATAACAATTGAAT GTCTGCACAGCCGCTTTCCACACAGACATCATAACAAAANAATTTCCACC
我想使用滑动窗口方法(非重叠窗口,大小= 50)。我想在50bp窗口中找到每个角色的坐标,但不包括N.第一个chr1的输出应该是 - >
chr1 0 50 chr1 50 100 chr1 100 215 chr1 215 265
代码是 -
use warnings;
*ARGV or die "No input file specified";
open *first, '<',$ARGV[0] or die "Unable to open input file: $!";
$start=1;
while(<first>) {
chomp;
if ( /(>)(\w)/ ) { #taking lines which have name of chromosome
@arr=split(">"); #splitting at ">" character and in $arr[1], there is chr name now
if (defined @array){
foreach (@array){
$length++;
if($_ ne N){
$non++;
if ($non == 50){
print $chr,"\t",$start,"\t",$length,"\n";
$start=$length;
$non=0;
}
}
}
}
undef @array;
$length=0;
$non=0;
$start=0;
}
else {
@count=split(//, $_); #splitting each character in line
push(@array,@count); #storing each character in array till we find next chromosome
$chr=$arr[1];
}
}
foreach (@array){
$length++;
if($_ ne N){
$non++;
if ($non == 50){
print $chr,"\t",$start,"\t",$length,"\n";
$start=$length;
$non=0;
}
}
}
事情是我的fasta文件很大,这段代码占用了大量的内存和时间。您能否提出建议,如何使用更少的内存来快速完成。
由于
答案 0 :(得分:4)
在您的计划开始时始终 use strict
和use warnings
,尤其是在您寻求帮助时。通过为您找到许多简单的错误,它将节省大量时间。
你从哪里开始以这种方式使用typeglobs? *ARGV
总是为true,因此测试@ARGV
是否为空是无用的,并且使用*first
作为文件句柄会起作用但这是非常不寻常的。最好是词法文件句柄,像这样
open my $first, '<', $ARGV[0] or die $!;
但是,不需要显式打开指定为参数的文件:如果从空文件句柄<>
读取,Perl会隐式执行此操作。
此程序似乎可以满足您的需求。
use strict;
use warnings;
use constant WINDOW => 50;
@ARGV or die "No input file specified";
my ($key, $pos, $start, $size);
while (<>) {
if ( /^>(.+?)\s/ ) {
$key = $1;
$pos = $size = 0;
undef $start;
next;
}
chomp;
for (split //) {
next unless /[ATGC]/;
$start //= $pos;
$size++;
if ($key and $size == WINDOW) {
printf "%-6s %4d %4d\n", $key, $start, $pos + 1;
undef $start;
$size = 0;
}
}
continue {
$pos++;
}
}
<强>输出强>
chr1 0 50
chr1 50 100
chr1 100 215
chr1 215 265
chrM 0 50
chrM 50 100
chrM 100 150
chrM 150 200
chrM 200 250
答案 1 :(得分:1)
由于您需要两次输出数据的代码,我将其移动到子程序中。
#!/usr/bin/perl
use strict ;
use warnings ;
if( ! @ARGV ) {
die "No input file specified";
}
open my $file , '<', $ARGV[0] or die "Unable to open input file: $!";
my ( $chromosome , $start ) = ( undef , 1 ) ;
my @array = () ;
while(<$file>) {
chomp;
if ( m/^>(\w+)/ ) { # New chromosome
my $new_chromosome = $1 ; # Save the new chromosome name temporarily
if( @array ) {
split_sequence( $chromosome , \@array ) ;
}
@array = () ;
$chromosome = $new_chromosome ;
} else {
push @array , split( // ) ;
}
}
split_sequence( $chromosome , \@array ) if @array ;
sub split_sequence {
my ( $chromosome , $arrayref ) = @_ ;
printf "%-10.10s %d (total length)\n" , $chromosome , $#{ $arrayref } ;
my ( $start , $nonN ) = ( 0 , 0 ) ;
for( my $i = 0 ; $i <= $#{ $arrayref } ; $i++ ) {
if( $arrayref->[$i] ne 'N' ) {
$nonN++ ;
if( $nonN == 50 ) {
printf "%-10.10s %8d %8d\n" , $chromosome , $start , $i ;
$start = $i + 1 ;
$nonN = 0 ;
}
}
}
if( $#{ $arrayref } > $start ) { # Incomplete window leftover ...
# less than 50 bases long
printf "%-10.10s %8d %8d **\n" , $chromosome , $start , $#{ $arrayref } ;
}
}
输出:
perl SO002.pl SO002.fasta
chr1 299 (total length)
chr1 0 49
chr1 50 99
chr1 100 214
chr1 215 264
chr1 265 299 **
chrM 300 (total length)
chrM 0 49
chrM 50 99
chrM 100 149
chrM 150 199
chrM 200 249
chrM 250 300
答案 2 :(得分:0)
这是一个使用Bio :: SeqIO模块解析fasta文件的解决方案。
#!/usr/bin/perl
use strict;
use warnings;
use Bio::SeqIO;
use constant WINDOW => 50;
my $in = Bio::SeqIO->new(-file => "fasta.txt" ,
-format => 'Fasta');
while ( my $seq = $in->next_seq() ) {
my $count = 0;
my $beg_pos = 0;
local $_ = $seq->seq;
while (/(.)/g) {
++$count if $1 =~ /[TAGC]/;
if ($count == WINDOW) {
$count = 0;
printf "%s %d %d\n", $seq->id, $beg_pos, pos() - 1;
$beg_pos = pos();
}
elsif (pos == length) { # have read last char in string
printf "%s %d %d\n", $seq->id, $beg_pos, pos() - 1;
}
}
}