Question

我有一个大的FASTA文件（遗传序列，整个染色体），其中每行包含50个字符（碱基a，g，t和c）。此文件中大约有400万行。

我想重新组织文件，以便将每行的每个字符放在新文件的自己的行中。也就是说，将原始文件中的每个50个字符的行转换为50个单字符行。这将导致整个序列重写为单个列。最后，我想将序列作为单个列，这样我就可以放置一个包含每个碱基的基因组坐标位置的相邻列。

我就是这样做的，使用perl并创建一组for循环。

unless(@ARGV) {
    # $0 name of the program being executed;
    print "\n usage: $0 filename\n\n"; 
    exit;
}

# use shift to pull off @ARGV value and return to $list;
my $fastafile = shift; 
open(FASTA, "<$fastafile");
my @count =(<FASTA>);
close FASTA;

# print scalar @count;

for ( my $i = 0; $i < scalar @count ; $i ++ ) {

#print "$count[$i]\n\n\n\n"; 
my @seq  = split( "", $count[ $i ] ); 
print " line = $i ";
for ( my $j = 0; $j < scalar @seq; $j++ ){
    #my $count =
    print "$seq[$j]  for count = $j \n"; 

    }

}

它似乎有效，但它很慢，很慢。我想知道它是否很慢，因为FASTA文件有400万行，或者因为我的代码，或者两者都很慢。我正在寻找加快这一过程的建议。谢谢！

Answer 1

问题是你正在 slurping 该文件。当巨大的文件被 slurped 时，进程将等待所有I / O结束以开始处理。一个选项是逐行处理文件：

open my $fh, '<', $fastafile or die "Error opening file: $!";

while ( my $line = <$fh> ) {
    chomp $line;    # Remove the newline from the end of each line

    my @seq = split //, $line;

    # Loop from 0 to the last index of @seq
    for my $i ( 0 .. $#seq ) {
        print "$seq[$i] for count = $i\n";
    }
}

Answer 2

也许以下内容会有所帮助：

use strict;
use warnings;

@ARGV or die "\n usage: $0 filename\n\n";

my $line = 0;
while (<>) {
    next if /^>/;
    chomp;

    print 'Line = ', $line++, "\n";
    my $count = 0;
    print "$_ for count = ", $count++, "\n" for split '';
    print "\n";
}

用法：perl script.pl fastaIn

以上内容也会跳过fasta标题。

示例输出：

Line = 0
T for count = 0
A for count = 1
C for count = 2
G for count = 3
A for count = 4
G for count = 5
...

Answer 3

使用Bio::SeqIO类来处理此问题，以便为fasta格式设置width和block（特定格式由Bio::SeqIO::fasta处理）。如果我没记错的话，它有一些技巧来处理非常大的序列，虽然我认为这些只限于写作部分（可耻的自我广告，我去年实施了其中一个）。这样的事应该可以正常工作：

use Bio::SeqIO;

## omit the -format option and it will try to guess the format
my $in  = Bio::SeqIO->new(-file => $fastafile, -format => 'Fasta');

while (my $seq = $in->next_seq()) {
  my $out = Bio::SeqIO->new(-file => ">outputfilename", -format => 'Fasta');
  $out->width(1); # 1 base pair per line
  $out->write_seq($seq);
}

注意这将允许在同一个文件中使用多个fasta序列（试验一个包含6行序列的fasta文件，以便对它有感觉）。

此外，这实际上写了一个真正的 fasta文件，因此您无法更改代码以编写2列文件。但是你提到的第二列带有基本索引的问题对我来说并没有多大意义。如果您知道第一个基数的偏移量，则第二列只是$ column_number + $ offset + 1（用于说明fasta标头）。但BioPerl有办法做到这一点，请不要重新发明轮子。将序列加载为Bio::Seq对象，并使用其方法获取子序列。

my $in  = Bio::SeqIO->new(-file => $fastafile);

while (my $seq = $in->next_seq()) {
  ## $subseq will be a string with the sequence from bp 500 to 1000
  my $subseq = $seq->subseq(500, 1000);
}

我不确定您对此有多大的性能提升，但您认为可以改进的任何内容，请将其分享回BioPerl项目。

Answer 4

看起来你的主要限制是你打印的数据量比你读的数量要多。

如果每行是50个字符+换行符，则“应该”写入100/51（大约两倍）的数据。

但是打印那个长字符串"X for count = 29\n"意味着你要为每个输入字符写出15-16个字符......

除此之外，你会吃掉很多内存，但是现在4M行x 50个字符并不是真的“太多”。不过，这是你不需要“花钱”的20M +家务管理费用。

也许这是一个编写自己的循环的地方不如使用Perl的运算符中的内置函数，如qq又名"" ......

我还将变量构造移到循环外部，以便在构造和垃圾收集时节省更多时间。

 {                            # Inner scope for local $" and my vars            #"
     local $" = "\n";         # Separator character for stringifying lists      #"
     my ($line, @line);       # Avoid cons/gc during the loop
     while ($line = <$fh>)
     {
           chomp $line;       # Strip any newline
           @line = split ('', $line);
           print "@line\n";   # Stringification using $"
     }
 }

（抱歉，Stack Exchange的语法高亮不知道$“是一个变量名，所以语法高亮有点奇怪。）

使用嵌套for循环缓慢执行Perl脚本

4 个答案: