Question

我正在尝试编写一个读取fasta文件的perl程序，并打印出一个文本文件，其中包含序列（fasta）文件中所有可用（重叠）长度为15 k-mers的文件。当我搜索非重叠的k-mers时，这个程序工作得非常好，但是当我编码它以找到重叠的k-mers时，它需要永远执行它并且Cygwin在12小时后结束了程序。（我将match_count留在那里计算总数，请随意忽略该行）

#!/usr/bin/perl
use strict;
use warnings;

my $k = 15;
my $input = 'fasta.fasta';
my $output = 'text.txt';
my $match_count = 0;

#Open File
unless (open(FASTA, "<", $input)){
    die "Unable to open fasta file", $!;
    }

    #Unwraps the FASTA format file
    $/=">";
    #Separate header and sequence
    #Remove spaces
unless (open(OUTPUT, ">", $output)){
die "Unable to open file", $!;
}

    while (my $line = <FASTA>){
            my($header, @seq) = split(/\n/, $line);
                    my $sequence = join '', @seq;

    while (length($sequence) >= $k){
        $sequence =~ m/(.{$k})/;
        print OUTPUT "$1\n";
        $sequence = substr($sequence, 1, length($sequence)-1);
    }
}

我要找的结果是：

A total of 20938309 k-mers printed in the text file when I use the wc -l command.

提前致谢！

Answer 1

不确定为什么你没有得到理想的结果。

我以为我发布了我在问题描述后使用过的2个程序。

第一个只计算我用于测试的文件中的kmers（fasta_dat.txt）。它不打印出来，但只是检查有多少kmers。

#!/usr/bin/perl
use strict;
use warnings;
use Bio::SeqIO;

my $in  = Bio::SeqIO->new( -file   => "fasta_dat.txt" ,
                           -format => 'fasta');

my $count_kmers;
my $k = 15;
while ( my $seq = $in->next_seq) {
    $count_kmers += $seq->length - $k + 1;
}

print $count_kmers;

__END__
C:\Old_Data\perlp>perl t9.pl
18657

您可以看到计数（在__END__令牌之后），18657。当我使用您的代码打印出来时，这个计数与kmers的数量一致。

#!/usr/bin/perl
use strict;
use warnings;
use 5.014;
use Devel::Size 'total_size';

my $k = 15;
my $input = 'fasta_dat.txt';
my $output = 'kmers.txt';
my $match_count = 0;

#Open File
unless (open(FASTA, "<", $input)){
    die "Unable to open fasta file", $!;
    }

    #Unwraps the FASTA format file
    $/=">";
    #Separate header and sequence
    #Remove spaces
unless (open(OUTPUT, ">", $output)){
    die "Unable to open file", $!;
}

<FASTA>; # discard 'first' 'empty' record

my %seen;
while (my $line = <FASTA>){
    chomp $line;
    my($header, @seq) = split(/\n/, $line);
    my $sequence = join '', @seq;

    for my $i (0 .. length($sequence) - $k) {
        my $kmer = substr($sequence, $i, $k);
        print OUTPUT $kmer, "\n" unless $seen{$kmer}++;
    }
}
print total_size(\%seen);

更新我跑的测试显示散列大小的内存增加了100倍。我测试中的kmers数量约为18500.这导致哈希大小为1.8MB。

对于数据，使用22M的kmers，将导致哈希大小~2.2GB。不知道这是否会超过你的记忆容量。

查找并打印所有重叠的k-mers

1 个答案: