哈希错误地跟踪计数,运行时间长

时间:2016-11-16 04:09:56

标签: perl hash bioinformatics

我正在研究Perl中的一个程序,我的输出是错误的并且需要永远处理。该代码旨在获取一个大的DNA序列文件,以15个字母的增量(kmers)读取它,一次前进1个位置。我应该将kmer序列输入到散列中,其值是该kmer的发生次数 - 意味着每个键应该是唯一的,并且当找到重复时,它应该增加该特定kmer的计数。我从教授的预期输出文件中知道,我有太多行,所以它允许重复并且没有正确计数。它也运行5分钟以上,所以我必须按Ctrl + C才能逃脱。当我查看kmers.txt时,该文件至少是正确编写和格式化的。

#!/usr/bin/perl

use strict;
use warnings;
use diagnostics;

# countKmers.pl
# Open file /scratch/Drosophila/dmel-2L-chromosome-r5.54.fasta
# Identify all k-mers of length 15, load them into a hash
# and count the number of occurences of each k-mer. Each
# unique k-mer and its' count will be written to file
# kmers.txt

#Create an empty hash
my %kMersHash = ();

#Open a filehandle for the output file kmers.txt
unless ( open ( KMERS, ">", "kmers.txt" ) ) {
    die $!;
}

#Call subroutine to load Fly Chromosome 2L
my $sequenceRef = loadSequence("/scratch/Drosophila/dmel-2L-chromosome-r5.54.fasta");

my $kMer      = 15;    #Set the size of the sliding window

my $stepSize  = 1;     #Set the step size

for (

    #The sliding window's start position is 0
    my $windowStart = 0;

    #Prevent going past end of the file
    $windowStart <= ( length($$sequenceRef) - $kMer );

    #Advance the window by the step size
    $windowStart += $stepSize

    )

{

    #Get the substring from $windowStart for length $kMer
    my $kMerSeq = substr( $$sequenceRef, $windowStart, $kMer );

#Call the subroutine to iterate through the kMers
    processKMers($kMerSeq);

}

sub processKMers {

    my ($kMerSeq) = @_;

    #Initialize $kCount with at least 1 occurrence  
    my $kCount = 1;

    #If the key already exists, the count is  
    #increased and changed in the hash
    if ( not exists $kMersHash{$kMerSeq} ) {

            #The hash key=>value is loaded: kMer=>count
            $kMersHash{$kMerSeq} = $kCount;
    }

    else {

            #Increment the count 
            $kCount ++;

            #The hash is updated 
            $kMersHash{$kMerSeq} = $kCount;
    }

    #Print out the hash to filehandle KMERS
    for (keys %kMersHash) {
            print KMERS $_, "\t", $kMersHash{$_}, "\n";
    }
}

sub loadSequence {

    #Get my sequence file name from the parameter array
    my ($sequenceFile) = @_;

    #Initialize my sequence to the empty string
    my $sequence = "";

    #Open the sequence file
    unless ( open( FASTA, "<", $sequenceFile ) ) {
            die $!;
    }

    #Loop through the file line-by-line
    while (<FASTA>) {

            #Assign the line, which is in the default 
            #variable to a named variable for readability.
            my $line = $_;

            #Chomp to get rid of end-of-line characters
            chomp($line);

            #Check to see if this is a FASTA header line
            if ( $line !~ /^>/ ) {

                    #If it's not a header line append it 
                    #to my sequence
                    $sequence .= $line;
            }

    }

    #Return a reference to the sequence
    return \$sequence;
}

2 个答案:

答案 0 :(得分:0)

以下是我编写应用程序的方法。 processKMers子例程归结为只增加一个哈希元素,所以我删除了它。我还将标识符更改为与Perl代码中更常见的snake_case匹配,我在load_sequence中没有看到任何返回对序列的引用的点,所以我更改了它返回字符串本身

use strict;
use warnings 'all';

use constant FASTA_FILE => '/scratch/Drosophila/dmel-2L-chromosome-r5.54.fasta';
use constant KMER_SIZE  => 15;
use constant STEP_SIZE  => 1;

my $sequence = load_sequence( FASTA_FILE );

my %kmers;

for (my $offset = 0;
        $offset + KMER_SIZE <= length $sequence;
        $offset += STEP_SIZE ) {

    my $kmer_seq = substr $sequence, $start, KMER_SIZE;

    ++$kmers{$kmer_seq};
}

open my $out_fh, '>', 'kmers.txt' or die $!;

for ( keys %kmers ) {
    printf $out_fh "%s\t%d\n", $_, $kmers{$_};
}

sub load_sequence {

    my ( $sequence_file ) = @_;

    my $sequence = "";

    open my $fh, '<', $sequence_file or die $!;

    while ( <$fh> ) {
        next if /^>/;
        chomp;
        $sequence .= $_;
    }

    return $sequence;
}

这是一种更简洁的方法来增加哈希元素而不直接在哈希上使用++

my $n;

if ( exists $kMersHash{$kMerSeq} ) {
    $n = $kMersHash{$kMerSeq};
}
else {
    $n = 0;
}

++$n;
$kMersHash{$kMerSeq} = $n;

答案 1 :(得分:-1)

除了processKMers之外,您的代码中的所有内容都很好。主要问题:

  • $kCount在调用processKMers之间不会持续存在,因此在您的其他声明中,$kCount将始终为2

  • 每次拨打processKMers时都会打印,这会降低您的速度。打印经常会显着减慢您的过程,您应该等到程序结束并打印一次。

保持代码大致相同:

sub processKMers {

    my ($kMerSeq) = @_;

    if ( not exists $kMersHash{$kMerSeq} ) {
            $kMersHash{$kMerSeq} = 1;
    }
    else {
            $kMersHash{$kMerSeq}++;
    }
}

然后你想在你的for循环后立即将你的打印逻辑移动。