Question

我的脚本中有一小段代码，用于计算峰区域的几个注释。下面的代码是速度的瓶颈，需要几个小时，因为我有大约100,000个区域，我需要运行它来计算CpG计数。有没有办法加快速度？

for (i in 1:nrow(dataMtx)){
peakCord<-gsub("chr", "", peakCord)
peakSeq<-system(sprintf("samtools faidx genome.fa %s", peakCord[i]), intern=T)
peakSeq<-gsub(">.*$", "", peakSeq)
peakSeq<-paste(peakSeq, collapse='')
dataMtx$CpGCount[i] <-  sum(str_count(peakSeq, "CG"))
print(i)
}

Answer 1

这样的事可能有用。不知道会有多快。

library(dplyr)
library(stringi)

result = 
  data_frame(peakCord = peakCord) %>%
  rowwise %>%
  mutate(peakCord.replace = 
           peakCord %>% 
           stri_replace_all_fixed("chr", ""),
         peakSeq = 
           peakCord.replace %>%
           sprintf("samtools faidx genome.fa %s", .) %>%
           system(intern = T) %>%
           stri_replace_all(">.*$", "") %>%
           paste(collapse=''),
         CpGCount = peakSeq %>% stri_count_fixed("CG") )

Answer 2

这是我最终通过perk代码提到的内容。如果有人需要它。代码不仅会计算CpG的数量，还会计算它们的位置，并且还会增加GC百分比。

use strict;
use warnings;
BEGIN { our $start_run = time(); }

open(POSITIONS,"mergedPeaks.bed"); # "mergedPeaks.bed");
my $filename='outfile.txt';
open(my $fh, '>', $filename) or die "Could not open file '$filename' $!";
my $string1="CG";
my @positions=();
while(<POSITIONS>){
chomp;
unless($_=~ m/^chrom/){
my ($seqName,$begin,$end) = split(/\t/);
$seqName=~s/chr//;
open(SAMTOOLS,"samtools faidx /n/meissnerfs2/Everyone/sthakurela/annotationFiles/genomeFASTA/hs/hg19/genome.fa $seqName:$begin-$end |");
my @data = <SAMTOOLS>;
chop(@data);
my $seq=join("", @data);
$seq =~ s/\d+|\:|\-//g;
while ($seq =~ /$string1/gi ){
push(@positions, pos($seq)- length($string1));
}
my $length=scalar @positions;
my $seqLen=length($seq);
my $GC_count=($seq=~tr/GC/GC/);
my $GCper=sprintf("%.2f", ($GC_count/$seqLen)*100);
print $fh $_, "\t", $length,"\t", (join(",",@positions)), "\t", $GCper, "\n";
@positions=();
@data=();
$GC_count=0;
$GCper=0;
$seqLen=0;
close(SAMTOOLS);
}}
close(POSITIONS);
close $fh;

my $end_run = time();
my $run_time = $end_run - our $start_run;
print "Job took $run_time seconds\n";

代码耗时太长 - 如何加快速度

2 个答案: