我的脚本中有一小段代码,用于计算峰区域的几个注释。下面的代码是速度的瓶颈,需要几个小时,因为我有大约100,000个区域,我需要运行它来计算CpG计数。有没有办法加快速度?
for (i in 1:nrow(dataMtx)){
peakCord<-gsub("chr", "", peakCord)
peakSeq<-system(sprintf("samtools faidx genome.fa %s", peakCord[i]), intern=T)
peakSeq<-gsub(">.*$", "", peakSeq)
peakSeq<-paste(peakSeq, collapse='')
dataMtx$CpGCount[i] <- sum(str_count(peakSeq, "CG"))
print(i)
}
答案 0 :(得分:0)
这样的事可能有用。不知道会有多快。
library(dplyr)
library(stringi)
result =
data_frame(peakCord = peakCord) %>%
rowwise %>%
mutate(peakCord.replace =
peakCord %>%
stri_replace_all_fixed("chr", ""),
peakSeq =
peakCord.replace %>%
sprintf("samtools faidx genome.fa %s", .) %>%
system(intern = T) %>%
stri_replace_all(">.*$", "") %>%
paste(collapse=''),
CpGCount = peakSeq %>% stri_count_fixed("CG") )
答案 1 :(得分:0)
这是我最终通过perk代码提到的内容。如果有人需要它。代码不仅会计算CpG的数量,还会计算它们的位置,并且还会增加GC百分比。
use strict;
use warnings;
BEGIN { our $start_run = time(); }
open(POSITIONS,"mergedPeaks.bed"); # "mergedPeaks.bed");
my $filename='outfile.txt';
open(my $fh, '>', $filename) or die "Could not open file '$filename' $!";
my $string1="CG";
my @positions=();
while(<POSITIONS>){
chomp;
unless($_=~ m/^chrom/){
my ($seqName,$begin,$end) = split(/\t/);
$seqName=~s/chr//;
open(SAMTOOLS,"samtools faidx /n/meissnerfs2/Everyone/sthakurela/annotationFiles/genomeFASTA/hs/hg19/genome.fa $seqName:$begin-$end |");
my @data = <SAMTOOLS>;
chop(@data);
my $seq=join("", @data);
$seq =~ s/\d+|\:|\-//g;
while ($seq =~ /$string1/gi ){
push(@positions, pos($seq)- length($string1));
}
my $length=scalar @positions;
my $seqLen=length($seq);
my $GC_count=($seq=~tr/GC/GC/);
my $GCper=sprintf("%.2f", ($GC_count/$seqLen)*100);
print $fh $_, "\t", $length,"\t", (join(",",@positions)), "\t", $GCper, "\n";
@positions=();
@data=();
$GC_count=0;
$GCper=0;
$seqLen=0;
close(SAMTOOLS);
}}
close(POSITIONS);
close $fh;
my $end_run = time();
my $run_time = $end_run - our $start_run;
print "Job took $run_time seconds\n";