Question

我有300个句子的向量，我试图使用stringdist包找到元素JW距离。天真实现的执行时间太长，导致我寻找减少运行时的方法。我正在尝试利用doParallel和foreach软件包，但我没有获得任何显着的加速。这就是我的目标。

library(foreach)
library(doParallel)
cl = makeCluster(detectCores())
registerDoParallel(cl)

sentence = # vector containing sentences 
jw_dist = foreach(i = 1:length(sentence)) %dopar% {
 temp = sentence[sentence!=sentence[i]]
 return(mean(1 - stringdist::stringdist(sentence[i],temp,method = "jw",nthread = 3))
  }
  stopCluster(cl)

如果有人可以指出我可以加速这段代码的方式，我真的很感激。

Answer 1

所以看起来你正在以极高的开销战斗。

不是在单个句子上并行化，而是将任务分成一些相当大的块，然后让apply完成其余的工作。我已经选择了10个块，每个100个句子，可能有一个更快的组合，但这个比你要求的更快（至少对我而言）：

library(doParallel)
library(foreach)

# generate fake sentences

txt <- readLines(url('https://baconipsum.com/api/?type=all-meat&sentences=300&start-with-lorem=1&format=text'))

sentences <- strsplit(txt,'\\.\\s')[[1]]

sentences <- rep(sentences[sample(1:100,100)],10)

# pairwise combinations of sentences
cbn <- combn(1:length(sentences),2)

# simple timing
st <- Sys.time()

# Since you work on LINUX, you can use FORK
cl <-  makeCluster(detectCores(),type = 'FORK')
registerDoParallel(cl)


res <- foreach(ii = seq(1,1000,100),.combine = 'c') %dopar% {

  apply(cbn[,ii:(ii+99)],2,function(x) stringdist(sentences[x[1]],sentences[x[2]],method = "jw"))

}

stopCluster(cl)
Sys.time() - st

在我的Ubuntu VM上，此代码运行时间约为1.8秒。

规格：

Ubuntu 64 bit
R version 3.4
8 CPU cores
32GB RAM Memory

HTH

编辑：

在这种情况下，避免并行处理可能是一个不错的选择。

使用这个lapply版本，我可以在约17秒内计算每个句子的平均值：

res <- do.call(rbind,lapply(1:1000,function(ii) c(ii,1-mean(stringdist(sentences[ii],sentences[-ii],method = "jw")))))

这将为您提供一个2列矩阵，其中包含每个句子的索引以及到相应句子的所有距离的1-mean。

使用Parallel

1 个答案:

编辑：