并行处理的句子生成会产生混乱的结果

时间:2018-03-13 04:50:03

标签: r machine-learning foreach nlp doparallel

我正在尝试为某些神经网络学习目的创建数据集。以前,我使用 for 循环来连接和制作句子,但由于这个过程花费了很长时间,我使用 foreach 实现了句子生成。这个过程很快,并在50秒内完成。我只是在模板上使用插槽填充,然后粘贴在一起形成一个句子,但输出变得乱码(单词中的拼写错误,单词之间的未知空格,单词本身丢失等等)。

library(foreach)
library(doParallel)
library(tictoc)

tic("Data preparation - parallel mode")
cl <- makeCluster(3)
registerDoParallel(cl)

f_sentences<-c();sentences<-c()
hr=38:180;fl=1:5;month=1:5
strt<-Sys.time()
a<-foreach(hr=38:180,.packages = c('foreach','doParallel')) %dopar% {
  foreach(fl=1:5,.packages = c('foreach','doParallel')) %dopar%{
    foreach(month=1:5,.packages = c('foreach','doParallel')) %dopar% {
      if(hr>=35 & hr<=44){
        sentences<-paste("About",toString(hr),"soldiers died in the battle (count being severly_low).","Around",toString(fl),
                         "soldiers and civilians went missing. We only have about",(sample(38:180,1)),"crates which lasts for",toString(month),"months as food supply")
        f_sentences<-c(f_sentences,sentences);outfile<-unname(f_sentences)}
      if(hr>=45 & hr<=59){
        sentences<-paste("About",toString(hr),"soldiers died in the battle (count being low).","Around",toString(fl),
                         "soldiers and civilians went missing. We only have about",(sample(38:180,1)),"crates which lasts for",toString(month),"months as food supply")
        f_sentences<-c(f_sentences,sentences);outfile<-unname(f_sentences)}
      if(hr>=60 & hr<=100){
        sentences<-paste("About",toString(hr),"soldiers died in the battle (count being medium).","Around",toString(fl),
                         "soldiers and civilians went missing. We only have about",(sample(38:180,1)),"crates which lasts for",toString(month),"months as food supply")
        f_sentences<-c(f_sentences,sentences);outfile<-unname(f_sentences)}
      if(hr>=101 & hr<=150){
        sentences<-paste("About",toString(hr),"soldiers died in the battle (count being high).","Around",toString(fl),
                         "soldiers and civilians went missing. We only have about",(sample(38:180,1)),"crates which lasts for",toString(month),"months as food supply")
        f_sentences<-c(f_sentences,sentences);outfile<-unname(f_sentences)}
      if(hr>=151 & hr<=180){
        sentences<-paste("About",toString(hr),"soldiers died in the battle (count being severly_high).","Around",toString(fl),
                         "soldiers and civilians went missing. We only have about",(sample(38:180,1)),"crates which lasts for",toString(month),"months as food supply")
        f_sentences<-c(f_sentences,sentences);outfile<-unname(f_sentences)}
      return(outfile)
    }
    write.table(outfile,file="/home/outfile.txt",append = T,row.names = F,col.names = F)
    gc()
  }
}
stopCluster(cl)
toc()

如此创建的文件的统计信息:

  • 行数:427,975
  • 使用拆分:单词拆分(“”)
  • 词汇:567

    path<-"/home/outfile.txt"
        File<-(fread(path,sep = "\n",header = F))[[1]]
        corpus<-tolower(File) %>%
            #removePunctuation() %>%
            strsplit(splitting) %>%
            unlist()
       vocab<-unique(corpus)

    像这样的简单句子应该有很少的词汇量,因为数字是这里唯一变化的参数。在检查词汇输出并使用grep命令时,我发现了很多乱码 (也有一些遗漏的词)如句子中的 got crpply 等,这通常不应该因为我有固定的模板而来。

      

    期望的句子
      “大约有40名士兵在战斗中死亡(数量很少)。大约有1名士兵和平民失踪。我们只有大约146个箱子作为食物供应持续1个月”

         

    grep -rnw'outfile.txt'-e'got'
       24105:“大约有62名士兵在战斗中死亡(数量中等)。大约2名士兵和平民去了117个箱子,持续1个月作为食物供应”

         

    grep -rnw'outfile.txt'-e'claply'
       76450:“大约有73名士兵在战斗中死亡(数量中等)。大约1名士兵和平民失踪。我们只有大约133名士兵”

    对于前几句话,在问题发生后,生成是正确的。这是什么原因?我只是用槽填充进行普通粘贴。任何帮助将不胜感激!

1 个答案:

答案 0 :(得分:3)

代码现在正常运行。没有更多的错误。我假设上次因故障而发生错误。在具有不同R版本的其他机器中进行测试,仍然没有问题。