tm_map转换站点仅在大型数据集

时间:2017-08-21 14:34:20

标签: r tm doparallel

我正在使用文本数据,我有一个函数来运行一些标准转换。当我在10k,100k甚至1M的样本上测试此函数时,该函数返回所需的对象,即包含已处理文本数据的语料库。但是,当我运行完整数据(几百万个文档)时,返回的对象为空。

我可以显示和描述数据,但考虑到问题的本质,我不知道如何创建可重现的示例。

该函数采用语料库并返回语料库。如果它被认为有用,我可以分享一些数据样本。

我意识到这很模糊,但我一直试图让它运行好几天。这是令人沮丧的,因为如果我通过逐行手动迭代函数进行调试,一切都按预期工作。当我运行完整数据样本时,它也按预期工作,我尝试运行最多1M条记录。

某些元信息,如果它有任何价值。我托管了RStudio,当我在那里运行和调试时,一切似乎都运行良好。然后,要在完整数据上运行脚本,我将ssh插入服务器并在屏幕会话中调用脚本,然后让它运行几个小时。

我尝试将函数的输出保存到RDS文件中,但函数返回的语料库只是NULL。

这是相关的代码块和罪魁祸首函数:

library(tidyverse)
library(qdap)
library(stringr)
library(tm)
library(textstem)
library(stringi)
library(foreach)
library(doParallel)

# custom function for updating misspelt words using a lookup table (It works, have tested, everything works with this one)
stringi_spelling_update <- content_transformer(function(x, lut = spellingdoc) stri_replace_all_regex(str = x, pattern = paste0("\\b", lut[,1], "\\b"), replacement = lut[,2], vectorize_all = FALSE))

# Now the suspect function:
# corpus parameter is a corpus of over 10m documents
# n parameter is for breaking corpus up into pieces to do transformations on using parallel processing
clean_corpus <- function(corpus, n = 500000) { # n is length of each peice in parallel processing

  # split the corpus into pieces for looping to get around memory issues with transformation
  nr <- length(corpus)
  pieces <- split(corpus, rep(1:ceiling(nr/n), each=n, length.out=nr))

  # save memory
  rm(corpus)

  # save pieces to rds files since not enough RAM
  tmpfile <- tempfile()
  for (i in seq_len(length(pieces))) {
    saveRDS(pieces[[i]],
            paste0(tmpfile, i, ".rds"))
  }

  # doparallel processing using doparallel package
  registerDoParallel(cores = 14)
  pieces <- foreach(i = seq_len(length(pieces))) %dopar% {
    piece <- readRDS(paste0(tmpfile, i, ".rds"))
    # spelling update based on lut
    piece <- tm_map(piece, function(i) stringi_spelling_update(i, spellingdoc))
    # regular transformations
    piece <- tm_map(piece, content_transformer(replace_abbreviation))
    piece <- tm_map(piece, content_transformer(removeNumbers))
    piece <- tm_map(piece, content_transformer(function(x, ...) 
      qdap::rm_stopwords(x, stopwords = tm::stopwords("en"), separate = F, strip = T, char.keep = c("-", ":", "/"))))
  }

  # combine the pieces back into one corpus
  corpus <- do.call(function(...) c(..., recursive = TRUE), pieces)
  rm(pieces)

  return(corpus)
} # end clean_corpus function

我不知道我是不是在找错了地方。如果该功能在较小的部件上工作正常,可能还有其他事情发生了吗?

这段代码如何适用于“小”数据?但是当我尝试运行我的完整数据时,我得到了返回NULL?

此外,还有sessionInfo()

> sessionInfo()
R version 3.4.0 (2017-04-21)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 16.04.2 LTS

Matrix products: default
BLAS: /usr/lib/openblas-base/libblas.so.3
LAPACK: /usr/lib/libopenblasp-r0.2.18.so

locale:
 [1] LC_CTYPE=en_US.UTF-8          LC_NUMERIC=C                  LC_TIME=en_US.UTF-8          
 [4] LC_COLLATE=en_US.UTF-8        LC_MONETARY=en_US.UTF-8       LC_MESSAGES=en_US.UTF-8      
 [7] LC_PAPER=en_US.UTF-8          LC_NAME=en_US.UTF-8           LC_ADDRESS=en_US.UTF-8       
[10] LC_TELEPHONE=en_US.UTF-8      LC_MEASUREMENT=en_US.UTF-8    LC_IDENTIFICATION=en_US.UTF-8

attached base packages:
[1] parallel  stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] doParallel_1.0.10      iterators_1.0.8        foreach_1.4.3          stringi_1.1.5         
 [5] textstem_0.0.1         tm_0.7-1               NLP_0.1-10             stringr_1.2.0         
 [9] qdap_2.2.5             RColorBrewer_1.1-2     qdapTools_1.3.3        qdapRegex_0.7.2       
[13] qdapDictionaries_1.0.6 dplyr_0.7.1            purrr_0.2.2.2          readr_1.1.1           
[17] tidyr_0.6.3            tibble_1.3.1           ggplot2_2.2.1          tidyverse_1.1.1       

loaded via a namespace (and not attached):
 [1] Rcpp_0.12.11        lubridate_1.6.0     lattice_0.20-35     xlsxjars_0.6.1     
 [5] gtools_3.5.0        assertthat_0.2.0    psych_1.7.5         slam_0.1-40        
 [9] R6_2.2.1            cellranger_1.1.0    plyr_1.8.4          chron_2.3-50       
[13] httr_1.2.1          rlang_0.1.1         lazyeval_0.2.0      readxl_1.0.0       
[17] data.table_1.10.4   gdata_2.18.0        gender_0.5.1        foreign_0.8-67     
[21] igraph_1.0.1        RCurl_1.95-4.8      munsell_0.4.3       broom_0.4.2        
[25] compiler_3.4.0      modelr_0.1.0        pkgconfig_2.0.1     mnormt_1.5-5       
[29] reports_0.1.4       gridExtra_2.2.1     codetools_0.2-15    XML_3.98-1.9       
[33] bitops_1.0-6        openNLP_0.2-6       grid_3.4.0          nlme_3.1-131       
[37] jsonlite_1.4        gtable_0.2.0        magrittr_1.5        scales_0.4.1       
[41] xlsx_0.5.7          reshape2_1.4.2      bindrcpp_0.2        openNLPdata_1.5.3-2
[45] xml2_1.1.1          venneuler_1.1-0     wordcloud_2.5       tools_3.4.0        
[49] forcats_0.2.0       glue_1.1.1          hms_0.3             plotrix_3.6-5      
[53] colorspace_1.3-2    rvest_0.3.2         rJava_0.9-8         bindr_0.1          
[57] haven_1.1.0        

0 个答案:

没有答案