我正在使用文本数据,我有一个函数来运行一些标准转换。当我在10k,100k甚至1M的样本上测试此函数时,该函数返回所需的对象,即包含已处理文本数据的语料库。但是,当我运行完整数据(几百万个文档)时,返回的对象为空。
我可以显示和描述数据,但考虑到问题的本质,我不知道如何创建可重现的示例。
该函数采用语料库并返回语料库。如果它被认为有用,我可以分享一些数据样本。
我意识到这很模糊,但我一直试图让它运行好几天。这是令人沮丧的,因为如果我通过逐行手动迭代函数进行调试,一切都按预期工作。当我运行完整数据样本时,它也按预期工作,我尝试运行最多1M条记录。
某些元信息,如果它有任何价值。我托管了RStudio,当我在那里运行和调试时,一切似乎都运行良好。然后,要在完整数据上运行脚本,我将ssh插入服务器并在屏幕会话中调用脚本,然后让它运行几个小时。
我尝试将函数的输出保存到RDS文件中,但函数返回的语料库只是NULL。
这是相关的代码块和罪魁祸首函数:
library(tidyverse)
library(qdap)
library(stringr)
library(tm)
library(textstem)
library(stringi)
library(foreach)
library(doParallel)
# custom function for updating misspelt words using a lookup table (It works, have tested, everything works with this one)
stringi_spelling_update <- content_transformer(function(x, lut = spellingdoc) stri_replace_all_regex(str = x, pattern = paste0("\\b", lut[,1], "\\b"), replacement = lut[,2], vectorize_all = FALSE))
# Now the suspect function:
# corpus parameter is a corpus of over 10m documents
# n parameter is for breaking corpus up into pieces to do transformations on using parallel processing
clean_corpus <- function(corpus, n = 500000) { # n is length of each peice in parallel processing
# split the corpus into pieces for looping to get around memory issues with transformation
nr <- length(corpus)
pieces <- split(corpus, rep(1:ceiling(nr/n), each=n, length.out=nr))
# save memory
rm(corpus)
# save pieces to rds files since not enough RAM
tmpfile <- tempfile()
for (i in seq_len(length(pieces))) {
saveRDS(pieces[[i]],
paste0(tmpfile, i, ".rds"))
}
# doparallel processing using doparallel package
registerDoParallel(cores = 14)
pieces <- foreach(i = seq_len(length(pieces))) %dopar% {
piece <- readRDS(paste0(tmpfile, i, ".rds"))
# spelling update based on lut
piece <- tm_map(piece, function(i) stringi_spelling_update(i, spellingdoc))
# regular transformations
piece <- tm_map(piece, content_transformer(replace_abbreviation))
piece <- tm_map(piece, content_transformer(removeNumbers))
piece <- tm_map(piece, content_transformer(function(x, ...)
qdap::rm_stopwords(x, stopwords = tm::stopwords("en"), separate = F, strip = T, char.keep = c("-", ":", "/"))))
}
# combine the pieces back into one corpus
corpus <- do.call(function(...) c(..., recursive = TRUE), pieces)
rm(pieces)
return(corpus)
} # end clean_corpus function
我不知道我是不是在找错了地方。如果该功能在较小的部件上工作正常,可能还有其他事情发生了吗?
这段代码如何适用于“小”数据?但是当我尝试运行我的完整数据时,我得到了返回NULL?
此外,还有sessionInfo()
> sessionInfo()
R version 3.4.0 (2017-04-21)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 16.04.2 LTS
Matrix products: default
BLAS: /usr/lib/openblas-base/libblas.so.3
LAPACK: /usr/lib/libopenblasp-r0.2.18.so
locale:
[1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C LC_TIME=en_US.UTF-8
[4] LC_COLLATE=en_US.UTF-8 LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
[7] LC_PAPER=en_US.UTF-8 LC_NAME=en_US.UTF-8 LC_ADDRESS=en_US.UTF-8
[10] LC_TELEPHONE=en_US.UTF-8 LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=en_US.UTF-8
attached base packages:
[1] parallel stats graphics grDevices utils datasets methods base
other attached packages:
[1] doParallel_1.0.10 iterators_1.0.8 foreach_1.4.3 stringi_1.1.5
[5] textstem_0.0.1 tm_0.7-1 NLP_0.1-10 stringr_1.2.0
[9] qdap_2.2.5 RColorBrewer_1.1-2 qdapTools_1.3.3 qdapRegex_0.7.2
[13] qdapDictionaries_1.0.6 dplyr_0.7.1 purrr_0.2.2.2 readr_1.1.1
[17] tidyr_0.6.3 tibble_1.3.1 ggplot2_2.2.1 tidyverse_1.1.1
loaded via a namespace (and not attached):
[1] Rcpp_0.12.11 lubridate_1.6.0 lattice_0.20-35 xlsxjars_0.6.1
[5] gtools_3.5.0 assertthat_0.2.0 psych_1.7.5 slam_0.1-40
[9] R6_2.2.1 cellranger_1.1.0 plyr_1.8.4 chron_2.3-50
[13] httr_1.2.1 rlang_0.1.1 lazyeval_0.2.0 readxl_1.0.0
[17] data.table_1.10.4 gdata_2.18.0 gender_0.5.1 foreign_0.8-67
[21] igraph_1.0.1 RCurl_1.95-4.8 munsell_0.4.3 broom_0.4.2
[25] compiler_3.4.0 modelr_0.1.0 pkgconfig_2.0.1 mnormt_1.5-5
[29] reports_0.1.4 gridExtra_2.2.1 codetools_0.2-15 XML_3.98-1.9
[33] bitops_1.0-6 openNLP_0.2-6 grid_3.4.0 nlme_3.1-131
[37] jsonlite_1.4 gtable_0.2.0 magrittr_1.5 scales_0.4.1
[41] xlsx_0.5.7 reshape2_1.4.2 bindrcpp_0.2 openNLPdata_1.5.3-2
[45] xml2_1.1.1 venneuler_1.1-0 wordcloud_2.5 tools_3.4.0
[49] forcats_0.2.0 glue_1.1.1 hms_0.3 plotrix_3.6-5
[53] colorspace_1.3-2 rvest_0.3.2 rJava_0.9-8 bindr_0.1
[57] haven_1.1.0