整数强制引入NA后,dfm()在非常大的令牌对象上失败

时间:2019-06-05 14:08:00

标签: quanteda

我正在一个非常大的文本数据库(> 20 GB)上使用Quanteda(R 3.6.0,Linux,64位)。这是我正在运行的代码:

options(echo=TRUE)
library(parallel)
library(doParallel)
registerDoParallel(8)
library(data.table)
library(stringr)
library(tidyr)
library(quanteda)
quanteda_options(threads=8)
data <- fread("county_monthly_baskets.csv")
degreedays <- fread("us_degreedays_long.csv")
data <- merge(data,degreedays,by=c('stcountyfp','month','year'),all=TRUE,allow.cartesian=TRUE)
z_cdd90 <- data[["z_cdd90"]]
heatTokens <- tokens(data$text, remove_punct=TRUE, remove_symbols=TRUE)
heatTokens <- tokens_compound(heatTokens, pattern=phrase(c('not stated','no usda organic seal on pkg','not applicable','n a','non refillable')))
heatTokens <- tokens_compound(heatTokens, pattern=phrase(c('low *','no *', 'rd *', 'reduced *', 'less *', 'not *', 'non *', 'vitamin a d')))
heatTokens <- tokens_remove(heatTokens, pattern=c('na'))
heatTokens <- tokens_remove(heatTokens,pattern=c('NOT_STATED','NO_USDA_ORGANIC_SEAL_ON_PKG','NOT_APPLICABLE','N_A','REGULAR','NON_REFILLABLE','NA'))
heatTokens <- tokens_select(heatTokens, min_nchar=3L)
heatTokens <- tokens_ngrams(heatTokens, n=1:2)
saveRDS(z_cdd90,file="z_cdd90.rds")
saveRDS(heatTokens,file="heatTokens.rds")
heatTokens <- readRDS("heatTokens.rds")
heatDfm <- dfm(heatTokens)
heatDfm <- dfm_trim(heatDfm, min_termfreq=100)
saveRDS(heatDfm,file="heatDfm.rds")
heatDfm_tfidf <- dfm_tfidf(heatDfm, scheme_tf="count", scheme_df="inverse")
saveRDS(heatDfm_tfidf,file="heatDfm_tfidf.rds")

令牌化工作正常,因此上述代码中直到“ heatDfm”的所有内容均成功运行。但是,创建dfm对象失败,并显示以下错误:

Error in if ((lp <- length(p)) < 1 || p[1] != 0 || any((dp <- p[-1] -  :   missing value where TRUE/FALSE needed
Calls: dfm ... initialize -> callNextMethod -> .nextMethod -> sparseMatrix
In addition: Warning message:
In sparseMatrix(j = index, p = cumsum(c(1, lengths(x))) - 1, x = 1L,  :  NAs introduced by coercion to integer range
Execution halted

我认为这是因为某些令牌的频率超过integer.max。有什么办法可以将bit64包中的integer64类与dfm()结合使用来解决此问题?我认为仅调用library(bit64)就不够了。否则,我想我可以在创建dfm之前简单地丢弃最频繁的令牌,但是,如果可能的话,我想避免这种情况。截至6月3日,我已在CRAN(1.4.3)和github开发版本(1.4.5)的最新Quanteda上复制了此问题。

0 个答案:

没有答案