Question

您好：我使用tm软件包进行一些文本分析，我需要在替换向量中使用配对替换项来对子项进行子化。所以模式/替换字典看起来像这样。

#pattern -replacement dictionary
df<-data.frame(replace=c('crude', 'oil', 'price'), with=c('xcrude', 'xoil', 'xprice'))
#load tm
library(tm)
#load crude
data('crude')

我试过这个并收到错误

tm_map(crude, mapply, gsub, df$replace, df$with)

Warning message:
In mclapply(content(x), FUN, ...) :
all scheduled cores encountered errors in user code

Answer 1

根据此answer，您可以使用stringi并将其包裹在content_transformer()以保留语料库结构：

corp <- tm_map(crude, content_transformer(
  function(x) { 
    stri_replace_all_fixed(x, df$replace, df$with, vectorize_all = FALSE) 
    })
  )

multigsub来自qdap

corp <- tm_map(crude, content_transformer(
  function(x) { 
    multigsub(df$replace, df$with, fixed = FALSE, x) 
    })
  )

给出了：

> corp[[1]][1]

“钻石三叶草公司（Diamond Shamrock Corp）表示，今天它已经削减了它 xcrude xoil 与 xcrude xoil 的合约 xprice 用于West Texas \ nIntermediate到该公司表示，每桶16.00桶。\ n \ n \ n \ n \ n今天 xprice 减少是因为下降\ n xoil 产品 xprices 和一个弱势的 xcrude xoil 市场，“一家公司\ nspokeswoman说。\ n
Diamond是美国 xoil 公司中最新的一家公司在过去两年中削减了合同，或者发布了 xprices 天\弱 xoil 市场。\ n路透社“

然后，您可以在生成的语料库中应用其他tm函数：

> DocumentTermMatrix(corp)
#<<DocumentTermMatrix (documents: 20, terms: 1269)>>
#Non-/sparse entries: 2262/23118
#Sparsity           : 91%
#Maximal term length: 17
#Weighting          : term frequency (tf)

使用mapply用向量中的替换替换向量中的模式

1 个答案: