Question

我有一个要处理的文档列表，对于每个记录，我想将一些元数据附加到“corpus”数据结构中的文档“member”，即tm，R包生成（从文本文件中读取）

这个for-loop有效，但速度很慢，性能似乎降低为函数f~1 / n_docs。

for (i in seq(from= 1, to=length(corpus), by=1)){
    if(opts$options$verbose == TRUE || i %% 50 == 0){
        print(paste(i, " ", substr(corpus[[i]], 1, 140), sep = " "))
    }
    DublinCore(corpus[[i]], "title") = csv[[i,10]]  
    DublinCore(corpus[[i]], "Publisher" ) = csv[[i,16]]   #institutions
}

这可能对语料库变量有所帮助，但我不知道是什么。但是当我把它放在一个tm_map（）（类似于lapply（）函数）中时，它的运行速度要快得多，但是这些更改并不是持久的：

i = 0
corpus = tm_map(corpus, function(x){
            i <<- i + 1


    if(opts$options$verbose == TRUE){
        print(paste(i, " ", substr(x, 1, 140), sep = " "))
    }

    meta(x, tag = "Heading") = csv[[i,10]]  
    meta(x, tag = "publisher" ) = csv[[i,16]] 
})

退出tm_map函数后，变量语料库有空元数据字段。它应该被填补。我还有一些与收藏有关的事情。

meta（）函数的R文档说明了这一点：

     Examples:
      data("crude")
      meta(crude[[1]])
      DublinCore(crude[[1]])
      meta(crude[[1]], tag = "Topics")
      meta(crude[[1]], tag = "Comment") <- "A short comment."
      meta(crude[[1]], tag = "Topics") <- NULL
      DublinCore(crude[[1]], tag = "creator") <- "Ano Nymous"
      DublinCore(crude[[1]], tag = "Format") <- "XML"
      DublinCore(crude[[1]])
      meta(crude[[1]])
      meta(crude)
      meta(crude, type = "corpus")
      meta(crude, "labels") <- 21:40
      meta(crude)

我尝试了很多这些调用（使用var“corpus”而不是“raw”），但它们似乎不起作用。其他人曾经似乎对类似的数据集有同样的问题（forum post from 2009，没有回应）

Answer 1

这里有一些基准测试...

使用for循环：

expr.for <- function() {
  for (i in seq(from= 1, to=length(corpus), by=1)){
    DublinCore(corpus[[i]], "title") = LETTERS[round(runif(26))]
    DublinCore(corpus[[i]], "Publisher" ) = LETTERS[round(runif(26))]
  }
}

microbenchmark(expr.for())
# Unit: milliseconds
#         expr      min       lq   median       uq      max
# 1 expr.for() 21.50504 22.40111 23.56246 23.90446 70.12398

使用tm_map：

corpus <- crude

expr.map <- function() {
  tm_map(corpus, function(x) {
    meta(x, "title") = LETTERS[round(runif(26))]
    meta(x, "Publisher" ) = LETTERS[round(runif(26))]
    x
  })
}

microbenchmark(expr.map())
# Unit: milliseconds
#         expr      min       lq   median       uq      max
# 1 expr.map() 5.575842 5.700616 5.796284 5.886589 8.753482

所以tm_map版本，正如您所注意到的，似乎快了大约4倍。

在您的问题中，您说tm_map版本中的更改不是持久的，这是因为您不会在匿名函数结束时返回x。最后它应该是：

meta(x, tag = "Heading") = csv[[i,10]]  
meta(x, tag = "publisher" ) = csv[[i,16]] 
x

R：tm Textmining包：Doc-Level元数据生成很慢

1 个答案: