Question

我目前正在处理文本挖掘文档，在这里我想从文本中提取相关的关键字（请注意，我有很多文本文档）。

我正在使用udpipe软件包。（http://bnosac.be/index.php/blog/77-an-overview-of-keyword-extraction-techniques）上有一个很棒的Vignette。一切正常，但是当我运行代码时，

x <- udpipe_annotate(ud_model, x = comments$feedback)

真的非常慢（尤其是当您有很多文字时）。 有人知道我如何更快地获得这部分吗？解决方法当然没问题。

library(udpipe)
library(textrank)
## First step: Take the Spanish udpipe model and annotate the text. Note: this takes about 3 minutes

data(brussels_reviews)
comments <- subset(brussels_reviews, language %in% "es")
ud_model <- udpipe_download_model(language = "spanish")
ud_model <- udpipe_load_model(ud_model$file_model)
x <- udpipe_annotate(ud_model, x = comments$feedback) # This part is really, really slow 
x <- as.data.frame(x)

非常感谢！

Answer 1

我正在基于未来的API添加答案。此功能与您使用的操作系统（Windows，Mac或Linux风格）无关。

future.apply包具有基本的* apply系列的所有并行替代方案。其余代码基于@jwijffels的答案。唯一的区别是我在annotate_splits函数中使用了data.table。

library(udpipe)
library(data.table)

data(brussels_reviews)
comments <- subset(brussels_reviews, language %in% "es")
ud_model <- udpipe_download_model(language = "spanish", overwrite = F)
ud_es <- udpipe_load_model(ud_model)


# returns a data.table
annotate_splits <- function(x, file) {
  ud_model <- udpipe_load_model(file)
  x <- as.data.table(udpipe_annotate(ud_model, 
                                     x = x$feedback,
                                     doc_id = x$id))
  return(x)
}


# load parallel library future.apply
library(future.apply)

# Define cores to be used
ncores <- 3L
plan(multiprocess, workers = ncores)

# split comments based on available cores
corpus_splitted <- split(comments, seq(1, nrow(comments), by = 100))

annotation <- future_lapply(corpus_splitted, annotate_splits, file = ud_model$file_model)
annotation <- rbindlist(annotation)

Answer 2

R包udpipe使用UDPipe 1.2 C ++库。本文中详细说明了注释速度（请参见https://doi.org/10.18653/v1/K17-3009中的表Table 8）。如果您想加快速度，请并行运行它，因为注释几乎可以并行化。

下面的示例使用parallel :: mclapply在16个内核上并行化，如果您当然有16个内核，则可以使大型文集的速度提高16倍。您可以使用任何并行化框架，在下面我使用了并行程序包-如果您在Windows上，则需要例如parallel :: parLapply，但是没有什么可以阻止您使用其他并行选项（snow / multicore / future / foreach / ...）来并行注释。

library(udpipe)
library(data.table)
library(parallel)
data(brussels_reviews)
comments <- subset(brussels_reviews, language %in% "fr")
ud_model <- udpipe_download_model(language = "french-partut")

annotate_splits <- function(x, file) {
  model <- udpipe_load_model(file)
  x <- udpipe_annotate(model, x = x$feedback, doc_id = x$id, tagger = "default", parser = "default")
  as.data.frame(x, detailed = TRUE)
}

corpus_splitted <- split(comments, seq(1, nrow(comments), by = 100))
annotation <- mclapply(corpus_splitted, FUN = function(x, file){
  annotate_splits(x, file) 
}, file = ud_model$file_model, mc.cores = 16)
annotation <- rbindlist(annotation)

请注意，udpipe_load_model也要花费一些时间，因此可能更好的策略是根据计算机上的内核数来并行化它，而不是如上所述的100个大块。

使udpipe_annotate（）更快

2 个答案: