Question

以下代码可以正常删除myCharVector中的停用词。但是当myCharVector有大量句子时，完成时间太长。如何加快循环操作（使用apply）？

感谢。

library(tm)

myCharVector  <- c("This is the first sentence", "hello this is second", "and now is the third one")
for(i in 1:length(myCharVector))  
{
for(j in 1:length(stopwords("en")))
{
tmp1 <- paste(stopwords("en")[j], " ", sep = "")
tmp1 <- paste(" ", tmp1, sep = "")
myCharVector[i] <- gsub(tmp1,  " ", myCharVector[i]) 
}  
}

Answer 1

您可以尝试mgsub

library(qdap)
 mgsub(sprintf(' %s ', stopwords('en')), ' ', myCharVector)
#[1] "This first sentence" "hello second"        "and now third one"

Answer 2

在这种情况下似乎有一个domain-specific solution。

但总的来说，努力更多地使用R的矢量化操作。例如，不是单独paste每个单词，而是执行此操作：

stopwords = paste0(' ', stopwords('en'), ' ')

它依次用空格围绕每个禁用词。同样，您不需要循环myCharVector，您可以直接使用gsub。

最重要的是，不会循环索引。这是间接的，缓慢的，并且（几乎？）总是不必要的。直接在条目上循环：

for (word in paste0(' ', stopwords('en'), ' '))
    myCharVector = gsub(word, ' ', myCharVector)

与您的解决方案相比，这同时更短，更清晰，更有效。

（也就是说，无论如何都会产生错误的结果，你应该使用预定义的函数。）

如何加速R代码

2 个答案: