比r中的gsub更快的方法

时间:2015-03-26 08:29:40

标签: regex r

我试图找出,如果在R中有比 gsub 矢量化函数更快的方法,我有一些"句子跟随数据框" (发送$ words)然后我有从这些句子中删除的单词(存储在wordsForRemoving变量中)。

sent <- data.frame(words = 
                     c("just right size and i love this notebook", "benefits great laptop",
                       "wouldnt bad notebook", "very good quality", "bad orgtop but great",
                       "great improvement for that bad product but overall is not good", 
                       "notebook is not good but i love batterytop"), 
                   user = c(1,2,3,4,5,6,7),
                   stringsAsFactors=F)

wordsForRemoving <- c("great","improvement","love","great improvement","very good","good",
                      "right", "very","benefits", "extra","benefit","top","extraordinarily",
                      "extraordinary", "super","benefits super","good","benefits great",
                      "wouldnt bad")

然后,我将创建大数据&#34;模拟时间消耗计算...

df.expanded <- as.data.frame(replicate(1000000,sent$words))
library(zoo)
sent <- coredata(sent)[rep(seq(nrow(sent)),1000000),]
rownames(sent) <- NULL

使用以下 gsub 方法从已发送的$ words中删除字词(wordsForRemoving)需要72.87秒。我知道,这不是很好的模拟,但实际上我使用的单词字典超过3.000个单词,300,000个句子,整体处理时间超过1.5小时。

pattern <- paste0("\\b(?:", paste(wordsForRemoving, collapse = "|"), ")\\b ?")
res <- gsub(pattern, "", sent$words)

#  user  system elapsed 
# 72.87    0.05   73.79

拜托,任何人都可以帮助我为我的任务编写更快的方法。非常感谢任何帮助或建议。非常感谢前进。

2 个答案:

答案 0 :(得分:17)

正如Jason所说,stringi对你来说是个不错的选择..

以下是stringi

的表现
system.time(res <- gsub(pattern, "", sent$words))
   user  system elapsed 
 66.229   0.000  66.199 

library(stringi)
system.time(stri_replace_all_regex(sent$words, pattern, ""))
   user  system elapsed 
 21.246   0.320  21.552 

更新(感谢Arun)

system.time(res <- gsub(pattern, "", sent$words, perl = TRUE))
   user  system elapsed 
 12.290   0.000  12.281 

答案 1 :(得分:1)

这不是一个真正的答案,因为我没有找到任何总是更快的方法。显然,这取决于文本/向量的长度。短文本gsub的执行速度最快。对于较长的文本或矢量,有时gsubperl=TRUE有时stri_replace_all_regex的执行速度最快。

这里有一些测试代码可以尝试:

library(stringi)
text = "(a1,\"something (f fdd71)\");(b2,\"something else (a fa171)\");(b4,\"something else (a fa171)\")"
# text = paste(rep(text, 5), collapse = ",")
# text = rep(text, 100)
nchar(text)

a = gsub(pattern = "[()]", replacement = "", x = text)
b = gsub(pattern = "[()]", replacement = "", x = text, perl=T)
c = stri_replace_all_regex(str = text, pattern = "[()]", replacement = "")
d = stri_replace(str = text, regex = "[()]", replacement = "", mode="all")

identical(a,b); identical(a,c); identical(a,d)

library(microbenchmark)
mc <- microbenchmark(
  gsub = gsub(pattern = "[()]", replacement = "", x = text),
  gsub_perl = gsub(pattern = "[()]", replacement = "", x = text, perl=T),
  stringi_all = stri_replace_all_regex(str = text, pattern = "[()]", replacement = ""),
  stringi = stri_replace(str = text, regex = "[()]", replacement = "", mode="all")
)
mc
Unit: microseconds
        expr    min      lq     mean  median     uq     max neval  cld
        gsub 10.868 11.7740 13.47869 13.5840 14.490  31.394   100 a   
   gsub_perl 79.690 80.2945 82.58225 82.4070 83.312 137.043   100    d
 stringi_all 14.188 14.7920 15.58558 15.5460 16.301  17.509   100  b  
     stringi 36.828 38.0350 39.90904 38.7895 39.543 129.194   100   c