我试图找出,如果在R中有比 gsub 矢量化函数更快的方法,我有一些"句子跟随数据框" (发送$ words)然后我有从这些句子中删除的单词(存储在wordsForRemoving变量中)。
sent <- data.frame(words =
c("just right size and i love this notebook", "benefits great laptop",
"wouldnt bad notebook", "very good quality", "bad orgtop but great",
"great improvement for that bad product but overall is not good",
"notebook is not good but i love batterytop"),
user = c(1,2,3,4,5,6,7),
stringsAsFactors=F)
wordsForRemoving <- c("great","improvement","love","great improvement","very good","good",
"right", "very","benefits", "extra","benefit","top","extraordinarily",
"extraordinary", "super","benefits super","good","benefits great",
"wouldnt bad")
然后,我将创建大数据&#34;模拟时间消耗计算...
df.expanded <- as.data.frame(replicate(1000000,sent$words))
library(zoo)
sent <- coredata(sent)[rep(seq(nrow(sent)),1000000),]
rownames(sent) <- NULL
使用以下 gsub 方法从已发送的$ words中删除字词(wordsForRemoving)需要72.87秒。我知道,这不是很好的模拟,但实际上我使用的单词字典超过3.000个单词,300,000个句子,整体处理时间超过1.5小时。
pattern <- paste0("\\b(?:", paste(wordsForRemoving, collapse = "|"), ")\\b ?")
res <- gsub(pattern, "", sent$words)
# user system elapsed
# 72.87 0.05 73.79
拜托,任何人都可以帮助我为我的任务编写更快的方法。非常感谢任何帮助或建议。非常感谢前进。
答案 0 :(得分:17)
正如Jason所说,stringi对你来说是个不错的选择..
以下是stringi
的表现system.time(res <- gsub(pattern, "", sent$words))
user system elapsed
66.229 0.000 66.199
library(stringi)
system.time(stri_replace_all_regex(sent$words, pattern, ""))
user system elapsed
21.246 0.320 21.552
更新(感谢Arun)
system.time(res <- gsub(pattern, "", sent$words, perl = TRUE))
user system elapsed
12.290 0.000 12.281
答案 1 :(得分:1)
这不是一个真正的答案,因为我没有找到任何总是更快的方法。显然,这取决于文本/向量的长度。短文本gsub
的执行速度最快。对于较长的文本或矢量,有时gsub
和perl=TRUE
有时stri_replace_all_regex
的执行速度最快。
这里有一些测试代码可以尝试:
library(stringi)
text = "(a1,\"something (f fdd71)\");(b2,\"something else (a fa171)\");(b4,\"something else (a fa171)\")"
# text = paste(rep(text, 5), collapse = ",")
# text = rep(text, 100)
nchar(text)
a = gsub(pattern = "[()]", replacement = "", x = text)
b = gsub(pattern = "[()]", replacement = "", x = text, perl=T)
c = stri_replace_all_regex(str = text, pattern = "[()]", replacement = "")
d = stri_replace(str = text, regex = "[()]", replacement = "", mode="all")
identical(a,b); identical(a,c); identical(a,d)
library(microbenchmark)
mc <- microbenchmark(
gsub = gsub(pattern = "[()]", replacement = "", x = text),
gsub_perl = gsub(pattern = "[()]", replacement = "", x = text, perl=T),
stringi_all = stri_replace_all_regex(str = text, pattern = "[()]", replacement = ""),
stringi = stri_replace(str = text, regex = "[()]", replacement = "", mode="all")
)
mc
Unit: microseconds expr min lq mean median uq max neval cld gsub 10.868 11.7740 13.47869 13.5840 14.490 31.394 100 a gsub_perl 79.690 80.2945 82.58225 82.4070 83.312 137.043 100 d stringi_all 14.188 14.7920 15.58558 15.5460 16.301 17.509 100 b stringi 36.828 38.0350 39.90904 38.7895 39.543 129.194 100 c