在我之前提出的两个问题的基础上建立:
R: How to prevent memory overflow when using mgsub in vector mode?
我喜欢@Tyler使用fixed = TRUE的建议,因为它可以显着加快计算速度。但是,它并不总是适用。我需要替换caps
作为一个单独的单词w /或没有围绕它的标点符号。先验不知道该单词后面或后面会有什么,但它必须是任何常规的标点符号(,。! - +等)。它不能是数字或字母。以下示例。 capsule
必须保持原样。
i = "Here is the capsule, caps key, and two caps, or two caps. or even three caps-"
orig = "caps"
change = "cap"
gsub_FixedTrue <- function(i) {
i = paste0(" ", i, " ")
orig = paste0(" ", orig, " ")
change = paste0(" ", change, " ")
i = gsub(orig,change,i,fixed=TRUE)
i = gsub("^\\s|\\s$", "", i, perl=TRUE)
return(i)
}
#Second fastest, doesn't clog memory
gsub_FixedFalse <- function(i) {
i = gsub(paste0("\\b",orig,"\\b"),change,i)
return(i)
}
print(gsub_FixedTrue(i)) #wrong
print(gsub_FixedFalse(i)) #correct
结果。需要第二个输出
[1] "Here is the capsule, cap key, and two caps, or two caps. or even three caps-"
[1] "Here is the capsule, cap key, and two cap, or two cap. or even three cap-"
答案 0 :(得分:1)
使用上一个问题中的部分进行测试我认为我们可以在标点符号前面放置一个占位符,如下所示,而不会减慢太多:
line <- c("one", "two one", "four phones", "and a capsule", "But here's a caps key",
"Here is the capsule, caps key, and two caps, or two caps. or even three caps-" )
e <- c("one", "two", "caps")
r <- c("ONE", "TWO", "cap")
line <- rep(line, 1700000/length(line))
line <- gsub("([[:punct:]])", " <DEL>\\1<DEL> ", line, perl=TRUE)
## Start
line2 <- paste0(" ", line, " ")
e2 <- paste0(" ", e, " ")
r2 <- paste0(" ", r, " ")
for (i in seq_along(e2)) {
line2 <- gsub(e2[i], r2[i], line2, fixed=TRUE)
}
gsub("^\\s|\\s$| <DEL>|<DEL> ", "", line2, perl=TRUE)