将一组模式匹配替换为R中的相应替换字符串

时间:2014-10-31 13:38:01

标签: r string replace

PHP中的str_replace(和preg_replace)函数用替换字符串替换所有出现的搜索字符串。我最感兴趣的是,如果searchreplace args是数组(在R中我们称之为向量),那么str_replace从每个数组(向量)获取一个值并使用他们在主题上搜索和替换。

换句话说,R(或某些R包)是否具有执行以下功能的功能:

string <- "The quick brown fox jumped over the lazy dog."
patterns     <- c("quick", "brown", "fox")
replacements <- c("slow",  "black", "bear")
xxx_replace_xxx(string, patterns, replacements)          ## ???
## [1] "The slow black bear jumped over the lazy dog."

所以我正在寻找像chartr这样的东西,但是对于搜索模式和任意数量字符的替换字符串。这不能通过对gsub()的一次调用来完成,因为replacement参数只能是一个字符串,请参阅?gsub。所以我目前的实现就像:

xxx_replace_xxx <- function(string, patterns, replacements) {
   for (i in seq_along(patterns))
      string <- gsub(patterns[i], replacements[i], string, fixed=TRUE)
   string
}

但是,如果length(patterns)很大,我正在寻找更快的东西 - 我需要处理大量数据,而且我对目前的结果不满意。

用于基准测试的示例性玩具数据:

string <- readLines("http://www.gutenberg.org/files/31536/31536-0.txt", encoding="UTF-8")
patterns <- c("jak", "to", "do", "z", "na", "i", "w", "za", "tu", "gdy",
   "po", "jest", "Tadeusz", "lub", "razem", "nas", "przy", "oczy", "czy",
   "sam", "u", "tylko", "bez", "ich", "Telimena", "Wojski", "jeszcze")
replacements <- paste0(patterns, rev(patterns))

3 个答案:

答案 0 :(得分:10)

使用PCRE代替固定匹配,我的机器上的时间约为1/3。

xxx_replace_xxx_pcre <- function(string, patterns, replacements) {
   for (i in seq_along(patterns))
      string <- gsub(patterns[i], replacements[i], string, perl=TRUE)
   string
}
system.time(x <- xxx_replace_xxx(string, patterns, replacements))
#    user  system elapsed 
#   0.491   0.000   0.491 
system.time(p <- xxx_replace_xxx_pcre(string, patterns, replacements))
#    user  system elapsed 
#   0.162   0.000   0.162 
identical(x,p)
# [1] TRUE

答案 1 :(得分:8)

如果模式是由字符字符组成的固定字符串,如示例中所示,则此方法有效。 gsubfngsub类似,但replacment参数可以是字符串,列表,函数或proto对象。如果它是一个列表,就像这里一样,它将正则表达式的匹配与名称进行比较,对于找到的那些,它将用相应的值替换它们:

library(gsubfn)

gsubfn("\\b\\w+\\b", as.list(setNames(replacements, patterns)), string)
## [1] "The slow black bear jumped over the lazy dog."

答案 2 :(得分:4)

使用stri_replace_*_all函数之一并将vectorize_all参数设置为FALSE,可以使用stringi&gt; = 0.3-1来完成此操作:

library("stringi")
string <- "The quicker brown fox jumped over the lazy dog."
patterns     <- c("quick", "brown", "fox")
replacements <- c("slow",  "black", "bear")
stri_replace_all_fixed(string, patterns, replacements, vectorize_all=FALSE)
## [1] "The slower black bear jumped over the lazy dog."
stri_replace_all_regex(string, "\\b" %s+% patterns %s+% "\\b", replacements, vectorize_all=FALSE)
## [1] "The quicker black bear jumped over the lazy dog."

一些基准:

string <- readLines("http://www.gutenberg.org/files/31536/31536-0.txt", encoding="UTF-8")
patterns <- c("jak", "to", "do", "z", "na", "i", "w", "za", "tu", "gdy",
   "po", "jest", "Tadeusz", "lub", "razem", "nas", "przy", "oczy", "czy",
   "sam", "u", "tylko", "bez", "ich", "Telimena", "Wojski", "jeszcze")
replacements <- paste0(patterns, rev(patterns))
microbenchmark::microbenchmark(
   stri_replace_all_fixed(string, patterns, replacements, vectorize_all=FALSE),
   stri_replace_all_regex(string, "\\b" %s+% patterns %s+% "\\b", replacements, vectorize_all=FALSE),
   xxx_replace_xxx_pcre(string, "\\b" %s+% patterns %s+% "\\b", replacements),
   gsubfn("\\b\\w+\\b", as.list(setNames(replacements, patterns)), string),
   unit="relative",
   times=25
)
## Unit: relative
##                   expr       min        lq      mean    median        uq       max neval
## stri_replace_all_fixed  1.000000  1.000000  1.000000  1.000000  1.000000  1.000000    25 
## stri_replace_all_regex  2.169701  2.248115  2.198638  2.267935  2.267635  1.753289    25  
## xxx_replace_xxx_pcre    1.983135  1.967303  1.937021  1.961449  1.974422  1.469894    25  
## gsubfn                 63.067835 69.870657 69.815031 71.178841 72.503020 57.019072    25  

因此,就字边界匹配而言,基于PCRE的版本是最快的。