如何替换数据框中的字母?

时间:2014-03-02 15:30:45

标签: regex r

我有一个数据框。我想搜索每个字母或符号。

例如,我想要字母“g”和“h”。如果字母在任何地方都是字符串,则分配1 else 0。

code              gh     HI    &*
efhikq&*          0      0     1
efhiku&           0      0     0
DHIfux            0      1     0
DHIJUVXYefhjoq!*  0      1     0
HIfghv(           1      1     0
hiIvg             1      0     0

3 个答案:

答案 0 :(得分:3)

更新,对于需要所有字母的更复杂的版本,但是按任何顺序:

haystack <- c("efhikq&*", "efhiku&", "DHIfux", "DHIJUVXYefhjoq!*", "HIfghv(", "hiIvg")
needle <- c("gh", "HI", "&*")

needle.split <- setNames(strsplit(needle, ""), needle)

`rownames<-`(
  sapply(
    needle.split, 
    function(x) {
      apply(sapply(x, grepl, haystack, fixed=TRUE), 1, all) + 0
} ), haystack)

这里我们首先将针分解为组件字符,然后对于一组字符中的每个字符,我们对haystack运行grepl(这是最里面的sapply),然后我们确保通过检查矩阵的每一行的所有值,使用sapplyapply生成all生成的每个字符都存在。我们还使用+ 0位(也保留矩阵结构)转换为1/0。最后,最外面的sapply将为每个字符组重复此逻辑,然后缝合结果,我们使用rownames<-技巧添加rownames。

这符合所需的输出:

                 gh HI &*
efhikq&*          0  0  1
efhiku&           0  0  0
DHIfux            0  1  0
DHIJUVXYefhjoq!*  0  1  0
HIfghv(           1  1  0
hiIvg             1  0  0

OLD VERSION匹配所有字符:

这是一个解决方案:

`rownames<-`(sapply(needle, grepl, haystack, fixed=TRUE) + 0, haystack)

产生

                 gh HI &*
efhikq&*          0  0  1
efhiku&           0  0  0
DHIfux            0  1  0
DHIJUVXYefhjoq!*  0  1  0
HIfghv(           1  1  0
hiIvg             0  0  0

这基本上是Ananda建议的实现(我在完成此操作之前没有看到他的评论),添加了以非替换形式使用替换函数rownames<-的“技巧”附加行名称。这是数据:

haystack <- c("efhikq&*", "efhiku&", "DHIfux", "DHIJUVXYefhjoq!*", "HIfghv(", "hiIvg")
needle <- c("gh", "HI", "&*")

答案 1 :(得分:2)

这是一个非常有效的替代方案。它看起来很混乱,但我想不出这里更好的替代paste0。基本步骤是创建看起来像"^(?=.*g)(?=.*h)"的正则表达式模式,并在perl = TRUE中设置grepl

Specials <- c(".", "|", "(", ")", "[", "{", "^", "$", "*", "+", "?")
Patterns <- strsplit(patterns, "", fixed=TRUE)
out <- vapply(vapply(Patterns, function(x) {
  x <- ifelse(x %in% Specials, paste0("\\", x), x)
  paste0("^", paste0("(?=.*", x, ")", collapse=""))
}, character(1L)), grepl, logical(length(code)),
code, perl = TRUE) * 1
dimnames(out) <- list(code, patterns)
out
#                  gh HI &*
# efhikq&*          0  0  1
# efhiku&           0  0  0
# DHIfux            0  1  0
# DHIJUVXYefhjoq!*  0  1  0
# HIfghv(           1  1  0
# hiIvg             1  0  0

基准

与@ BrodieG的答案相比,这个表现相当不错。以下是一些基准测试。

基准测试的功能

funBG <- function() {
   needle.split <- setNames(strsplit(needle, ""), needle)
  `rownames<-`(sapply(needle.split, function(x) {
        apply(sapply(x, grepl, haystack, fixed=TRUE), 1, all) + 0
      }), haystack)
}

funAM <- function() {
    Specials <- c(".", "|", "(", ")", "[", "{", "^", "$", "*", "+", "?")
    Patterns <- strsplit(patterns, "", fixed=TRUE)
    out <- vapply(vapply(Patterns, function(x) {
      x <- ifelse(x %in% Specials, paste0("\\", x), x)
      paste0("^", paste0("(?=.*", x, ")", collapse=""))
    }, character(1L)), grepl, logical(length(code)),
    code, perl = TRUE) * 1
    dimnames(out) <- list(code, patterns)
    out
}

微小数据

haystack <- code <- c("efhikq&*", "efhiku&", "DHIfux", 
                      "DHIJUVXYefhjoq!*", "HIfghv(", "hiIvg")
needle <- patterns <- c("gh", "HI", "&*")

library(microbenchmark)
microbenchmark(funBG(), funAM())
# Unit: microseconds
#     expr     min      lq  median       uq     max neval
#  funBG() 686.509 717.405 741.209 754.3070 977.042   100
#  funAM() 248.062 257.851 265.587 277.5425 651.062   100

中等数据

set.seed(1)
haystack <- code <- sample(code, 1000, replace = TRUE)
microbenchmark(funBG(), funAM())
# Unit: milliseconds
#     expr       min        lq    median        uq      max neval
#  funBG() 19.859273 20.662812 20.894996 21.856938 36.80115   100
#  funAM()  1.359937  1.403614  1.477143  1.498897  2.78009   100

identical(funBG(), funAM())
# [1] TRUE

更大的数据

haystack <- code <- sample(code, 1e6, replace = TRUE)
system.time(funBG())
#   user  system elapsed 
# 50.372   0.003  53.057 
system.time(funAM())
#  user  system elapsed 
# 1.135   0.000   1.141  

答案 2 :(得分:0)

略有不同的解决方案(基于Ananda Mahto的评论):

code <- c("efhikq&*", "efhiku&", "DHIfux", "DHIJUVXYefhjoq!*", "HIfghv(", "hiIvg")
patterns <- c("gh", "HI", "&*")

cols <- sapply(patterns, function(x) as.numeric(grepl(x, code, fixed = TRUE)))

df <- as.data.frame(cbind(code,cols))