我有一个数据框。我想搜索每个字母或符号。
例如,我想要字母“g”和“h”。如果字母在任何地方都是字符串,则分配1 else 0。
code gh HI &*
efhikq&* 0 0 1
efhiku& 0 0 0
DHIfux 0 1 0
DHIJUVXYefhjoq!* 0 1 0
HIfghv( 1 1 0
hiIvg 1 0 0
答案 0 :(得分:3)
更新,对于需要所有字母的更复杂的版本,但是按任何顺序:
haystack <- c("efhikq&*", "efhiku&", "DHIfux", "DHIJUVXYefhjoq!*", "HIfghv(", "hiIvg")
needle <- c("gh", "HI", "&*")
needle.split <- setNames(strsplit(needle, ""), needle)
`rownames<-`(
sapply(
needle.split,
function(x) {
apply(sapply(x, grepl, haystack, fixed=TRUE), 1, all) + 0
} ), haystack)
这里我们首先将针分解为组件字符,然后对于一组字符中的每个字符,我们对haystack运行grepl
(这是最里面的sapply
),然后我们确保通过检查矩阵的每一行的所有值,使用sapply
和apply
生成all
生成的每个字符都存在。我们还使用+ 0
位(也保留矩阵结构)转换为1/0。最后,最外面的sapply
将为每个字符组重复此逻辑,然后缝合结果,我们使用rownames<-
技巧添加rownames。
这符合所需的输出:
gh HI &*
efhikq&* 0 0 1
efhiku& 0 0 0
DHIfux 0 1 0
DHIJUVXYefhjoq!* 0 1 0
HIfghv( 1 1 0
hiIvg 1 0 0
OLD VERSION匹配所有字符:
这是一个解决方案:
`rownames<-`(sapply(needle, grepl, haystack, fixed=TRUE) + 0, haystack)
产生
gh HI &*
efhikq&* 0 0 1
efhiku& 0 0 0
DHIfux 0 1 0
DHIJUVXYefhjoq!* 0 1 0
HIfghv( 1 1 0
hiIvg 0 0 0
这基本上是Ananda建议的实现(我在完成此操作之前没有看到他的评论),添加了以非替换形式使用替换函数rownames<-
的“技巧”附加行名称。这是数据:
haystack <- c("efhikq&*", "efhiku&", "DHIfux", "DHIJUVXYefhjoq!*", "HIfghv(", "hiIvg")
needle <- c("gh", "HI", "&*")
答案 1 :(得分:2)
这是一个非常有效的替代方案。它看起来很混乱,但我想不出这里更好的替代paste0
。基本步骤是创建看起来像"^(?=.*g)(?=.*h)"
的正则表达式模式,并在perl = TRUE
中设置grepl
。
Specials <- c(".", "|", "(", ")", "[", "{", "^", "$", "*", "+", "?")
Patterns <- strsplit(patterns, "", fixed=TRUE)
out <- vapply(vapply(Patterns, function(x) {
x <- ifelse(x %in% Specials, paste0("\\", x), x)
paste0("^", paste0("(?=.*", x, ")", collapse=""))
}, character(1L)), grepl, logical(length(code)),
code, perl = TRUE) * 1
dimnames(out) <- list(code, patterns)
out
# gh HI &*
# efhikq&* 0 0 1
# efhiku& 0 0 0
# DHIfux 0 1 0
# DHIJUVXYefhjoq!* 0 1 0
# HIfghv( 1 1 0
# hiIvg 1 0 0
与@ BrodieG的答案相比,这个表现相当不错。以下是一些基准测试。
基准测试的功能
funBG <- function() {
needle.split <- setNames(strsplit(needle, ""), needle)
`rownames<-`(sapply(needle.split, function(x) {
apply(sapply(x, grepl, haystack, fixed=TRUE), 1, all) + 0
}), haystack)
}
funAM <- function() {
Specials <- c(".", "|", "(", ")", "[", "{", "^", "$", "*", "+", "?")
Patterns <- strsplit(patterns, "", fixed=TRUE)
out <- vapply(vapply(Patterns, function(x) {
x <- ifelse(x %in% Specials, paste0("\\", x), x)
paste0("^", paste0("(?=.*", x, ")", collapse=""))
}, character(1L)), grepl, logical(length(code)),
code, perl = TRUE) * 1
dimnames(out) <- list(code, patterns)
out
}
微小数据
haystack <- code <- c("efhikq&*", "efhiku&", "DHIfux",
"DHIJUVXYefhjoq!*", "HIfghv(", "hiIvg")
needle <- patterns <- c("gh", "HI", "&*")
library(microbenchmark)
microbenchmark(funBG(), funAM())
# Unit: microseconds
# expr min lq median uq max neval
# funBG() 686.509 717.405 741.209 754.3070 977.042 100
# funAM() 248.062 257.851 265.587 277.5425 651.062 100
中等数据
set.seed(1)
haystack <- code <- sample(code, 1000, replace = TRUE)
microbenchmark(funBG(), funAM())
# Unit: milliseconds
# expr min lq median uq max neval
# funBG() 19.859273 20.662812 20.894996 21.856938 36.80115 100
# funAM() 1.359937 1.403614 1.477143 1.498897 2.78009 100
identical(funBG(), funAM())
# [1] TRUE
更大的数据
haystack <- code <- sample(code, 1e6, replace = TRUE)
system.time(funBG())
# user system elapsed
# 50.372 0.003 53.057
system.time(funAM())
# user system elapsed
# 1.135 0.000 1.141
答案 2 :(得分:0)
略有不同的解决方案(基于Ananda Mahto的评论):
code <- c("efhikq&*", "efhiku&", "DHIfux", "DHIJUVXYefhjoq!*", "HIfghv(", "hiIvg")
patterns <- c("gh", "HI", "&*")
cols <- sapply(patterns, function(x) as.numeric(grepl(x, code, fixed = TRUE)))
df <- as.data.frame(cbind(code,cols))