假设我们在几列中有一个包含字符串的表'数据'。我们希望找到包含某个值的所有行的索引,或者更好的是,找到几个值中的一个。但是,该专栏未知。
目前,我所做的是:
apply(df, 2, function(x) which(x == "M017"))
其中df =
1 04.10.2009 01:24:51 M017 <NA> <NA> NA
2 04.10.2009 01:24:53 M018 <NA> <NA> NA
3 04.10.2009 01:24:54 M051 <NA> <NA> NA
4 04.10.2009 01:25:06 <NA> M016 <NA> NA
5 04.10.2009 01:25:07 <NA> M015 <NA> NA
6 04.10.2009 01:26:07 <NA> M017 <NA> NA
7 04.10.2009 01:26:27 <NA> M017 <NA> NA
8 04.10.2009 01:27:23 <NA> M017 <NA> NA
9 04.10.2009 01:27:30 <NA> M017 <NA> NA
10 04.10.2009 01:27:32 M017 <NA> <NA> NA
11 04.10.2009 01:27:34 M051 <NA> <NA> NA
如果我们试图找到多个值,这也有效:
apply(df, 2, function(x) which(x %in% c("M017", "M018")))
结果是:
$`1`
integer(0)
$`2`
[1] 1 2 20
$`3`
[1] 16 17 18 19
$`4`
integer(0)
$`5`
integer(0)
但是,处理结果列表列表相当繁琐。
是否有更有效的方法可以在任何列中查找包含值(或更多)的行?
答案 0 :(得分:23)
怎么样
apply(df, 1, function(r) any(r %in% c("M017", "M018")))
如果第i行包含其中一个值,则第i个元素将为TRUE
,否则为FALSE
。或者,如果您只想要行号,请将上述语句括在which(...)
。
答案 1 :(得分:4)
这是一个dplyr
选项:
library(dplyr)
# across all columns:
df %>% filter_all(any_vars(. %in% c('M017', 'M018')))
# or in only select columns:
df %>% filter_at(vars(col1, col2), any_vars(. %in% c('M017', 'M018')))
答案 2 :(得分:3)
如果要查找包含向量中任何值的rows
,可以选择循环向量(lapply(v1,..)
),创建逻辑索引(TRUE / FALSE), (==
)。使用Reduce
和OR(|
)通过检查相应的元素将列表缩减为单个逻辑矩阵。对行(rowSums
)求和,加倍否定(!!
)以得到任何匹配的行。
indx1 <- !!rowSums(Reduce(`|`, lapply(v1, `==`, df)), na.rm=TRUE)
或使用which
arr.ind=TRUE
的行索引
indx2 <- unique(which(Vectorize(function(x) x %in% v1)(df),
arr.ind=TRUE)[,1])
我没有使用@ kristang的解决方案,因为它给了我错误。基于1000x500
矩阵,@ konvas的解决方案是最有效的(到目前为止)。但是,如果行数增加,这可能会有所不同
val <- paste0('M0', 1:1000)
set.seed(24)
df1 <- as.data.frame(matrix(sample(c(val, NA), 1000*500,
replace=TRUE), ncol=500), stringsAsFactors=FALSE)
set.seed(356)
v1 <- sample(val, 200, replace=FALSE)
konvas <- function() {apply(df1, 1, function(r) any(r %in% v1))}
akrun1 <- function() {!!rowSums(Reduce(`|`, lapply(v1, `==`, df1)),
na.rm=TRUE)}
akrun2 <- function() {unique(which(Vectorize(function(x) x %in%
v1)(df1),arr.ind=TRUE)[,1])}
library(microbenchmark)
microbenchmark(konvas(), akrun1(), akrun2(), unit='relative', times=20L)
#Unit: relative
# expr min lq mean median uq max neval
# konvas() 1.00000 1.000000 1.000000 1.000000 1.000000 1.00000 20
# akrun1() 160.08749 147.642721 125.085200 134.491722 151.454441 52.22737 20
# akrun2() 5.85611 5.641451 4.676836 5.330067 5.269937 2.22255 20
# cld
# a
# b
# a
对于ncol = 10
,结果略有不同:
expr min lq mean median uq max neval
konvas() 3.116722 3.081584 2.90660 2.983618 2.998343 2.394908 20
akrun1() 27.587827 26.554422 22.91664 23.628950 21.892466 18.305376 20
akrun2() 1.000000 1.000000 1.00000 1.000000 1.000000 1.000000 20
v1 <- c('M017', 'M018')
df <- structure(list(datetime = c("04.10.2009 01:24:51",
"04.10.2009 01:24:53",
"04.10.2009 01:24:54", "04.10.2009 01:25:06", "04.10.2009 01:25:07",
"04.10.2009 01:26:07", "04.10.2009 01:26:27", "04.10.2009 01:27:23",
"04.10.2009 01:27:30", "04.10.2009 01:27:32", "04.10.2009 01:27:34"
), col1 = c("M017", "M018", "M051", "<NA>", "<NA>", "<NA>", "<NA>",
"<NA>", "<NA>", "M017", "M051"), col2 = c("<NA>", "<NA>", "<NA>",
"M016", "M015", "M017", "M017", "M017", "M017", "<NA>", "<NA>"
), col3 = c("<NA>", "<NA>", "<NA>", "<NA>", "<NA>", "<NA>", "<NA>",
"<NA>", "<NA>", "<NA>", "<NA>"), col4 = c(NA, NA, NA, NA, NA,
NA, NA, NA, NA, NA, NA)), .Names = c("datetime", "col1", "col2",
"col3", "col4"), class = "data.frame", row.names = c("1", "2",
"3", "4", "5", "6", "7", "8", "9", "10", "11"))