我的数据框中的列按相关性排序,左列是最相关的列。我正在尝试提取以“D”开头的最相关项目。
以下是一个例子:
df <- structure(list(TDIAG1 = structure(c(7L, 2L, 6L, 8L, 4L, 1L, 5L,
5L, 9L, 3L), .Label = c("D123", "D127", "E611", "E1133", "H269",
"K701", "K704", "K922", "R0989"), class = "factor"), TDIAG2 = structure(c(7L,
6L, 5L, 2L, 3L, 6L, 4L, 4L, 1L, 1L), .Label = c("", "D649", "H431",
"H570", "K703", "D123", "R18"), class = "factor"), TDIAG3 = structure(c(2L,
6L, 5L, 4L, 3L, 1L, 1L, 1L, 1L, 1L), .Label = c("", "F102", "H333",
"K296", "K658", "Z720"), class = "factor"), TDIAG4 = structure(c(2L,
1L, 4L, 3L, 1L, 1L, 1L, 1L, 1L, 1L), .Label = c("", "E834", "K703",
"K766"), class = "factor"), TDIAG5 = structure(c(1L, 1L, 3L,
2L, 1L, 1L, 1L, 1L, 1L, 1L), .Label = c("", "F101", "F102"), class = "factor"),
TDIAG6 = structure(c(1L, 1L, 2L, 3L, 1L, 1L, 1L, 1L, 1L,
1L), .Label = c("", "E877", "Z720"), class = "factor")), .Names = c("TDIAG1",
"TDIAG2", "TDIAG3", "TDIAG4", "TDIAG5", "TDIAG6"), row.names = c(NA,
10L), class = "data.frame")
> df
TDIAG1 TDIAG2 TDIAG3 TDIAG4 TDIAG5 TDIAG6
1 K704 R18 F102 E834
2 D127 D123 Z720
3 K701 K703 K658 K766 F102 E877
4 K922 D649 K296 K703 F101 Z720
5 E1133 H431 H333
6 D123 D123
7 H269 H570
8 H269 H570
9 R0989
10 E611
结果向量应在没有匹配时报告NA,并在匹配时报告第一个(最左侧)项目。我可以找到我感兴趣的项目...但是,我不得不为每一行提取第一个(最左边)。
> sapply(df, function (x) grepl("D", x))
TDIAG1 TDIAG2 TDIAG3 TDIAG4 TDIAG5 TDIAG6
[1,] FALSE FALSE FALSE FALSE FALSE FALSE
[2,] TRUE TRUE FALSE FALSE FALSE FALSE
[3,] FALSE FALSE FALSE FALSE FALSE FALSE
[4,] FALSE TRUE FALSE FALSE FALSE FALSE
[5,] FALSE FALSE FALSE FALSE FALSE FALSE
[6,] TRUE TRUE FALSE FALSE FALSE FALSE
[7,] FALSE FALSE FALSE FALSE FALSE FALSE
[8,] FALSE FALSE FALSE FALSE FALSE FALSE
[9,] FALSE FALSE FALSE FALSE FALSE FALSE
[10,] FALSE FALSE FALSE FALSE FALSE FALSE
结果应该是:
c(NA,"D127", NA, "D649", NA, "D123", NA, NA, NA, NA)
编辑: 如果我希望模式为c(“D”,“K”),这将如何扩展?我得到一个错误,说它只需要第一个。 (答案:将模式改为“D | K”)
编辑2:另外,如果我想在每行找到最左边的“D”代码但是从预先指定的列表中排除代码(例如,排除c(“D123”,“D090”, “D111”)?
编辑3:我写了一个包含所有答案的小函数。它适用于我正在做的事情。也许在某个阶段可能会让其他人受益。
功能:
FLAG <- function(data, tomatch, Exact.tomatch=T, Exclude=NA, Exact.excl=T, Return=c("01", "FirstValue", "Count")){
if(Exact.tomatch == T){tomatch <-paste("^",tomatch,"$", sep="")}
if(length(tomatch) > 1){tomatch <- paste(tomatch, collapse="|")}
if(Exact.excl==F){Exclude <- paste(Exclude, collapse="|")}
out <- NA
if(is.na(Exclude[1])==T){hits <- vapply(data, grepl, logical(nrow(data)), pattern = tomatch)}
if(is.na(Exclude[1])!=T & Exact.excl==T){hits <- vapply(lapply(data, function(x) replace(x,x %in% Exclude, NA)), grepl, logical(nrow(data)), pattern = tomatch)}
if(is.na(Exclude[1])!=T & Exact.excl==F){hits <- vapply(replace(data, vapply(data, grepl, logical(nrow(data)), pattern = Exclude)==T, NA), grepl, logical(nrow(data)), pattern = tomatch)}
if(Return[1] == "01"){out <- replace(rowSums(hits), rowSums(hits) >1, 1)}
if(Return[1] == "Count"){out <- rowSums(hits)}
if(Return[1] == "FirstValue"){out <- data[cbind(seq_len(nrow(data)),replace(max.col(hits,"first"), rowSums(hits)==0, NA))]}
out
}
该功能需要数据框或列表作为输入。然后是要查找的内容,要排除的内容以及这些内容是否应该完全匹配的向量。最后,它可以返回第一个(最左边)匹配,所有匹配的计数,或者只有一个标志,如果找到任何结果匹配。
示例1.在df中查找以D或K开头的任何代码(不限于完全匹配),但排除K701,K703和D127(与这些完全匹配),并返回第一个(最左侧)值:
FLAG(data=df, tomatch=c("D", "K"), Exact.tomatch=F, Exclude=c("K701", "K703","D127"), Exact.excl=T, Return="FirstValue")
示例2.在df中查找以D或H开头的任何代码(不限于精确匹配),但排除包含H3(无精确匹配)的任何代码,并返回第一个(最左侧)值:< / p>
FLAG(data=df, tomatch=c("D", "H"), Exact.tomatch=F, Exclude=c("H3"), Exact.excl=F, Return="FirstValue")
答案 0 :(得分:4)
无需跨越每一行。运行grepl
nrow(df)
次可能会比运行vapply
次sapply
(甚至更慢的表兄ncol(df)
}慢得多。 E.g:
hits <- vapply(df, grepl, logical(nrow(df)), pattern = "D")
df[cbind(
seq_len(nrow(df)),
replace(max.col(hits,"first"), rowSums(hits)==0, NA)
)]
#[1] NA "D127" NA "D649" NA "D123" NA NA NA NA
对百万行data.frame进行基准测试。
df <- df[sample(rownames(df),1e6,replace=TRUE),]
system.time({hits <- vapply(df, grepl, logical(nrow(df)), pattern = "D")
df[cbind(
seq_len(nrow(df)),
replace(max.col(hits,"first"), rowSums(hits)==0, NA)
)]})
# user system elapsed
# 1.904 0.120 2.024
system.time(apply(df, 1, function(x) grep("D", x, value=T)[1]))
# user system elapsed
# 23.141 0.172 23.317
答案 1 :(得分:3)
由于您按行操作,因此需要apply
,而不是sapply
此外,使用value = TRUE
内的grep
参数将返回您正在寻找的实际字符串
> apply(df, 1, function(x) grep("D", x, value=T)[1])
1 2 3 4 5 6 7 8 9 10
NA "D127" NA "D649" NA "D123" NA NA NA NA