R逐行提取符合标准的行

时间:2015-08-18 23:54:55

标签: r apply data-manipulation sapply

我的数据框中的列按相关性排序,左列是最相关的列。我正在尝试提取以“D”开头的最相关项目。

以下是一个例子:

df <- structure(list(TDIAG1 = structure(c(7L, 2L, 6L, 8L, 4L, 1L, 5L, 
5L, 9L, 3L), .Label = c("D123", "D127", "E611", "E1133", "H269", 
"K701", "K704", "K922", "R0989"), class = "factor"), TDIAG2 = structure(c(7L, 
6L, 5L, 2L, 3L, 6L, 4L, 4L, 1L, 1L), .Label = c("", "D649", "H431", 
"H570", "K703", "D123", "R18"), class = "factor"), TDIAG3 = structure(c(2L, 
6L, 5L, 4L, 3L, 1L, 1L, 1L, 1L, 1L), .Label = c("", "F102", "H333", 
"K296", "K658", "Z720"), class = "factor"), TDIAG4 = structure(c(2L, 
1L, 4L, 3L, 1L, 1L, 1L, 1L, 1L, 1L), .Label = c("", "E834", "K703", 
"K766"), class = "factor"), TDIAG5 = structure(c(1L, 1L, 3L, 
2L, 1L, 1L, 1L, 1L, 1L, 1L), .Label = c("", "F101", "F102"), class = "factor"), 
    TDIAG6 = structure(c(1L, 1L, 2L, 3L, 1L, 1L, 1L, 1L, 1L, 
    1L), .Label = c("", "E877", "Z720"), class = "factor")), .Names = c("TDIAG1", 
"TDIAG2", "TDIAG3", "TDIAG4", "TDIAG5", "TDIAG6"), row.names = c(NA, 
10L), class = "data.frame")


    > df
   TDIAG1 TDIAG2 TDIAG3 TDIAG4 TDIAG5 TDIAG6
1    K704    R18   F102   E834              
2    D127   D123   Z720                     
3    K701   K703   K658   K766   F102   E877
4    K922   D649   K296   K703   F101   Z720
5   E1133   H431   H333                     
6    D123   D123                            
7    H269   H570                            
8    H269   H570                            
9   R0989                                   
10   E611 

结果向量应在没有匹配时报告NA,并在匹配时报告第一个(最左侧)项目。我可以找到我感兴趣的项目...但是,我不得不为每一行提取第一个(最左边)。

> sapply(df,  function (x) grepl("D", x))
      TDIAG1 TDIAG2 TDIAG3 TDIAG4 TDIAG5 TDIAG6
 [1,]  FALSE  FALSE  FALSE  FALSE  FALSE  FALSE
 [2,]   TRUE   TRUE  FALSE  FALSE  FALSE  FALSE
 [3,]  FALSE  FALSE  FALSE  FALSE  FALSE  FALSE
 [4,]  FALSE   TRUE  FALSE  FALSE  FALSE  FALSE
 [5,]  FALSE  FALSE  FALSE  FALSE  FALSE  FALSE
 [6,]   TRUE   TRUE  FALSE  FALSE  FALSE  FALSE
 [7,]  FALSE  FALSE  FALSE  FALSE  FALSE  FALSE
 [8,]  FALSE  FALSE  FALSE  FALSE  FALSE  FALSE
 [9,]  FALSE  FALSE  FALSE  FALSE  FALSE  FALSE
[10,]  FALSE  FALSE  FALSE  FALSE  FALSE  FALSE

结果应该是:

c(NA,"D127", NA, "D649", NA, "D123", NA, NA, NA, NA)

编辑: 如果我希望模式为c(“D”,“K”),这将如何扩展?我得到一个错误,说它只需要第一个。 (答案:将模式改为“D | K”)

编辑2:另外,如果我想在每行找到最左边的“D”代码但是从预先指定的列表中排除代码(例如,排除c(“D123”,“D090”, “D111”)?

编辑3:我写了一个包含所有答案的小函数。它适用于我正在做的事情。也许在某个阶段可能会让其他人受益。

功能:

FLAG <- function(data, tomatch, Exact.tomatch=T, Exclude=NA,  Exact.excl=T, Return=c("01", "FirstValue", "Count")){ 
  if(Exact.tomatch == T){tomatch <-paste("^",tomatch,"$", sep="")}
  if(length(tomatch) > 1){tomatch <- paste(tomatch, collapse="|")}
  if(Exact.excl==F){Exclude <- paste(Exclude, collapse="|")}

  out <- NA
  if(is.na(Exclude[1])==T){hits <- vapply(data, grepl, logical(nrow(data)), pattern = tomatch)}
  if(is.na(Exclude[1])!=T & Exact.excl==T){hits <- vapply(lapply(data, function(x) replace(x,x %in% Exclude, NA)), grepl, logical(nrow(data)), pattern = tomatch)}
  if(is.na(Exclude[1])!=T & Exact.excl==F){hits <- vapply(replace(data, vapply(data, grepl, logical(nrow(data)), pattern = Exclude)==T, NA), grepl, logical(nrow(data)), pattern = tomatch)}
  if(Return[1] == "01"){out <- replace(rowSums(hits), rowSums(hits) >1, 1)}
  if(Return[1] == "Count"){out <- rowSums(hits)}
  if(Return[1] == "FirstValue"){out <- data[cbind(seq_len(nrow(data)),replace(max.col(hits,"first"), rowSums(hits)==0, NA))]}
  out
}

该功能需要数据框或列表作为输入。然后是要查找的内容,要排除的内容以及这些内容是否应该完全匹配的向量。最后,它可以返回第一个(最左边)匹配,所有匹配的计数,或者只有一个标志,如果找到任何结果匹配。

示例1.在df中查找以D或K开头的任何代码(不限于完全匹配),但排除K701,K703和D127(与这些完全匹配),并返回第一个(最左侧)值:

FLAG(data=df, tomatch=c("D", "K"), Exact.tomatch=F, Exclude=c("K701", "K703","D127"),  Exact.excl=T, Return="FirstValue")

示例2.在df中查找以D或H开头的任何代码(不限于精确匹配),但排除包含H3(无精确匹配)的任何代码,并返回第一个(最左侧)值:< / p>

FLAG(data=df, tomatch=c("D", "H"), Exact.tomatch=F, Exclude=c("H3"),  Exact.excl=F, Return="FirstValue")

2 个答案:

答案 0 :(得分:4)

无需跨越每一行。运行grepl nrow(df)次可能会比运行vapplysapply(甚至更慢的表兄ncol(df)}慢得多。 E.g:

hits <- vapply(df, grepl, logical(nrow(df)), pattern = "D")
df[cbind(
  seq_len(nrow(df)),
  replace(max.col(hits,"first"), rowSums(hits)==0, NA)
  )]
#[1] NA     "D127" NA     "D649" NA     "D123" NA     NA     NA     NA 

对百万行data.frame进行基准测试。

df <- df[sample(rownames(df),1e6,replace=TRUE),]
system.time({hits <- vapply(df, grepl, logical(nrow(df)), pattern = "D")
df[cbind(
  seq_len(nrow(df)),
  replace(max.col(hits,"first"), rowSums(hits)==0, NA)
  )]})
#   user  system elapsed 
#  1.904   0.120   2.024 

system.time(apply(df, 1, function(x) grep("D", x, value=T)[1]))
#   user  system elapsed 
# 23.141   0.172  23.317

答案 1 :(得分:3)

由于您按行操作,因此需要apply,而不是sapply

此外,使用value = TRUE内的grep参数将返回您正在寻找的实际字符串

> apply(df, 1, function(x) grep("D", x, value=T)[1])
     1      2      3      4      5      6      7      8      9     10 
    NA "D127"     NA "D649"     NA "D123"     NA     NA     NA     NA