获取与行长不均匀的条件匹配的数据帧行

时间:2018-04-18 13:42:49

标签: r dataframe subset

让数据帧具有不均匀的行长度,未知的列 - 即每行可能具有不同的长度,但所有NA值始终位于末尾。还有三个值:startpenultimatelast

问题:如何(优雅地,没有嵌套循环)查找数据框中符合该条件的所有行?

示例:对于以下数据框和值:

df <- structure(list(V1 = c("a", "a", "a", "a", "b"), V2 = c("b", "n", "t", "o", "l"), V3 = c("c", "m", "h", "j", "p"), V4 = c("d", "c", "j", "", "e"), V5 = c("", "d", "", "", "")), 
.Names = c("V1", "V2", "V3", "V4", "V5"), 
row.names = c(NA, 5L), class = "data.frame")
df[df == ""] <- NA

start <- "a"
penultimate <- "c"
last <- "d"

所需的输出将是以下子集:

  V1 V2 V3 V4   V5
1  a  b  c  d  [NA]
2  a  n  m  c   d

3 个答案:

答案 0 :(得分:2)

我设法用applyMARGIN=1解决了这个问题但是,我怀疑它的效率。

df[apply(df, 1, function(x) {
    temp = x[!is.na(x)]
    temp[1] == start & tail(temp, 1) == last & tail(temp, 2)[1] == penultimate
}), ]

#  V1 V2 V3 V4   V5
#1  a  b  c  d <NA>
#2  a  n  m  c    d

对于每一行,我们首先删除所有NA元素,然后检查条件(startlastpenultimate)并使用布尔索引对行进行子集化

答案 1 :(得分:1)

这是使用基础R的一种方式:

output <- apply(df, 1, function(row) {
    index_last <- max(which(!is.na(row)))
    if (row[1] == start & row[index_last - 1] == penultimate & row[index_last] == last) {
        return(row)
    }
    return(NULL)
})

这会列出已过滤的行,我们可以rbind返回data.frame

> do.call(rbind, output)
  V1  V2  V3  V4  V5 
1 "a" "b" "c" "d" NA 
2 "a" "n" "m" "c" "d"

答案 2 :(得分:1)

你可以在这里使用正则表达式

pattern <- paste0("^", start, ".*", penultimate, last, "$")
# "^a.*cd$"
index <- grepl(pattern, apply(df, 1, function(i) paste(i[!is.na(i)], collapse="")))
# [1]  TRUE  TRUE FALSE FALSE FALSE
df[index,]
#   V1 V2 V3 V4   V5
# 1  a  b  c  d <NA>
# 2  a  n  m  c    d