Question

让数据帧具有不均匀的行长度，未知的列 - 即每行可能具有不同的长度，但所有NA值始终位于末尾。还有三个值：start，penultimate和last。

问题：如何（优雅地，没有嵌套循环）查找数据框中符合该条件的所有行？

示例：对于以下数据框和值：

df <- structure(list(V1 = c("a", "a", "a", "a", "b"), V2 = c("b", "n", "t", "o", "l"), V3 = c("c", "m", "h", "j", "p"), V4 = c("d", "c", "j", "", "e"), V5 = c("", "d", "", "", "")), 
.Names = c("V1", "V2", "V3", "V4", "V5"), 
row.names = c(NA, 5L), class = "data.frame")
df[df == ""] <- NA

start <- "a"
penultimate <- "c"
last <- "d"

所需的输出将是以下子集：

  V1 V2 V3 V4   V5
1  a  b  c  d  [NA]
2  a  n  m  c   d

Answer 1

我设法用apply和MARGIN=1解决了这个问题但是，我怀疑它的效率。

df[apply(df, 1, function(x) {
    temp = x[!is.na(x)]
    temp[1] == start & tail(temp, 1) == last & tail(temp, 2)[1] == penultimate
}), ]

#  V1 V2 V3 V4   V5
#1  a  b  c  d <NA>
#2  a  n  m  c    d

对于每一行，我们首先删除所有NA元素，然后检查条件（start，last和penultimate）并使用布尔索引对行进行子集化

Answer 2

这是使用基础R的一种方式：

output <- apply(df, 1, function(row) {
    index_last <- max(which(!is.na(row)))
    if (row[1] == start & row[index_last - 1] == penultimate & row[index_last] == last) {
        return(row)
    }
    return(NULL)
})

这会列出已过滤的行，我们可以rbind返回data.frame：

> do.call(rbind, output)
  V1  V2  V3  V4  V5 
1 "a" "b" "c" "d" NA 
2 "a" "n" "m" "c" "d"

Answer 3

你可以在这里使用正则表达式

pattern <- paste0("^", start, ".*", penultimate, last, "$")
# "^a.*cd$"
index <- grepl(pattern, apply(df, 1, function(i) paste(i[!is.na(i)], collapse="")))
# [1]  TRUE  TRUE FALSE FALSE FALSE
df[index,]
#   V1 V2 V3 V4   V5
# 1  a  b  c  d <NA>
# 2  a  n  m  c    d

获取与行长不均匀的条件匹配的数据帧行

3 个答案: