我有以下数据框:
structure(list(A = c(1L, 1L, 1L, 1L, 2L), B = c(1L, 2L, 2L, 2L,
1L), C = c(1L, 1L, 2L, 2L, 1L), D = structure(c(2L, 2L, 1L, 2L,
2L), .Label = c("", "x"), class = "factor"), E = structure(c(2L,
1L, 2L, 2L, 1L), .Label = c("", "x"), class = "factor"), F = structure(c(2L,
1L, 2L, 2L, 2L), .Label = c("", "x"), class = "factor"), G = structure(c(2L,
1L, 1L, 1L, 1L), .Label = c("", "x"), class = "factor"), Y = structure(c(2L,
1L, 2L, 1L, 1L), .Label = c("", "x"), class = "factor")), .Names = c("A",
"B", "C", "D", "E", "F", "G", "Y"), class = "data.frame", row.names = c(NA,
-5L))
我想过滤此数据框并删除列(D,E,F,G,Y)中的缺失值。我正在以下代码中使用“ complete.cases”进行此操作:
completeFun <- function(data, desiredCols) {
completeVec <- complete.cases(data[, desiredCols])
return(data[completeVec, ])
}
但是,我注意到的是,当我调用该函数时,例如:completeFun(test, c('E','F')
,返回了以下输出:
A B C D E F G Y
1 1 1 1 x x x x x
3 1 2 2 <NA> x x <NA> x
4 1 2 2 x x x <NA> <NA>
这将删除E OR F
是NA
的行,而只保留E AND F are NOT NA
的行。
但是,我要保留的行是其中任何一列(E,F)为NOT NA
,即neither E nor F == NA
,这意味着在这种情况下的输出: / p>
A B C D E F G Y
1 1 1 1 x x x x x
3 1 2 2 <NA> x x <NA> x
4 1 2 2 x x x <NA> <NA>
5 2 1 1 x <NA> x <NA> <NA>
当然,我想尽可能地保持函数的灵活性,以便能够在计算中包括更多的列。
执行此操作的最佳R方法是什么?
更新
根据Sotos的答案,以下情况基于他的答案不起作用:
structure(list(A = c(1L, 1L, 1L, 1L, 2L), B = c(1L, 2L, 2L, 2L,
1L), C = c(1L, 1L, 2L, 2L, 1L), D = structure(c(1L, 1L, NA, 1L,
1L), .Label = "x", class = "factor"), E = structure(c(1L, NA,
1L, 1L, NA), .Label = "x", class = "factor"), F = structure(c(1L,
NA, 1L, 1L, 1L), .Label = "x", class = "factor"), G = structure(c(1L,
NA, NA, NA, NA), .Label = "x", class = "factor"), Y = structure(c(1L,
NA, 1L, NA, 1L), .Label = "x", class = "factor")), .Names = c("A",
"B", "C", "D", "E", "F", "G", "Y"), row.names = c(NA, -5L), class = "data.frame")
对于这个新的数据框,如果我按如下方式调用该函数:completeFun(test, cols = c('E','F', 'Y'))
我将得到以下输出:
A B C D E F G Y
1 1 1 1 x x x x x
NA NA NA NA <NA> <NA> <NA> <NA> <NA>
3 1 2 2 <NA> x x <NA> x
NA.1 NA NA NA <NA> <NA> <NA> <NA> <NA>
NA.2 NA NA NA <NA> <NA> <NA> <NA> <NA>
丢失了数据帧的最后一行,其中F AND Y
具有非空值。
答案 0 :(得分:2)
您可以通过rowSums
(即
completeFun <- function(df, cols) {
return(df[rowSums(df[cols] == '') != length(cols),])
}
completeFun(dd, cols = c('E', 'F'))
# A B C D E F G Y
#1 1 1 1 x x x x x
#3 1 2 2 x x x
#4 1 2 2 x x x
#5 2 1 1 x x
completeFun(dd, cols = 'Y')
# A B C D E F G Y
#1 1 1 1 x x x x x
#3 1 2 2 x x x
编辑
在前面的示例中,OP具有空格而不是NA
,因此我们正在检查它们。如果我们要检查NA
,可以修改该函数并改为使用is.na
检查。
completeFun <- function(df, cols) {
df[rowSums(is.na(df[cols])) != length(cols), ]
}
completeFun(df, cols = c('E','F', 'Y'))
# A B C D E F G Y
#1 1 1 1 x x x x x
#3 1 2 2 <NA> x x <NA> x
#4 1 2 2 x x x <NA> <NA>
#5 2 1 1 x <NA> x <NA> x
答案 1 :(得分:0)
类似于Sotos的回答,但它更具灵活性。
如果非NA值的数量等于或大于阈值thrsh
,则认为该行已完成。
completeFun <- function(dtf, cols, na.val="", thrsh=1) {
dtf[dtf == na.val] <- NA
ix <- rowSums(!is.na(dtf[, cols])) >= thrsh
dtf[ix, ]
}
completeFun(test, cols=c("E", "F"))
# A B C D E F G Y
# 1 1 1 1 x x x x x
# 3 1 2 2 <NA> x x <NA> x
# 4 1 2 2 x x x <NA> <NA>
# 5 2 1 1 x <NA> x <NA> <NA>
completeFun(test, cols=c("D", "E", "F", "Y"), thrsh=3)
# A B C D E F G Y
# 1 1 1 1 x x x x x
# 3 1 2 2 <NA> x x <NA> x
# 4 1 2 2 x x x <NA> <NA>