根据无功输入过滤data.frame

时间:2018-07-10 08:49:21

标签: r

我有以下数据框:

structure(list(A = c(1L, 1L, 1L, 1L, 2L), B = c(1L, 2L, 2L, 2L, 
1L), C = c(1L, 1L, 2L, 2L, 1L), D = structure(c(2L, 2L, 1L, 2L, 
2L), .Label = c("", "x"), class = "factor"), E = structure(c(2L, 
1L, 2L, 2L, 1L), .Label = c("", "x"), class = "factor"), F = structure(c(2L, 
1L, 2L, 2L, 2L), .Label = c("", "x"), class = "factor"), G = structure(c(2L, 
1L, 1L, 1L, 1L), .Label = c("", "x"), class = "factor"), Y = structure(c(2L, 
1L, 2L, 1L, 1L), .Label = c("", "x"), class = "factor")), .Names = c("A", 
"B", "C", "D", "E", "F", "G", "Y"), class = "data.frame", row.names = c(NA, 
-5L))

我想过滤此数据框并删除列(D,E,F,G,Y)中的缺失值。我正在以下代码中使用“ complete.cases”进行此操作:

completeFun <- function(data, desiredCols) {

   completeVec <- complete.cases(data[, desiredCols])
   return(data[completeVec, ])
 }

但是,我注意到的是,当我调用该函数时,例如:completeFun(test, c('E','F'),返回了以下输出:

  A B C    D E F    G    Y
1 1 1 1    x x x    x    x
3 1 2 2 <NA> x x <NA>    x
4 1 2 2    x x x <NA> <NA>

这将删除E OR FNA的行,而只保留E AND F are NOT NA的行。

但是,我要保留的行是其中任何一列(E,F)为NOT NA,即neither E nor F == NA,这意味着在这种情况下的输出: / p>

  A B C    D    E    F    G    Y
1 1 1 1    x    x    x    x    x
3 1 2 2 <NA>    x    x <NA>    x
4 1 2 2    x    x    x <NA> <NA>
5 2 1 1    x <NA>    x <NA> <NA>

当然,我想尽可能地保持函数的灵活性,以便能够在计算中包括更多的列。

执行此操作的最佳R方法是什么?

更新

根据Sotos的答案,以下情况基于他的答案不起作用:

structure(list(A = c(1L, 1L, 1L, 1L, 2L), B = c(1L, 2L, 2L, 2L, 
1L), C = c(1L, 1L, 2L, 2L, 1L), D = structure(c(1L, 1L, NA, 1L, 
1L), .Label = "x", class = "factor"), E = structure(c(1L, NA, 
1L, 1L, NA), .Label = "x", class = "factor"), F = structure(c(1L, 
NA, 1L, 1L, 1L), .Label = "x", class = "factor"), G = structure(c(1L, 
NA, NA, NA, NA), .Label = "x", class = "factor"), Y = structure(c(1L, 
NA, 1L, NA, 1L), .Label = "x", class = "factor")), .Names = c("A", 
"B", "C", "D", "E", "F", "G", "Y"), row.names = c(NA, -5L), class = "data.frame")

对于这个新的数据框,如果我按如下方式调用该函数:completeFun(test, cols = c('E','F', 'Y'))我将得到以下输出:

      A  B  C    D    E    F    G    Y
1     1  1  1    x    x    x    x    x
NA   NA NA NA <NA> <NA> <NA> <NA> <NA>
3     1  2  2 <NA>    x    x <NA>    x
NA.1 NA NA NA <NA> <NA> <NA> <NA> <NA>
NA.2 NA NA NA <NA> <NA> <NA> <NA> <NA>

丢失了数据帧的最后一行,其中F AND Y具有非空值。

2 个答案:

答案 0 :(得分:2)

您可以通过rowSums(即

)执行此操作
completeFun <- function(df, cols) {
    return(df[rowSums(df[cols] == '') != length(cols),])
}

completeFun(dd, cols = c('E', 'F'))
#  A B C D E F G Y
#1 1 1 1 x x x x x
#3 1 2 2   x x   x
#4 1 2 2 x x x    
#5 2 1 1 x   x  

completeFun(dd, cols = 'Y')
#  A B C D E F G Y
#1 1 1 1 x x x x x
#3 1 2 2   x x   x

编辑

在前面的示例中,OP具有空格而不是NA,因此我们正在检查它们。如果我们要检查NA,可以修改该函数并改为使用is.na检查。

completeFun <- function(df, cols) {
    df[rowSums(is.na(df[cols])) != length(cols), ]
 }


completeFun(df, cols = c('E','F', 'Y'))
#  A B C    D    E F    G    Y
#1 1 1 1    x    x x    x    x
#3 1 2 2 <NA>    x x <NA>    x
#4 1 2 2    x    x x <NA> <NA>
#5 2 1 1    x <NA> x <NA>    x

答案 1 :(得分:0)

类似于Sotos的回答,但它更具灵活性。
如果非NA值的数量等于或大于阈值thrsh,则认为该行已完成。

completeFun <- function(dtf, cols, na.val="", thrsh=1) {
    dtf[dtf == na.val] <- NA
    ix <- rowSums(!is.na(dtf[, cols])) >= thrsh
    dtf[ix, ]
}

completeFun(test, cols=c("E", "F"))
#   A B C    D    E F    G    Y
# 1 1 1 1    x    x x    x    x
# 3 1 2 2 <NA>    x x <NA>    x
# 4 1 2 2    x    x x <NA> <NA>
# 5 2 1 1    x <NA> x <NA> <NA>

completeFun(test, cols=c("D", "E", "F", "Y"), thrsh=3)
#   A B C    D E F    G    Y
# 1 1 1 1    x x x    x    x
# 3 1 2 2 <NA> x x <NA>    x
# 4 1 2 2    x x x <NA> <NA>