当按某种条件对data.frames进行子集化时,如果数据帧包含NA,则可能会因条件而获得NA值。然后它会在data.frame:
的子集化中产生问题# data generation
set.seed(123)
df <- data.frame(a = 1:100, b = sample(c("moon", "venus"), 100, replace = TRUE), c = sample(c('a', 'b', NA), 100, replace = TRUE))
# indexing
with(df, df[a < 30 & b == "moon" & c == "a",])
你得到:
a b c
NA NA <NA> <NA>
10 10 moon a
12 12 moon a
NA.1 NA <NA> <NA>
NA.2 NA <NA> <NA>
29 29 moon a
这是因为条件导致包含NA的向量,然后这些NA将在索引数据帧时产生上述结果。
解决方案之一将是以下修复之一:
with(df, df[a < 30 & b == "moon" & (c == "a" & !is.na(c)),]) # exclude NAs
with(df, df[a < 30 & b == "moon" & (c == "a" | is.na(c)),]) # include NAs
但这些都非常笨拙 - 想象一下,你有很长的条件
df[A == x1 & B == x2 & C == x3 & D == x4,]
并且您必须像这样包装每个元素 - df[(A == x1 | is.na(A)) & (B == x2 | is.na(B)) ...,]
。
对于这个问题有没有优雅的解决方案,如果您只是尝试检查数据框,则不需要在控制台上编写这么多代码?
答案 0 :(得分:4)
好吧,如果你想省略NA
行,一个快速而又狡猾的解决方案就是把它包装在which
中:
> with(df, df[a < 30 & b == "moon" & c == "a",])
a b c
NA NA <NA> <NA>
10 10 moon a
12 12 moon a
NA.1 NA <NA> <NA>
NA.2 NA <NA> <NA>
29 29 moon a
> with(df, df[which(a < 30 & b == "moon" & c == "a"),])
a b c
10 10 moon a
12 12 moon a
29 29 moon a
在编辑时:在这种情况下的另一种选择,可能会被一些人不赞成,但我个人觉得非常有用,就是在括号内定义一个局部变量:
> with(df, df[{i<-a < 30 & b == "moon" & c == "a"; i | is.na(i)},])
a b c
6 6 moon <NA>
10 10 moon a
12 12 moon a
15 15 moon <NA>
18 18 moon <NA>
29 29 moon a
> with(df, df[{i<-a < 30 & b == "moon" & c == "a"; i & !is.na(i)},])
a b c
10 10 moon a
12 12 moon a
29 29 moon a
这比编写特殊函数或在单独的行上定义索引更简洁,并且适用于没有R函数完全符合您的要求的许多情况。
答案 1 :(得分:1)
您可以使用data.table
包。这样可以简化代码,因为您不必在with(df, ...)
中包含所有内容,并且将NAs视为FALSE。
require(data.table)
dt <- data.table(df)
dt[a < 30 & b == "moon" & c == "a",] # exclude NAs
dt[a < 30 & b == "moon" & (c == "a"|is.na(c)),] # include NAs
答案 2 :(得分:1)
clean <- function(x, include = FALSE){
x[is.na(x)] <- include
x
}
# Original output
with(df, df[a < 30 & b == "moon" & c == "a",])
# Clean it up and remove NAs
with(df, df[clean(a < 30 & b == "moon" & c == "a"),])
# Clean it up but include NAs
with(df, df[clean(a < 30 & b == "moon" & c == "a", include = TRUE),])
给出了
> with(df, df[a < 30 & b == "moon" & c == "a",])
a b c
NA NA <NA> <NA>
10 10 moon a
12 12 moon a
NA.1 NA <NA> <NA>
NA.2 NA <NA> <NA>
29 29 moon a
>
> with(df, df[clean(a < 30 & b == "moon" & c == "a"),])
a b c
10 10 moon a
12 12 moon a
29 29 moon a
> with(df, df[clean(a < 30 & b == "moon" & c == "a", include = TRUE),])
a b c
6 6 moon <NA>
10 10 moon a
12 12 moon a
15 15 moon <NA>
18 18 moon <NA>
29 29 moon a
使用which
也可以使用,但它只允许您默认排除值