我有一个带有多列键的大型R data.table
,其中一些值列包含一些NA。我想在一个或多个值列中删除完全NA的组,但是保留整个组。对密钥的每一列重复此操作。
举一个简化的例子:
library(data.table)
DT = data.table(
Series = rep(letters[1:12], each = 3),
Id = 1:12,
Value1 = c(1:3, NA, 5:9, rep(NA,3), 1:3, NA, 5:9, rep(NA,3), 1:3, NA, 5:9, rep(NA,3)),
Value2 = c(rep(NA,3), 1:4, NA, 6:9, rep(NA,3), 1:9, 1:9, rep(NA,3)))
DT
Series Id Value1 Value2
1: a 1 1 NA
2: a 2 2 NA
3: a 3 3 NA
4: b 4 NA 1
5: b 5 5 2
6: b 6 6 3
7: c 7 7 4
8: c 8 8 NA
9: c 9 9 6
10: d 10 NA 7
11: d 11 NA 8
12: d 12 NA 9
13: e 1 1 NA
14: e 2 2 NA
15: e 3 3 NA
16: f 4 NA 1
17: f 5 5 2
18: f 6 6 3
19: g 7 7 4
20: g 8 8 5
21: g 9 9 6
22: h 10 NA 7
23: h 11 NA 8
24: h 12 NA 9
25: i 1 1 1
26: i 2 2 2
27: i 3 3 3
28: j 4 NA 4
29: j 5 5 5
30: j 6 6 6
31: k 7 7 7
32: k 8 8 8
33: k 9 9 9
34: l 10 NA NA
35: l 11 NA NA
36: l 12 NA NA
Series Id Value1 Value2
所以我想放弃:
正确的结果应如下所示:
Series Id Value1 Value2
1: b 5 5 2
2: b 6 6 3
3: c 7 7 4
4: c 8 8 NA
5: c 9 9 6
6: f 5 5 2
7: f 6 6 3
8: g 7 7 4
9: g 8 8 5
10: g 9 9 6
11: i 1 1 1
12: i 2 2 2
13: i 3 3 3
14: j 5 5 5
15: j 6 6 6
16: k 7 7 7
17: k 8 8 8
18: k 9 9 9
Series Id Value1 Value2
到目前为止我管理的内容:
我可以找到Value1的NA系列,如下所示:
DT[, sum(1-is.na(Value1)) == 0, by = Series][V1 == TRUE]
我甚至可以做到
setkey(DT, Series)
DT = DT[DT[, sum(1-is.na(Value)) == 0, by = Series][V1 != TRUE]]
但现在我最终在决赛桌上出现了V1。
答案 0 :(得分:10)
您可以执行此操作以获取并非所有Value
为NA
的条目:
setkey(DT, "Series")
DT[, .SD[(!all(is.na(Value)))], by=Series]
需要!all
周围的parens来避免Matthew将要研究的非连接语法(请参阅注释)。与此相同:
DT[, .SD[as.logical(!all(is.na(Value)))], by=Series]
在此基础上回答新的澄清问题:
allNA = function(x) all(is.na(x)) # define helper function
for (i in c("Id","Series"))
DT = DT[, if (!any(sapply(.SD,allNA))) .SD else NULL, by=i]
DT
Series Id Value1 Value2
1: i 1 1 1
2: i 2 2 2
3: i 3 3 3
4: b 5 5 2
5: b 6 6 3
6: f 5 5 2
7: f 6 6 3
8: j 5 5 5
9: j 6 6 6
10: c 7 7 4
11: c 8 8 NA
12: c 9 9 6
13: g 7 7 4
14: g 8 8 5
15: g 9 9 6
16: k 7 7 7
17: k 8 8 8
18: k 9 9 9
但这改变了顺序。所以不是所要求的结果。以下内容保持顺序,也应该更快。
# starting fresh from original DT in question again
DT[,drop:=FALSE]
for (i in c("Series","Id"))
DT[,drop:=drop|any(sapply(.SD,allNA)),by=i]
DT[(!drop)][,drop:=NULL][]
Series Id Value1 Value2
1: b 5 5 2
2: b 6 6 3
3: c 7 7 4
4: c 8 8 NA
5: c 9 9 6
6: f 5 5 2
7: f 6 6 3
8: g 7 7 4
9: g 8 8 5
10: g 9 9 6
11: i 1 1 1
12: i 2 2 2
13: i 3 3 3
14: j 5 5 5
15: j 6 6 6
16: k 7 7 7
17: k 8 8 8
18: k 9 9 9
答案 1 :(得分:10)
使用complete.cases
函数怎么样?
DT[complete.cases(DT),]
它将删除列值为NA
的行> DT[complete.cases(DT),]
Series Id Value1 Value2
1: b 4 4 1
2: b 5 5 2
3: b 6 6 3
4: c 7 7 4
5: c 8 8 5
6: c 9 9 6
7: f 4 4 1
8: f 5 5 2
9: f 6 6 3
10: g 7 7 4
11: g 8 8 5
12: g 9 9 6
13: j 4 4 1
14: j 5 5 2
15: j 6 6 3
16: k 7 7 4
17: k 8 8 5
18: k 9 9 6