比
有更好的方法吗?DT <- DT[,!apply(DT,2,function(x) all(is.na(x))), with = FALSE]
仅在未完全用NA
填充的列上具有数据表的子集?
谢谢
答案 0 :(得分:4)
基本思想是查找所有{NA
列,其内容类似于:
na_idx = sapply(DT, function(x) all(is.na(x)))
要将其应用于表的子集,答案取决于您是否要从表中删除这些列,或者是否打算创建单独的派生表;
在前一种情况下,您应该将这些列设置为NULL
:
DT[ , which(sapply(DT, function(x) all(is.na(x)))) := NULL]
在后一种情况下,有几种选择:
idx = sapply(DT, function(x) !all(is.na(x)))
DT = DT[ , idx, with = FALSE] # or DT = DT[ , ..idx]
DT = DT[ , lapply(.SD, function(x) if (all(is.na(x))) NULL else x)]
apply
和colSums
方法将涉及效率低下的矩阵转换。
以下是@DavidArenburg在此处的评论中列出的案例的基准:
method time
1: which := NULL 1.434
2: for set NULL 3.432
3: lapply(.SD) 16.041
4: ..idx 10.343
5: with FALSE 4.896
代码:
library(data.table)
NN = 1e7
kk = 50
n_na = 5
set.seed(021349)
DT = setDT(replicate(kk, rnorm(NN), simplify = FALSE))
DT[ , (sample(kk, n_na)) := NA_real_]
DT2 = copy(DT)
t1 = system.time(
DT2[ , which(sapply(DT2, function(x) all(is.na(x)))) := NULL]
)
rm(DT2)
DT2 = copy(DT)
t2 = system.time({
for (col in copy(names(DT2)))
if (all(is.na(DT2[[col]]))) set(DT2, , col, NULL)
})
rm(DT2)
DT2 = copy(DT)
t3 = system.time({
DT3 = DT2[ , lapply(.SD, function(x) if (all(is.na(x))) NULL else x)]
})
rm(DT3)
t4 = system.time({
idx = sapply(DT2, function(x) !all(is.na(x)))
DT3 = DT2[ , ..idx]
})
rm(DT3)
t5 = system.time({
idx = sapply(DT2, function(x) !all(is.na(x)))
DT3 = DT2[ , idx, with = FALSE]
})
data.table(
method = c('which := NULL', 'for set NULL',
'lapply(.SD)', '..idx', 'with FALSE'),
time = sapply(list(t1, t2, t3, t4, t5), `[`, 'elapsed')
)