我有一个大型数据集,其中包含许多NA。我想找到第一个NA和最后一个NA出现的行。例如,对于列A,我希望输出是第二行(数字前的最后一个NA)和第五行(数字后面的第一个NA)。我的代码如下所示,效果不佳。
nonnaindex <- which(!is.na(df))
firstnonna <- apply(nonnaindex, 2, min)
数据:
ID A B C
1 NA NA 3
2 NA 2 2
3 3 3 1
4 4 5 NA
5 NA 6 NA
答案 0 :(得分:1)
我相信这个功能可能就是你要找的东西:
first_and_last_non_na <- function(DT, col) {
library(data.table)
data.table(DT)[, grp := rleid(is.na(get(col)))][
, rbind(last(.SD[is.na(get(col)) & grp == min(grp)]),
first(.SD[is.na(get(col)) & grp == max(grp)]))][
!is.na(ID)][, grp := NULL][]
}
返回
first_and_last_na_row(DT, "A")
ID A B C 1: 2 NA 2 2 2: 5 NA 6 NA
first_and_last_na_row(DT, "B")
ID A B C 1: 1 NA NA 3
first_and_last_na_row(DT, "C")
ID A B C 1: 4 4 5 NA
first_and_last_na_row(DT, "D")
Empty data.table (0 rows) of 4 cols: ID,A,B,C
的情况
DT
ID A B C 1: 1 NA NA 3 2: 2 NA 2 2 3: 3 3 3 1 4: 4 4 5 NA 5: 5 NA 6 NA
或
first_and_last_na_row(DT2, "D")
ID A B C D 1: 1 NA NA 3 NA
如果是Akrun's (simplified) example
DT2
ID A B C D 1: 1 NA NA 3 NA 2: 2 NA 2 2 2 3: 3 3 3 1 NA 4: 4 4 5 NA NA 5: 5 NA 6 NA 4
melt()
OP有commented,他的生产数据集由4000列和192行组成,他需要索引来清理另一个数据集。他在所有列上尝试了for
循环非常慢。
因此,我建议将数据集从长格式重新整形,并使用data.table
的高效分组机制:
# reshape from wide to long format
long <- setDT(DT2)[, melt(.SD, id = "ID")][
# add grouping variable to distinguish streaks continuous of NA/non-NA values
# for each variable
, grp := rleid(variable, is.na(value))][
# set sort order just for convenience, not essential
, setorder(.SD, variable, ID)]
long
ID variable value grp 1: 1 A NA 1 2: 2 A NA 1 3: 3 A 3 2 4: 4 A 4 2 5: 5 A NA 3 6: 1 B NA 4 7: 2 B 2 5 8: 3 B 3 5 9: 4 B 5 5 10: 5 B 6 5 11: 1 C 3 6 12: 2 C 2 6 13: 3 C 1 6 14: 4 C NA 7 15: 5 C NA 7 16: 1 D NA 8 17: 2 D 2 9 18: 3 D NA 10 19: 4 D NA 10 20: 5 D 4 11
现在,我们通过
得到每个变量(如果有的话)的起始或结束,分别为NA
序列的索引
# starting NA sequence
long[, .(ID = which(is.na(value) & grp == min(grp))), by = variable]
variable ID 1: A 1 2: A 2 3: B 1 4: D 1
# ending NA sequence
long[, .(ID = which(is.na(value) & grp == max(grp))), by = variable]
variable ID 1: A 5 2: C 4 3: C 5
请注意,这将返回开始或结束NA
序列的所有索引,这可能更方便后续清理另一个数据集。如果只需要最后和第一个索引,可以通过
long[long[, is.na(value) & grp == min(grp), by =variable]$V1, .(ID = max(ID)), by = variable]
variable ID 1: A 2 2: B 1 3: D 1
long[long[, is.na(value) & grp == max(grp), by =variable]$V1, .(ID = min(ID)), by = variable]
variable ID 1: A 5 2: C 4
我使用192行乘以4000列的虚拟数据集测试了这种方法。整个操作需要不到一秒。