我有一个以下格式的数据表:
id c1 c2
1 1 NA
1 1 NA
1 1 10
1 1 NA
1 1 NA
1 1 10
1 1 NA
1 1 NA
1 1 11
1 1 NA
1 1 NA
1 1 11
2 1 NA
2 1 12
2 1 NA
2 1 NA
2 1 12
从这个数据表中我想更新c2中两个值之间的所有NA,如下所示:
id c1 c2
1 1 NA
1 1 NA
1 1 10
1 1 10
1 1 10
1 1 10
1 1 NA
1 1 NA
1 1 11
1 1 11
1 1 11
1 1 11
2 1 NA
2 1 12
2 1 12
2 1 12
2 1 12
答案 0 :(得分:2)
可以使用for
循环和which()
:
df=data.frame(id = c(rep(1,12)),c2 = c(NA,NA,10,NA,NA,10, NA,NA,11,NA,11,NA))
查找c2的唯一值:
vals=unique(df[which(!is.na(df$c2)),'c2'])
循环显示唯一值并在第一次和最后一次出现之间替换观察结果:
for(i in vals){
df[min(which(df$c2==i)):max(which(df$c2==i)),'c2']=i
}
答案 1 :(得分:2)
除了直接使用行索引的David's approach之外,还有另一种data.table
方法,它使用非等连接:
# coerce to data.table
setDT(DT)[
# append unique row id
, rn := .I][
# non-equi join on row ids
DT[!is.na(c2), .(rmin = min(rn), rmax = max(rn)), by = c2],
on = .(rn >= rmin, rn <= rmax), c2 := i.c2][
# remove row id column
, rn := NULL][]
id c1 c2 1: 1 1 NA 2: 1 1 NA 3: 1 1 10 4: 1 1 10 5: 1 1 10 6: 1 1 10 7: 1 1 NA 8: 1 1 NA 9: 1 1 11 10: 1 1 11 11: 1 1 11 12: 1 1 11 13: 2 1 NA 14: 2 1 12 15: 2 1 12 16: 2 1 12 17: 2 1 12
表达式
DT[!is.na(c2), .(rmin = min(rn), rmax = max(rn)), by = c2]
返回c2
c2 rmin rmax 1: 10 3 6 2: 11 9 12 3: 12 14 17
隐含的假设是行id范围不重叠。它要求每个“间隙”与唯一的c2
值相关联。这也会影响其他解决方案1,2。
rleid()
可以改进代码以处理违反上述假设的情况。
使用rleid()
,即使具有相同的c2
值,我们也可以区分不同的差距。例如,对于第二个样本数据集
DT2[!is.na(c2), .(c2 = first(c2), rmin = min(rn), rmax = max(rn)), by = rleid(c2)]
rleid c2 rmin rmax 1: 1 10 3 6 2: 2 11 9 12 3: 3 12 14 17 4: 4 10 20 23
完整的代码:
setDT(DT2)[, rn := .I][
DT2[!is.na(c2), .(c2 = first(c2), rmin = min(rn), rmax = max(rn)), by = rleid(c2)],
on = .(rn >= rmin, rn <= rmax), c2 := i.c2][, rn := NULL][]
id c1 c2 1: 1 1 NA 2: 1 1 NA 3: 1 1 10 4: 1 1 10 5: 1 1 10 6: 1 1 10 7: 1 1 NA 8: 1 1 NA 9: 1 1 11 10: 1 1 11 11: 1 1 11 12: 1 1 11 13: 2 1 NA 14: 2 1 12 15: 2 1 12 16: 2 1 12 17: 2 1 12 18: 2 1 NA 19: 2 1 NA 20: 2 1 10 21: 2 1 10 22: 2 1 10 23: 2 1 10 24: 2 1 NA 25: 2 1 NA id c1 c2
library(data.table)
DT <- fread("id c1 c2
1 1 NA
1 1 NA
1 1 10
1 1 NA
1 1 NA
1 1 10
1 1 NA
1 1 NA
1 1 11
1 1 NA
1 1 NA
1 1 11
2 1 NA
2 1 12
2 1 NA
2 1 NA
2 1 12")
扩展数据集(注意c2 == 10
的重复出现):
DT2 <- fread("id c1 c2
1 1 NA
1 1 NA
1 1 10
1 1 NA
1 1 NA
1 1 10
1 1 NA
1 1 NA
1 1 11
1 1 NA
1 1 NA
1 1 11
2 1 NA
2 1 12
2 1 NA
2 1 NA
2 1 12
2 1 NA
2 1 NA
2 1 10
2 1 NA
2 1 NA
2 1 10
2 1 NA
2 1 NA")
答案 2 :(得分:1)
好的(新的/编辑过的答案),我们可以利用这样一个事实,即解决方案的理想属性是填充应该产生与填充相同的结果:
var config = configs[db];