使用'R'中的数据表迭代替换NA

时间:2014-02-09 18:13:29

标签: r data.table

我正在尝试用适当组中的随机样本替换NA。例如,在第2行中,NA来自'France',年龄和时间为'20 -30''30 -40'。因此,我想对所有其他“法国”,“20 -30”,“30-40”观察结果进行随机抽样回复列。

我有下面的代码很好,但每个值都替换为相同的随机样本。例如,如果我有一个以上的'法国','20 -30','30 -40'NA,那么它们相应的R2都是相同的。

我希望每个NA都可以独立采样,但data.table似乎是“一次性”进行采样,因此我无法做到这一点。有什么想法吗?

DT <- data.table(mydf, key = "Country,Age,Time")
DT[, R2 := ifelse(is.na(Response), sample(na.omit(Response), 1), 
                  Response), by = key(DT)]
DT
#    Index Country   Age  Time Response R2
# 1:     5  France 20-30 30-40        1  1
# 2:     6  France 20-30 30-40       NA  2
# 3:     7  France 20-30 30-40        2  2
# 4:     1 Germany 20-30 15-20        1  1
# 5:     2 Germany 20-30 15-20       NA  1
# 6:     3 Germany 20-30 15-20        1  1
# 7:     4 Germany 20-30 15-20        0  0

其中mydf是

mydf <- structure(list(Index = 1:7, Country = c("Germany", "Germany", 
"Germany", "Germany", "France", "France", "France"), Age = c("20-30", 
"20-30", "20-30", "20-30", "20-30", "20-30", "20-30"), Time = c("15-20", 
"15-20", "15-20", "15-20", "30-40", "30-40", "30-40"), Response = c(1L, 
NA, 1L, 0L, 1L, NA, 2L)), .Names = c("Index", "Country", "Age", 
"Time", "Response"), class = "data.frame", row.names = c(NA, -7L))

2 个答案:

答案 0 :(得分:2)

set.seed(1234)
require(data.table)
DT <- data.table(mydf, key = "Country,Age,Time")

第一步

DT[, R2 := sample(na.omit(Response), length(Response), replace = T), 
   by = key(DT)]

DT

#    Index Country   Age  Time Response R2
# 1:     5  France 20-30 30-40        1  1
# 2:     6  France 20-30 30-40       NA  2
# 3:     7  France 20-30 30-40        2  2
# 4:     1 Germany 20-30 15-20        1  1
# 5:     2 Germany 20-30 15-20       NA  0
# 6:     3 Germany 20-30 15-20        1  1
# 7:     4 Germany 20-30 15-20        0  1

修改

第二步

在第一步中,您可以对组进行采样(按= ...)并获取R2的值。 第二步,使用没有NA的响应值更新R2。

DT[!is.na(Response), R2 := Response]

DT

#    Index Country   Age  Time Response R2
# 1:     5  France 20-30 30-40        1  1
# 2:     6  France 20-30 30-40       NA  2
# 3:     7  France 20-30 30-40        2  2
# 4:     1 Germany 20-30 15-20        1  1
# 5:     2 Germany 20-30 15-20       NA  0
# 6:     3 Germany 20-30 15-20        1  1
# 7:     4 Germany 20-30 15-20        0  0

答案 1 :(得分:2)

我这样做:

DT[, is_na := is.na(Response)]
nas <- DT[, sample(Response[!is_na], sum(is_na), TRUE) ,
             by=list(Country, Age, Time)]$V1
DT[, R2 := Response][(is_na), R2 := nas]