Question

我正在R中工作，正在使用data.table。我有一个看起来像这样的数据集：

ID   country_id    weight
1    BGD           56
1    NA            57
1    NA            63
2    SA            12
2    NA            53
2    SA            54

如果country_id中的值是NA，则需要用赋予该ID的非na country_id值替换它。我希望数据集看起来像这样：

ID   country_id    weight
1    BGD           56
1    BGD           57
1    BGD           63
2    SA            12
2    SA            53
2    SA            54

此数据集包含数百万个ID，因此不能手动修复每个ID。

感谢您的帮助！

编辑：已解决！

我使用了以下代码： dt [，country_id：= country_id [！is.na（country_id）] [1]，通过= ID]

Answer 1

另一种选择是使用联接：

DT[is.na(country_id), country_id := 
    DT[!is.na(country_id)][.SD, on=.(ID), mult="first", country_id]]

说明：

DT[is.na(country_id)将数据集子集分配给country_id列中具有NA的那些子集
.SD是上一步的数据子集（也是data.table）。
DT[!is.na(country_id)][.SD, on=.(ID)使用.SD作为键，将DT[!is.na(country_id)]与ID连接起来。
j=country_id返回右表DT[!is.na(country_id)]的country_id列，如果有多个匹配项，mult="first"返回第一个匹配项。
country_id :=将DT行中country_id为TRUE的列is.na(country_id)更新为联接的结果。

计时代码和类似但更大的数据，如Andrew所述：

library(data.table)
set.seed(42)

nr <- 1e7
dt <- data.table(ID = rep(1:(nr/4), each = 4),
    country_id = rep(rep(c("BGD", "SA", "USA", "DEN", "THI"), each = 4)),
    weight = sample(10:100, nr, TRUE))
dt[sample(1:nr, nr/2), country_id := NA]
DT <- copy(dt)

microbenchmark::microbenchmark(
    first_nonmissing = dt[, country_id := country_id[!is.na(country_id)][1L], by = ID],
    use_join=DT[is.na(country_id), country_id := DT[!is.na(country_id)][.SD, on=.(ID), mult="first", country_id]],
    times = 1L
)

时间：

Unit: milliseconds
             expr       min        lq      mean    median        uq       max neval
 first_nonmissing 3282.1373 3282.1373 3282.1373 3282.1373 3282.1373 3282.1373     1
         use_join  554.5314  554.5314  554.5314  554.5314  554.5314  554.5314     1

Answer 2

通过评论中的答案/建议，您可以选择几种方式。我模拟了一个数据集，该数据集的country_id列中有1,000,000行，但缺少30％，以了解哪种情况最适合您的情况。

在此基准测试中，扩展性最佳的答案是将NA替换为第一个非缺失值，且该值具有相同的ID：dt[, country_id := country_id[!is.na(country_id)][1], by = ID]。

Unit: milliseconds
             expr       min        lq      mean    median        uq       max neval
 first_nonmissing  253.0039  267.0272  284.3988  271.4015  274.5101  405.2004    10
            tidyr  943.6658  951.9638  970.7185  960.6233  971.0660 1069.3023    10
          na.locf 7173.9556 7218.2757 7267.6968 7271.0279 7325.6820 7344.9142    10

基准代码：

microbenchmark::microbenchmark(
  first_nonmissing = dt[, country_id := country_id[!is.na(country_id)][1], by = ID],
  tidyr = tidyr::fill(dplyr::group_by(dt, ID), country_id),
  na.locf = dt[, country_id := zoo::na.locf(country_id, na.rm = FALSE), by = ID],
  times = 10
)

数据：

library(data.table)
set.seed(42)

dt <- data.table(ID = rep(1:250000, each = 4),
                 country_id = rep(rep(c("BGD", "SA", "USA", "DEN", "THI"), each = 4)),
                 weight = sample(10:100, 1e6, replace = T))

dt$country_id[sample(1:1e6, 3e5)] <- NA

Answer 3

希望以下代码可以帮助您填写NA

res <- Reduce(rbind,
       lapply(split(df,df$ID), function(v) 
         {v$country_id <- head(v$country_id[!is.na(v$country_id)],1);v}))

屈服

  ID country_id weight
1  1        BGD     56
2  1        BGD     57
3  1        BGD     63
4  2         SA     12
5  2         SA     53
6  2         SA     54

如何用给同一个ID的先前非NA值替换NA值

3 个答案: