我想填写每个ID的遗失年份。对于下面的小例子,这很容易。
# Create example data table.
dt <- data.table(id = c(1, 1, 1, 2, 2, 2, 2, 3, 3, 3),
value = rnorm(10),
time = c(1, 2, 3, 3, 5, 6, 7, 2, 3, 6))
# Sort by time variable.
setkey(dt, time)
# Fill in the gaps.
system.time(
dt <- dt[, .SD[J(min(time):max(time))], by=id]
)
# Sort by ID and time, then print.
setkey(dt, id, time)[]
给出
> dt
id value time
1: 1 -0.9062227 1
2: 1 2.0822289 2
3: 1 0.5073055 3
4: 2 0.3673813 3
5: 2 NA 4
6: 2 0.3726807 5
7: 2 -0.7381199 6
8: 2 0.7048979 7
9: 3 -0.7852230 2
10: 3 0.2327946 3
11: 3 NA 4
12: 3 NA 5
13: 3 -0.3430340 6
这些年来现在是连续的,并且已经为缺失值添加了NA,正是我想要的。
然而,这个解决方案永远需要更大的数据。表。
# Create big example data table
n <- 1e5
dt <- data.table(id = rep(1:(n/4), each=4),
value = rnorm(n),
year = sample(1997:2001, n, replace=TRUE))
# Remove duplicate years.
setkey(dt, id, year)
dt <- unique(dt)
# Fill in the gaps.
setkey(dt, year)
system.time(
dt2 <- dt[, .SD[J(min(year):max(year))], by=id]
)
对于那些~100000行,需要大约20秒。
我想为一个包含1亿行的数据表执行此操作。必须有更快的方式吗?
答案 0 :(得分:4)
可能有帮助
dtN <- copy(dt)
setkey(dtN, id, year)
system.time({
dtN2 <- dtN[, list(year=min(year):max(year)), by=id]
setkey(dtN2, id, year)
res <- dtN[dtN2]
})
# user system elapsed
# 0.047 0.000 0.048
dim(res)
#[1] 122958 3
setkey(dt, year)
system.time(
dt2 <- dt[, .SD[J(min(year):max(year))], by=id]
)
#user system elapsed
# 20.078 0.035 20.109
dim(dt2)
#[1] 122958 3
n <- 1e5
set.seed(24)
dt <- data.table(id = rep(1:(n/4), each=4),
value = rnorm(n),
year = sample(1997:2001, n, replace=TRUE))
dt <- unique(dt)