我有一个不平衡的面板,如下例所示:
test <- read.table(
text = "
A 2010-01-01 1 rdm
A 2010-01-10 2 dfg
A 2010-01-14 3 fdgfd
A 2010-02-15 4 fdgfd
A 2010-08-17 5 dg
A 2010-12-19 6 dfg
B 2009-01-01 1 dfg
B 2010-01-01 2 ydg
B 2010-01-10 3 fdgfd
B 2010-01-14 4 dfg
B 2010-02-15 5 dfg
",header=F)
library(data.table)
setDT(test)
names(test) <- c("ID", "date", "nr", "namecol")
我想在日期方面保持平衡,即每个人(A,B等)在没有数据的日期都有NA行。我不知道每组的最短日期或组之间的最短日期。与最大值相同,但也许选择一个等于特定日期的最大值(与跨组计算相比)更快。 所需的输出是:
out <- read.table(
text = "
A 2009-01-01 NA NA
A 2010-01-01 1 rdm
A 2010-01-10 2 dfg
A 2010-01-14 3 fdgfd
A 2010-02-15 4 fdgfd
A 2010-08-17 5 dg
A 2010-12-19 6 dfg
B 2009-01-01 1 dfg
B 2010-01-01 2 ydg
B 2010-01-10 3 fdgfd
B 2010-01-14 4 dfg
B 2010-02-15 5 dfg
B 2010-08-17 NA NA
B 2010-12-19 NA NA
",header=F)
setDT(out)
names(out) <- c("ID", "date", "nr", "namecol")
我的数据集非常大,因此我认为最好在data.table
(或plyr
,reshape2
)或类似的内容中执行此操作。
答案 0 :(得分:5)
我们与CJ
&#39; ID&#39;以及&#39; date&#39;进行交叉加入(unique
)。将key
列设置为&#39; ID&#39;后的数据集和&#39; date&#39;然后使用原始数据集执行join
。
setDT(test, key = c("ID", "date"))[CJ(ID, date, unique=TRUE)]
# ID date nr namecol
# 1: A 2009-01-01 NA NA
# 2: A 2010-01-01 1 rdm
# 3: A 2010-01-10 2 dfg
# 4: A 2010-01-14 3 fdgfd
# 5: A 2010-02-15 4 fdgfd
# 6: A 2010-08-17 5 dg
# 7: A 2010-12-19 6 dfg
# 8: B 2009-01-01 1 dfg
# 9: B 2010-01-01 2 ydg
#10: B 2010-01-10 3 fdgfd
#11: B 2010-01-14 4 dfg
#12: B 2010-02-15 5 dfg
#13: B 2010-08-17 NA NA
#14: B 2010-12-19 NA NA
test <- structure(list(ID = c("A", "A", "A", "A", "A", "A", "B", "B",
"B", "B", "B"), date = structure(c(14610, 14619, 14623, 14655,
14838, 14962, 14245, 14610, 14619, 14623, 14655), class = "Date"),
nr = c(1L, 2L, 3L, 4L, 5L, 6L, 1L, 2L, 3L, 4L, 5L), namecol = c("rdm",
"dfg", "fdgfd", "fdgfd", "dg", "dfg", "dfg", "ydg", "fdgfd",
"dfg", "dfg")), .Names = c("ID", "date", "nr", "namecol"),
row.names = c(NA, -11L), class = "data.frame")