转换为平衡的面板数据

时间:2016-09-11 11:32:30

标签: r data.table panel reshape2

我有一个不平衡的面板,如下例所示:

test <- read.table(
text = "
A   2010-01-01  1   rdm
A   2010-01-10  2   dfg
A   2010-01-14  3   fdgfd
A   2010-02-15  4   fdgfd
A   2010-08-17  5   dg
A   2010-12-19  6   dfg
B   2009-01-01  1   dfg
B   2010-01-01  2   ydg
B   2010-01-10  3   fdgfd
B   2010-01-14  4   dfg
B   2010-02-15  5   dfg
",header=F)
library(data.table)
setDT(test)
names(test) <-  c("ID", "date", "nr", "namecol")

我想在日期方面保持平衡,即每个人(A,B等)在没有数据的日期都有NA行。我不知道每组的最短日期或组之间的最短日期。与最大值相同,但也许选择一个等于特定日期的最大值(与跨组计算相比)更快。 所需的输出是:

out <- read.table(
text = "
A   2009-01-01  NA  NA
A   2010-01-01  1   rdm
A   2010-01-10  2   dfg
A   2010-01-14  3   fdgfd
A   2010-02-15  4   fdgfd
A   2010-08-17  5   dg
A   2010-12-19  6   dfg
B   2009-01-01  1   dfg
B   2010-01-01  2   ydg
B   2010-01-10  3   fdgfd
B   2010-01-14  4   dfg
B   2010-02-15  5   dfg
B   2010-08-17  NA  NA
B   2010-12-19  NA  NA
",header=F)
setDT(out)
names(out) <-  c("ID", "date", "nr", "namecol")

我的数据集非常大,因此我认为最好在data.table(或plyrreshape2)或类似的内容中执行此操作。

1 个答案:

答案 0 :(得分:5)

我们与CJ&#39; ID&#39;以及&#39; date&#39;进行交叉加入(unique)。将key列设置为&#39; ID&#39;后的数据集和&#39; date&#39;然后使用原始数据集执行join

setDT(test, key = c("ID", "date"))[CJ(ID, date, unique=TRUE)]
#    ID       date nr namecol
# 1:  A 2009-01-01 NA      NA
# 2:  A 2010-01-01  1     rdm
# 3:  A 2010-01-10  2     dfg
# 4:  A 2010-01-14  3   fdgfd
# 5:  A 2010-02-15  4   fdgfd
# 6:  A 2010-08-17  5      dg
# 7:  A 2010-12-19  6     dfg
# 8:  B 2009-01-01  1     dfg
# 9:  B 2010-01-01  2     ydg
#10:  B 2010-01-10  3   fdgfd
#11:  B 2010-01-14  4     dfg
#12:  B 2010-02-15  5     dfg
#13:  B 2010-08-17 NA      NA
#14:  B 2010-12-19 NA      NA

数据

test <- structure(list(ID = c("A", "A", "A", "A", "A", "A", "B", "B", 
"B", "B", "B"), date = structure(c(14610, 14619, 14623, 14655, 
14838, 14962, 14245, 14610, 14619, 14623, 14655), class = "Date"), 
nr = c(1L, 2L, 3L, 4L, 5L, 6L, 1L, 2L, 3L, 4L, 5L), namecol = c("rdm", 
"dfg", "fdgfd", "fdgfd", "dg", "dfg", "dfg", "ydg", "fdgfd", 
"dfg", "dfg")), .Names = c("ID", "date", "nr", "namecol"),
 row.names = c(NA, -11L), class = "data.frame")