我正在处理面板数据,其中包含一个唯一的案例标识符和一个用于观察时间点的列(长格式)。有时间常数变量和时变观测值:
id time tc1 obs1
1 101 1 male 4
2 101 2 male 5
3 101 3 male 3
4 102 1 female 6
5 102 3 female 2
6 103 1 male 2
对于我的模型,我现在需要每个时间点每个id都有完整记录的数据。换句话说,如果缺少观察,我仍然需要为观察到的变量添加id,时间,时间常数变量和NA(如同行(102,2,"女性&#) 34;,NA)在上面的例子中)。所以我的问题是:
如果有人可以对此有所了解,那将会很棒。
提前多多感谢!
感谢大家的回复。这是我最终做的,这是几种建议方法的混合。问题是我每行有几个时变变量(obs1-obsn),我没有得到dcast来容纳它 - value.name不需要多于参数。
# create all possible permutations of id and year
iddat = expand.grid(id = unique(dataset$id), time = (c(1996,1999,2002,2005,2008,2011)))
iddat <- iddat[order(iddat$id, iddat$time), ]
# add permutations to existing data, combinations so far missing are NA
dataset_new <- merge(dataset, iddat, all.x=TRUE, all.y=TRUE, by=c("id", "time"))
# drop time-constant variables from data
dataset_new[c("tc1", "tc2", "tc3")] <- list(NULL)
# merge back time-constant variables from original data
temp <- dataset[c("tc1", "tc2", "tc3")]
dataset_new <- merge(dataset_new, temp, by=c("id"))
# sort
dataset_new <- dataset_new[order(dataset_new$id, dataset_new$time), ]
dataset_new <- unique(dataset_new) # some rows are duplicates after last merge, no idea why
rm(temp)
rm(iddat)
一切顺利,再次感谢,马特
答案 0 :(得分:2)
可能有更优雅的方式,但这是一个选择。我假设您需要id
和time
的所有组合,但不需要tc1
(即tc1
与id
相关联)。
# your data
df <- read.table(text = " id time tc1 obs1
1 101 1 male 4
2 101 2 male 5
3 101 3 male 3
4 102 1 female 6
5 102 3 female 2
6 103 1 male 2", header = TRUE)
首先将数据转换为宽格式以引入NA,然后转换回long。
library('reshape2')
df_wide <- dcast(
df,
id + tc1 ~ time,
value.var = "obs1",
fill = NA
)
df_long <- melt(
df_wide,
id.vars = c("id","tc1"),
variable.name = "time",
value.name = "obs1"
)
# sort by id and then time
df_long[order(df_long$id, df_long$time), ]
id tc1 time obs1
1 101 male 1 4
4 101 male 2 5
7 101 male 3 3
2 102 female 1 6
5 102 female 2 NA
8 102 female 3 2
3 103 male 1 2
6 103 male 2 NA
9 103 male 3 NA
答案 1 :(得分:2)
您可以创建一个空数据集,然后合并到您匹配的记录中。
# Create dataset. For you actual data ,you would replace c(1:3) with
# c(1:max(yourdata$id)) and adjust the number of time periods to match your data.
id <- rep(c(1:3), each = 3)
time <- rep(c(1:3), 3)
df <- data.frame(id,time)
test <- df[c(1,3,5,7,9),]
test$tc1 <- c("male", "male", "female", "male", "male")
test$obs1 <-c(4,5,3,6,2)
merge(df, test, by.x = c("id","time"), by.y = c("id","time"), all.x = TRUE)
结果:
id time tc1 obs1
1 1 1 male 4
2 1 2 <NA> NA
3 1 3 male 5
4 2 1 <NA> NA
5 2 2 female 3
6 2 3 <NA> NA
7 3 1 male 6
8 3 2 <NA> NA
9 3 3 male 2