在面板数据中创建缺少的观察

时间:2014-06-26 18:18:49

标签: r panel

我正在处理面板数据,其中包含一个唯一的案例标识符和一个用于观察时间点的列(长格式)。有时间常数变量和时变观测值:

    id  time    tc1     obs1
1   101 1       male    4
2   101 2       male    5
3   101 3       male    3
4   102 1       female  6
5   102 3       female  2
6   103 1       male    2

对于我的模型,我现在需要每个时间点每个id都有完整记录的数据。换句话说,如果缺少观察,我仍然需要为观察到的变量添加id,时间,时间常数变量和NA(如同行(102,2,"女性&#) 34;,NA)在上面的例子中)。所以我的问题是:

  1. 如何确定我的数据集中是否已存在具有唯一ID和时间组合的行?
  2. 如果没有,我如何添加此行,携带时间常数变量并用NA填充观察值?
  3. 如果有人可以对此有所了解,那将会很棒。

    提前多多感谢!


    编辑

    感谢大家的回复。这是我最终做的,这是几种建议方法的混合。问题是我每行有几个时变变量(obs1-obsn),我没有得到dcast来容纳它 - value.name不需要多于参数。

    # create all possible permutations of id and year
    iddat = expand.grid(id = unique(dataset$id), time = (c(1996,1999,2002,2005,2008,2011)))
    iddat <- iddat[order(iddat$id, iddat$time), ]
    
    # add permutations to existing data, combinations so far missing are NA
    dataset_new <- merge(dataset, iddat, all.x=TRUE, all.y=TRUE, by=c("id", "time"))
    
    # drop time-constant variables from data
    dataset_new[c("tc1", "tc2", "tc3")] <- list(NULL)
    
    # merge back time-constant variables from original data
    temp <- dataset[c("tc1", "tc2", "tc3")]
    dataset_new <- merge(dataset_new, temp, by=c("id"))
    
    # sort
    dataset_new <- dataset_new[order(dataset_new$id, dataset_new$time), ]
    dataset_new <- unique(dataset_new) # some rows are duplicates after last merge, no idea why
    
    rm(temp)
    rm(iddat)
    

    一切顺利,再次感谢,马特

2 个答案:

答案 0 :(得分:2)

可能有更优雅的方式,但这是一个选择。我假设您需要idtime的所有组合,但不需要tc1(即tc1id相关联)。

# your data
df <- read.table(text = "    id  time    tc1     obs1
1   101 1       male    4
2   101 2       male    5
3   101 3       male    3
4   102 1       female  6
5   102 3       female  2
6   103 1       male    2", header = TRUE)

首先将数据转换为宽格式以引入NA,然后转换回long。

library('reshape2')

df_wide <- dcast(
  df, 
  id + tc1 ~ time,
  value.var = "obs1", 
  fill = NA
)

df_long <- melt(
  df_wide, 
  id.vars = c("id","tc1"), 
  variable.name = "time",
  value.name = "obs1"
)

# sort by id and then time
df_long[order(df_long$id, df_long$time), ]
   id    tc1 time obs1
1 101   male    1    4
4 101   male    2    5
7 101   male    3    3
2 102 female    1    6
5 102 female    2   NA
8 102 female    3    2
3 103   male    1    2
6 103   male    2   NA
9 103   male    3   NA

答案 1 :(得分:2)

您可以创建一个空数据集,然后合并到您匹配的记录中。

 # Create dataset.  For you actual data ,you would replace c(1:3) with 
 # c(1:max(yourdata$id)) and adjust the number of time periods to match your data.
 id <- rep(c(1:3), each = 3)
 time <- rep(c(1:3), 3)
 df <- data.frame(id,time)


 test <- df[c(1,3,5,7,9),]
 test$tc1 <- c("male", "male", "female", "male", "male")
 test$obs1 <-c(4,5,3,6,2)

 merge(df, test, by.x = c("id","time"), by.y = c("id","time"), all.x = TRUE)

结果:

 id time    tc1 obs1
 1  1    1   male    4
 2  1    2   <NA>   NA
 3  1    3   male    5
 4  2    1   <NA>   NA
 5  2    2 female    3
 6  2    3   <NA>   NA
 7  3    1   male    6
 8  3    2   <NA>   NA
 9  3    3   male    2