按组填写日期的缺失行

时间:2018-10-18 20:00:46

标签: r date group-by data.table row

我有一个这样的数据表,只是更大了:

customer_id <- c("1","1","1","2","2","2","2","3","3","3")
account_id <- as.character(c(11,11,11,55,55,55,55,38,38,38))
time <- c(as.Date("2017-01-01","%Y-%m-%d"), as.Date("2017-05-01","%Y-%m- 
%d"), as.Date("2017-06-01","%Y-%m-%d"),
          as.Date("2017-02-01","%Y-%m-%d"), as.Date("2017-04-01","%Y-%m- 
%d"), as.Date("2017-05-01","%Y-%m-%d"),
          as.Date("2017-06-01","%Y-%m-%d"), as.Date("2017-01-01","%Y-%m- 
%d"), as.Date("2017-04-01","%Y-%m-%d"),
          as.Date("2017-05-01","%Y-%m-%d"))


tenor <- c(1,2,3,1,2,3,4,1,2,3)
variable_x <- c(87,90,100,120,130,150,12,13,15,14)

my_data <- data.table(customer_id,account_id,time,tenor,variable_x)

customer_id account_id       time tenor variable_x
          1         11 2017-01-01     1         87
          1         11 2017-05-01     2         90
          1         11 2017-06-01     3        100
          2         55 2017-02-01     1        120
          2         55 2017-04-01     2        130
          2         55 2017-05-01     3        150
          2         55 2017-06-01     4         12
          3         38 2017-01-01     1         13
          3         38 2017-04-01     2         15
          3         38 2017-05-01     3         14

我应该观察每个对的customer_id,account_id从2017-01-01到2017-06-01的月度观察,但是对于某些customer_id,account_id对,则缺少这6个月序列中的某些日期。我想填写那些缺失的日期,以便每个对customer_id,account_id对都有6个月的观察值,只是缺少变量tenor和variable_x。也就是说,它应该看起来像这样:

    customer_id account_id       time tenor variable_x
           1         11    2017-01-01     1         87
           1         11    2017-02-01    NA         NA
           1         11    2017-03-01    NA         NA
           1         11    2017-04-01    NA         NA
           1         11    2017-05-01     2         90
           1         11    2017-06-01     3        100
           2         55    2017-01-01    NA         NA
           2         55    2017-02-01     1        120
           2         55    2017-03-01    NA         NA
           2         55    2017-04-01     2        130
           2         55    2017-05-01     3        150
           2         55    2017-06-01     4         12
           3         38    2017-01-01     1         13
           3         38    2017-02-01    NA         NA
           3         38    2017-03-01    NA         NA
           3         38    2017-04-01     2         15
           3         38    2017-05-01     3         14
           3         38    2017-06-01    NA         NA

我尝试使用创建一个从2017-01-01到2017-06-01的日期序列

ts = seq(as.Date("2017/01/01"), as.Date("2017/06/01"), by = "month")

然后使用

将其合并为原始数据
ts = data.table(ts)
colnames(ts) = "time"
merged <- merge(ts, my_data, by="time", all.x=TRUE)

,但是它不起作用。拜托,您知道如何为每个customer_id,account_id对添加带有日期的此类行吗?

2 个答案:

答案 0 :(得分:3)

我们可以加入。创建从minmax by'1 month'的时间序列,展开按'customer_id','account_id'分组的数据集,并将on与列和“时间”

ts1 <- seq(min(my_data$time), max(my_data$time), by = "1 month")
my_data[my_data[, .(time =ts1 ), .(customer_id, account_id)], 
             on = .(customer_id, account_id, time)]
#    customer_id account_id       time tenor variable_x
# 1:           1         11 2017-01-01     1         87
# 2:           1         11 2017-02-01    NA         NA
# 3:           1         11 2017-03-01    NA         NA
# 4:           1         11 2017-04-01    NA         NA
# 5:           1         11 2017-05-01     2         90
# 6:           1         11 2017-06-01     3        100
# 7:           2         55 2017-01-01    NA         NA
# 8:           2         55 2017-02-01     1        120
# 9:           2         55 2017-03-01    NA         NA
#10:           2         55 2017-04-01     2        130
#11:           2         55 2017-05-01     3        150
#12:           2         55 2017-06-01     4         12
#13:           3         38 2017-01-01     1         13
#14:           3         38 2017-02-01    NA         NA
#15:           3         38 2017-03-01    NA         NA
#16:           3         38 2017-04-01     2         15
#17:           3         38 2017-05-01     3         14
#18:           3         38 2017-06-01    NA         NA

或使用tidyverse

library(tidyverse)
distinct(my_data, customer_id, account_id) %>%
      mutate(time = list(ts1)) %>% 
      unnest %>% 
      left_join(my_data)

或者使用complete中的tidyr

my_data %>% 
     complete(nesting(customer_id, account_id), time = ts1)

答案 1 :(得分:1)

另一种data.table方法:

my_data2 <- my_data[, .(time = seq(as.Date("2017/01/01"), as.Date("2017/06/01"), 
                              by = "month")), by = list(customer_id, account_id)]

merge(my_data2, my_data, all.x = TRUE)

     customer_id account_id       time tenor variable_x
 1:           1         11 2017-01-01     1         87
 2:           1         11 2017-02-01    NA         NA
 3:           1         11 2017-03-01    NA         NA
 4:           1         11 2017-04-01    NA         NA
 5:           1         11 2017-05-01     2         90
 6:           1         11 2017-06-01     3        100
 7:           2         55 2017-01-01    NA         NA
 8:           2         55 2017-02-01     1        120
 9:           2         55 2017-03-01    NA         NA
10:           2         55 2017-04-01     2        130
11:           2         55 2017-05-01     3        150
12:           2         55 2017-06-01     4         12
13:           3         38 2017-01-01     1         13
14:           3         38 2017-02-01    NA         NA
15:           3         38 2017-03-01    NA         NA
16:           3         38 2017-04-01     2         15
17:           3         38 2017-05-01     3         14
18:           3         38 2017-06-01    NA         NA