我正在使用一个结构如下的数据框:
structure(list(Date = structure(c(1L, 2L, 3L, 1L, 2L, 3L, 4L,
5L, 1L, 2L), .Label = c("2010-02-01", "2010-03-01", "2010-04-01",
"2010-05-01", "2010-06-01"), class = "factor"), y = c(1, 1, 1,
2, 2, 2, 2, 2, 3, 3), binary = c(0, 0, 0, 0, 0, 0, 1, 1, 0, 1
)), class = "data.frame", row.names = c(NA, -10L))
Date y binary
1 2010-02-01 1 0
2 2010-03-01 1 0
3 2010-04-01 1 0
4 2010-02-01 2 0
5 2010-03-01 2 0
6 2010-04-01 2 0
7 2010-05-01 2 1
8 2010-06-01 2 1
9 2010-02-01 3 0
10 2010-03-01 3 1
我正在尝试使每个组至少每月连续进行至少四个观测,条件是二进制一旦对某个组取值= 1,它将保持这种状态。结果应如下所示:
Date y binary
>1 2010-02-01 1 0
>2 2010-03-01 1 0
>3 2010-04-01 1 0
>4 2010-05-01 1 0
>5 2010-02-01 2 0
>6 2010-03-01 2 0
>7 2010-04-01 2 0
>8 2010-05-01 2 1
>9 2010-06-01 2 1
>10 2010-02-01 3 0
>11 2010-03-01 3 1
>12 2010-04-01 3 1
>13 2010-05-01 3 1
我已经为第一组(y = 1)创建了数据子集,下面的循环适用于该子集。
dt1 <- dt[1:3,]
maxdate<- 0
while(nrow(dt1) < 5){maxdate <- as.Date(dt1[nrow(dt1), 1]) %m+% months(1) ; dt1<- rbind(dt1, c(as.character(maxdate) , dt1[nrow(dt1),2], dt1[nrow(dt1),3]))}
但是我不知道如何将此功能合并到dt %>% group_by(y)
这样的dplyr结构中。
我如何获得我的结果,最好是使用dplyr,并且在可能的情况下如何不重复使用for循环?(实际数据集非常大)。
答案 0 :(得分:4)
这里是一个选项,我们首先将'Date'转换为Date
类,并按'y'分组,获取行数(每组n()
),然后使用该信息来展开complete
中的“日期”,以使每个组至少有4行,其中fill
元素与先前的非NA无关,NA
并删除创建的临时“ n”列>
library(dplyr)
library(tidyr)
df1 %>%
mutate(Date = as.Date(Date)) %>%
group_by(y) %>%
mutate(n = n()) %>%
complete(Date = seq(first(Date), length.out = max(first(n), 4),
by = '1 month')) %>%
fill(binary) %>%
select(-n)
# A tibble: 13 x 3
# Groups: y [3]
# y Date binary
# <dbl> <date> <dbl>
# 1 1 2010-02-01 0
# 2 1 2010-03-01 0
# 3 1 2010-04-01 0
# 4 1 2010-05-01 0
# 5 2 2010-02-01 0
# 6 2 2010-03-01 0
# 7 2 2010-04-01 0
# 8 2 2010-05-01 1
# 9 2 2010-06-01 1
#10 3 2010-02-01 0
#11 3 2010-03-01 1
#12 3 2010-04-01 1
#13 3 2010-05-01 1
答案 1 :(得分:2)
一种选择是创建一个包含所有所需日期的新表,然后将该表与原始表df
进行滚动连接,然后根据需要与其他列nafill
进行滚动连接。
library(lubridate)
library(data.table)
setDT(df)
df[, Date := as.Date(Date)]
alldts <-
df[, if(.N < 4) .(Date = first(Date) + months(0:3)) else Date, by = y]
df[alldts, on = .(y, Date), roll = -Inf
][, binary := nafill(binary, 'locf')][]
# Date y binary
# 1: 2010-02-01 1 0
# 2: 2010-03-01 1 0
# 3: 2010-04-01 1 0
# 4: 2010-05-01 1 0
# 5: 2010-02-01 2 0
# 6: 2010-03-01 2 0
# 7: 2010-04-01 2 0
# 8: 2010-05-01 2 1
# 9: 2010-06-01 2 1
# 10: 2010-02-01 3 0
# 11: 2010-03-01 3 1
# 12: 2010-04-01 3 1
# 13: 2010-05-01 3 1