我有纵向数据,我想根据现有行中多列的值插入新行。
对于任何人,每当上一个发布日期与下一个录取日期之间有间隔时,我想添加一个新行,其中以上一个发布日期为录入日期,并以下一个录入日期为发布日期,因此没有“差距”。如果一个人的最终观察结果有发布日期,我还想添加一个新行,以上一个发布日期作为准入日期,并以NA作为发布日期。
我认为这可能需要data.table或dplyr的add_row,但我不确定如何。我看到的其他SO问题是基于组中的行数或在每个现有行之前/之后添加新行。如果我能弄清楚如何在正确的位置插入行,我想我可以使用dplyr的lag和Lead函数来填写正确的日期。
以下是一些示例数据:
myData <- data.frame(ID = c(2, 2, 2, 3, 3, 4, 5, 5, 5, 5),
TERM_TYPE = c("Parole", "Prison", "Parole",
"Parole", "Prison", "Parole",
"Parole", "Prison", "Parole", "Prison"),
ADMISSION_DATE = c("2006-10-15", "2008-09-15", "2009-01-15",
"2006-01-15", "2006-12-15", "2006-12-15",
"2006-04-15", "2013-01-15", "2013-12-15", "2015-01-15"),
RELEASE_DATE = c("2008-09-15","2009-01-15", "2010-12-15",
"2006-10-15", NA, "2008-06-15",
"2010-01-15", "2013-12-15", "2015-01-15", NA),
stringsAsFactors = FALSE)
我希望它看起来像这样:
ID TERM_TYPE ADMISSION_DATE RELEASE_DATE
1 2 Parole 2006-10-15 2008-09-15
2 2 Prison 2008-09-15 2009-01-15
3 2 Parole 2009-01-15 2010-12-15
4 2 Not supervised 2010-12-15 <NA>
5 3 Parole 2006-01-15 2006-10-15
6 3 Prison 2006-10-15 <NA>
7 4 Parole 2006-12-15 2008-06-15
8 4 Not supervised 2008-06-15 <NA>
9 5 Parole 2006-04-15 2010-01-15
10 5 Not supervised 2010-01-15 2013-01-15
11 5 Prison 2013-01-15 2013-12-15
12 5 Parole 2013-12-15 2015-01-15
13 5 Prison 2015-01-15 <NA>
答案 0 :(得分:0)
可能会有更简洁的方法来执行此操作,但是我认为这表明了潜在的思路。基本上,我合并了三个表:
1)原始数据 2)缺少的间隔时间 3)已知发布日期之后的期限
#2和#3是通过从原始内容中提取相关行并对其进行修改以显示我们想要的内容而创建的。例如,#2查找自上一行开始有间隔的行,并进行修改以使该行看起来像缺少的期间。
# First, change dates into date formats
library(tidyverse)
library(lubridate)
myData <- myData %>%
mutate_at(vars(contains("DATE")), ymd)
# Create table #2
myData_fill_gaps <- myData %>%
group_by(ID) %>%
mutate(gap_days = (ADMISSION_DATE - lag(RELEASE_DATE)) / ddays(1),
ADM_temp = lag(RELEASE_DATE),
REL_temp = ADMISSION_DATE) %>%
ungroup() %>%
filter(gap_days > 0) %>% # Only keep rows relating to gaps
mutate(TERM_TYPE = "Not supervised") %>%
select(ID, TERM_TYPE, ADMISSION_DATE = ADM_temp, RELEASE_DATE = REL_temp)
# Create table #3
myData_add_release_NA <- myData %>%
group_by(ID) %>%
slice(n()) %>% # Only keep last row for each ID
filter(!is.na(RELEASE_DATE)) %>% # Only keep if lacking an NA in RELEASE_DATE
mutate(TERM_TYPE = "Not supervised",
ADMISSION_DATE = RELEASE_DATE,
RELEASE_DATE = NA_real_)
myData_combined <- bind_rows(
myData,
myData_fill_gaps,
myData_add_release_NA
) %>%
arrange(ID, ADMISSION_DATE)
输出
> myData_combined
ID TERM_TYPE ADMISSION_DATE RELEASE_DATE
1 2 Parole 2006-10-15 2008-09-15
2 2 Prison 2008-09-15 2009-01-15
3 2 Parole 2009-01-15 2010-12-15
4 2 Not supervised 2010-12-15 <NA>
5 3 Parole 2006-01-15 2006-10-15
6 3 Not supervised 2006-10-15 2006-12-15
7 3 Prison 2006-12-15 <NA>
8 4 Parole 2006-12-15 2008-06-15
9 4 Not supervised 2008-06-15 <NA>
10 5 Parole 2006-04-15 2010-01-15
11 5 Not supervised 2010-01-15 2013-01-15
12 5 Prison 2013-01-15 2013-12-15
13 5 Parole 2013-12-15 2015-01-15
14 5 Prison 2015-01-15 <NA>