我正在使用dplyr
和ifelse
根据两个条件创建一个新列,其中包含以下数据。
dat <- structure(list(GenIndID = c("BHS_034", "BHS_034", "BHS_068",
"BHS_068", "BHS_068", "BHS_068", "BHS_068", "BHS_068", "BHS_068",
"BHS_068", "BHS_068"), IndID = c("BHS_034_A", "BHS_034_A", "BHS_068_A",
"BHS_068_A", "BHS_068_A", "BHS_068_A", "BHS_068_A", "BHS_068_A",
"BHS_068_A", "BHS_068_A", "BHS_068_A"), Fate = c("Mort", "Mort",
"Alive", "Alive", "Alive", "Alive", "Alive", "Alive", "Alive",
"Alive", "Alive"), Status = c("Alive", "Mort", "Alive", "Alive",
"MIA", "Alive", "MIA", "Alive", "MIA", "Alive", "Alive"), Type = c("Linked",
"Linked", "SOB", "SOB", "SOB", "SOB", "SOB", "SOB", "SOB", "SOB",
"SOB"), SurveyID = c("GYA13-1", "GYA14-1", "GYA13-1", "GYA14-1",
"GYA14-2", "GYA15-1", "GYA16-1", "GYA16-2", "GYA17-1", "GYA17-3",
"GYA15-2"), SurveyDt = structure(c(1379570400, 1407477600, 1379570400,
1407477600, 1409896800, NA, 1462946400, 1474351200, 1495519200,
1507010400, 1441951200), tzone = "", class = c("POSIXct", "POSIXt"
))), row.names = c(NA, 11L), .Names = c("GenIndID", "IndID",
"Fate", "Status", "Type", "SurveyID", "SurveyDt"), class = "data.frame")
> dat
GenIndID IndID Fate Status Type SurveyID SurveyDt
1 BHS_034 BHS_034_A Mort Alive Linked GYA13-1 2013-09-19
2 BHS_034 BHS_034_A Mort Mort Linked GYA14-1 2014-08-08
3 BHS_068 BHS_068_A Alive Alive SOB GYA13-1 2013-09-19
4 BHS_068 BHS_068_A Alive Alive SOB GYA14-1 2014-08-08
5 BHS_068 BHS_068_A Alive MIA SOB GYA14-2 2014-09-05
6 BHS_068 BHS_068_A Alive Alive SOB GYA15-1 <NA>
7 BHS_068 BHS_068_A Alive MIA SOB GYA16-1 2016-05-11
8 BHS_068 BHS_068_A Alive Alive SOB GYA16-2 2016-09-20
9 BHS_068 BHS_068_A Alive MIA SOB GYA17-1 2017-05-23
10 BHS_068 BHS_068_A Alive Alive SOB GYA17-3 2017-10-03
11 BHS_068 BHS_068_A Alive Alive SOB GYA15-2 2015-09-11
更具体地说,按GenIndID
进行分组我想基于SurveyDt
和Type
的两个条件,创建一个最大Fate
的新日期字段。另外,我希望最长日期仅在SurveyDt
时评估Status == Alive
。我的代码生成所有NA
值,而不是BHS_068
所描述的符合所有指定条件的日期字段。
我最近在这里看到case_when
这可能是合适的,但我无法正确实现它。
dat %>% group_by(GenIndID) %>%
mutate(NewDat = as.POSIXct(ifelse(Type == "SOB" & Fate == "Alive", max(SurveyDt[Status == "Alive"], na.rm = F), NA),
origin='1970-01-01', na.rm=T)) %>%
as.data.frame()
任何建议将不胜感激。
答案 0 :(得分:2)
如果您想坚持dplyr
并使用case_when
,您必须确保每个案例陈述的值都是相同的类型。
在这种情况下,您的TRUE值将是datetime,因此您必须将默认值设为日期时间,并将其包装在as.POSIXct
中。
dat %>%
group_by(GenIndID) %>%
mutate(NewDat = case_when(Type == "SOB" & Fate == "Alive" ~ max(SurveyDt[Status == "Alive"], na.rm = TRUE),
TRUE ~ as.POSIXct(NA, origin = "1970-01-01")))
使用ifelse
dat %>%
group_by(GenIndID) %>%
mutate(NewDat = ifelse(Type == "SOB" & Fate == "Alive",
max(SurveyDt[Status == "Alive"], na.rm = TRUE),
as.POSIXct(NA, origin = "1970-01-01")))
答案 1 :(得分:1)
我们可以使用data.table
。转换为data.table(setDT(dat)
)后,指定i
作为逻辑比较,按&#39; GenIndID&#39;分组,我们分配(:=
),{{调查问卷&#39;在哪里'&#39;状态&#39;是&#34;活着&#34;到&#39; NewDat&#39;
max