我有一个与此类似的数据框:
df<-read.csv(text="id;census;startDate;endDate
ZF001;died;16.10.2012;16.05.2015
ZF002;alive;20.10.2013
ZF003;alive;04.11.2013;
ZF004;died;11.11.2013;20.12.2014
ZF005;died;25.11.2013;16.06.2015
ZF006;alive;25.11.2014;
ZF007;survived;02.12.2014;19.01.2015
ZF008;alive;11.12.2014;
ZF009;survived;28.01.2015;12.03.2015", sep=";")
df$startDate<-as.Date(df$startDate, "%d.%m.%Y")
df$endDate<-as.Date(df$endDate, "%d.%m.%Y")
我需要的是以下内容:一个新数据框,其中包含每年探测者参与研究的天数。它看起来应该类似于:
id year days
ZF001 2012 77
ZF001 2013 365
ZF001 2014 365
ZF001 2015 135
etc.
答案 0 :(得分:10)
我假设你只想要死亡的先证者(因为活的没有结束日期),这里有一个可能的data.table
解决方案,这是非常自我解释的
library(data.table)
setDT(df)[census == "died",
as.data.table(table(year(seq.Date(startDate, endDate, by = "day")))),
by = id]
# id V1 N
# 1: ZF001 2012 77
# 2: ZF001 2013 365
# 3: ZF001 2014 365
# 4: ZF001 2015 136
# 5: ZF004 2013 51
# 6: ZF004 2014 354
# 7: ZF005 2013 37
# 8: ZF005 2014 365
# 9: ZF005 2015 167
基本上我们计算从id
开始到结束日期的所有天数,然后,我们使用year
函数来提取年份,然后只计算频率
或等效的dplyr
解决方案
library(dplyr)
df %>%
group_by(id) %>%
filter(census=='died') %>%
do(as.data.frame(table(year(seq.Date(.$startDate, .$endDate, by ='day')))))
每条评论 修改:
如果你想要所有患者(死亡或活着),而对于想要使用Sys.Date
的活着的患者,我们可以在这种情况下定义一个简单的辅助函数
dateFunc <- function(x, y){
if(is.na(y)) {
as.data.table(table(year(seq.Date(x, Sys.Date(), by = "day"))))
} else as.data.table(table(year(seq.Date(x, y, by = "day"))))
}
setDT(df)[, setNames(dateFunc(startDate, endDate), c("Year", "Days")), by = id]
# id Year Days
# 1: ZF001 2012 77
# 2: ZF001 2013 365
# 3: ZF001 2014 365
# 4: ZF001 2015 136
# 5: ZF002 2013 73
# 6: ZF002 2014 365
# 7: ZF002 2015 222
# 8: ZF003 2013 58
# 9: ZF003 2014 365
# 10: ZF003 2015 222
# 11: ZF004 2013 51
# 12: ZF004 2014 354
# 13: ZF005 2013 37
# 14: ZF005 2014 365
# 15: ZF005 2015 167
# 16: ZF006 2014 37
# 17: ZF006 2015 222
# 18: ZF007 2014 30
# 19: ZF007 2015 19
# 20: ZF008 2014 21
# 21: ZF008 2015 222
# 22: ZF009 2015 44
数据强>
df <- structure(list(id = structure(1:9, .Label = c("ZF001", "ZF002",
"ZF003", "ZF004", "ZF005", "ZF006", "ZF007", "ZF008", "ZF009"
), class = "factor"), census = structure(c(2L, 1L, 1L, 2L, 2L,
1L, 3L, 1L, 3L), .Label = c("alive", "died", "survived"), class = "factor"),
startDate = structure(c(15629, 15998, 16013, 16020, 16034,
16399, 16406, 16415, 16463), class = "Date"), endDate = structure(c(16571,
NA, NA, 16424, 16602, NA, 16454, NA, 16506), class = "Date")), .Names = c("id",
"census", "startDate", "endDate"), row.names = c(NA, -9L), class = "data.frame")