从数据框计算每年的天数

时间:2015-08-10 09:16:09

标签: r

我有一个与此类似的数据框:

df<-read.csv(text="id;census;startDate;endDate
ZF001;died;16.10.2012;16.05.2015
ZF002;alive;20.10.2013
ZF003;alive;04.11.2013;
ZF004;died;11.11.2013;20.12.2014
ZF005;died;25.11.2013;16.06.2015
ZF006;alive;25.11.2014;
ZF007;survived;02.12.2014;19.01.2015
ZF008;alive;11.12.2014;
ZF009;survived;28.01.2015;12.03.2015", sep=";")

df$startDate<-as.Date(df$startDate, "%d.%m.%Y")
df$endDate<-as.Date(df$endDate, "%d.%m.%Y")

我需要的是以下内容:一个新数据框,其中包含每年探测者参与研究的天数。它看起来应该类似于:

id     year days
ZF001  2012   77
ZF001  2013  365
ZF001  2014  365
ZF001  2015  135
etc.

1 个答案:

答案 0 :(得分:10)

我假设你只想要死亡的先证者(因为活的没有结束日期),这里有一个可能的data.table解决方案,这是非常自我解释的

library(data.table)
setDT(df)[census == "died", 
          as.data.table(table(year(seq.Date(startDate, endDate, by = "day")))), 
          by = id]
#       id   V1   N
# 1: ZF001 2012  77
# 2: ZF001 2013 365
# 3: ZF001 2014 365
# 4: ZF001 2015 136
# 5: ZF004 2013  51
# 6: ZF004 2014 354
# 7: ZF005 2013  37
# 8: ZF005 2014 365
# 9: ZF005 2015 167

基本上我们计算从id开始到结束日期的所有天数,然后,我们使用year函数来提取年份,然后只计算频率

或等效的dplyr解决方案

library(dplyr)
df %>% 
  group_by(id) %>% 
  filter(census=='died') %>% 
  do(as.data.frame(table(year(seq.Date(.$startDate, .$endDate, by ='day')))))
每条评论

修改: 如果你想要所有患者(死亡或活着),而对于想要使用Sys.Date的活着的患者,我们可以在这种情况下定义一个简单的辅助函数

dateFunc <- function(x, y){
  if(is.na(y)) {
    as.data.table(table(year(seq.Date(x, Sys.Date(), by = "day"))))                              
  } else as.data.table(table(year(seq.Date(x, y, by = "day"))))
}

setDT(df)[, setNames(dateFunc(startDate, endDate), c("Year", "Days")), by = id]
#        id Year Days
#  1: ZF001 2012   77
#  2: ZF001 2013  365
#  3: ZF001 2014  365
#  4: ZF001 2015  136
#  5: ZF002 2013   73
#  6: ZF002 2014  365
#  7: ZF002 2015  222
#  8: ZF003 2013   58
#  9: ZF003 2014  365
# 10: ZF003 2015  222
# 11: ZF004 2013   51
# 12: ZF004 2014  354
# 13: ZF005 2013   37
# 14: ZF005 2014  365
# 15: ZF005 2015  167
# 16: ZF006 2014   37
# 17: ZF006 2015  222
# 18: ZF007 2014   30
# 19: ZF007 2015   19
# 20: ZF008 2014   21
# 21: ZF008 2015  222
# 22: ZF009 2015   44

数据

df <- structure(list(id = structure(1:9, .Label = c("ZF001", "ZF002", 
"ZF003", "ZF004", "ZF005", "ZF006", "ZF007", "ZF008", "ZF009"
), class = "factor"), census = structure(c(2L, 1L, 1L, 2L, 2L, 
1L, 3L, 1L, 3L), .Label = c("alive", "died", "survived"), class = "factor"), 
    startDate = structure(c(15629, 15998, 16013, 16020, 16034, 
    16399, 16406, 16415, 16463), class = "Date"), endDate = structure(c(16571, 
    NA, NA, 16424, 16602, NA, 16454, NA, 16506), class = "Date")), .Names = c("id", 
"census", "startDate", "endDate"), row.names = c(NA, -9L), class = "data.frame")