优化大型R数据帧中的检查循环

时间:2016-10-19 21:06:52

标签: r performance loops dataframe dplyr

我有一个数据框,每行记录一次孩子对健康机构的访问:

head(cred2015)
DNI_PN   COD_ESTAB   FEC_ATEN SEXO    FEC_NAC CONTROL
60099999      3697 2015-08-04    2 2015-07-28       2
60099999      3697 2015-08-14    2 2015-07-28       3
60099999      3697 2015-08-28    2 2015-07-28       1
60099999      3697 2015-09-28    2 2015-07-28       2
60099999      3697 2015-10-28    2 2015-07-28       3
60999999      3697 2015-11-28    2 2015-07-28       4

DNI_PN识别个人,FEC_NAC是他们的出生日期,FEC_ATEN是访问日期。

有一个儿童必须遵循的特定访问时间表,我希望检查这些访问中的每一个,如果孩子是否参加。这就是我目前的尝试:

#Create a vector with unique identifiers to loop over
dni<-unique(cred2015$DNI_PN)

#The first scheduled visit must happen 7 days after birth.
#My custom function takes a lower and upper date limit for wiggle room.
conRN1<-sapply(dni,checkCRED,5,9)

#The function in question:
checkCRED<-function(idnum,llim,ulim){
  require('magrittr')

  #Examine visits from one individual at a time
  subcred<-cred2015 %>% subset(DNI_PN == idnum)

  #Check if the lower limit would fall inside the year 2015
  if(as.Date("2015-12-31","%Y-%m-%d")-unique(subcred$FEC_NAC)>llim){

  #Note if a visit happened within the time window
    if((subcred %>% subset(FEC_ATEN-FEC_NAC>=llim & FEC_ATEN-FEC_NAC<=ulim))$CONTROL %>% length > 0){
      1 #On time
    } else {
      0 #Wasn't on time
    }
  }else{
    NA #Exclude if CRED date not in database range
  }
}

目前,为1000名不同的人做这件事需要花费近一分钟的时间,这是不可接受的,因为我需要为数十万人评估超过12次不同的预定访问。

请注意,CONTROL变量会跟踪观察对应的访问次数,但会根据子计划进行重置(第一个月最多2次访问,第一年最多11次,最多第二年6,等等)所以我找不到它的任何用处,但我离开它以防万一比我更聪明地找出涉及它的解决方案。

答案不必限于基础R。

数据

structure(list(DNI_PN = c(60099999, 60099999, 60099999, 60099999, 
60099999, 60999999), COD_ESTAB = c(3697, 3697, 3697, 3697, 3697, 
3697), FEC_ATEN = structure(c(16651, 16661, 16675, 16706, 16736, 
16767), class = "Date"), SEXO = c(2L, 2L, 2L, 2L, 2L, 2L), FEC_NAC = structure(c(16644, 
16644, 16644, 16644, 16644, 16644), class = "Date"), CONTROL = c(2L, 
3L, 1L, 2L, 3L, 4L)), .Names = c("DNI_PN", "COD_ESTAB", "FEC_ATEN", 
"SEXO", "FEC_NAC", "CONTROL"), class = "data.frame", row.names = c(NA, 
-6L))

0 个答案:

没有答案