如何在R

时间:2017-04-17 16:33:42

标签: r performance loops data.table data-cleaning

我有一个数据集,包含一个人离开网络时的日期。一个人可以多次离开网络,因为他们可以在离开后再次加入网络。以下代码复制了该场景。

library(data.table)
Leaving_Date<- data.table(Id= c(1,2,3,4,3,5),Date = as.Date(
c("2017-01-01","2017-02-03","2017-01-01","2017-03-10","2017-02-09","2017-02-05")))

(ids在此表中重复多次,因为一个人可以多次离开网络,因为他们再次加入网络)

 > Leaving_Date
   Id       Date
1:  1 2017-01-01
2:  2 2017-02-03
3:  3 2017-01-01
4:  4 2017-03-10
5:  3 2017-02-09
6:  5 2017-02-05

我有另一个数据集,只要一个特定的人跟进,就可以在他们离开网络之前或之后给出日期。以下代码复制了该场景。

FOLLOWUPs <- data.table(Id = c(1,2,3,2,2,3,3,4,1,5),
                        Date =as.Date(c("2016-10-01","2017-02-04",
                        "2017-01-17","2017-02-23", "2017-03-03",
                        "2017-02-10","2017-02-11","2017-01-01",
                        "2017-01-15","2017-01-01")))


> FOLLOWUPs
    Id       Date
 1:  1 2016-10-01
 2:  2 2017-02-04
 3:  3 2017-01-17
 4:  2 2017-02-23
 5:  2 2017-03-03
 6:  3 2017-02-10
 7:  3 2017-02-11
 8:  4 2017-01-01
 9:  1 2017-01-15
10:  5 2017-01-01

现在我想在Leaving_Date中查找每个案例并找到他们被跟进的日期并创建三个列(SevenDay,FourteenDay,ThirtyDay),表示0和1中的后续时间段(如果有的话)。我使用以下代码:

SEVENDAY_FOLLOWUP <- vector()
FOURTEEN_FOLLOWUP <- vector()
THIRTYDAY_FOLLOWUP <- vector()
for(i in 1:nrow(Leaving_Date)){
  sub_data <- FOLLOWUPs[Id== Leaving_Date[i,1]]
  if(nrow(sub_data[Date > Leaving_Date[i,Date] &
                   Date < (Leaving_Date[i,Date]+7)])== 0){
     SEVENDAY_FOLLOWUP <- rbind(SEVENDAY_FOLLOWUP,0)
   }
   else{
     SEVENDAY_FOLLOWUP <- rbind(SEVENDAY_FOLLOWUP,1)
   }

   if(nrow(sub_data[Date > Leaving_Date[i,Date] &
                    Date < (Leaving_Date[i,Date]+14)])== 0){
     FOURTEEN_FOLLOWUP <- rbind(FOURTEEN_FOLLOWUP,0)
   }
   else{
     FOURTEEN_FOLLOWUP <- rbind(FOURTEEN_FOLLOWUP,1)
   }

   if(nrow(sub_data[Date > Leaving_Date[i,Date] &
                    Date < (Leaving_Date[i,Date]+30)])== 0){
     THIRTYDAY_FOLLOWUP <- rbind(THIRTYDAY_FOLLOWUP,0)
   }
   else{
     THIRTYDAY_FOLLOWUP <- rbind(THIRTYDAY_FOLLOWUP,1)
   }
 }               


 Leaving_Date$SEVENDAY <- as.vector(SEVENDAY_FOLLOWUP)
 Leaving_Date$FOURTEENDAY <- as.vector(FOURTEEN_FOLLOWUP)
 Leaving_Date$THIRTYDAY <- as.vector(THIRTYDAY_FOLLOWUP)

最终数据

 > Leaving_Date
    Id       Date SEVENDAY FOURTEENDAY THIRTYDAY
 1:  1 2017-01-01        0           0         1
 2:  2 2017-02-03        1           1         1
 3:  3 2017-01-01        0           0         1
 4:  4 2017-03-10        0           0         0
 5:  3 2017-02-09        1           1         1
 6:  5 2017-02-05        0           0         0

这段代码非常低效,因为我必须运行它进行100k观察,这需要花费很多时间。有没有有效的方法来做到这一点。

3 个答案:

答案 0 :(得分:4)

使用非等连接:

DAQmxCfgOnDemandTiming()

setorder(FOLLOWUPs, Id, Date) Leaving_Date[, n := FOLLOWUPs[.SD, on=.(Id, Date > Date), mult = "first", x.Date - i.Date] ] Id Date n 1: 1 2017-01-01 14 days 2: 2 2017-02-03 1 days 3: 3 2017-01-01 16 days 4: 4 2017-03-10 NA days 5: 3 2017-02-09 1 days 6: 5 2017-02-05 NA days 切换到Date可能会大约快两倍。请参阅IDate

我认为最好停在这里,但?IDate可以在必要时与7,14,30进行比较,例如

n

旁注:请不要给这样的表名。

答案 1 :(得分:0)

我认为这可以帮助您使用dplyr

它有一个内部联接&#39;通过Id - 在给定Id的两个数据框中生成日期的所有组合 - 然后按Id计算日期差异,组,然后检查是否有值落在三个类别的范围内。

library(dplyr)

Leaving_Date2 <- Leaving_Date %>% inner_join(FOLLOWUPs %>% rename(FU_Date=Date)) %>% 
  mutate(datediff=as.numeric(FU_Date-Date)) %>% group_by(Id,Date) %>% 
  summarise(SEVENDAY=as.numeric(any(datediff %in% 0:6)),
            FOURTEENDAY=as.numeric(any(datediff %in% 0:13)),
            THIRTYDAY=as.numeric(any(datediff %in% 0:29)))

答案 2 :(得分:0)

我们可以将此作为查询而不是循环。首先,我清理了data.tables,因为我对变量名称感到困惑。

为了使比较步骤更容易,我们首先预先计算7天,14天和30天阈值的后续日期限制。

library(dplyr)

dt_leaving_neat = Leaving_Date %>%
  mutate(.id = 1:n()) %>%
  mutate(limit_07 = Date + 7) %>%
  mutate(limit_14 = Date + 14) %>%
  mutate(limit_30 = Date + 30) %>%
  rename(id = .id, id_person = Id, leaving_date = Date)

dt_follow_neat = FOLLOWUPs %>% 
  select(id_person = Id, followed_up_date = Date)

实际操作只是一个查询。它是在dplyr中写出来以便于阅读,但如果速度是您的主要关注点,您可以将其翻译为data.table。我建议运行管道中的每一步,以确保您了解正在发生的事情。

dt_followed_up = dt_leaving_neat %>%
  tidyr::gather(follow_up, limit_date, limit_07:limit_30) %>%
  left_join(dt_follow_neat, by = "id_person") %>%
  mutate(followed_up = (followed_up_date > leaving_date) & (followed_up_date < limit_date)) %>%
  select(id, id_person, leaving_date, follow_up, followed_up) %>%
  filter(followed_up == TRUE) %>%
  unique() %>%
  tidyr::spread(follow_up, followed_up, fill = 0) %>%
  select(id, id_person, leaving_date, limit_07, limit_14, limit_30)

我们的想法是将离开日期加入到跟进日期,并检查跟进日期是否在阈值范围内(以及离开日期之后,因为可能在离开之前无法跟进)。

然后进行一些最终清洁以返回所需的格式。您也可以使用selectrename更改列名称。

dt_result = dt_leaving_neat %>%
  select(id, id_person, leaving_date) %>%
  left_join(dt_followed_up, by = c("id", "id_person", "leaving_date"))

dt_result[is.na(dt_result)] = 0

结果

> dt_result
  id id_person leaving_date limit_07 limit_14 limit_30
1  1         1   2017-01-01        0        0        1
2  2         2   2017-02-03        1        1        1
3  3         3   2017-01-01        0        0        1
4  4         4   2017-03-10        0        0        0
5  5         3   2017-02-09        1        1        1
6  6         5   2017-02-05        0        0        0

根据Andrew的回答,等效的1行data.table soln

FOLLOWUPs[Leaving_Date, on = "Id", .(Id, follow_date = Date, leaving_date = i.Date)][, diff := follow_date - leaving_date][, .(seven = any(diff %in% 0:6), fourteen = any(diff %in% 0:13), thirty = any(diff %in% 0:29)), .(Id, leaving_date)]