有条件地计算2个日期之间每个ID的唯一日期数

时间:2017-09-04 12:37:30

标签: r date dataframe

我有一个主表,其中包含每个personid的主要事件的日期:

dfSecondary <- data.frame(date = c("2017-09-01", "2017-08-30", "2017-08-04", "2017-08-02", "2017-08-02"),
                      personid = c(122345, 122345, 12341, 122345, 12341))

(&#34;以前&#34;&#34;差异&#34;变量上的NAs表示此personid有他的第一个&#34; main even&#34;即:没有以前的日期和没有时间差)

我还有一个辅助表,包含一个&#34;辅助事件&#34;对于每个人:

Occurances  <- c(NA, NA, 2, 0, 3)
dfObjective <- data.frame(dfMain, Occurances)

我的问题是,最佳方式(由于我的数据量)是什么来增强我的&#34; dfMain&#34;数据框,其中包含每个personid的主要事件日期之间的唯一次要事件的数量。

在虚拟示例中,我的目标是获取此表:

<ng-template ngSwitchDefault>

4 个答案:

答案 0 :(得分:5)

使用data.table - 包:

# load 'data.table' package and convert date-columns to date-class
library(data.table)
setDT(dfMain)[, 1:2 := lapply(.SD, as.IDate), .SDcols = 1:2][]
setDT(dfSecondary)[, date := as.IDate(date)][]

# create a reference
dfSecondary <- dfSecondary[dfMain
                           , on = .(personid, date > previous, date < last)
                           , .(dates = x.date)
                           , by = .EACHI]
setnames(dfSecondary, 2:3, c('previous','last'))

# join and summarise
dfMain[na.omit(dfSecondary, cols = 1:3)[, sum(!is.na(dates), na.rm = TRUE)
                                        , by = .(personid, previous, last)]
       , on = .(personid, previous, last)
       , Occ := V1][]

给出:

         last   previous personid diff Occ
1: 2017-08-01       <NA>    12341   NA  NA
2: 2017-08-01       <NA>   122345   NA  NA
3: 2017-08-05 2017-08-01    12341    4   2
4: 2017-09-02 2017-08-05    12341   28   0
5: 2017-09-02 2017-08-01   122345   32   3

答案 1 :(得分:3)

使用dplyrtidyr

library(dplyr)
library(tidyr)

dfMain %>%
  left_join(dfSecondary,by="personid") %>%                  # put everything together
  mutate_at(c("last","previous","date"),as.Date) %>%        # reformat as date
  mutate(is_between = date <= last & date >= previous) %>%  # tests if it's in between
  group_by(last,previous,personid,diff) %>%                 # group by columns from initial df
  summarize(Occ = sum(is_between)) %>%                      # count how many we have in between
  `[<-`(is.na(.$previous),"Occ",NA) %>%                     # add NAs where previous was NA
  ungroup                                                   # ungroup to have regular table

# # A tibble: 5 x 5
#         last   previous personid  diff   Occ
#       <date>     <date>    <dbl> <dbl> <int>
# 1 2017-08-01         NA    12341    NA    NA
# 2 2017-08-01         NA   122345    NA    NA
# 3 2017-08-05 2017-08-01    12341     4     2
# 4 2017-09-02 2017-08-01   122345    32     3
# 5 2017-09-02 2017-08-05    12341    28     0

注意:订单已被更改,请告诉我这是否是一个问题,我会解决它。

答案 2 :(得分:3)

使用非equi join

Jaap's data.table approach可以在“单行”中压缩:

dfMain[, Occurrences := dfSecondary[dfMain, 
                                    on = .(personid, date <= last, date >= previous), 
                                    .N, by = .EACHI]$N][]
         last   previous personid diff Occurrences
1: 2017-08-01       <NA>    12341   NA           0
2: 2017-08-01       <NA>   122345   NA           0
3: 2017-08-05 2017-08-01    12341    4           2
4: 2017-09-02 2017-08-05    12341   28           0
5: 2017-09-02 2017-08-01   122345   32           3

dfSecondary[dfMain, ...]是一个非equi 右连接,它在连接中获取dfMain的所有行和聚合。结果与dfMain具有相同的行数和顺序。因此,我们可以选择计数列N并创建新的Occurrences列。

非equi 加入是data.table引入版本1.9.8的新功能(2016年11月25日CRAN上)。

数据

需要将示例数据集强制转换为类data.table,并且需要将各种日期列转换为日期类。

library(data.table)
cols <- c("last", "previous")
setDT(dfMain)[, (cols) := lapply(.SD, as.IDate), .SDcols = cols][]
setDT(dfSecondary)[, date := as.IDate(date)][]

答案 3 :(得分:1)

以下是tidyverse的解决方案。

library(tidyverse)

# Convert columns of factor to date class
# Add an ID column
dfMain2 <- dfMain %>% 
  mutate_if(is.factor, as.character) %>%
  mutate_if(is.character, as.Date) %>%
  mutate(ID = 1:n())

# Convert columns of factor to date class
# Add a Count column
dfSecondary2 <- dfSecondary %>% 
  mutate_if(is.factor, as.character) %>%
  mutate_if(is.character, as.Date) %>%
  mutate(Count = 1)

# Create sequence of dates between previous and last
# Unnest the data frame
# Perform join based on "Period" = "date", "personid"
# Group the data frame by ID and calculate the total count
dfMain3 <- dfMain2 %>%
  drop_na(previous) %>%
  mutate(Period = map2(previous, last, seq, by = 1)) %>%
  unnest() %>%
  left_join(dfSecondary2, by = c("Period" = "date", "personid")) %>%
  group_by(ID) %>%
  summarise(Occurances = sum(Count, na.rm = TRUE))

# Join the data frame by ID to create dfObjective
dfObjective <- dfMain2 %>%
  left_join(dfMain3, by = "ID") %>%
  select(-ID)

dfObjective
        last   previous personid diff Occurances
1 2017-08-01       <NA>    12341   NA         NA
2 2017-08-01       <NA>   122345   NA         NA
3 2017-08-05 2017-08-01    12341    4          2
4 2017-09-02 2017-08-05    12341   28          0
5 2017-09-02 2017-08-01   122345   32          3

数据

dfMain <- data.frame(last    = c("2017-08-01", "2017-08-01", "2017-08-05","2017-09-02","2017-09-02"),
                     previous    = c(NA, NA, "2017-08-01", "2017-08-05", "2017-08-01"),
                     personid    = c(12341, 122345, 12341, 12341, 122345),
                     diff        = c(NA, NA, 4, 28, 32))


dfSecondary <- data.frame(date = c("2017-09-01", "2017-08-30", "2017-08-04", "2017-08-02", "2017-08-02"),
                          personid = c(122345, 122345, 12341, 122345, 12341))