我有一个主表,其中包含每个personid的主要事件的日期:
dfSecondary <- data.frame(date = c("2017-09-01", "2017-08-30", "2017-08-04", "2017-08-02", "2017-08-02"),
personid = c(122345, 122345, 12341, 122345, 12341))
(&#34;以前&#34;&#34;差异&#34;变量上的NAs表示此personid有他的第一个&#34; main even&#34;即:没有以前的日期和没有时间差)
我还有一个辅助表,包含一个&#34;辅助事件&#34;对于每个人:
Occurances <- c(NA, NA, 2, 0, 3)
dfObjective <- data.frame(dfMain, Occurances)
我的问题是,最佳方式(由于我的数据量)是什么来增强我的&#34; dfMain&#34;数据框,其中包含每个personid的主要事件日期之间的唯一次要事件的数量。
在虚拟示例中,我的目标是获取此表:
<ng-template ngSwitchDefault>
答案 0 :(得分:5)
使用data.table
- 包:
# load 'data.table' package and convert date-columns to date-class
library(data.table)
setDT(dfMain)[, 1:2 := lapply(.SD, as.IDate), .SDcols = 1:2][]
setDT(dfSecondary)[, date := as.IDate(date)][]
# create a reference
dfSecondary <- dfSecondary[dfMain
, on = .(personid, date > previous, date < last)
, .(dates = x.date)
, by = .EACHI]
setnames(dfSecondary, 2:3, c('previous','last'))
# join and summarise
dfMain[na.omit(dfSecondary, cols = 1:3)[, sum(!is.na(dates), na.rm = TRUE)
, by = .(personid, previous, last)]
, on = .(personid, previous, last)
, Occ := V1][]
给出:
last previous personid diff Occ 1: 2017-08-01 <NA> 12341 NA NA 2: 2017-08-01 <NA> 122345 NA NA 3: 2017-08-05 2017-08-01 12341 4 2 4: 2017-09-02 2017-08-05 12341 28 0 5: 2017-09-02 2017-08-01 122345 32 3
答案 1 :(得分:3)
使用dplyr
和tidyr
library(dplyr)
library(tidyr)
dfMain %>%
left_join(dfSecondary,by="personid") %>% # put everything together
mutate_at(c("last","previous","date"),as.Date) %>% # reformat as date
mutate(is_between = date <= last & date >= previous) %>% # tests if it's in between
group_by(last,previous,personid,diff) %>% # group by columns from initial df
summarize(Occ = sum(is_between)) %>% # count how many we have in between
`[<-`(is.na(.$previous),"Occ",NA) %>% # add NAs where previous was NA
ungroup # ungroup to have regular table
# # A tibble: 5 x 5
# last previous personid diff Occ
# <date> <date> <dbl> <dbl> <int>
# 1 2017-08-01 NA 12341 NA NA
# 2 2017-08-01 NA 122345 NA NA
# 3 2017-08-05 2017-08-01 12341 4 2
# 4 2017-09-02 2017-08-01 122345 32 3
# 5 2017-09-02 2017-08-05 12341 28 0
注意:订单已被更改,请告诉我这是否是一个问题,我会解决它。
答案 2 :(得分:3)
Jaap's data.table
approach可以在“单行”中压缩:
dfMain[, Occurrences := dfSecondary[dfMain,
on = .(personid, date <= last, date >= previous),
.N, by = .EACHI]$N][]
last previous personid diff Occurrences 1: 2017-08-01 <NA> 12341 NA 0 2: 2017-08-01 <NA> 122345 NA 0 3: 2017-08-05 2017-08-01 12341 4 2 4: 2017-09-02 2017-08-05 12341 28 0 5: 2017-09-02 2017-08-01 122345 32 3
dfSecondary[dfMain, ...]
是一个非equi 右连接,它在连接中获取dfMain
的所有行和聚合。结果与dfMain
具有相同的行数和顺序。因此,我们可以选择计数列N
并创建新的Occurrences
列。
非equi 加入是data.table
引入版本1.9.8的新功能(2016年11月25日CRAN上)。
需要将示例数据集强制转换为类data.table
,并且需要将各种日期列转换为日期类。
library(data.table)
cols <- c("last", "previous")
setDT(dfMain)[, (cols) := lapply(.SD, as.IDate), .SDcols = cols][]
setDT(dfSecondary)[, date := as.IDate(date)][]
答案 3 :(得分:1)
以下是tidyverse
的解决方案。
library(tidyverse)
# Convert columns of factor to date class
# Add an ID column
dfMain2 <- dfMain %>%
mutate_if(is.factor, as.character) %>%
mutate_if(is.character, as.Date) %>%
mutate(ID = 1:n())
# Convert columns of factor to date class
# Add a Count column
dfSecondary2 <- dfSecondary %>%
mutate_if(is.factor, as.character) %>%
mutate_if(is.character, as.Date) %>%
mutate(Count = 1)
# Create sequence of dates between previous and last
# Unnest the data frame
# Perform join based on "Period" = "date", "personid"
# Group the data frame by ID and calculate the total count
dfMain3 <- dfMain2 %>%
drop_na(previous) %>%
mutate(Period = map2(previous, last, seq, by = 1)) %>%
unnest() %>%
left_join(dfSecondary2, by = c("Period" = "date", "personid")) %>%
group_by(ID) %>%
summarise(Occurances = sum(Count, na.rm = TRUE))
# Join the data frame by ID to create dfObjective
dfObjective <- dfMain2 %>%
left_join(dfMain3, by = "ID") %>%
select(-ID)
dfObjective
last previous personid diff Occurances
1 2017-08-01 <NA> 12341 NA NA
2 2017-08-01 <NA> 122345 NA NA
3 2017-08-05 2017-08-01 12341 4 2
4 2017-09-02 2017-08-05 12341 28 0
5 2017-09-02 2017-08-01 122345 32 3
数据
dfMain <- data.frame(last = c("2017-08-01", "2017-08-01", "2017-08-05","2017-09-02","2017-09-02"),
previous = c(NA, NA, "2017-08-01", "2017-08-05", "2017-08-01"),
personid = c(12341, 122345, 12341, 12341, 122345),
diff = c(NA, NA, 4, 28, 32))
dfSecondary <- data.frame(date = c("2017-09-01", "2017-08-30", "2017-08-04", "2017-08-02", "2017-08-02"),
personid = c(122345, 122345, 12341, 122345, 12341))