我有一个数据集如下:
id event date
1 A 2010-01-04
2 B 2011-02-11
2 A 2011-05-09
3 A 2005-11-01
1 A 2010-01-05
1 A 2010-08-09
2 A 2011-06-09
2 A 2011-08-25
3 A 2005-05-10
3 A 2001-06-07
1 B 2011-05-09
我正在使用R.对于每个id,我想标记在任何12个月期间发生超过2次的事件A. 12个月不是基于日历年。 有什么好的建议吗?
编辑: 这是我想到的算法,但我不知道如何在R中执行它。
答案 0 :(得分:0)
您可以尝试:
df <- read.table(text="id event date
1 A 2010-01-04
2 B 2011-02-11
2 A 2011-05-09
3 A 2005-11-01
1 A 2010-01-05
1 A 2010-08-09
2 A 2011-06-09
2 A 2011-08-25
3 A 2005-05-10
3 A 2001-06-07
1 B 2011-05-09", header=T)
df$date <- as.Date(df$date)
df %>%
group_by(id, event) %>%
arrange(date) %>%
mutate(flag=sum(abs(date-lag(date))<365, na.rm=TRUE)>0)
id event date flag
<int> <fctr> <date> <lgl>
1 3 A 2001-06-07 TRUE
2 3 A 2005-05-10 TRUE
3 3 A 2005-11-01 TRUE
4 1 A 2010-01-04 TRUE
5 1 A 2010-01-05 TRUE
6 1 A 2010-08-09 TRUE
7 2 B 2011-02-11 FALSE
8 2 A 2011-05-09 TRUE
9 1 B 2011-05-09 FALSE
10 2 A 2011-06-09 TRUE
11 2 A 2011-08-25 TRUE
答案 1 :(得分:0)
这是可行的代码,虽然对我正在处理的大数据集效率不高。随意提出更有效的代码。
df2<-df %>%
filter(event=="A") %>%
group_by(id) %>%
arrange(id, desc(date)) %>%
mutate(timediff=difftime(date,lead(date),units="days"))
df2$timediff=ifelse(is.na(df2$timediff),0, df2$timediff)
f<-function(id,date,timediff){
count <- ifelse(max(cumsum(df2$timediff[df2$id==id&df2$date<=date]))<=365,
length(df2$timediff[df2$id==id&df2$date<=date]),
min(which(cumsum(df2$timediff[df2$id==id&df2$date<=date])>365)))}
df3<-df2 %>%
rowwise() %>%
mutate(eventcount=f(id,date,timediff))
df3
Source: local data frame [9 x 5]
Groups: <by row>
# A tibble: 9 x 5
id event date timediff eventcount
<chr> <fctr> <date> <dbl> <int>
1 1 A 2010-08-09 216 3
2 1 A 2010-01-05 1 2
3 1 A 2010-01-04 0 1
4 2 A 2011-08-25 77 3
5 2 A 2011-06-09 31 2
6 2 A 2011-05-09 0 1
7 3 A 2005-11-01 175 2
8 3 A 2005-05-10 1433 1
9 3 A 2001-06-07 0 1
任何事件计数超过2的行都将被标记。
答案 2 :(得分:0)
出于性能原因,基于data.table
的解决方案:
library(data.table)
library(lubridate)
# Create the data
df <- read.table(text="id event date
1 A 2010-01-04
2 B 2011-02-11
2 A 2011-05-09
3 A 2005-11-01
1 A 2010-01-05
1 A 2010-08-09
2 A 2011-06-09
2 A 2011-08-25
3 A 2005-05-10
3 A 2001-06-07
1 B 2011-05-09", header=T, stringsAsFactors = F)
setDT(df) # convert to a data.table
df[, `:=`(rowno = 1:.N, date.typed = ymd(date))] # add a unique row ID + convert date strings into date type
df[, date.window := (date.typed - years(1))] # add column with with the start date of observations
# Use data.table chaining to:
# 1. Do a non-equi join (1 year time event time window) with event type "A"
# 2. count events per group then
# 3. finally show ordered output
df[df[event == "A"], c(.SD, irowno = i.rowno, i.date = i.date),
on = .(date.typed >= date.typed, date.window <= date.typed, event == event, id == id),
by = .EACHI] [, .(count = .N), by = .(id, event, date, rowno)] [order(id, -date)]
输出:
id event date rowno count
1: 1 A 2010-08-09 6 3
2: 1 A 2010-01-05 5 2
3: 1 A 2010-01-04 1 1
4: 2 A 2011-08-25 8 3
5: 2 A 2011-06-09 7 2
6: 2 A 2011-05-09 3 1
7: 3 A 2005-11-01 4 2
8: 3 A 2005-05-10 9 1
9: 3 A 2001-06-07 10 1
PS:确实不需要唯一的行号,但是更容易理解结果并在以后丰富原始数据......