使用R

时间:2017-07-11 02:45:32

标签: r time data-cleaning

我有一个数据集如下:

id  event  date
1   A      2010-01-04
2   B      2011-02-11
2   A      2011-05-09
3   A      2005-11-01
1   A      2010-01-05
1   A      2010-08-09
2   A      2011-06-09
2   A      2011-08-25
3   A      2005-05-10
3   A      2001-06-07
1   B      2011-05-09

我正在使用R.对于每个id,我想标记在任何12个月期间发生超过2次的事件A. 12个月不是基于日历年。 有什么好的建议吗?

编辑: 这是我想到的算法,但我不知道如何在R中执行它。

  1. 过滤事件A的行
  2. 按ID和日期降序排列数据框
  3. 按ID分组
  4. 计算连续行之间的日期差异(例如,第1行将在第1行和第2行之间的日期存在差异,因为它们来自相同的ID
  5. 对于id的每一行,计算行之下的事件数量,其时间差异总和小于或等于12个月。如果计数大于2,则标记该行。

3 个答案:

答案 0 :(得分:0)

您可以尝试:

df <- read.table(text="id event date
1 A 2010-01-04
2 B 2011-02-11
2 A 2011-05-09
3 A 2005-11-01
1 A 2010-01-05
1 A 2010-08-09
2 A 2011-06-09
2 A 2011-08-25
3 A 2005-05-10
3 A 2001-06-07
1 B 2011-05-09", header=T)
df$date <- as.Date(df$date)

df %>% 
  group_by(id, event) %>% 
  arrange(date) %>% 
  mutate(flag=sum(abs(date-lag(date))<365, na.rm=TRUE)>0)

      id  event       date  flag
   <int> <fctr>     <date> <lgl>
1      3      A 2001-06-07  TRUE
2      3      A 2005-05-10  TRUE
3      3      A 2005-11-01  TRUE
4      1      A 2010-01-04  TRUE
5      1      A 2010-01-05  TRUE
6      1      A 2010-08-09  TRUE
7      2      B 2011-02-11 FALSE
8      2      A 2011-05-09  TRUE
9      1      B 2011-05-09 FALSE
10     2      A 2011-06-09  TRUE
11     2      A 2011-08-25  TRUE

答案 1 :(得分:0)

这是可行的代码,虽然对我正在处理的大数据集效率不高。随意提出更有效的代码。

df2<-df %>% 
   filter(event=="A") %>% 
   group_by(id) %>%
   arrange(id, desc(date)) %>% 
   mutate(timediff=difftime(date,lead(date),units="days"))

df2$timediff=ifelse(is.na(df2$timediff),0, df2$timediff)

f<-function(id,date,timediff){
 count <- ifelse(max(cumsum(df2$timediff[df2$id==id&df2$date<=date]))<=365, 
           length(df2$timediff[df2$id==id&df2$date<=date]), 
           min(which(cumsum(df2$timediff[df2$id==id&df2$date<=date])>365)))}

df3<-df2 %>%
  rowwise() %>%
  mutate(eventcount=f(id,date,timediff))

df3

Source: local data frame [9 x 5]
Groups: <by row>

# A tibble: 9 x 5
    id  event       date timediff eventcount
   <chr> <fctr>     <date>    <dbl>      <int>
1     1      A 2010-08-09      216          3
2     1      A 2010-01-05        1          2
3     1      A 2010-01-04        0          1
4     2      A 2011-08-25       77          3
5     2      A 2011-06-09       31          2
6     2      A 2011-05-09        0          1
7     3      A 2005-11-01      175          2
8     3      A 2005-05-10     1433          1
9     3      A 2001-06-07        0          1

任何事件计数超过2的行都将被标记。

答案 2 :(得分:0)

出于性能原因,基于data.table的解决方案:

library(data.table)
library(lubridate)

# Create the data
df <- read.table(text="id event date
1 A 2010-01-04
2 B 2011-02-11
2 A 2011-05-09
3 A 2005-11-01
1 A 2010-01-05
1 A 2010-08-09
2 A 2011-06-09
2 A 2011-08-25
3 A 2005-05-10
3 A 2001-06-07
1 B 2011-05-09", header=T, stringsAsFactors = F)

setDT(df)   # convert to a data.table

df[, `:=`(rowno = 1:.N, date.typed = ymd(date))]  # add a unique row ID + convert date strings into date type
df[, date.window := (date.typed - years(1))]      # add column with with the start date of observations

# Use data.table chaining to:
# 1. Do a non-equi join (1 year time event time window) with event type "A"
# 2. count events per group then
# 3. finally show ordered output
df[df[event == "A"], c(.SD, irowno = i.rowno, i.date = i.date),
   on = .(date.typed >= date.typed, date.window <= date.typed, event == event, id == id),
   by = .EACHI]    [, .(count = .N), by = .(id, event, date, rowno)]   [order(id, -date)]

输出:

   id event       date rowno count
1:  1     A 2010-08-09     6     3
2:  1     A 2010-01-05     5     2
3:  1     A 2010-01-04     1     1
4:  2     A 2011-08-25     8     3
5:  2     A 2011-06-09     7     2
6:  2     A 2011-05-09     3     1
7:  3     A 2005-11-01     4     2
8:  3     A 2005-05-10     9     1
9:  3     A 2001-06-07    10     1

PS:确实不需要唯一的行号,但是更容易理解结果并在以后丰富原始数据......