操纵日期和连续结果

时间:2018-07-29 13:13:23

标签: r dplyr data.table

我需要一些帮助来处理连续的结果。

这是我的示例数据:

df <- structure(list(idno = c(1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 
2, 2, 2), result = structure(c(1L, 2L, 2L, 1L, 1L, 1L, 1L, 1L, 
1L, 1L, 2L, 1L, 1L, 2L, 2L, 2L), .Label = c("Negative", "Positive"
), class = c("ordered", "factor")), samp_date = structure(c(15909, 
15938, 15979, 16007, 16041, 16080, 16182, 16504, 16576, 16645, 
16721, 16745, 17105, 17281, 17416, 17429), class = "Date")), class = "data.frame", row.names = c(NA, 
-16L))

“ idno”代表在给定日期(“ samp_date”)进行“结果”测试的个人。

我需要从每个人中找出最早的连续“负”,并返回第一个“负”结果的日期。要返回此日期,连续的负数必须跨越> 30天,且没有“正”结果。

idno == 1的示例答案为2013-10-29,而idno == 2的示例答案为2015-11-06。

我尝试使用rle(as.character(df$result)),但一直在努力了解如何将其应用于分组数据。

我更喜欢使用dplyr或data.table的方法。

感谢您的帮助。

3 个答案:

答案 0 :(得分:2)

类似于@MKR的答案,您可以创建分组变量并汇总到data.table中:

library(data.table)
setDT(df)[, samp_date := as.IDate(samp_date)]

# summarize by grouping var g = rleid(idno, result)    
runDT = df[, .(
  start = first(samp_date),
  end  = last(samp_date),
  dur  = difftime(last(samp_date), first(samp_date), units="days")
), by=.(idno, result, g = rleid(idno, result))]

#    idno   result g      start        end      dur
# 1:    1 Negative 1 2013-07-23 2013-07-23   0 days
# 2:    1 Positive 2 2013-08-21 2013-10-01  41 days
# 3:    1 Negative 3 2013-10-29 2015-07-29 638 days
# 4:    2 Positive 4 2015-10-13 2015-10-13   0 days
# 5:    2 Negative 5 2015-11-06 2016-10-31 360 days
# 6:    2 Positive 6 2017-04-25 2017-09-20 148 days

# find rows meeting the criterion
w = runDT[.(idno = unique(idno), result = "Negative", min_dur = 30), 
  on=.(idno, result, dur >= min_dur), mult="first", which=TRUE]

# filter
runDT[w]

#    idno   result g      start        end      dur
# 1:    1 Negative 3 2013-10-29 2015-07-29 638 days
# 2:    2 Negative 5 2015-11-06 2016-10-31 360 days

答案 1 :(得分:1)

基于dplyr的解决方案可以通过创建一组连续出现的result列,然后最终采用符合条件的第一个出现来实现:

library(dplyr)
df %>% mutate(samp_date = as.Date(samp_date)) %>% 
  group_by(idno) %>%
  arrange(samp_date) %>%
  mutate(result_grp = cumsum(as.character(result)!=lag(as.character(result),default=""))) %>%
  group_by(idno, result_grp) %>%
  filter( result == "Negative" & (max(samp_date) - min(samp_date) )>=30) %>%
  slice(1) %>%
  ungroup() %>%
  select(-result_grp) 

# # A tibble: 2 x 3
# idno result   samp_date 
# <dbl> <ord>    <date>    
# 1  1.00 Negative 2013-10-29
# 2  2.00 Negative 2015-11-06

答案 2 :(得分:0)

library(dplyr)
df %>% group_by(idno) %>% 
       mutate(time_diff = ifelse(result=="Negative" & lead(result)=='Negative', samp_date - lead(samp_date),0), 
              ConsNegDate = min(samp_date[which(abs(time_diff)>30)]))


  # A tibble: 16 x 5
  # Groups:   idno [2]
       idno result   samp_date  time_diff ConsNegDate
      <dbl> <ord>    <date>         <dbl> <date>     
   1     1 Negative 2013-07-23         0 2013-10-29 
   2     1 Positive 2013-08-21         0 2013-10-29 
   3     1 Positive 2013-10-01         0 2013-10-29 
   4     1 Negative 2013-10-29       -34 2013-10-29 
   5     1 Negative 2013-12-02       -39 2013-10-29 
   6     1 Negative 2014-01-10      -102 2013-10-29 
   7     1 Negative 2014-04-22      -322 2013-10-29 
   8     1 Negative 2015-03-10       -72 2013-10-29 
   9     1 Negative 2015-05-21       -69 2013-10-29 
  10     1 Negative 2015-07-29        NA 2013-10-29 
  11     2 Positive 2015-10-13         0 2015-11-06 
  12     2 Negative 2015-11-06      -360 2015-11-06 
  13     2 Negative 2016-10-31         0 2015-11-06 
  14     2 Positive 2017-04-25         0 2015-11-06 
  15     2 Positive 2017-09-07         0 2015-11-06 
  16     2 Positive 2017-09-20         0 2015-11-06