分组并计算(第一个和第二个)到(第一个和第三个)之间的天数inR

时间:2018-10-04 10:48:56

标签: r dataframe

我如何分组和计算(第一和第二)与(第一和第三)发生R中的ID之间的天数, 例如我有下面的数据框:

CRASH_DATE  geoid           CRASH_TIME  type
2015-12-10  123             1650        Fatal_i
2015-12-06  156             1722        Fatal_i
2015-12-10  123             1956        Fatal_i
2015-11-29  156             705         Fatal_i
2015-11-21  156             1756        Fatal_i
2015-12-10  123             1936        Fatal_i
2015-11-19  156             712         Fatal_i
2015-11-21  112             1706        Fatal_i
...

我想要一个输出,例如:

geoid   days_between(1,2)    days_between(1,3)
123     0                    0                 
156     2                    10                
112     Nan                  Nan                       
...

这是我的代码:

 dt2  <- data.table(table)
 dt22 <- dt2[,list(diff1 = CRASH_DATE - shift(CRASH_TIME, fill = 
 first(CRASH_TIME)),diff2 = CRASH_DATE - shift(CRASH_TIME, fill = 
 first(CRASH_TIME))),by = c("geoid")]

但这是错误的。

4 个答案:

答案 0 :(得分:1)

df = read.table(text = "
CRASH_DATE  geoid           CRASH_TIME  type
2015-12-10  123             1650        Fatal_i
2015-12-06  156             1722        Fatal_i
2015-12-10  123             1956        Fatal_i
2015-11-29  156             705         Fatal_i
2015-11-21  156             1756        Fatal_i
2015-12-10  123             1936        Fatal_i
2015-11-19  156             712         Fatal_i
2015-11-21  112             1706        Fatal_i
", header=T)

library(dplyr)
library(lubridate)

df %>%
  mutate(CRASH_DATE = ymd(CRASH_DATE)) %>%  # update to date variable (if needed)
  arrange(CRASH_DATE) %>%
  group_by(geoid) %>%
  summarise(days_between_1_2 = as.numeric(CRASH_DATE[2] - CRASH_DATE[1]),
            days_between_1_3 = as.numeric(CRASH_DATE[3] - CRASH_DATE[1]))

# # A tibble: 3 x 3
#   geoid days_between_1_2 days_between_1_3
#   <int>            <dbl>            <dbl>
# 1   112               NA               NA
# 2   123                0                0
# 3   156                2               10

答案 1 :(得分:1)

使用底数R,aggregate()

df = read.table(text = 
  'CRASH_DATE  geoid           CRASH_TIME  type
  2015-12-10  123             1650        Fatal_i
  2015-12-06  156             1722        Fatal_i
  2015-12-10  123             1956        Fatal_i
  2015-11-29  156             705         Fatal_i
  2015-11-21  156             1756        Fatal_i
  2015-12-10  123             1936        Fatal_i
  2015-11-19  156             712         Fatal_i
  2015-11-21  112             1706        Fatal_i', 
  header=TRUE, 
  stringsAsFactors=FALSE)

df$CRASH_DATE <- as.Date(df$CRASH_DATE)  # convert to date

df <- df[order(df$geoid, df$CRASH_DATE), ]  #sort by geoid, CRASH_DATE

# group by geoid, calculate cumsum(diff(df$CRASH_DATE):
aggregate( df$CRASH_DATE, 
           by=df["geoid"], 
           FUN=function(x) cumsum(as.integer(diff(x))))

  geoid         x
1   112          
2   123      0, 0
3   156 2, 10, 17

匿名函数使用

  • cumsum()的累积总和
  • diff()每个日期之间的差异

答案 2 :(得分:0)

要完成答案集-这是data.table解决方案,因为您原来使用的是它-

setorderv(dt2, c('geoid','CRASH_DATE'), c(1, 1))
dt2[, date_order := 1:.N, by = c('geoid')]

dt2_wide = dcast(dt2, geoid ~ date_order, value.var = "CRASH_DATE")

dt2_wide[,days_between_1_2 := abs(`1` - `2`)]
dt2_wide[,days_between_1_3 := abs(`1` - `3`)]

答案 3 :(得分:0)

我以data.table样式提出以下建议,前提是Date格式的CRASH_DATE列和dt作为data.table对象。我了解您希望订单保持不变,希望它“按原样”显示在文件中的方式:

    dt[,.(days_between_1_2=.SD[2,CRASH_DATE]-.SD[1,CRASH_DATE],
          days_between_1_3=.SD[3,CRASH_DATE]-.SD[1,CRASH_DATE]),geoid]