我如何分组和计算(第一和第二)与(第一和第三)发生R中的ID之间的天数, 例如我有下面的数据框:
CRASH_DATE geoid CRASH_TIME type
2015-12-10 123 1650 Fatal_i
2015-12-06 156 1722 Fatal_i
2015-12-10 123 1956 Fatal_i
2015-11-29 156 705 Fatal_i
2015-11-21 156 1756 Fatal_i
2015-12-10 123 1936 Fatal_i
2015-11-19 156 712 Fatal_i
2015-11-21 112 1706 Fatal_i
...
我想要一个输出,例如:
geoid days_between(1,2) days_between(1,3)
123 0 0
156 2 10
112 Nan Nan
...
这是我的代码:
dt2 <- data.table(table)
dt22 <- dt2[,list(diff1 = CRASH_DATE - shift(CRASH_TIME, fill =
first(CRASH_TIME)),diff2 = CRASH_DATE - shift(CRASH_TIME, fill =
first(CRASH_TIME))),by = c("geoid")]
但这是错误的。
答案 0 :(得分:1)
df = read.table(text = "
CRASH_DATE geoid CRASH_TIME type
2015-12-10 123 1650 Fatal_i
2015-12-06 156 1722 Fatal_i
2015-12-10 123 1956 Fatal_i
2015-11-29 156 705 Fatal_i
2015-11-21 156 1756 Fatal_i
2015-12-10 123 1936 Fatal_i
2015-11-19 156 712 Fatal_i
2015-11-21 112 1706 Fatal_i
", header=T)
library(dplyr)
library(lubridate)
df %>%
mutate(CRASH_DATE = ymd(CRASH_DATE)) %>% # update to date variable (if needed)
arrange(CRASH_DATE) %>%
group_by(geoid) %>%
summarise(days_between_1_2 = as.numeric(CRASH_DATE[2] - CRASH_DATE[1]),
days_between_1_3 = as.numeric(CRASH_DATE[3] - CRASH_DATE[1]))
# # A tibble: 3 x 3
# geoid days_between_1_2 days_between_1_3
# <int> <dbl> <dbl>
# 1 112 NA NA
# 2 123 0 0
# 3 156 2 10
答案 1 :(得分:1)
aggregate()
df = read.table(text =
'CRASH_DATE geoid CRASH_TIME type
2015-12-10 123 1650 Fatal_i
2015-12-06 156 1722 Fatal_i
2015-12-10 123 1956 Fatal_i
2015-11-29 156 705 Fatal_i
2015-11-21 156 1756 Fatal_i
2015-12-10 123 1936 Fatal_i
2015-11-19 156 712 Fatal_i
2015-11-21 112 1706 Fatal_i',
header=TRUE,
stringsAsFactors=FALSE)
df$CRASH_DATE <- as.Date(df$CRASH_DATE) # convert to date
df <- df[order(df$geoid, df$CRASH_DATE), ] #sort by geoid, CRASH_DATE
# group by geoid, calculate cumsum(diff(df$CRASH_DATE):
aggregate( df$CRASH_DATE,
by=df["geoid"],
FUN=function(x) cumsum(as.integer(diff(x))))
geoid x
1 112
2 123 0, 0
3 156 2, 10, 17
匿名函数使用
cumsum()
的累积总和diff()
每个日期之间的差异答案 2 :(得分:0)
要完成答案集-这是data.table解决方案,因为您原来使用的是它-
setorderv(dt2, c('geoid','CRASH_DATE'), c(1, 1))
dt2[, date_order := 1:.N, by = c('geoid')]
dt2_wide = dcast(dt2, geoid ~ date_order, value.var = "CRASH_DATE")
dt2_wide[,days_between_1_2 := abs(`1` - `2`)]
dt2_wide[,days_between_1_3 := abs(`1` - `3`)]
答案 3 :(得分:0)
我以data.table样式提出以下建议,前提是Date格式的CRASH_DATE列和dt作为data.table对象。我了解您希望订单保持不变,希望它“按原样”显示在文件中的方式:
dt[,.(days_between_1_2=.SD[2,CRASH_DATE]-.SD[1,CRASH_DATE],
days_between_1_3=.SD[3,CRASH_DATE]-.SD[1,CRASH_DATE]),geoid]