我正在使用id,month,date的一些数据。我希望每个月的ID和平均值有所不同(所以两组)。我已经阅读了this post,并且我试图修改答案(仅针对ID,而不是月份),没有运气。
我的数据类似于:
test <-structure(list(id = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L), .Label = "1", class = "factor"),
month = c(1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 3,
3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3,
3, 3, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4,
4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4,
4, 4, 4, 4, 4, 4),
date = structure(c(17555, 17555, 17555, 17555, 17555, 17555, 17555, 17555, 17555, 17555, 17555, 17555,
17555, 17555, 17555, 17555, 17555, 17555, 17555, 17555, 17555,
17555, 17579, 17579, 17579, 17579, 17579, 17579, 17579, 17579,
17579, 17579, 17579, 17579, 17618, 17618, 17618, 17618, 17618,
17618, 17618, 17618, 17618, 17618, 17618, 17621, 17621, 17621,
17621, 17621, 17621, 17621, 17621, 17621, 17621, 17621, 17649,
17649, 17649, 17649, 17649, 17649, 17649, 17649, 17649, 17649,
17649, 17649, 17649, 17649, 17649, 17649, 17649, 17649, 17649,
17649, 17649, 17649, 17649, 17649, 17649, 17649, 17649, 17649,
17649, 17649, 17649, 17649, 17649, 17649, 17649, 17649, 17649,
17649, 17649, 17649, 17649, 17649), class = "Date")),class="data.frame",row.names = c(NA,-98L))
结果是这样的(对dput()
感到抱歉,但分享数据示例的方式却不那么痛苦了。)
head(test)
id month date
1 1 1 2018-01-24
2 1 1 2018-01-24
3 1 1 2018-01-24
4 1 1 2018-01-24
5 1 1 2018-01-24
6 1 1 2018-01-24
所以我试过这个:
library(dplyr)
test %>%
group_by(id,month)%>%
arrange(date) %>%
summarize(avg = as.numeric(mean(diff(date))))%>%data.frame()
结果是:
> result
id month avg
1 1 1 0.0000000
2 1 2 0.0000000
3 1 3 0.1428571
4 1 4 0.0000000
但是,看看这些数据,March有一个问题,因为3月的日子是31和28,他们的差异是3,差异的平均值应该是3(只有一个距离)。
> table(test[which(test$month==3),]$date)
2018-03-28 2018-03-31
11 11
我究竟做错了什么?
提前致谢
答案 0 :(得分:3)
您获得的结果是正确的:diff(date)
计算数据中所有连续日期对之间的差异(在组内和排序日期之后)。 3月份,您有11次2018-03-28
次和11次2018-03-31
次。所以在3月份,diff(date)
是10倍0,一次3倍,10倍0.因此平均值为3/21 = 0.143。
也许您首先要考虑(id, month, date)
的不同组合:
test %>%
distinct(id, month, date) %>%
group_by(id,month)%>%
arrange(date) %>%
summarize(avg = as.numeric(mean(diff(date)))) %>%
data.frame()
请注意,此输出3表示3月,但NaN
表示其他月份,因为您要求长度为1的向量上的diff
,这会给出一个长度为0的向量。 ,你可以使用
test %>%
distinct(id, month, date) %>%
group_by(id,month)%>%
arrange(date) %>%
summarize(avg = as.numeric(max(date)-min(date)) / max(1, n()-1))