我有一个包含10000行的时间序列数据集,其中1年的数据看起来像这样
2012-01-01 06:23:02 c d10
2012-01-01 08:12:12 d d2
...........................
2012-12-31 08:22:24 s d5
它有3个字段
date_time, category1, category2 where category1 contains 4 categorical values (c,v,d,s) category2 contains 10 categorical values(d1....d10).
我想根据每个分类值d1 ...... d10计算所有分类值c,v,d,s的个体计数。它应该像d1,d2 ...... d10相对于时间范围0-1,1-2,...... 22-23存在多少c,v,d,s
如何在1-2
,2-3
,3-4
,..... 23-24
示例输出应该是这样的
1-2 2-3 3-4 ........23-24
d1 c=2,d=3,v=3
S = 4
d2 c=3 d=3,v=2,s=2
..................
D10
我尝试过使用lubridate,data.table包但无法找到预期的解决方案
答案 0 :(得分:0)
不清楚预期的结果。可能有帮助:
indx <- with(dat1, as.numeric(format(as.POSIXct(cut(date_time,
breaks='hour')),'%H')))
dat1$indx1 <- interaction(indx, indx+1, sep="-",
lex.order=TRUE, drop=TRUE)
dat1$date_time <- as.character(dat1$date_time)
library(reshape2)
res1 <- dcast(dat1, category1+category2~indx1, value.var='date_time')
res1[,-(1:2)] <- lapply(res1[,-(1:2)], as.POSIXct)
head(res1,2)
# category1 category2 0-1 1-2 2-3 3-4 4-5 5-6 6-7 7-8
#1 c1 d1 <NA> 2012-01-03 01:43:02 <NA> <NA> <NA> <NA> <NA> <NA>
#2 c1 d10 <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA>
# 8-9 9-10 10-11 11-12 12-13 13-14 14-15 15-16
#1 <NA> 2012-01-01 09:13:02 <NA> <NA> <NA> <NA> <NA> <NA>
#2 <NA> 2012-01-02 09:43:02 <NA> 2012-01-02 11:03:02 <NA> <NA> <NA> <NA>
# 16-17 17-18 18-19 19-20 20-21 21-22 22-23 23-24
#1 <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA>
#2 <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA>
如果你想要计数
res2 <- dcast(dat1, category1+category2~indx1, value.var='date_time', length)
res2[1:3,1:3]
# category1 category2 0-1
#1 c1 d1 0
#2 c1 d10 0
#3 c1 d11 0
set.seed(24)
dat1 <- data.frame(date_time=seq(as.POSIXct('2012-01-01 06:23:02',
'%Y-%m-%d %H:%M:%S'), length.out=300, by='10 min'), category1 =
sample(paste0('c',1:20), 300, replace=TRUE), category2 =
sample(paste0('d', 1:20), 300, replace=TRUE))