动物园里的动物:我们可以通过ID聚合每日时间序列因子和标记活动吗?

时间:2013-05-12 06:20:22

标签: r data.table zoo sqldf

假设动物园里有多年的动物活动日常时间序列。非常大的数据集的子集可能如下所示:

library(data.table)
type <- c(rep('giraffe',90),rep('monkey',90),rep('anteater',90))
status <- as.factor(c(rep('display',31),rep('caged',28),rep('display',31),
rep('caged',25), rep('display',35),rep('caged',30),rep('caged',10),
rep('display',10),rep('caged',10),rep('display',60)))
date <- rep(seq.Date( as.Date("2001-01-01"), as.Date("2001-03-31"), "day" ),3)

其中'type'是动物类型,'status'是当天动物正在做什么的指标,例如,笼中或展示。

animals <-  data.table(type,status,date);animals
         type  status       date
  1:  giraffe display 2001-01-01
  2:  giraffe display 2001-01-02
  3:  giraffe display 2001-01-03
  4:  giraffe display 2001-01-04
  5:  giraffe display 2001-01-05
 ---                            
266: anteater display 2001-03-27
267: anteater display 2001-03-28
268: anteater display 2001-03-29
269: anteater display 2001-03-30
270: anteater display 2001-03-31

假设我们想要将其汇总到月度系列中,该系列列出了动物的整个月状态信息。在新系列中,“状态”反映了该月初动物的状态。 “fullmonth”是一个二进制变量(1 = TRUE,0 = FALSE),表示此状态是否持续整个月,“anydisp”是否为二进制变量(1 = TRUE,0 = FALSE),表示动物是否开启在该月中的任何时间显示(&gt; = 1天)。因此,因为长颈鹿在1月和3月的整个月展出,但在2月份被关在笼子里,因此得到了相应的标记。

date <- rep(seq.Date( as.Date("2001-01-01"), as.Date("2001-03-31"),"month"),3)
type <- c(rep('giraffe',3),rep('monkey',3),rep('anteater',3))
status <- as.factor(c('display','caged','display','caged','display','caged',
'caged','display','display'))
fullmonth <- c(1,1,1,0,1,0,0,1,1)
anydisp <- c(1,0,1,1,1,1,1,1,1)

animals2 <- data.table(date,type,status,fullmonth,anydisp);animals2
     date     type  status fullmonth anydisp
2001-01-01  giraffe display         1   1
2001-02-01  giraffe   caged         1   0
2001-03-01  giraffe display         1   1
2001-01-01   monkey   caged         0   1
2001-02-01   monkey display         1   1
2001-03-01   monkey   caged         0   1
2001-01-01 anteater   caged         0   1
2001-02-01 anteater display         1   1
2001-03-01 anteater display         1   1

我认为zoo可能是要走的路但是在玩完之后我发现它不能很好地处理非数值,即使我将任意值分配给定性组件(状态)也不清楚它将如何解决问题。

##aggregate function with zoo? 
library(zoo)
animals$activity <- as.numeric(ifelse(status=='display',1,0))
animals2 <- subset(animals, select=c(date,activity))
datas <- zoo(animals2)
monthlyzoo <- aggregate(datas,as.yearmon,sum)
Error in Summary.factor(1L, na.rm = FALSE) : 
  sum not meaningful for factors

有人知道使用sqldfdata.table的解决方案吗?

更新

想要添加一个新要求,即所显示的日期是本月的第一天,即使数据在本月晚些时候开始。例如,此数据集说明了这种情况:

animals2 <- animals[30:270,];head(animals2)

setkey(animals2, "type", "date")

oo <- animals2[, list(date=date[1], status = status[1],
                      fullmonth = 1 * all(status == status[1]),
                      anydisplay = any(status == "display") * 1 ),
               by = list(month(date), type)][, month := NULL]
oo

      type       date  status fullmonth anydisplay
1: anteater 2001-01-30   caged         0          1
2: anteater 2001-02-01 display         1          1
3: anteater 2001-03-01 display         1          1
4:  giraffe 2001-01-01 display         1          1
5:  giraffe 2001-02-01   caged         1          0
6:  giraffe 2001-03-01 display         1          1
7:   monkey 2001-01-01   caged         0          1 
8:   monkey 2001-02-01 display         1          1
9:   monkey 2001-03-01 display         0          1

sqldf("select 
    min(date) date, 
    type,
    status, 
    max(status) = min(status) fullmonth,
    sum(status = 'display') > 0 anydisp
from animals2
group by type, strftime('%Y %m', date * 3600 * 24, 'unixepoch')
order by type, date")

        date     type  status fullmonth anydisp
1 2001-01-30 anteater   caged         0       1
2 2001-02-01 anteater display         1       1
3 2001-03-01 anteater display         1       1
4 2001-01-01  giraffe display         1       1
5 2001-02-01  giraffe   caged         1       0
6 2001-03-01  giraffe display         1       1
7 2001-01-01   monkey   caged         0       1
8 2001-02-01   monkey display         1       1
9 2001-03-01   monkey   caged         0       1

这可以通过后期处理修改日期的任何解决方案来实现:

dateswitch <- paste(year(animals2$date),month(animals2$date),1,sep='/')
dateswitch <- as.Date(dateswitch, "%Y/%m/%d")
animals2$date <- as.Date(dateswitch)

2 个答案:

答案 0 :(得分:3)

这样的东西?

setkey(animals, "type", "date")
oo <- animals[, list(date=date[1], status = status[1], 
                     fullmonth = 1 * all(status == status[1]), 
                     anydisplay = any(status == "display") * 1), 
by = list(month(date), type)][, month := NULL]
#        type       date  status fullmonth anydisplay
# 1: anteater 2001-01-01   caged         0          1
# 2: anteater 2001-02-01 display         1          1
# 3: anteater 2001-03-01 display         1          1
# 4:  giraffe 2001-01-01 display         1          1
# 5:  giraffe 2001-02-01   caged         1          0
# 6:  giraffe 2001-03-01 display         1          1
# 7:   monkey 2001-01-01   caged         0          1
# 8:   monkey 2001-02-01 display         1          1
# 9:   monkey 2001-03-01 display         0          1

答案 1 :(得分:2)

这是一个sqldf解决方案:

library(sqldf)

# define input data.frame where type, status and date variables are defined in question
animals <-  data.frame(type,status,date)

sqldf("select 
    min(date) date, 
    type,
    status, 
    max(status) = min(status) fullmonth,
    sum(status = 'display') > 0 anydisp
from animals
group by type, strftime('%Y %m', date * 3600 * 24, 'unixepoch')
order by type, date")

此命令的输出显示的数据为:

        date     type  status fullmonth anydisp
1 2001-01-01 anteater   caged         0       1
2 2001-02-01 anteater display         1       1
3 2001-03-01 anteater display         1       1
4 2001-01-01  giraffe display         1       1
5 2001-02-01  giraffe   caged         1       0
6 2001-03-01  giraffe display         1       1
7 2001-01-01   monkey   caged         0       1
8 2001-02-01   monkey display         1       1
9 2001-03-01   monkey   caged         0       1

增加:海报后来在问题中添加了另一项要求,即将日期显示为该月的第一天,即使数据直到该月晚些时候才开始。如果DF是上面sqldf语句的结果,那么将其转换为:

library(zoo)
transform(DF, date = as.Date(as.yearmon(date)))

或者最好消除日期部分(因为如果没有相关日期的数据,可能会被视为误导)并且仅使用"yearmon"类给出年份和月份:

library(zoo)
transform(DF, date = as.yearmon(date))