使用带有日期值和聚合的dcast.data.table

时间:2016-12-23 22:49:36

标签: r data.table dcast

试图解决这个问题。假设你有一个data.table:

export class LogicalId extends String {
    constructor(value: string) {
        if (!/somepattern/.exec(value) {
            throw new ValidationError(...);
        }
        super(value);
    }
}

我想将它转换为这样的

dt <- data.table (person=c('bob', 'bob', 'bob'), 
                  door=c('front door', 'front door', 'front door'),
                  type=c('timeIn', 'timeIn', 'timeOut'),
                  time=c(
as.POSIXct('2016 12 02 06 05 01', format = '%Y %m %d %H %M %S'),
as.POSIXct('2016 12 02 06 05 02', format = '%Y %m %d %H %M %S'),
as.POSIXct('2016 12 02 06 05 03', format = '%Y %m %d %H %M %S')                     )
)

我似乎无法为dcast.data.table获取正确的语法。我试过了

person        door        timeIn             timeOut

bob           front door  min(<date/time>) max(<date/time>)

会抛出错误:

  

聚合函数应采用向量输入并返回单个值(长度= 1)。

我也尝试过:

dcast.data.table(
  dt, person + door ~ type, 
  value.var = 'time', 
  fun.aggregate = function(x) ifelse(type == 'timeIn', min(x), max(x))
)

但结果却抛弃了我的日期

 dcast.data.table(dt, person + door ~ type, value.var = 'time')

任何建议将不胜感激。 TIA

2 个答案:

答案 0 :(得分:7)

使用dcast有多种方法可以达到预期效果。 jazzurro 的解决方案在重新整形结果之前进行聚合。这里的方法直接使用dcast,但可能需要一些后处理。我们正在使用 jazzurro 的数据,这些数据经过调整以符合UTC时区和data.table的CRAN版本1.10.0。

1。让ifelse工作

如Q中所述,

dcast(
  dt, person + door ~ type, 
  value.var = 'time', 
  fun.aggregate = function(x) ifelse(type == 'timeIn', min(x), max(x))
)

返回错误消息。错误消息的全文包括使用fill参数的提示。遗憾的是,ifelse()不尊重POSIXct类(有关详细信息,请参阅?ifelse),因此需要执行此操作。

dcast(
  dt, person + door ~ type, 
  value.var = 'time', 
  fun.aggregate = function(x) 
    lubridate::as_datetime(ifelse(type == 'timeIn', min(x), max(x))),
  fill = 0
)

我们得到了

#   person       door              timeIn             timeOut
#1:    ana front door 2016-12-02 07:06:01 2016-12-02 07:06:05
#2:    bob front door 2016-12-02 06:05:01 2016-12-02 06:05:05

2。替代ifelse

ifelse的帮助页面建议

(tmp <- yes; tmp[!test] <- no[!test]; tmp)

作为替代。遵循这个建议,

dcast(
  dt, person + door ~ type, 
  value.var = 'time', 
  fun.aggregate = function(x) {
    test <- type == "timeIn"; tmp <- min(x); tmp[!test] = max(x)[!test]; tmp
    }
)

返回

#   person       door              timeIn             timeOut
#1:    ana front door 2016-12-02 07:06:01 2016-12-02 07:06:05
#2:    bob front door 2016-12-02 06:05:01 2016-12-02 06:05:05

请注意,fill参数和强制转换为POSIXct都不需要。

3。使用增强的dcast

使用最新版本的dcast.data.table,我们可以为fun.aggregate提供一系列功能:

dcast(dt, person + door ~ type, value.var = 'time', fun = list(min, max))

返回

#   person       door     time_min_timeIn    time_min_timeOut     time_max_timeIn    time_max_timeOut
#1:    ana front door 2016-12-02 07:06:01 2016-12-02 07:06:03 2016-12-02 07:06:02 2016-12-02 07:06:05
#2:    bob front door 2016-12-02 06:05:01 2016-12-02 06:05:03 2016-12-02 06:05:02 2016-12-02 06:05:05

我们可以删除不需要的列,并通过

重命名其他列
dcast(dt, person + door ~ type, value.var = 'time', fun = list(min, max))[
  , .(person, door, timeIn = time_min_timeIn, timeOut = time_max_timeOut)]

让我们

#   person       door              timeIn             timeOut
#1:    ana front door 2016-12-02 07:06:01 2016-12-02 07:06:05
#2:    bob front door 2016-12-02 06:05:01 2016-12-02 06:05:05

数据

如上所述,我们正在使用 jazzurro 的数据

dt <- structure(list(person = c("bob", "bob", "bob", "bob", "ana", 
"ana", "ana", "ana"), door = c("front door", "front door", "front door", 
"front door", "front door", "front door", "front door", "front door"
), type = c("timeIn", "timeIn", "timeOut", "timeOut", "timeIn", 
"timeIn", "timeOut", "timeOut"), time = structure(c(1480658701, 
1480658702, 1480658703, 1480658705, 1480662361, 1480662362, 1480662363, 
1480662365), class = c("POSIXct", "POSIXt"))), .Names = c("person", 
"door", "type", "time"), row.names = c(NA, -8L), class = c("data.table", 
"data.frame"))

但是将时区强制为UTC

使用

dt[, time := lubridate::with_tz(time, "UTC")]

我们有

dt
#   person       door    type                time
#1:    bob front door  timeIn 2016-12-02 06:05:01
#2:    bob front door  timeIn 2016-12-02 06:05:02
#3:    bob front door timeOut 2016-12-02 06:05:03
#4:    bob front door timeOut 2016-12-02 06:05:05
#5:    ana front door  timeIn 2016-12-02 07:06:01
#6:    ana front door  timeIn 2016-12-02 07:06:02
#7:    ana front door timeOut 2016-12-02 07:06:03
#8:    ana front door timeOut 2016-12-02 07:06:05

独立于当地时区。

答案 1 :(得分:6)

这是实现目标的一种方式。我修改了您的dt并创建了以下数据集。对于每个人,我查找了timeIn的最短时间和timeOut的最长时间。然后,我将dcast()应用于结果。

#   person       door    type                time
#1:    bob front door  timeIn 2016-12-02 06:05:01
#2:    bob front door  timeIn 2016-12-02 06:05:02
#3:    bob front door timeOut 2016-12-02 06:05:03
#4:    bob front door timeOut 2016-12-02 06:05:05
#5:    ana front door  timeIn 2016-12-02 07:06:01
#6:    ana front door  timeIn 2016-12-02 07:06:02
#7:    ana front door timeOut 2016-12-02 07:06:03
#8:    ana front door timeOut 2016-12-02 07:06:05

library(data.table)

dcast(
   dt[, .SD[(type == "timeIn" & time == min(time))|(type == "timeOut" & time == max(time))], by = person],
   person + door ~ type)

#   person       door              timeIn             timeOut
#1:    ana front door 2016-12-02 07:06:01 2016-12-02 07:06:05
#2:    bob front door 2016-12-02 06:05:01 2016-12-02 06:05:05

DATA

dt <- structure(list(person = c("bob", "bob", "bob", "bob", "ana", 
"ana", "ana", "ana"), door = c("front door", "front door", "front door", 
"front door", "front door", "front door", "front door", "front door"
), type = c("timeIn", "timeIn", "timeOut", "timeOut", "timeIn", 
"timeIn", "timeOut", "timeOut"), time = structure(c(1480658701, 
1480658702, 1480658703, 1480658705, 1480662361, 1480662362, 1480662363, 
1480662365), class = c("POSIXct", "POSIXt"))), .Names = c("person", 
"door", "type", "time"), row.names = c(NA, -8L), class = c("data.table", 
"data.frame"))