分类数据的时间序列 - 如何计算每个类别的百分比,随着时间的推移?

时间:2016-05-27 02:11:50

标签: r time-series categorical-data

我有一个时间戳数据框,用于指定分类状态。状态有效,直到下一个时间戳,此时类别可能会更改。

我希望能够确定每个类别在常规时间段内花费的时间百分比,例如每月,每季度或每年。

这似乎是一个常见的问题,但我一直无法找到一个优雅的解决方案或库来解决它。

例如,使用以下示例数据框:

            date    status
2016-02-20 09:11:00      a
2016-03-06 02:38:00      c
2016-03-10 15:20:00      b
2016-03-10 21:20:00      a
2016-03-11 11:51:00      b
2016-03-12 01:19:00      c
2016-03-22 14:39:00      c
2016-03-23 11:37:00      b
2016-03-25 17:38:00      c
2016-03-26 01:24:00      c
2016-03-26 12:40:00      a
2016-04-12 10:28:00      c

......我可能想每周报告3 / 1-3 / 7,3 / 8-3 / 14,3 / 15-3 / 21,' a&#每周的百分比时间39;,' b'和' c'状态。

我开始使用蛮力编码解决这个问题(这很丑陋......),当我决定在这里问我是否应该采用更优雅的方式来做这件事。

======== 编辑在下面添加一个不优雅的暴力解决方案 ========

time_analysis <- function(df, starttime, endtime) {
  # - assumes sorted by date

  startindex <- sum(df$date <= starttime) # find the index of the entry which contains the start time
  endindex <- sum(df$date <= endtime) + 1 # find the index of the entry which contains the end time

  if ( (startindex == 0) || (endindex > nrow(df) ) ) {
    print("Date outside of available data")
    return(NULL)
  }

  df2 <- df[ startindex:endindex, ] # subset the dataframe to include the range, but still need to trim ends

  df2$date[1] <- starttime # trim to the start time
  df2$date[nrow(df2)] <- endtime # trim back the end time
  df2$status[nrow(df2)] <- df2$status[nrow(df2)-1] # status hasn't changed yet, so still the previous status

  duration <- diff(df2$date) # vector of the time within each segment, 1 fewer elements than the dataframe
  units(duration) <- 'days'
  duration <- as.numeric(duration) # need to convert to numeric, or else can't divide by total duration

  df2 <- df2[ -nrow(df2), ] # remove the last row, to make length same as the duration vector
  df2$duration <- duration # add the duration column

  total <- sum(df2$duration) # to allow calculations within the ddply
  return(ddply(df2[, c('status','duration')], 'status', function(x) { # calculate by each status category
    return( c(
      date = starttime,
      totaldays = round(sum(x$duration), 2),
      fraction = round(sum(x$duration) / total, 3)) )
  } ))
}

以下将是一个示例用途,将报告分成大约2周的块。我讨厌使用手动日期编码并在R中使用循环,但是我没有经验知道更好的方法。

times <- c("2016-03-01","2016-03-15","2016-04-01","2016-04-15","2016-05-01","2016-05-15")
result <- data.frame()
for (i in 1:(length(times) - 1)) {
  result <- rbind( result, time_analysis(d, times[i], times[i+1]) )
}
print(result, row.names = FALSE)

屈服(除了日期超出范围的一些错误):

status       date totaldays fraction
     a 2016-03-01      5.71    0.409
     b 2016-03-01      0.81    0.058
     c 2016-03-01      7.43    0.532
     a 2016-03-15      5.47    0.322
     b 2016-03-15      2.25    0.132
     c 2016-03-15      9.28    0.546

===== 发布后,找到了一种更好的方式来产生时间:

times <- as.character( seq( as.Date("2016-03-01"), as.Date("2016-05-15"), by = '2 weeks' ) )

1 个答案:

答案 0 :(得分:0)

这是一种将cut.POSIXt() S3特定与嵌套data.table聚合相结合的方法。

## define data
library(data.table);
dt <- data.table(date=as.POSIXct(c('2016-02-20 09:11:00','2016-03-06 02:38:00','2016-03-10 15:20:00','2016-03-10 21:20:00','2016-03-11 11:51:00','2016-03-12 01:19:00','2016-03-22 14:39:00','2016-03-23 11:37:00','2016-03-25 17:38:00','2016-03-26 01:24:00','2016-03-26 12:40:00','2016-04-12 10:28:00')),status=c('a','c','b','a','b','c','c','b','c','c','a','c'));

## solution
dt[,{ n1 <- .N; .SD[,.(pct=.N/n1*100),.(status)]; },.(month=cut(df$date,'month'))];
##         month status pct
## 1: 2016-02-01      a 100
## 2: 2016-03-01      c  50
## 3: 2016-03-01      b  30
## 4: 2016-03-01      a  20
## 5: 2016-04-01      c 100