我有一个时间戳数据框,用于指定分类状态。状态有效,直到下一个时间戳,此时类别可能会更改。
我希望能够确定每个类别在常规时间段内花费的时间百分比,例如每月,每季度或每年。
这似乎是一个常见的问题,但我一直无法找到一个优雅的解决方案或库来解决它。
例如,使用以下示例数据框:
date status
2016-02-20 09:11:00 a
2016-03-06 02:38:00 c
2016-03-10 15:20:00 b
2016-03-10 21:20:00 a
2016-03-11 11:51:00 b
2016-03-12 01:19:00 c
2016-03-22 14:39:00 c
2016-03-23 11:37:00 b
2016-03-25 17:38:00 c
2016-03-26 01:24:00 c
2016-03-26 12:40:00 a
2016-04-12 10:28:00 c
......我可能想每周报告3 / 1-3 / 7,3 / 8-3 / 14,3 / 15-3 / 21,' a&#每周的百分比时间39;,' b'和' c'状态。
我开始使用蛮力编码解决这个问题(这很丑陋......),当我决定在这里问我是否应该采用更优雅的方式来做这件事。
======== 编辑在下面添加一个不优雅的暴力解决方案 ========
time_analysis <- function(df, starttime, endtime) {
# - assumes sorted by date
startindex <- sum(df$date <= starttime) # find the index of the entry which contains the start time
endindex <- sum(df$date <= endtime) + 1 # find the index of the entry which contains the end time
if ( (startindex == 0) || (endindex > nrow(df) ) ) {
print("Date outside of available data")
return(NULL)
}
df2 <- df[ startindex:endindex, ] # subset the dataframe to include the range, but still need to trim ends
df2$date[1] <- starttime # trim to the start time
df2$date[nrow(df2)] <- endtime # trim back the end time
df2$status[nrow(df2)] <- df2$status[nrow(df2)-1] # status hasn't changed yet, so still the previous status
duration <- diff(df2$date) # vector of the time within each segment, 1 fewer elements than the dataframe
units(duration) <- 'days'
duration <- as.numeric(duration) # need to convert to numeric, or else can't divide by total duration
df2 <- df2[ -nrow(df2), ] # remove the last row, to make length same as the duration vector
df2$duration <- duration # add the duration column
total <- sum(df2$duration) # to allow calculations within the ddply
return(ddply(df2[, c('status','duration')], 'status', function(x) { # calculate by each status category
return( c(
date = starttime,
totaldays = round(sum(x$duration), 2),
fraction = round(sum(x$duration) / total, 3)) )
} ))
}
以下将是一个示例用途,将报告分成大约2周的块。我讨厌使用手动日期编码并在R中使用循环,但是我没有经验知道更好的方法。
times <- c("2016-03-01","2016-03-15","2016-04-01","2016-04-15","2016-05-01","2016-05-15")
result <- data.frame()
for (i in 1:(length(times) - 1)) {
result <- rbind( result, time_analysis(d, times[i], times[i+1]) )
}
print(result, row.names = FALSE)
屈服(除了日期超出范围的一些错误):
status date totaldays fraction
a 2016-03-01 5.71 0.409
b 2016-03-01 0.81 0.058
c 2016-03-01 7.43 0.532
a 2016-03-15 5.47 0.322
b 2016-03-15 2.25 0.132
c 2016-03-15 9.28 0.546
===== 发布后,找到了一种更好的方式来产生时间:
times <- as.character( seq( as.Date("2016-03-01"), as.Date("2016-05-15"), by = '2 weeks' ) )
答案 0 :(得分:0)
这是一种将cut.POSIXt()
S3特定与嵌套data.table
聚合相结合的方法。
## define data
library(data.table);
dt <- data.table(date=as.POSIXct(c('2016-02-20 09:11:00','2016-03-06 02:38:00','2016-03-10 15:20:00','2016-03-10 21:20:00','2016-03-11 11:51:00','2016-03-12 01:19:00','2016-03-22 14:39:00','2016-03-23 11:37:00','2016-03-25 17:38:00','2016-03-26 01:24:00','2016-03-26 12:40:00','2016-04-12 10:28:00')),status=c('a','c','b','a','b','c','c','b','c','c','a','c'));
## solution
dt[,{ n1 <- .N; .SD[,.(pct=.N/n1*100),.(status)]; },.(month=cut(df$date,'month'))];
## month status pct
## 1: 2016-02-01 a 100
## 2: 2016-03-01 c 50
## 3: 2016-03-01 b 30
## 4: 2016-03-01 a 20
## 5: 2016-04-01 c 100