Question

我有一个带有时间和输出列的数据文件。输出列由值1和2组成。对于输出列的每次运行，它取值为2，我想计算运行期间经过的总时间，即结束时间减去开始时间。例如：

    time          output       total
      2                2           4-2=2
      4                2
      6                1
      8                2           10-8=2
      10               2
      12               1
      14               1
      16               1
      18               2           22-18=4
      20               2
      22               2

对于大型数据框，有没有一些简单的方法呢？

Answer 1

听起来你想要在每个输出变量运行中经过的时间，其中该变量等于2.

一种方法是use dplyr to group by runs，过滤输出类型2的运行，然后计算经过的时间：

library(dplyr)
dat %>%
  group_by(run={x = rle(output) ; rep(seq_along(x$lengths), x$lengths)}) %>%
  filter(output == 2) %>%
  summarize(total=max(time)-min(time))
# Source: local data frame [3 x 2]
# 
#     run total
#   (int) (dbl)
# 1     1     2
# 2     3     2
# 3     5     4

这也可以使用rle函数在基础R中完成：

x <- rle(dat$output)
unname(tapply(dat$time, rep(seq_along(x$lengths), x$lengths), function(x) max(x)-min(x))[x$values == 2])
# [1] 2 2 4

Answer 2

这是另一种方式。我使用foo创建了一个名为rleid()的组变量。对于每个组，我从上一个output值中减去了第一个output值，即total。然后，我将total中的所有值替换为NA，其中output不是2.然后，对于每个组，我分配了一个包含第一个值total和NA的向量。最后，我放弃了组变量。

library(data.table)

mydf <- data.frame(time = c(2,4,6,8,10,12,14,16,18,20,22),
                   output = c(2,2,1,2,2,1,1,1,2,2,2))

setDT(mydf)[, foo := rleid(output)][,
    total := last(time) - first(time), by = "foo"][,
    total := replace(total, which(output !=2), NA)][,
    total := c(total[1L], rep(NA, .N - 1)), by = "foo"][, -3, with = FALSE][]

#    time output total
# 1:    2      2     2
# 2:    4      2    NA
# 3:    6      1    NA
# 4:    8      2     2
# 5:   10      2    NA
# 6:   12      1    NA
# 7:   14      1    NA
# 8:   16      1    NA
# 9:   18      2     4
#10:   20      2    NA
#11:   22      2    NA

Answer 3

我知道你想通过outlook的'run'进行分组？

首先我们需要为'运行'编制索引。我创建了一个基于rle的函数（我找不到任何可以执行此操作的函数，但它可能已经存在）。

indexer <-function(x){
  run <- rle(x)$length
  size <- length(run)
  value <- c()
  for(i in 1:size){
    value = c(value, rep(i,run[i]))
    }
  value
  }

df$index <- indexer(df$output)

df %>% group_by(index) %>% mutate(total = max(time) - min(time))

    time output index total
1      2      2     1     2
2      4      2     1     2
3      6      1     2     0
4      8      2     3     2
5     10      2     3     2
6     12      1     4     4
7     14      1     4     4
8     16      1     4     4
9     18      2     5     4
10    20      2     5     4
11    22      2     5     4

计算变量运行期间经过的时间

3 个答案: