通过将开始时间和结束时间基于组移动到新数据帧中的开始和结束时间而不循环来重新格式化数据帧

时间:2017-05-08 22:54:35

标签: r for-loop dataframe dplyr

我有一组工作代码,使用下面的for循环来处理数据帧,但如果可能的话,需要在没有for循环的情况下对其进行优化。我已经搜索了一段时间来找到这样的东西,但一定不知道正确的搜索条件。谢谢你的帮助。

数据框示例(底部的较长版本)具有日期时间列和瓶子列。瓶子列从一些数字(下面的1)开始,并且在添加样品时重复并切换到2,依此类推到瓶子7(在这种情况下),然后 RESTARTS 在1并且转到14(在这种情况下)和一遍又一遍。 (请注意,实际文件中每瓶超过2次)

datetime        bottle
6/9/2016 0:00   1
6/9/2016 0:15   1
6/9/2016 0:30   1
6/9/2016 0:45   1
6/9/2016 1:00   2
6/9/2016 1:15   2
6/9/2016 1:30   2
6/9/2016 1:45   3
6/9/2016 2:00   3
6/9/2016 2:15   4
6/9/2016 2:30   4
6/9/2016 2:45   5
6/9/2016 3:00   5
6/9/2016 3:15   6
6/9/2016 3:30   6
6/9/2016 3:45   7
6/9/2016 4:00   7
6/9/2016 4:15   7
6/9/2016 4:30   1
6/9/2016 4:45   1
6/9/2016 5:00   1
6/9/2016 5:15   2
6/9/2016 5:30   2
6/9/2016 5:45   2
6/9/2016 6:00   3
6/9/2016 6:15   3
6/9/2016 6:30   3

我需要创建一个包含瓶子开始和结束时间的新数据框。注意,重复每个瓶子序列。

  bottle begin         end
    1   6/9/2016 0:00   6/9/2016 0:45
    2   6/9/2016 1:00   6/9/2016 1:30
    3   6/9/2016 1:45   6/9/2016 2:00
    4   6/9/2016 2:15   6/9/2016 2:30
    5   6/9/2016 2:45   6/9/2016 3:00
    6   6/9/2016 3:15   6/9/2016 3:30
    7   6/9/2016 3:45   6/9/2016 4:15
    1   6/9/2016 4:30   6/9/2016 5:00
    2   6/9/2016 5:15   6/9/2016 5:45
    3   6/9/2016 6:00   6/9/2016 6:30

到目前为止,我所做的是下面带注释的代码。这很好用,但在完整的数据帧上需要很长时间。

#create id number for each bottle using data.table
setDT(t2s_bottle_timing.df)[, id := .GRP, by = t2s_bottle]

#declare/set variables
x1 <- 1
x2 <- 1
x3 <- 1
i <- 1
N <- length(t2s_bottle_timing.df$t2s_bottle)

#renumber id column to have unique id for each bottle run
for (i in 2:(N-1)) {
  x1 <- t2s_bottle_timing.df[(i) , 2] #load bottle numbers
  x2 <- t2s_bottle_timing.df[(i+1) , 2] #load bottle numbers
  if (x2 == x1)  {   t2s_bottle_timing.df[(i),3] <- x3 } #set id number
  if (x2 != x1)  {   x3 <- x3 +1} #increment id number
  t2s_bottle_timing.df[(i+1),3] <- x3 #load new id number into table
}

# get rid of unused stuff
rm(x1, x2, i, N, x3)

# summerise the raw dataframe to produce the bottle, begin, end dataframe
t2s_timing_output.df <- t2s_bottle_timing.df %>% group_by( id ,t2s_bottle ) 
  %>%  #group_by(id,bottle)
  summarize(
    begin = min(datetime),
    end = max(datetime) )

所以这有效但我渴望学习另一种方法和更有效的方法来做到这一点。

t2s_bottle_timing.df <- structure(list(datetime = structure(c(1465514100, 1465515000, 
1465515900, 1465516800, 1465517700, 1465518600, 1465519500, 1465520400, 
1465521300, 1465522200, 1465523100, 1465524000, 1465524900, 1465525800, 
1465526700, 1465527600, 1465528500, 1465529400, 1465530300, 1465531200, 
1465532100, 1465533000, 1465533900, 1465534800, 1465535700, 1465536600, 
1465537500, 1465538400, 1465539300, 1465540200, 1465541100, 1465542000, 
1465542900, 1465543800, 1465544700, 1465545600, 1465546500, 1465547400, 
1465548300, 1465549200, 1465550100, 1465551000, 1465551900, 1465552800, 
1465553700, 1465554600, 1465555500, 1465556400, 1465557300, 1465558200, 
1465559100, 1465560000, 1465560900, 1465561800, 1465562700, 1465563600, 
1465564500, 1465565400, 1465566300, 1465567200, 1465568100, 1465569000, 
1465569900, 1465570800, 1465571700, 1465572600, 1465573500, 1465574400, 
1465575300, 1465576200, 1465577100, 1465578000, 1465578900, 1465579800, 
1465580700, 1465581600, 1465582500, 1465583400, 1465584300, 1465585200, 
1465586100, 1465587000, 1465587900, 1465588800, 1465589700, 1465590600, 
1465591500), tzone = "UTC", class = c("POSIXct", "POSIXt")), 
    t2s_bottle = c(1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 
    2L, 2L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 4L, 4L, 4L, 4L, 
    4L, 4L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 6L, 6L, 6L, 6L, 6L, 7L, 
    7L, 7L, 7L, 7L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 
    3L, 3L, 4L, 4L, 4L, 4L, 4L, 4L, 5L, 5L, 5L, 6L, 6L, 6L, 6L, 
    6L, 6L, 7L, 7L, 7L, 7L, 7L, 7L, 8L, 8L, 8L, 8L, 8L, 8L, 8L, 
    8L)), .Names = c("datetime", "t2s_bottle"), row.names = c(NA, 
-87L), spec = structure(list(cols = structure(list(datetime = structure(list(), class = c("collector_character", 
"collector")), t2s_bottle = structure(list(), class = c("collector_integer", 
"collector"))), .Names = c("datetime", "t2s_bottle")), default = structure(list(), class = c("collector_guess", 
"collector"))), .Names = c("cols", "default"), class = "col_spec"), class = c("tbl_df", 
"tbl", "data.frame"))

1 个答案:

答案 0 :(得分:2)

你的例子让我感到困惑,但是如果你想要的是创建一个索引,那么cumsum可能有逻辑帮助:

t2s_bottle_timing.df %>% 
mutate(index = cumsum(t2s_bottle != dplyr::lag(t2s_bottle, default = 0))) %>% 
group_by(index, t2s_bottle) %>% 
summarise(begin = min(datetime), end = max(datetime))

   index t2s_bottle               begin                 end
   <int>      <int>              <dttm>              <dttm>
1      1          1 2016-06-09 23:15:00 2016-06-10 00:15:00
2      2          2 2016-06-10 00:30:00 2016-06-10 02:15:00
3      3          3 2016-06-10 02:30:00 2016-06-10 04:30:00
4      4          4 2016-06-10 04:45:00 2016-06-10 06:00:00
5      5          5 2016-06-10 06:15:00 2016-06-10 07:45:00
6      6          6 2016-06-10 08:00:00 2016-06-10 09:00:00
7      7          7 2016-06-10 09:15:00 2016-06-10 10:15:00
8      8          1 2016-06-10 10:30:00 2016-06-10 11:15:00
9      9          2 2016-06-10 11:30:00 2016-06-10 13:00:00
10    10          3 2016-06-10 13:15:00 2016-06-10 13:30:00
11    11          4 2016-06-10 13:45:00 2016-06-10 15:00:00
12    12          5 2016-06-10 15:15:00 2016-06-10 15:45:00
13    13          6 2016-06-10 16:00:00 2016-06-10 17:15:00
14    14          7 2016-06-10 17:30:00 2016-06-10 18:45:00
15    15          8 2016-06-10 19:00:00 2016-06-10 20:45:00