汇总时间序列的多个组中的数据

时间:2019-06-23 22:03:34

标签: r dplyr tidyverse posixct mutate

我在不同的位置和时间对鸟类进行了一系列观察。数据框如下所示:

birdID   site          ts
1       A          2013-04-15 09:29
1       A          2013-04-19 01:22
1       A          2013-04-20 23:13
1       A          2013-04-22 00:03
1       B          2013-04-22 14:02
1       B          2013-04-22 17:02
1       C          2013-04-22 14:04
1       C          2013-04-22 15:18
1       C          2013-04-23 00:54
1       A          2013-04-23 01:20
1       A          2013-04-24 23:07
1       A          2013-04-30 23:47
1       B          2013-04-30 03:51
1       B          2013-04-30 04:26
2       C          2013-04-30 04:29
2       C          2013-04-30 18:49
2       A          2013-05-01 01:03
2       A          2013-05-01 23:15
2       A          2013-05-02 00:09
2       C          2013-05-03 07:57
2       C          2013-05-04 07:21
2       C          2013-05-05 02:54
2       A          2013-05-05 03:27
2       A          2013-05-14 00:16
2       D          2013-05-14 10:00
2       D          2013-05-14 15:00

我想以一种汇总数据的方式来显示每个站点上每只鸟的第一次和最后一次检测以及每个站点的持续时间,同时保留有关多次访问站点的信息(即,如果一只鸟从站点走了) A> B> C> A> B,我想独立显示站点A和B的每次访问,而不是将两次访问都合并在一起)。

我希望产生这样的输出,其中保留每次访问的开始(min_ts),结束(max_ts)和持续时间(天):

birdID  site      min_ts                max_ts          days
1      A      2013-04-15 09:29    2013-04-22 00:03  6.6
1      B      2013-04-22 14:02    2013-04-22 17:02  0.1
1      C      2013-04-22 14:04    2013-04-23 00:54  0.5
1      A      2013-04-23 01:20    2013-04-30 23:47  7.9
1      B      2013-04-30 03:51    2013-04-30 04:26  0.02
2      C      2013-04-30 4:29     2013-04-30 18:49  0.6
2      A      2013-05-01 01:03    2013-05-02 00:09  0.96
2      C      2013-05-03 07:57    2013-05-05 02:54  1.8
2      A      2013-05-05 03:27    2013-05-14 00:16  8.8
2      D      2013-05-14 10:00    2013-05-14 15:00  0.2

我尝试过这段代码,该代码可以产生正确的变量,但是将有关单个站点的所有信息集中在一起,而不保留多次访问:

df <- df %>%
  group_by(birdID, site) %>%
  summarise(min_ts = min(ts),
            max_ts = max(ts),
            days = difftime(max_ts, min_ts, units = "days")) %>%
  arrange(birdID, min_ts)
birdID  site    min_ts               max_ts            days
1   A   2013-04-15 09:29   2013-04-30 23:47    15.6
1   B   2013-04-22 14:02   2013-04-30 4:26     7.6
1   C   2013-04-22 14:04   2013-04-23 0:54     0.5
2   C   2013-04-30 04:29   2013-05-05 2:54     4.9
2   A   2013-05-01 01:03   2013-05-14 0:16     12.9
2   D   2013-05-14 10:00   2013-05-14 15:00    0.2

我意识到按站点分组是一个问题,但是如果我将其删除为分组变量,则数据汇总将不包含站点信息。我已经试过了。它没有运行,但我认为它已接近解决方案:

df <- df %>% 
   group_by(birdID) %>% 
   summarize(min_ts = if_else((birdID == lag(birdID) & site != lag(site)), min(ts), NA_real_), 
             max_ts = if_else((birdID == lag(birdID) & site != lag(site)), max(ts), NA_real_), 
            min_d = min(yday(ts)),
            max_d = max(yday(ts)),
            days = max_d - min_d)) 

2 个答案:

答案 0 :(得分:5)

一种可能是:

git hash-object foo.c

在此处创建一个类似df %>% group_by(birdID, site, rleid = with(rle(site), rep(seq_along(lengths), lengths))) %>% summarise(min_ts = min(ts), max_ts = max(ts), days = difftime(max_ts, min_ts, units = "days")) %>% ungroup() %>% select(-rleid) %>% arrange(birdID, min_ts) birdID site min_ts max_ts days <int> <chr> <dttm> <dttm> <drtn> 1 1 A 2013-04-15 09:29:00 2013-04-22 00:03:00 6.60694444 days 2 1 B 2013-04-22 14:02:00 2013-04-22 17:02:00 0.12500000 days 3 1 C 2013-04-22 14:04:00 2013-04-23 00:54:00 0.45138889 days 4 1 A 2013-04-23 01:20:00 2013-04-30 23:47:00 7.93541667 days 5 1 B 2013-04-30 03:51:00 2013-04-30 04:26:00 0.02430556 days 6 2 C 2013-04-30 04:29:00 2013-04-30 18:49:00 0.59722222 days 7 2 A 2013-05-01 01:03:00 2013-05-02 00:09:00 0.96250000 days 8 2 C 2013-05-03 07:57:00 2013-05-05 02:54:00 1.78958333 days 9 2 A 2013-05-05 03:27:00 2013-05-14 00:16:00 8.86736111 days 10 2 D 2013-05-14 10:00:00 2013-05-14 15:00:00 0.20833333 days 的分组变量,然后计算差异。

或者使用rleid()中的rleid()来同样地显示:

data.table

答案 1 :(得分:1)

另一种替代方法是使用lagcumsum创建分组变量。

library(dplyr)

df %>%
  group_by(birdID, group = cumsum(site != lag(site, default = first(site)))) %>%
  summarise(min_ts = min(ts),
            max_ts = max(ts),
            days = difftime(max_ts, min_ts, units = "days")) %>%
  ungroup() %>%
  select(-group)

# A tibble: 10 x 4
#   birdID min_ts              max_ts              days           
#    <int> <dttm>              <dttm>              <drtn>         
# 1      1 2013-04-15 09:29:00 2013-04-22 00:03:00 6.60694444 days
# 2      1 2013-04-22 14:02:00 2013-04-22 17:02:00 0.12500000 days
# 3      1 2013-04-22 14:04:00 2013-04-23 00:54:00 0.45138889 days
# 4      1 2013-04-23 01:20:00 2013-04-30 23:47:00 7.93541667 days
# 5      1 2013-04-30 03:51:00 2013-04-30 04:26:00 0.02430556 days
# 6      2 2013-04-30 04:29:00 2013-04-30 18:49:00 0.59722222 days
# 7      2 2013-05-01 01:03:00 2013-05-02 00:09:00 0.96250000 days
# 8      2 2013-05-03 07:57:00 2013-05-05 02:54:00 1.78958333 days
# 9      2 2013-05-05 03:27:00 2013-05-14 00:16:00 8.86736111 days
#10      2 2013-05-14 10:00:00 2013-05-14 15:00:00 0.20833333 days