如何根据行之间的时间间隔以及与其他变量相关的其他两个条件选择DF的特定行

时间:2019-04-30 08:45:32

标签: r dplyr lubridate

我有一个数据框df,该数据框总结了特定动物物种个体的观察结果。 DateTime列告诉您何时看到动物,Observer列见过动物,Animal列告诉您哪个特定的人(可以被识别)。

df<-data.frame(DateTime=c("2016-08-01 12:04:07","2016-08-01 12:06:07","2016-08-01 12:06:58","2016-08-01 13:12:12","2016-08-01 14:04:07","2016-08-01 13:12:45","2016-08-01 15:04:07","2016-08-01 17:13:16","2016-08-01 17:21:16","2016-08-01 17:21:34","2016-08-01 17:23:42","2016-08-01 17:27:16","2016-08-01 17:27:22","2016-08-01 17:28:01","2016-08-01 17:29:28","2016-08-01 17:28:08","2016-08-01 17:28:15"),
               Observer=c("Peter","Sophie","Peter","Peter","Sophie","Sophie","Peter","Sophie","Sophie","Sophie","Peter","Peter","Peter","Andreu","Sophie","Anna","Peter"),
               Animal=c(1,2,1,1,2,1,2,1,2,1,1,2,2,2,1,2,2))
df$DateTime<- as.POSIXct(df$DateTime, format= "%Y-%m-%d %H:%M:%S", tz= "UTC")

df
              DateTime Observer Animal
1  2016-08-01 12:04:07    Peter      1
2  2016-08-01 12:06:07   Sophie      2
3  2016-08-01 12:06:58    Peter      1
4  2016-08-01 13:12:12    Peter      1
5  2016-08-01 14:04:07   Sophie      2
6  2016-08-01 13:12:45   Sophie      1
7  2016-08-01 15:04:07    Peter      2
8  2016-08-01 17:13:16   Sophie      1
9  2016-08-01 17:21:16   Sophie      2
10 2016-08-01 17:21:34   Sophie      1
11 2016-08-01 17:23:42    Peter      1
12 2016-08-01 17:27:16    Peter      2
13 2016-08-01 17:27:22    Peter      2
14 2016-08-01 17:28:01   Andreu      2
15 2016-08-01 17:29:28   Sophie      1
16 2016-08-01 17:28:08     Anna      2
17 2016-08-01 17:28:15    Peter      2

由于计算动物的方法不同,同一个人在不到60秒的时间内看不到同一个人,但是另一个人看到了。

出于特定的目的,我需要创建一个df,每次有人看到一个特定的人时,我都会在接下来60秒内观察其他人的行(如果同一人看到的是相同的人)在不到60秒的时间内我直接删除了该行。我们可以在第12和13行中看到此示例),但是我在列Other_observers中添加了这些已删除行的信息,该行汇总了其他看到了这只动物,还有Who,它概括了它们的名字。

我想得到的是这样的:

df
              DateTime Observer Ind Other_observers         Who
1  2016-08-01 12:04:07    Peter   1               0          NA
2  2016-08-01 12:06:07   Sophie   2               0          NA
3  2016-08-01 12:06:58    Peter   1               0          NA
4  2016-08-01 13:12:12    Peter   1               1      Sophie
5  2016-08-01 14:04:07   Sophie   2               0          NA
6  2016-08-01 15:04:07    Peter   2               0          NA
7  2016-08-01 17:13:16   Sophie   1               0          NA
8  2016-08-01 17:21:16   Sophie   2               0          NA
9  2016-08-01 17:21:34   Sophie   1               0          NA
10 2016-08-01 17:23:42    Peter   1               0          NA
11 2016-08-01 17:27:16    Peter   2               2 Andreu Anna
12 2016-08-01 17:28:15    Peter   2               0          NA
13 2016-08-01 17:29:28   Sophie   1               0          NA

有人知道怎么做吗?

1 个答案:

答案 0 :(得分:1)

我再三考虑了一下,我认为我有一个(也很简单)的解决方案,它没有我们讨论的限制。我添加了一些其他观察来检查这种极端情况。

library(tidyverse)

df <- tribble(
  ~DateTime, ~Observer, ~Animal,
  "2016-08-01 12:04:07",   "Peter",       1,
  "2016-08-01 12:06:07",  "Sophie",       2,
  "2016-08-01 12:06:58",   "Peter",       1,
  "2016-08-01 13:12:12",   "Peter",       1,
  "2016-08-01 14:04:07",  "Sophie",       2,
  "2016-08-01 13:12:45",  "Sophie",       1,
  "2016-08-01 15:04:07",   "Peter",       2,
  "2016-08-01 17:13:16",  "Sophie",       1,
  "2016-08-01 17:21:16",  "Sophie",       2,
  "2016-08-01 17:21:34",  "Sophie",       1,
  "2016-08-01 17:23:42",   "Peter",       1,
  "2016-08-01 17:27:16",   "Peter",       2,
  "2016-08-01 17:27:22",   "Peter",       2,
  "2016-08-01 17:28:01",  "Andreu",       2,
  "2016-08-01 17:29:28",  "Sophie",       1,
  "2016-08-01 17:28:08",    "Anna",       2,
  "2016-08-01 17:28:15",   "Peter",       2,
  "2016-08-01 17:28:17",   "Peter",       2,
  "2016-08-01 17:28:21",   "Peter",       2,
  "2016-08-01 17:28:21",   "Anna",        2,
) %>%
  mutate(DateTime = as.POSIXct(DateTime, format= "%Y-%m-%d %H:%M:%S", tz= "UTC"))

min_diff = as.difftime(60, units = c("secs"))

cumsum_reset <- function(s, x, reset) {
  ns <- s + x
  if (ns > reset) return(0)
  ns
}

df_wrangled <- df %>%
  arrange(DateTime) %>%
  group_by(Animal) %>%
  mutate(
    # Time difference to laste observation of this animal
    Diff = replace_na(DateTime - lag(DateTime, 1), 0),
    # Cumulative time since first observation, resets to 0 when more than `min_diff`
    CumDiff = accumulate(Diff, cumsum_reset, reset = min_diff),
    # Group observations within the `min_diff` period
    ObsGroup = cumsum(CumDiff == 0)
  ) %>%
  group_by(ObsGroup, add = TRUE) %>%
  summarize(
    Other_observers = length(unique(Observer)) - 1,
    Who = paste(unique(setdiff(Observer, Observer[1])), collapse = " "),
    DateTime = DateTime[1],
    Observer = Observer[1]
  ) %>%
  ungroup()

print(df_wrangled, n = Inf)
#> # A tibble: 13 x 6
#>    Animal ObsGroup Other_observers Who         DateTime            Observer
#>     <dbl>    <int>           <dbl> <chr>       <dttm>              <chr>   
#>  1      1        1               0 ""          2016-08-01 12:04:07 Peter   
#>  2      1        2               0 ""          2016-08-01 12:06:58 Peter   
#>  3      1        3               1 Sophie      2016-08-01 13:12:12 Peter   
#>  4      1        4               0 ""          2016-08-01 17:13:16 Sophie  
#>  5      1        5               0 ""          2016-08-01 17:21:34 Sophie  
#>  6      1        6               0 ""          2016-08-01 17:23:42 Peter   
#>  7      1        7               0 ""          2016-08-01 17:29:28 Sophie  
#>  8      2        1               0 ""          2016-08-01 12:06:07 Sophie  
#>  9      2        2               0 ""          2016-08-01 14:04:07 Sophie  
#> 10      2        3               0 ""          2016-08-01 15:04:07 Peter   
#> 11      2        4               0 ""          2016-08-01 17:21:16 Sophie  
#> 12      2        5               2 Andreu Anna 2016-08-01 17:27:16 Peter   
#> 13      2        6               1 Anna        2016-08-01 17:28:17 Peter

reprex package(v0.2.1)于2019-04-30创建

旧解决方案:

这是使用出色的fuzzyjoin package的一种解决方案。本质上,只要它们之间的距离小于min_dist,我就将它们加入自身。

这里有些棘手的情况我没有解决。例如,如果观察者记录了一个观察到的动物的观察结果,例如每30秒钟观察一次,持续5分钟,那么我相信,只要第一次观察到的时间小于1分钟,它们都会被过滤掉。这可能不是您想要的,但是我现在不确定如何解决这个问题。

library(tidyverse)
library(fuzzyjoin)
df<-data.frame(DateTime=c("2016-08-01 12:04:07","2016-08-01 12:06:07","2016-08-01 12:06:58","2016-08-01 13:12:12","2016-08-01 14:04:07","2016-08-01 13:12:45","2016-08-01 15:04:07","2016-08-01 17:13:16","2016-08-01 17:21:16","2016-08-01 17:21:34","2016-08-01 17:23:42","2016-08-01 17:27:16","2016-08-01 17:27:22","2016-08-01 17:28:01","2016-08-01 17:29:28","2016-08-01 17:28:08","2016-08-01 17:28:15"),
               Observer=c("Peter","Sophie","Peter","Peter","Sophie","Sophie","Peter","Sophie","Sophie","Sophie","Peter","Peter","Peter","Andreu","Sophie","Anna","Peter"),
               Animal=c(1,2,1,1,2,1,2,1,2,1,1,2,2,2,1,2,2))
df$DateTime<- as.POSIXct(df$DateTime, format= "%Y-%m-%d %H:%M:%S", tz= "UTC")


min_diff = as.difftime(1, units = c("mins"))

df_wrangled <- df %>%
  as_tibble() %>%
  arrange(DateTime) %>%
  # Add a unique id for each observation
  mutate(id = 1:n()) %>%
  fuzzy_left_join(
    x = .,
    y = .,
    by = c("Animal", "DateTime"),
    match_fun = list(
      `==`,
      function(x, y) y - x < min_diff & y - x > 0
    )
  ) %>%
  # Remove observations that occured within `min_diff`
  filter(!(id.x %in% id.y)) %>%
  # Remove observations by same observer within `min_diff`
  filter(ifelse(is.na(Observer.y), TRUE, Observer.x != Observer.y)) %>%
  group_by(DateTime.x, Observer.x, Animal.x, id.x) %>%
  summarize(
    Other_observers = length(na.omit(Observer.y)),
    Who = paste(Observer.y, collapse = " ")
  ) %>%
  ungroup()

print(df_wrangled, n = Inf)
#> # A tibble: 12 x 6
#>    DateTime.x          Observer.x Animal.x  id.x Other_observers Who       
#>    <dttm>              <fct>         <dbl> <int>           <int> <chr>     
#>  1 2016-08-01 12:04:07 Peter             1     1               0 NA        
#>  2 2016-08-01 12:06:07 Sophie            2     2               0 NA        
#>  3 2016-08-01 12:06:58 Peter             1     3               0 NA        
#>  4 2016-08-01 13:12:12 Peter             1     4               1 Sophie    
#>  5 2016-08-01 14:04:07 Sophie            2     6               0 NA        
#>  6 2016-08-01 15:04:07 Peter             2     7               0 NA        
#>  7 2016-08-01 17:13:16 Sophie            1     8               0 NA        
#>  8 2016-08-01 17:21:16 Sophie            2     9               0 NA        
#>  9 2016-08-01 17:21:34 Sophie            1    10               0 NA        
#> 10 2016-08-01 17:23:42 Peter             1    11               0 NA        
#> 11 2016-08-01 17:27:16 Peter             2    12               2 Andreu An…
#> 12 2016-08-01 17:29:28 Sophie            1    17               0 NA

reprex package(v0.2.1)于2019-04-30创建