我有一个数据框df
,该数据框总结了特定动物物种个体的观察结果。 DateTime
列告诉您何时看到动物,Observer
列见过动物,Animal
列告诉您哪个特定的人(可以被识别)。
df<-data.frame(DateTime=c("2016-08-01 12:04:07","2016-08-01 12:06:07","2016-08-01 12:06:58","2016-08-01 13:12:12","2016-08-01 14:04:07","2016-08-01 13:12:45","2016-08-01 15:04:07","2016-08-01 17:13:16","2016-08-01 17:21:16","2016-08-01 17:21:34","2016-08-01 17:23:42","2016-08-01 17:27:16","2016-08-01 17:27:22","2016-08-01 17:28:01","2016-08-01 17:29:28","2016-08-01 17:28:08","2016-08-01 17:28:15"),
Observer=c("Peter","Sophie","Peter","Peter","Sophie","Sophie","Peter","Sophie","Sophie","Sophie","Peter","Peter","Peter","Andreu","Sophie","Anna","Peter"),
Animal=c(1,2,1,1,2,1,2,1,2,1,1,2,2,2,1,2,2))
df$DateTime<- as.POSIXct(df$DateTime, format= "%Y-%m-%d %H:%M:%S", tz= "UTC")
df
DateTime Observer Animal
1 2016-08-01 12:04:07 Peter 1
2 2016-08-01 12:06:07 Sophie 2
3 2016-08-01 12:06:58 Peter 1
4 2016-08-01 13:12:12 Peter 1
5 2016-08-01 14:04:07 Sophie 2
6 2016-08-01 13:12:45 Sophie 1
7 2016-08-01 15:04:07 Peter 2
8 2016-08-01 17:13:16 Sophie 1
9 2016-08-01 17:21:16 Sophie 2
10 2016-08-01 17:21:34 Sophie 1
11 2016-08-01 17:23:42 Peter 1
12 2016-08-01 17:27:16 Peter 2
13 2016-08-01 17:27:22 Peter 2
14 2016-08-01 17:28:01 Andreu 2
15 2016-08-01 17:29:28 Sophie 1
16 2016-08-01 17:28:08 Anna 2
17 2016-08-01 17:28:15 Peter 2
由于计算动物的方法不同,同一个人在不到60秒的时间内看不到同一个人,但是另一个人看到了。
出于特定的目的,我需要创建一个df
,每次有人看到一个特定的人时,我都会在接下来60秒内观察其他人的行(如果同一人看到的是相同的人)在不到60秒的时间内我直接删除了该行。我们可以在第12和13行中看到此示例),但是我在列Other_observers
中添加了这些已删除行的信息,该行汇总了其他看到了这只动物,还有Who
,它概括了它们的名字。
我想得到的是这样的:
df
DateTime Observer Ind Other_observers Who
1 2016-08-01 12:04:07 Peter 1 0 NA
2 2016-08-01 12:06:07 Sophie 2 0 NA
3 2016-08-01 12:06:58 Peter 1 0 NA
4 2016-08-01 13:12:12 Peter 1 1 Sophie
5 2016-08-01 14:04:07 Sophie 2 0 NA
6 2016-08-01 15:04:07 Peter 2 0 NA
7 2016-08-01 17:13:16 Sophie 1 0 NA
8 2016-08-01 17:21:16 Sophie 2 0 NA
9 2016-08-01 17:21:34 Sophie 1 0 NA
10 2016-08-01 17:23:42 Peter 1 0 NA
11 2016-08-01 17:27:16 Peter 2 2 Andreu Anna
12 2016-08-01 17:28:15 Peter 2 0 NA
13 2016-08-01 17:29:28 Sophie 1 0 NA
有人知道怎么做吗?
答案 0 :(得分:1)
我再三考虑了一下,我认为我有一个(也很简单)的解决方案,它没有我们讨论的限制。我添加了一些其他观察来检查这种极端情况。
library(tidyverse)
df <- tribble(
~DateTime, ~Observer, ~Animal,
"2016-08-01 12:04:07", "Peter", 1,
"2016-08-01 12:06:07", "Sophie", 2,
"2016-08-01 12:06:58", "Peter", 1,
"2016-08-01 13:12:12", "Peter", 1,
"2016-08-01 14:04:07", "Sophie", 2,
"2016-08-01 13:12:45", "Sophie", 1,
"2016-08-01 15:04:07", "Peter", 2,
"2016-08-01 17:13:16", "Sophie", 1,
"2016-08-01 17:21:16", "Sophie", 2,
"2016-08-01 17:21:34", "Sophie", 1,
"2016-08-01 17:23:42", "Peter", 1,
"2016-08-01 17:27:16", "Peter", 2,
"2016-08-01 17:27:22", "Peter", 2,
"2016-08-01 17:28:01", "Andreu", 2,
"2016-08-01 17:29:28", "Sophie", 1,
"2016-08-01 17:28:08", "Anna", 2,
"2016-08-01 17:28:15", "Peter", 2,
"2016-08-01 17:28:17", "Peter", 2,
"2016-08-01 17:28:21", "Peter", 2,
"2016-08-01 17:28:21", "Anna", 2,
) %>%
mutate(DateTime = as.POSIXct(DateTime, format= "%Y-%m-%d %H:%M:%S", tz= "UTC"))
min_diff = as.difftime(60, units = c("secs"))
cumsum_reset <- function(s, x, reset) {
ns <- s + x
if (ns > reset) return(0)
ns
}
df_wrangled <- df %>%
arrange(DateTime) %>%
group_by(Animal) %>%
mutate(
# Time difference to laste observation of this animal
Diff = replace_na(DateTime - lag(DateTime, 1), 0),
# Cumulative time since first observation, resets to 0 when more than `min_diff`
CumDiff = accumulate(Diff, cumsum_reset, reset = min_diff),
# Group observations within the `min_diff` period
ObsGroup = cumsum(CumDiff == 0)
) %>%
group_by(ObsGroup, add = TRUE) %>%
summarize(
Other_observers = length(unique(Observer)) - 1,
Who = paste(unique(setdiff(Observer, Observer[1])), collapse = " "),
DateTime = DateTime[1],
Observer = Observer[1]
) %>%
ungroup()
print(df_wrangled, n = Inf)
#> # A tibble: 13 x 6
#> Animal ObsGroup Other_observers Who DateTime Observer
#> <dbl> <int> <dbl> <chr> <dttm> <chr>
#> 1 1 1 0 "" 2016-08-01 12:04:07 Peter
#> 2 1 2 0 "" 2016-08-01 12:06:58 Peter
#> 3 1 3 1 Sophie 2016-08-01 13:12:12 Peter
#> 4 1 4 0 "" 2016-08-01 17:13:16 Sophie
#> 5 1 5 0 "" 2016-08-01 17:21:34 Sophie
#> 6 1 6 0 "" 2016-08-01 17:23:42 Peter
#> 7 1 7 0 "" 2016-08-01 17:29:28 Sophie
#> 8 2 1 0 "" 2016-08-01 12:06:07 Sophie
#> 9 2 2 0 "" 2016-08-01 14:04:07 Sophie
#> 10 2 3 0 "" 2016-08-01 15:04:07 Peter
#> 11 2 4 0 "" 2016-08-01 17:21:16 Sophie
#> 12 2 5 2 Andreu Anna 2016-08-01 17:27:16 Peter
#> 13 2 6 1 Anna 2016-08-01 17:28:17 Peter
由reprex package(v0.2.1)于2019-04-30创建
旧解决方案:
这是使用出色的fuzzyjoin package的一种解决方案。本质上,只要它们之间的距离小于min_dist
,我就将它们加入自身。
这里有些棘手的情况我没有解决。例如,如果观察者记录了一个观察到的动物的观察结果,例如每30秒钟观察一次,持续5分钟,那么我相信,只要第一次观察到的时间小于1分钟,它们都会被过滤掉。这可能不是您想要的,但是我现在不确定如何解决这个问题。
library(tidyverse)
library(fuzzyjoin)
df<-data.frame(DateTime=c("2016-08-01 12:04:07","2016-08-01 12:06:07","2016-08-01 12:06:58","2016-08-01 13:12:12","2016-08-01 14:04:07","2016-08-01 13:12:45","2016-08-01 15:04:07","2016-08-01 17:13:16","2016-08-01 17:21:16","2016-08-01 17:21:34","2016-08-01 17:23:42","2016-08-01 17:27:16","2016-08-01 17:27:22","2016-08-01 17:28:01","2016-08-01 17:29:28","2016-08-01 17:28:08","2016-08-01 17:28:15"),
Observer=c("Peter","Sophie","Peter","Peter","Sophie","Sophie","Peter","Sophie","Sophie","Sophie","Peter","Peter","Peter","Andreu","Sophie","Anna","Peter"),
Animal=c(1,2,1,1,2,1,2,1,2,1,1,2,2,2,1,2,2))
df$DateTime<- as.POSIXct(df$DateTime, format= "%Y-%m-%d %H:%M:%S", tz= "UTC")
min_diff = as.difftime(1, units = c("mins"))
df_wrangled <- df %>%
as_tibble() %>%
arrange(DateTime) %>%
# Add a unique id for each observation
mutate(id = 1:n()) %>%
fuzzy_left_join(
x = .,
y = .,
by = c("Animal", "DateTime"),
match_fun = list(
`==`,
function(x, y) y - x < min_diff & y - x > 0
)
) %>%
# Remove observations that occured within `min_diff`
filter(!(id.x %in% id.y)) %>%
# Remove observations by same observer within `min_diff`
filter(ifelse(is.na(Observer.y), TRUE, Observer.x != Observer.y)) %>%
group_by(DateTime.x, Observer.x, Animal.x, id.x) %>%
summarize(
Other_observers = length(na.omit(Observer.y)),
Who = paste(Observer.y, collapse = " ")
) %>%
ungroup()
print(df_wrangled, n = Inf)
#> # A tibble: 12 x 6
#> DateTime.x Observer.x Animal.x id.x Other_observers Who
#> <dttm> <fct> <dbl> <int> <int> <chr>
#> 1 2016-08-01 12:04:07 Peter 1 1 0 NA
#> 2 2016-08-01 12:06:07 Sophie 2 2 0 NA
#> 3 2016-08-01 12:06:58 Peter 1 3 0 NA
#> 4 2016-08-01 13:12:12 Peter 1 4 1 Sophie
#> 5 2016-08-01 14:04:07 Sophie 2 6 0 NA
#> 6 2016-08-01 15:04:07 Peter 2 7 0 NA
#> 7 2016-08-01 17:13:16 Sophie 1 8 0 NA
#> 8 2016-08-01 17:21:16 Sophie 2 9 0 NA
#> 9 2016-08-01 17:21:34 Sophie 1 10 0 NA
#> 10 2016-08-01 17:23:42 Peter 1 11 0 NA
#> 11 2016-08-01 17:27:16 Peter 2 12 2 Andreu An…
#> 12 2016-08-01 17:29:28 Sophie 1 17 0 NA
由reprex package(v0.2.1)于2019-04-30创建