我有一个很大的数据框,其中包含每个月放置在不同位置的相机陷阱的相机陷阱观察结果。一种观察包括由一只动物触发的五张照片。 Excerpt of the dataframe
前20行中的 dput
:
>structure(list(deploymentid = structure(c(2L, 2L, 2L, 2L, 2L,
>2L, 2L, 2L, 2L, 2L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L), .Label = c("B4-Wintergatter_Riedlhäng",
"I3-Wintergatter_Riedlhäng"), class = "factor"), species = structure(c(1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L), .Label = "Rotwild", class = "factor"), time = structure(c(1520900972,
1520900972, 1520900972, 1520900972, 1520900972, 1520900982, 1520900982,
1520900982, 1520900982, 1520900982, 1520901025, 1520901025, 1520901025,
1520901025, 1520901025, 1520975705, 1520975705, 1520975705, 1520975705,
1520975705), class = c("POSIXct", "POSIXt"), tzone = "UTC")), .Names = c("deploymentid",
"species", "time"), row.names = c(NA, 20L), class = "data.frame")
为了进行分析,我确定了连续观察之间的2分钟间隔被认为是独立的。为此,我为每个摄像机部署计算了两张连续照片之间的时间差。之后,我选择的所有时间相差大于2分钟。然后,我将数据框子集化为仅包含在所选时间拍摄的照片:
1)首先,我使用dplyr来计算到同一部署的上一张照片的时间间隔。对于每个部署的首次观察,我随机选择1000作为大于120的数字,因此稍后将它们包括在我的选择中。
library(dplyr)
deerobs_tbl<-tbl_df(Deerobs)
deerobs_gr<-group_by(deerobs_tbl,deploymentid)
deerobs_or<-arrange(deerobs_gr$time,.by_group = T)
deerobs_2<-mutate(deerobs_or,diff=c(1000,diff(time)))
deerobs2_df<-data.frame(deerobs_2)
2)我想dplyr也可以做到这一点,但是plyr更易于使用。我只用一个列来创建数据框,其中包含部署ID,时间和上一张图片的时间差。然后,我为每个部署选择了相距2分钟以上的时间,并选择了这些时间的所有行。
library (plyr)
deerobs_times<-data.frame(deerobs2_df$time,deerobs2_df$deploymentid,deerobs2_df$diff)
deerobs_times_apart<-ddply(deerobs_times,"deerobs2_df.deploymentid",subset,deerobs2_df.diff>120)
deerobs_t<-deerobs_times_apart[,1]
Deerobs_subset<-subset(deerobs2_df,deerobs2_df$time%in%deerobs_t)
唯一的问题是,这消除了远远超出必要的观察结果。照片的数量从9000多个减少到少于3000个。例如,如果十个观察以1.5分钟的间隔彼此跟随,则所有照片都将被删除,尽管五张相距超过两分钟。是否有可能避免这个问题并选择相距两分钟以上的所有观测值?
答案 0 :(得分:0)
如果数据集不是太大,则聚类是解决此问题的一种方法。
library(dplyr)
data <- structure(list(deploymentid = structure(c(2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L), .Label = c("B4-Wintergatter_Riedlhäng", "I3-Wintergatter_Riedlhäng"), class = "factor"), species = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L), .Label = "Rotwild", class = "factor"), time = structure(c(1520900972, 1520900972, 1520900972, 1520900972, 1520900972, 1520900982, 1520900982, 1520900982, 1520900982, 1520900982, 1520901025, 1520901025, 1520901025, 1520901025, 1520901025, 1520975705, 1520975705, 1520975705, 1520975705, 1520975705), class = c("POSIXct", "POSIXt"), tzone = "UTC")), .Names = c("deploymentid", "species", "time"), row.names = c(NA, 20L), class = "data.frame")
data %>%
mutate(
# Create a numeric vector on minute scale
minutes = difftime(time, min(time), units = 'min') %>% as.numeric(),
# Cluster and group based on 2 minute height
time_group = cutree(hclust(dist(minutes)), h = 2)
) %>%
# Collapse the groups of images
group_by(deploymentid, species, time_group) %>%
summarise(n = n(), mean_time = mean(time))
# # A tibble: 3 x 5
# # Groups: deploymentid, species [?]
# deploymentid species time_group n mean_time
# <fct> <fct> <int> <int> <dttm>
# 1 B4-Wintergatter_Riedlhäng Rotwild 1 5 2018-03-13 00:30:25
# 2 I3-Wintergatter_Riedlhäng Rotwild 1 10 2018-03-13 00:29:37
# 3 I3-Wintergatter_Riedlhäng Rotwild 2 5 2018-03-13 21:15:05
答案 1 :(得分:0)
感谢@Eric,您的想法为我解决了很多问题。 所以这是最终的结果:
# Add a column "eventid", which is unique for each event
Deerobs$eventid<-as.factor(paste(Deerobs$Kamera_ID,Deerobs$time,sep='-'))
# Group the pictures by deployment and order them
library(dplyr)
deerobs_tbl<-tbl_df(Deerobs)
deerobs_gr<-group_by(deerobs_tbl,deploymentid)
deerobs_or<-arrange(deerobs_gr,deerobs_gr$time,.by_group = T)
# Add two minute time groups for each deployment
deerobs2<-deerobs_or%>%mutate(
minu=difftime(time, min(time), units = 'min') %>% as.numeric(),
time_group_minu = cutree(hclust(dist(minu)), h = 2))
# Add a unique ID for each time group
deerobs2$twomin_periodid<-as.factor(paste(deerobs2$Kamera_ID,deerobs2$time_group_minu,sep='-'))
# Select only the first eventid of each time group
deerobs_twominsub<-deerobs2[!duplicated(deerobs2$twomin_periodid),]
# Select all the rows with these event IDs
Deerobs_subset<-subset(deerobs2,deerobs2$eventid%in%deerobs_twominsub$eventid)