如何在R中选择时间间隔固定的行

时间:2018-08-15 15:05:05

标签: r time

我有一个很大的数据框,其中包含每个月放置在不同位置的相机陷阱的相机陷阱观察结果。一种观察包括由一只动物触发的五张照片。 Excerpt of the dataframe

enter image description here

前20行中的

dput

>structure(list(deploymentid = structure(c(2L, 2L, 2L, 2L, 2L, 
>2L, 2L, 2L, 2L, 2L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L), .Label = c("B4-Wintergatter_Riedlhäng",
"I3-Wintergatter_Riedlhäng"), class = "factor"), species = structure(c(1L, 
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 
1L, 1L, 1L), .Label = "Rotwild", class = "factor"), time = structure(c(1520900972, 
1520900972, 1520900972, 1520900972, 1520900972, 1520900982, 1520900982, 
1520900982, 1520900982, 1520900982, 1520901025, 1520901025, 1520901025, 
1520901025, 1520901025, 1520975705, 1520975705, 1520975705, 1520975705, 
1520975705), class = c("POSIXct", "POSIXt"), tzone = "UTC")), .Names = c("deploymentid", 
"species", "time"), row.names = c(NA, 20L), class = "data.frame")

为了进行分析,我确定了连续观察之间的2分钟间隔被认为是独立的。为此,我为每个摄像机部署计算了两张连续照片之间的时间差。之后,我选择的所有时间相差大于2分钟。然后,我将数据框子集化为仅包含在所选时间拍摄的照片:

1)首先,我使用dplyr来计算到同一部署的上一张照片的时间间隔。对于每个部署的首次观察,我随机选择1000作为大于120的数字,因此稍后将它们包括在我的选择中。

library(dplyr)
deerobs_tbl<-tbl_df(Deerobs)
deerobs_gr<-group_by(deerobs_tbl,deploymentid)
deerobs_or<-arrange(deerobs_gr$time,.by_group = T)
deerobs_2<-mutate(deerobs_or,diff=c(1000,diff(time)))
deerobs2_df<-data.frame(deerobs_2)

2)我想dplyr也可以做到这一点,但是plyr更易于使用。我只用一个列来创建数据框,其中包含部署ID,时间和上一张图片的时间差。然后,我为每个部署选择了相距2分钟以上的时间,并选择了这些时间的所有行。

library (plyr)
deerobs_times<-data.frame(deerobs2_df$time,deerobs2_df$deploymentid,deerobs2_df$diff)
deerobs_times_apart<-ddply(deerobs_times,"deerobs2_df.deploymentid",subset,deerobs2_df.diff>120)
deerobs_t<-deerobs_times_apart[,1]
Deerobs_subset<-subset(deerobs2_df,deerobs2_df$time%in%deerobs_t)

唯一的问题是,这消除了远远超出必要的观察结果。照片的数量从9000多个减少到少于3000个。例如,如果十个观察以1.5分钟的间隔彼此跟随,则所有照片都将被删除,尽管五张相距超过两分钟。是否有可能避免这个问题并选择相距两分钟以上的所有观测值?

2 个答案:

答案 0 :(得分:0)

如果数据集不是太大,则聚类是解决此问题的一种方法。

library(dplyr)

data <- structure(list(deploymentid = structure(c(2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L), .Label = c("B4-Wintergatter_Riedlhäng", "I3-Wintergatter_Riedlhäng"), class = "factor"), species = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L), .Label = "Rotwild", class = "factor"), time = structure(c(1520900972, 1520900972, 1520900972, 1520900972, 1520900972, 1520900982, 1520900982, 1520900982, 1520900982, 1520900982, 1520901025, 1520901025, 1520901025, 1520901025, 1520901025, 1520975705, 1520975705, 1520975705, 1520975705, 1520975705), class = c("POSIXct", "POSIXt"), tzone = "UTC")), .Names = c("deploymentid", "species", "time"), row.names = c(NA, 20L), class = "data.frame")

data %>%
  mutate(
    # Create a numeric vector on minute scale
    minutes    = difftime(time, min(time), units = 'min') %>% as.numeric(),
    # Cluster and group based on 2 minute height
    time_group = cutree(hclust(dist(minutes)), h = 2)
  ) %>%
  # Collapse the groups of images
  group_by(deploymentid, species, time_group) %>%
  summarise(n = n(), mean_time = mean(time))

# # A tibble: 3 x 5
# # Groups:   deploymentid, species [?]
#   deploymentid              species time_group     n mean_time          
#   <fct>                     <fct>        <int> <int> <dttm>             
# 1 B4-Wintergatter_Riedlhäng Rotwild          1     5 2018-03-13 00:30:25
# 2 I3-Wintergatter_Riedlhäng Rotwild          1    10 2018-03-13 00:29:37
# 3 I3-Wintergatter_Riedlhäng Rotwild          2     5 2018-03-13 21:15:05

答案 1 :(得分:0)

感谢@Eric,您的想法为我解决了很多问题。 所以这是最终的结果:

# Add a column "eventid", which is unique for each event
   Deerobs$eventid<-as.factor(paste(Deerobs$Kamera_ID,Deerobs$time,sep='-'))

# Group the pictures by deployment and order them 
  library(dplyr)
  deerobs_tbl<-tbl_df(Deerobs)
  deerobs_gr<-group_by(deerobs_tbl,deploymentid)
  deerobs_or<-arrange(deerobs_gr,deerobs_gr$time,.by_group = T)

# Add two minute time groups for each deployment
  deerobs2<-deerobs_or%>%mutate(
  minu=difftime(time, min(time), units = 'min') %>% as.numeric(),
  time_group_minu = cutree(hclust(dist(minu)), h = 2))

# Add a unique ID for each time group
    deerobs2$twomin_periodid<-as.factor(paste(deerobs2$Kamera_ID,deerobs2$time_group_minu,sep='-'))
# Select only the first eventid of each time group
    deerobs_twominsub<-deerobs2[!duplicated(deerobs2$twomin_periodid),]
# Select all the rows with these event IDs
    Deerobs_subset<-subset(deerobs2,deerobs2$eventid%in%deerobs_twominsub$eventid)