基于日期时间差或窗口的子集数据

时间:2013-12-19 13:59:37

标签: r datetime diff subset

我的位置数据有日期时间戳。应该定期收集这些位置,但并不总是这样。我需要提取时间窗口内的那些位置。因此,例如,相隔12小时的位置。如果我从位置1开始使用日期时间,请找到12小时后的下一个位置。如果没有一个正好12个小时,那么下一个最接近新的指定时间。然后从那个新位置开始,在12小时内找到下一个位置。我必须为每个唯一ID执行此操作。

COLLAR_ID                    dt
2159    2006-01-27 13:02:55
2159    2006-01-27 14:01:12
2159    2006-01-27 15:01:04
2159    2006-01-27 16:01:09

是数据的样子,这里是您可以剪切和粘贴的一小部分数据。请注意,它是完全相同的ID,我有5个不同的ID,具有不同的开始日期/时间

structure(list(COLLAR_ID = c(2159L, 2159L, 2159L, 2159L, 2159L, 
2159L, 2159L, 2159L, 2159L, 2159L, 2159L, 2159L, 2159L, 2159L, 
2159L, 2159L, 2159L, 2159L, 2159L, 2159L), dt = structure(c(1138366975, 
1138370472, 1138374064, 1138377669, 1138381264, 1138384873, 1138388503, 
1138399312, 1138402842, 1138406507, 1138413700, 1138417261, 1138420848, 
1138424444, 1138428071, 1138431695, 1138435287, 1138438938, 1138442428, 
1138446098), class = c("POSIXct", "POSIXt"), tzone = "GMT")), .Names = c("COLLAR_ID", 
"dt"), class = "data.frame", row.names = c(NA, 20L))

所以我认为从示例数据来看,如果我的开始日期是2006-01-27 00:00:00时间,那么它应该记录的下一个位置是12:00:00 - 但是这个位置不是存在所以它应该记录13:02:55。但即使这是一个严格的1小时缓冲窗口“外侧”2分钟。

我曾想过将日期时间转换为Julian十进制数,以便更容易使用,但我不知道该怎么做。将日期/时间四舍五入到几小时就可以了,除非有时在1小时的时间间隔内有2个或3个位置,所以我不知何故需要在与原始开始“最接近”的那1个位置中进行选择

因此,添加新细节可能会使事情更加混乱 - 一些数据最初是以1小时的间隔收集的,然后在3周后转换为12小时。但是,我不知道每个人应该改变的编程时间。其他人在12小时开始,开始时间为00:00:00,但是切换到1小时间隔,然后在几天后切换到12小时 - 但又一次不知道它在哪一天进行切换。所以,从下午2点开始可以切换到12小时。

我试图查看this stack overflow conversation,但看不出它会如何起作用。所以,这是我在下面的尝试,我现在从最初的问题发布中更新了。这是行不通的。我仍在努力......它似乎仍然相当笨重的代码。

 test2 = test2[order(test2$COLLAR_ID,test2$dt),]
test2$dt <- as.POSIXct(strptime((test2$dt), "%Y-%m-%d %H:%M:%S"), tz="GMT")
MinInterval = 12 #minimum time interval (in hours) between consecutive locations
row = 0           # Keeps track of row within alldata
Endtest2 = 2                  #keeps track of row within individual within all data
SubData1 = test2[1,]
IDNames = levels(as.factor(test2$COLLAR_ID))
test22 = data.frame()

for (n in 1:length(IDNames)){
  IndivData = test2[test2$COLLAR_ID==IDNames[n],]
  row = row+1               #Continues to track next row between individuals
  Endtest2 = 2               #restarts counting the rows for NEXT individual
  SubData1[row,]=IndivData[1,]

  while (Endtest2<nrow(IndivData) ){
    timediff = difftime(IndivData$dt[Endtest2],SubData1$dt[row],units = "hours")

    if (timediff>MinInterval){          #If time difference is greater than 47 hours then do
      row = row+1
      SubData1=rbind(SubData1,IndivData[Endtest2,])
      Endtest2 = Endtest2+1                
    } else{
      Endtest2 = Endtest2+1
    }
  } #end while loop

} #end loop through individuals
test22 =SubData1
} #end conditional to subset data

我道歉并且很尴尬地说,我完全忘记了posted a question这个(使用类似的代码)很久以来,但从来没有得到任何解决方案。我放弃了整个努力,但现在正在重新审视新数据(更混乱的数据)和新的需求。该脚本不会过滤掉正确的数据。

2 个答案:

答案 0 :(得分:2)

使用roll中的漂亮data.table功能,您可以获得最接近午夜/中午的时间戳:

# Make data (hourly time stamps +- random noise with 30 min standard dev)

len <- 30  # Days
stamps <- seq(as.POSIXct("2013-12-01"), by="-1 hour", length.out=len*12) + rnorm(len*12, 0, 1800)
stamps.target <- seq(as.POSIXct("2013-12-01"), by="-12 hour", length.out=len)

# Use data table to join stamps.target (midnight/noon) to stamps (hourly w/ noise)

library(data.table)
dt.data <- data.table(stamps, closest.match=stamps, key="stamps")
dt.target <- data.table(stamps.target)
dt.data[dt.target, roll="nearest"]

#                    stamps       closest.match
#    1: 2013-12-01 00:00:00 2013-12-01 00:24:20
#    2: 2013-11-30 12:00:00 2013-11-30 11:57:10
#    3: 2013-11-30 00:00:00 2013-11-29 23:41:29
#    4: 2013-11-29 12:00:00 2013-11-29 11:39:32
#    5: 2013-11-29 00:00:00 2013-11-28 23:31:32
#   ....

编辑:包含多个项圈的解决方案

虽然以下是相当多的代码,但大部分代码都是生成数据。实际工作实际上只是最后三行:

# Make data (hourly time stamps +- random noise with 30 min standard dev)

len <- 30  # number of 12 hour intervals
pets <- c("fido", "rosie", "felix")
start.date <- as.POSIXct("2013-12-01")

# Create random roughly 1 hour apart time stamps for
# our pets and store in data table.  

library(data.table)
stamps.data <- 
  do.call(
    rbind,
    lapply(
      pets,
      function(x) {
        data.table(
          pet=rep(x, len * 12), 
          stamp.join=seq(
            start.date, 
            by="-1 hour", 
            length.out=len*12
          ) + rnorm(len*12, 0, 1800)
  ) } ) )
# The above looks complicated, but just creates our
# data, a 3 column data table with roughly hourly time
# stamps for each pet: 
#         pet          stamp.join
#    1: rosie 2013-11-16 01:16:32
#    2:  fido 2013-11-16 01:24:28
#    3: felix 2013-11-16 01:24:40
#    4:  fido 2013-11-16 01:50:54
#    5: rosie 2013-11-16 02:33:49
#   ---                          
# 1076: felix 2013-11-30 22:50:22
# 1077: rosie 2013-11-30 23:10:52
# 1078: felix 2013-11-30 23:52:32
# 1079:  fido 2013-12-01 00:24:01
# 1080: rosie 2013-12-01 00:34:36   

# Now add a copy of stamp.join to the data table; necessary
# because we will lose the stamp.join column in the join

stamps.data[, closest.match:=stamp.join]

# Now, for each pet, create a data.table with the target
# times (CJ does a cartesian join of our pets and our target
# times vectors and returns a data table, this is necessary
# because we are doing a rolling join, if it was an exact
# join we wouldn't need to CJ with pets, could just use
# target stamps)

stamps.target <- CJ(pets, seq(as.POSIXct("2013-12-01"), by="-12 hour", length.out=len))
setkey(stamps.data, pet, stamp.join)  # join on pet and stamp.join

# Use data table to join stamps.target (midnight/noon) to stamps (hourly w/ noise)

stamps.data[stamps.target, roll="nearest"][order(stamp.join)]

#       pet          stamp.join       closest.match
#  1: felix 2013-11-16 12:00:00 2013-11-16 12:03:31
#  2:  fido 2013-11-16 12:00:00 2013-11-16 12:20:55
#  3: rosie 2013-11-16 12:00:00 2013-11-16 11:36:37
#  4: felix 2013-11-17 00:00:00 2013-11-17 00:01:48
#  5:  fido 2013-11-17 00:00:00 2013-11-17 00:12:11
#  6: rosie 2013-11-17 00:00:00 2013-11-16 23:47:56
#  ----

答案 1 :(得分:0)

使用您提供的数据集(我从您的结构中创建了一个名为 temp 的对象),这就是我想出的。此代码将为每次观察创建12小时的标记,最后通过在第一次观察之后删除所有观察结果,在每个12小时窗口中选择第一个观察结果。

# create an xts object, I just find them easier to work with
xts_object<-xts(temp$COLLAR_ID, order.by=temp$dt)

# extract time and floor to 12 hours
time<-temp$dt
time_numeric<-as.numeric(time)
# 43200 is the number of seconds in 12 hours
floored_time<-c(floor(time_numeric/43200)*43200)
floored_time<-as.POSIXct(floored_time, origin="1970-01-01 00:00:00")

# create a new xts object with the floored index
floored_xts_object<-xts(xts_object, order.by=floored_time)

# drop double time stamps, leaving just the first observation in those 12 hours
unique_xts_object<-make.index.unique(floored_xts_object, drop=T)

可以随意尝试 ceiling round 。希望这会有所帮助。

我添加了一些代码来选择具有最小时差的时间戳到唯一的12小时,保留原始时间戳,返回带有时间戳的POSIXct对象,时间差为12小时。< / p>

# make floored times unique
unique_time<-unique(floored_time)

# use difftime in lapply to get time differences for each unique time to all time stamps
time_diffrences<-lapply(unique_time, difftime, time)
small<-lapply(time_diffrences, abs)
small<-as.data.frame(small)
names(small)<-NULL

# get back into an xts object of time differences
small<-xts(small, order.by=time)
# using apply on the xts object, find the minimum for each unique time, selecting with
# with which, and just extracting the index instead of the entire array
smallest<-index(small[arrayInd(which(as.array(small)%in% apply(small, 2, min), arr.ind=T), dim(small))[,1]])

这允许您从xts数据中选择那些时间戳

# select from your original xts_object those line
selected<-xts_object[smallest]

最佳, 本