我有一个数据框,显示何时在不同位置检测到动物。我想从仅站点A的检测文件(df)中消除行(过滤器),如果在一个时间段(5分钟)内未在站点B上检测到单个动物。 我需要遍历每只动物和多个站点。我的真实数据有许多动物和超过一百万个检测观测值。我正在寻找一个高效的data.table解决方案。
这两个变量是个体(动物)和检测到的位置。
示例:
obs.num<-1:21 # a simple observation number
animal<-c(rep("RBT 1",10),rep("RBT 2",7) ,rep("RBT 3",2),"RBT 4","RBT 2") #
a fake list of animal id's (my data has many)
now <- Sys.time()
ts <- seq(from = now, length.out = 16, by = "mins")
ts <- c(ts,seq(from=tail(ts,1), length.out = 3, by = "hour")) # create a
fake series of time stamps
ts <- c(ts,seq(from=tail(ts,1), length.out = 2, by = "hour"))
df<-data.frame(obs.num,animal,ts) # make data frame
df$site<-c("A","B","A","B","A","B","A","B","A","B","A","B","A","B","A","B","A","B","A","B","B")# make a fake series of sites detection occurred at
str(df)
df # my example data frame
在此示例中,我想删除整个行以进行观察19。
我正在寻找类似于此解决方案的data.table解决方案
library(sqldf)
sqldf("with B as (select * from df where site == 'B')
select distinct df.* from df
join B on df.animal = B.animal and
B.ts - df.ts between -5 * 60 and 5 * 60
order by 1")
答案 0 :(得分:1)
有点笨拙,但是您可以通过data.table
中的非等参来实现:
library(data.table)
setDT(df)
nm = names(df)
# unfortunately non-equi-joins don't support on-the-fly
# columns yet, so we have to first define them explicitly; see:
# https://github.com/Rdatatable/data.table/issues/1639
df[ , ts_minus_5 := ts - 5*60]
df[ , ts_plus_5 := ts + 5*60]
# identify the observations _matching_ your criteria (i.e. those to keep)
found_at_b = unique(
df[site == 'A'][df[site == 'B'], .(x.obs.num, x.animal),
on = .(animal == animal, ts >= ts_minus_5, ts <= ts_plus_5),
# allow.cartesian allows this join to return any
# number of rows, necessary since any "B" row
# might match multiple "A" rows;
# nomatch = 0L drops any "B" row without a
# match found in "A" rows
allow.cartesian = TRUE, nomatch = 0L]
)
# to filter, define a "drop" flag (could also call it "filter")
df[site == 'B', drop := FALSE]
df[found_at_b, on = c(obs.num = 'x.obs.num', animal = 'x.animal'),
drop := FALSE]
# could define drop = TRUE for the other rows, but no need
df = df[(!drop)]
还有其他一些方法可以通过更加谨慎地潜在地创建副本来清理代码,也许首先split
site
-将数据[]
放在一个{ {1}}通话等等,但这会让您入门。