我有两个数据框。一个数据框包含网络集(开始)和提升(结束)的时间。我需要使用这个时间段来计算第二个数据帧的平均DO。我遇到了与时间匹配的问题,因为catch数据框中的时间是特定的,而DO数据框中的时间是每小时。我尝试匹配到最近的时间,但无法管理,所以我的下一个想法是在时间间隔内使用任何东西。平均值需要在时间间隔内特定于船和网。我正在为几个非常大的数据集执行此操作,因此我需要能够循环函数而不是每行的代码。
catchdf:
Boat Net Set Lift
Dawn 26 2016-05-19 12:20:11 2016-05-27 11:48:36
Nip 26 2016-07-28 07:25:47 2016-08-07 06:13:10
Dawn 26.3 2016-08-01 13:24:51 2016-08-03 07:48:52
DODF:
time DO Boat Net
2016-05-19 13:00:00 10.2 Dawn 26
2016-05-21 15:00:00 10.4 Dawn 26
2016-05-26 09:00:00 10.9 Dawn 26
2016-05-28 10:00:00 9.4 Dawn 26
2016-07-28 09:00:00 11.9 Nip 26
2016-07-28 19:00:00 12.4 Nip 26
2016-08-04 04:00:00 5.4 Nip 26
2016-08-01 05:00:00 13.2 Dawn 26.3
2016-08-02 16:00:00 12.3 Dawn 26.3
2016-08-05 22:00:00 2.4 Dawn 26.3
我的目标是在每个时间段内使用meanDO在catchdf中插入一列。
Boat Net Set Lift MeanDO
Dawn 26 2016-05-19 12:20:11 2016-05-27 11:48:36 10.50
Nip 26 2016-07-28 07:25:47 2016-08-07 06:13:10 9.90
Dawn 26.3 2016-08-01 13:24:51 2016-08-03 07:48:52 12.3
这是我用来制作样本数据框的代码:
catchdf <- structure(list(Boat = c("Dawn", "Nip", "Dawn"), Net = c("26", "26", "26.3"), Set = c("2016-05-19 12:20:11", "2016-07-28 07:25:47", "2016-08-01 13:24:51"), Lift = c("2016-05-27 11:48:36", "2016-08-07 06:13:10", "2016-08-03 07:48:52")), .Names = c("Boat", "Net", "Set", "Lift"), class = "data.frame", row.names = c(NA, -3L))
dodf <- structure(list(time = c("2016-05-19 13:00:00", "2016-05-21 15:00:00", "2016-05-26 09:00:00", "2016-05-28 10:00:00", "2016-07-28 09:00:00", "2016-07-28 19:00:00", "2016-08-04 04:00:00", "2016-08-01 05:00:00", "2016-08-02 16:00:00", "2016-08-05 22:00:00"), DO = c("10.2", "10.4", "10.9", "9.4", "11.9", "12.4", "5.4", "13.2", "12.3", "2.4"), Boat = c("Dawn", "Dawn", "Dawn", "Dawn", "Nip", "Nip", "Nip", "Dawn", "Dawn", "Dawn"), Net = c("26", "26", "26", "26", "26", "26", "26", "26.3", "26.3", "26.3")), .Names = c("time", "DO", "Boat", "Net"), class = "data.frame", row.names = c(NA, -10L))
我已经被困在这一段很长一段时间了,并且还没有取得任何进展。任何帮助将不胜感激。
答案 0 :(得分:2)
这是可能的data.table解决方案。首先,我们将修复列的格式
library(data.table) #v1.10.4
cols <- c("Set", "Lift")
setDT(catchdf)[, (cols) := lapply(.SD, as.POSIXct), .SDcols = cols]
setDT(dodf)[, `:=`(time = as.POSIXct(time), DO = as.numeric(DO))]
然后我们可以在数据集之间进行非等连接,同时计算动态平均值
dodf[catchdf, .(MeanDO = mean(DO)), on = .(Boat, Net, time > Set, time < Lift), by = .EACHI]
# Boat Net time time MeanDO
# 1: Dawn 26 2016-05-19 12:20:11 2016-05-27 11:48:36 10.5
# 2: Nip 26 2016-07-28 07:25:47 2016-08-07 06:13:10 9.9
# 3: Dawn 26.3 2016-08-01 13:24:51 2016-08-03 07:48:52 12.3
答案 1 :(得分:1)
请注意,您的dodf data.frame正在输入DO值作为我建议仅使用data.frame
并将值从get go更改为numeric的因素。
dodf <- data.frame(time = c("2016-05-19 13:00:00", "2016-05-21 15:00:00",
"2016-05-26 09:00:00", "2016-05-28 10:00:00", "2016-07-28 09:00:00", "2016-
07-28 19:00:00", "2016-08-04 04:00:00", "2016-08-01 05:00:00", "2016-08-02
16:00:00", "2016-08-05 22:00:00"),
DO = c(10.2, 10.4, 10.9, 9.4, 11.9, 12.4, 5.4, 13.2, 12.3, 2.4),
Boat = c("Dawn", "Dawn", "Dawn", "Dawn", "Nip", "Nip", "Nip", "Dawn", "Dawn", "Dawn"),
Net = c("26", "26", "26", "26", "26", "26", "26", "26.3", "26.3", "26.3"))
library(tidyverse)
library(lubridate)
dodf %>%
left_join(catchdf, by=c('Boat', 'Net')) %>% # join the data.frames
mutate(time=ymd_hms(time), # assign the values to a data format
Set = ymd_hms(Set),
Lift = ymd_hms(Lift), y =
ifelse(time>=Set & time<=Lift,'in', 'out')) %>% # create a "test column"
filter(y=='in') %>% # filter out values outside of the sample periods
group_by(Boat, Net) %>%
summarise(meanDO = mean(DO))