已更新为更现实的示例;这次在interp_b中添加了重复项。
我试图使用第二个数据帧(interp_b
)中的值填充一个数据帧(bait
)中的字段。我想查看obs_datetime
中每一行的interp_b
,并确定在obs_datetime
之前最后一次诱捕绘图站年份的时间。以后将用它来计算每个obs_datetime
的诱饵时间。诱饵时间位于bait
列的bait_datetime
数据框中。结果应放在latestbait_datetime
数据框中的interp_b
字段中。
我正在可视化一个迭代过程,其中interp_b“ latestbait_datetime”不断进行重新计算,直到到达诱饵数据帧中的最后一行。我尝试过的for循环显然在所有行中运行并进行了指定的计算,但是我似乎无法获得所需格式的输出。它为每个循环生成输出,而不是重写和更新interp_b数据帧。
这里有一些代码可以构建两个数据框; interp_b和诱饵(请原谅)
# interp_b dataframe----
structure(list(plot_station_year = c("Cow_C2_2019", "RidingStable_C3_2018",
"RidingStable_C3_2018", "Raf_C1_2018", "Metcalfe_C2_2019"), obs_datetime = structure(c(1559487600,
1544954400, 1541084400, 1515160800, 1567756800), class = c("POSIXct",
"POSIXt"), tzone = "UTC"), latestbait_datetime = structure(c(NA_real_,
NA_real_, NA_real_, NA_real_, NA_real_), class = c("POSIXct",
"POSIXt"))), class = c("spec_tbl_df", "tbl_df", "tbl", "data.frame"
), row.names = c(NA, -5L))
# bait dataframe----
structure(list(plot_station_year = c("Cow_C2_2019", "Cow_C2_2019",
"RidingStable_C3_2018", "Raf_C1_2018"), bait_datetime = structure(c(1557500400,
1559746800, 1543676400, 1491318000), class = c("POSIXct", "POSIXt"
), tzone = "UTC")), class = c("spec_tbl_df", "tbl_df", "tbl",
"data.frame"), row.names = c(NA, -4L), spec = structure(list(
cols = list(plot_station_year = structure(list(), class = c("collector_character",
"collector")), bait_datetime = structure(list(format = "%d-%m-%Y %H:%M"), class = c("collector_datetime",
"collector"))), default = structure(list(), class = c("collector_guess",
"collector")), skip = 1), class = "col_spec"))
预期的结果将如下所示
以下是我的两次尝试。第一个结果导致一个数据框仅包含循环的最后一次运行,第二个尝试导致一个数据框包含所有运行结果(与绑定所期望的一样)。
library(tidyverse)
#attempt #1----
for (i in 1:nrow(bait)) {
print(paste("row =",i))
interpbait <- interp_b %>%
mutate(latestbait_datetime = if_else((plot_station_year == bait$plot_station_year[i] & (obs_datetime >= bait$bait_datetime[i] & (is.na(latestbait_datetime) | latestbait_datetime < bait$bait_datetime[i]))), bait$bait_datetime[i], latestbait_datetime))
}
#attempt #2----
resultb <- data.frame()
for (i in 1:nrow(bait)) {
print(paste("row =",i))
interpbait2 <- interp_b %>%
mutate(latestbait_datetime = if_else((plot_station_year == bait$plot_station_year[i] & (obs_datetime >= bait$bait_datetime[i] & (is.na(latestbait_datetime) | latestbait_datetime < bait$bait_datetime[i]))), bait$bait_datetime[i], latestbait_datetime))
resultb <- bind_rows(resultb, interpbait2)
print(resultb)
}
任何帮助将不胜感激。
答案 0 :(得分:1)
我不确定这需要多长时间,但这是一个整洁的解决方案。对于interp_b
中的每一行,我们将bait
数据帧过滤到正确的plot_station_year
,并确保所有日期时间都比{{ 1}}。然后,我们通过将datetime降序排列过滤后的interp_b
数据(以使最新日期在最前面)。我们对数据框的第一行进行切片,以便仅获取最新日期。然后,我们从数据框中“提取”日期时间,并将其添加到bait
中的相应行中。
interp_b
结果表与您的预期输出(library(tidyverse)
library(progress) # for progress bar
# create progress bar to update, so that you can estimate the amount of time it will take to finish the entire loop
pb <- progress_bar$new(total = nrow(interp_b))
for (i in 1:nrow(interp_b)) {
last_time_baited <- bait %>%
#filter bait dataframe to appropriate plot, station, year based on
# the row in interp_b
filter(plot_station_year == interp_b$plot_station_year[i],
# ensure all datetimes are less than that row in interp_b
bait_datetime < interp_b$obs_datetime[i]) %>%
# arrange by datetime (most recent datetimes first)
arrange(desc(bait_datetime)) %>%
# take the top row - this will be the most recent date-time that
# the plot-station was baited
slice(1) %>%
# "pull" that value out of the dataframe so you have a value,
# not a tibble
pull(bait_datetime) #
# update the row in interp_b with the date_time baited
interp_b$latestbait_datetime[i] <- last_time_baited
pb$tick() # print progress
}
)相匹配:
interp_b
答案 1 :(得分:0)
您可以使用data.table
进行外部联接,然后为每个plot_station_year选择最高的bait_datetime。
编辑:我编辑了答案,以反映对于obs_datetime
中给定的唯一plot_station_year
可能存在多个interp2
的可能性。为了保留这些索引,我们将它们编入索引,并将索引包括在过滤步骤中。
大文件(未经测试)的一项潜在改进可能是使用roll
进行合并,而不是执行外部合并然后进行过滤。
该版本显示在可复制示例的末尾:
library(data.table)
interp2 <- structure(list(plot_station_year = c("Cow_C2_2019", "Cow_C2_2019", "RidingStable_C3_2018",
"Raf_C1_2018", "Metcalfe_C2_2019"), obs_datetime = structure(c(1559487600, 1559487300,
1544954400, 1515160800, 1567756800), class = c("POSIXct", "POSIXt"
), tzone = "UTC"), latestbait_datetime = structure(c(NA_real_,
NA_real_, NA_real_, NA_real_), class = c("POSIXct", "POSIXt"))), class = c("spec_tbl_df",
"tbl_df", "tbl", "data.frame"), row.names = c(NA, -5L))
bait2 <- structure(list(plot_station_year = c("Cow_C2_2019", "Cow_C2_2019", "Cow_C2_2019",
"RidingStable_C3_2018", "Raf_C1_2018"), bait_datetime = structure(c(1557500400,
1496674800, 1576674800, 1543676400, 1491318000), class = c("POSIXct", "POSIXt"
), tzone = "UTC")), class = c("spec_tbl_df", "tbl_df", "tbl",
"data.frame"), row.names = c(NA, -5L), spec = structure(list(
cols = list(plot_station_year = structure(list(), class = c("collector_character",
"collector")), bait_datetime = structure(list(format = "%d-%m-%Y %H:%M"), class = c("collector_datetime",
"collector"))), default = structure(list(), class = c("collector_guess",
"collector")), skip = 1), class = "col_spec"))
# add index idx by plot_station_year, remove empty column, set keys
setDT(interp2)[, "latestbait_datetime" := NULL][, idx := 1:.N, by=plot_station_year]
setkeyv(interp2, c("plot_station_year", "idx", "obs_datetime"))
# same for bait2: set as data.table, set keys
setDT(bait2, key=c("plot_station_year", "bait_datetime"))
## option 1: merge files, then filter
# outer join on interp2 and bait2 on first column (and order by bait_datetime)
expected_out <- merge(interp2, bait2, by="plot_station_year", all=TRUE)
# set keys for sorting
setkey(expected_out, plot_station_year, idx, bait_datetime)
# select highest bait_datetime below obs_datetime by plot_station_year and idx
expected_out <- expected_out[is.na(bait_datetime) | bait_datetime < obs_datetime][,
tail(.SD, 1), by=.(plot_station_year, idx)]
# rename and sort columns
setnames(expected_out, old="bait_datetime", new="latestbait_datetime")
setorder(expected_out, -latestbait_datetime, idx, na.last = TRUE)[]
#> plot_station_year idx obs_datetime latestbait_datetime
#> 1: Cow_C2_2019 1 2019-06-02 15:00:00 2019-05-10 15:00:00
#> 2: Cow_C2_2019 2 2019-06-02 14:55:00 2019-05-10 15:00:00
#> 3: RidingStable_C3_2018 1 2018-12-16 10:00:00 2018-12-01 15:00:00
#> 4: Raf_C1_2018 1 2018-01-05 14:00:00 2017-04-04 15:00:00
#> 5: Metcalfe_C2_2019 1 2019-09-06 08:00:00 <NA>
## option 2 (might use less memory): rolling join
bait2[, latestbait_datetime := bait_datetime]
out_alt <- bait2[interp2, .(plot_station_year, obs_datetime, idx, latestbait_datetime),
on=c("plot_station_year", "bait_datetime==obs_datetime"), roll=Inf]
# order
setorder(out_alt, -latestbait_datetime, idx, na.last = TRUE)[]
#> plot_station_year obs_datetime idx latestbait_datetime
#> 1: Cow_C2_2019 2019-06-02 15:00:00 1 2019-05-10 15:00:00
#> 2: Cow_C2_2019 2019-06-02 14:55:00 2 2019-05-10 15:00:00
#> 3: RidingStable_C3_2018 2018-12-16 10:00:00 1 2018-12-01 15:00:00
#> 4: Raf_C1_2018 2018-01-05 14:00:00 1 2017-04-04 15:00:00
#> 5: Metcalfe_C2_2019 2019-09-06 08:00:00 1 <NA>
setcolorder(out_alt, c(1,3,2,4))[]
#> plot_station_year idx obs_datetime latestbait_datetime
#> 1: Cow_C2_2019 1 2019-06-02 15:00:00 2019-05-10 15:00:00
#> 2: Cow_C2_2019 2 2019-06-02 14:55:00 2019-05-10 15:00:00
#> 3: RidingStable_C3_2018 1 2018-12-16 10:00:00 2018-12-01 15:00:00
#> 4: Raf_C1_2018 1 2018-01-05 14:00:00 2017-04-04 15:00:00
#> 5: Metcalfe_C2_2019 1 2019-09-06 08:00:00 <NA>
## test that both options give the same result:
identical(expected_out, out_alt)
#> [1] TRUE