我正在尝试获取两个日期之间的变量的累计和。
我有两个数据框,如下所示:
数据框1:事件开始/结束:
# A tibble: 128 x 3
event_start event_end year
<dttm> <dttm> <int>
1 2003-07-03 04:00:00 2003-07-04 10:00:00 2003
2 2003-07-04 13:00:00 2003-07-05 18:00:00 2003
3 2003-07-07 23:00:00 2003-07-09 17:00:00 2003
4 2003-07-20 03:00:00 2003-07-22 19:00:00 2003
5 2003-07-29 17:00:00 2003-07-30 18:00:00 2003
6 2003-07-31 22:00:00 2003-08-03 20:00:00 2003
7 2003-08-23 01:00:00 2003-08-24 13:00:00 2003
8 2004-07-31 22:00:00 2004-08-05 03:00:00 2004
9 2004-08-11 13:00:00 2004-08-12 17:00:00 2004
10 2004-08-26 01:00:00 2004-08-29 12:00:00 2004
...
数据框2:具有我要从中获取累计总和的每小时数据,并且具有近30,000行数据:
datetime Date month_day julian_day year rain_mm Temp_C discharge Air_Temp Net_Radiation Incoming_Shortwave_Radiation
1 2003-07-01 00:00:00 2003-07-01 07-01 182 2003 0.0 5.300 0.183 7.99 -40.31 2.91
2 2003-07-01 01:00:00 2003-07-01 07-01 182 2003 0.0 4.910 0.178 7.15 -41.36 1.63
3 2003-07-01 02:00:00 2003-07-01 07-01 182 2003 0.0 4.440 0.174 6.08 -42.76 1.57
4 2003-07-01 03:00:00 2003-07-01 07-01 182 2003 0.0 4.210 0.168 5.61 -43.03 1.63
5 2003-07-01 04:00:00 2003-07-01 07-01 182 2003 0.0 3.970 0.164 4.26 -41.51 2.84
6 2003-07-01 05:00:00 2003-07-01 07-01 182 2003 0.0 3.740 0.155 3.58 -30.97 15.27
7 2003-07-01 06:00:00 2003-07-01 07-01 182 2003 0.0 3.580 0.148 5.90 -3.40 67.20
8 2003-07-01 07:00:00 2003-07-01 07-01 182 2003 0.0 3.660 0.141 9.47 75.78 191.00
9 2003-07-01 08:00:00 2003-07-01 07-01 182 2003 0.0 4.130 0.136 12.52 180.31 303.65
10 2003-07-01 09:00:00 2003-07-01 07-01 182 2003 0.0 4.755 0.129 14.47 303.49 425.95
11 2003-07-01 10:00:00 2003-07-01 07-01 182 2003 0.0 5.925 0.125 15.41 433.01 555.10
12 2003-07-01 11:00:00 2003-07-01 07-01 182 2003 0.0 7.095 0.122 16.66 536.61 656.30
...
我正在尝试获取每个event_start和event_date日期时间段之间的变量的累计和,特别是“ rain_mm”。我希望实现的输出数据帧将如下所示:(注意:cumsum_rain_mm值由该示例组成)。
# A tibble: 11 x 3
event_start event_end year cumsum_rain_mm
<dttm> <dttm> <int>
1 2005-07-04 09:00:00 2005-07-05 12:00:00 2005 11.2
2 2005-07-06 22:00:00 2005-07-08 00:00:00 2005 7.1
3 2005-07-10 22:00:00 2005-07-11 23:00:00 2005 7.1
...
10 2005-08-27 02:00:00 2005-08-29 04:00:00 2005 5.8
11 2005-08-30 17:00:00 2007-07-01 20:00:00 2005 6.4
我不能简单地根据两个datetime列每小时进行汇总,并且不确定从哪里开始,特别是由于每个数据帧的行数差异很大。
编辑: 最初的解决方案有效,但是现在sum_rain_mm列似乎不正确,现在看起来像这样:
# A tibble: 80 x 2
interval sum_rain_mm
<Interval> <dbl>
1 2003-07-20 14:00:00 PDT--2003-07-21 03:00:00 PDT 412.
2 2003-07-21 05:00:00 PDT--2003-07-22 01:00:00 PDT 412.
3 2003-07-29 12:00:00 PDT--2003-07-30 02:00:00 PDT 412.
4 2003-07-31 18:00:00 PDT--2003-08-01 15:00:00 PDT 412.
5 2003-08-05 01:00:00 PDT--2003-08-05 14:00:00 PDT 412.
6 2003-08-22 23:00:00 PDT--2003-08-23 23:00:00 PDT 412.
7 2003-08-30 17:00:00 PDT--2003-08-31 06:00:00 PDT 412.
8 2004-07-09 09:00:00 PDT--2004-07-11 03:00:00 PDT 412.
9 2004-07-30 09:00:00 PDT--2004-07-31 09:00:00 PDT 412.
10 2004-08-02 02:00:00 PDT--2004-08-02 20:00:00 PDT 412.
# ... with 70 more rows
答案 0 :(得分:2)
基于tidyverse的解决方案可以如下所示:
library(dplyr)
library(lubridate)
library(tidyr)
df1 %>%
crossing(df2) %>%
mutate(across(c(event_start, event_end, dateTime), ymd_hms),
interval = interval(event_start, event_end)) %>%
filter(dateTime %within% interval) %>%
group_by(interval) %>%
mutate(sum_rain_mm = sum(rain_mm)) %>%
ungroup() %>%
select(interval, sum_rain_mm) %>%
distinct()
# interval sum_rain_mm
# <Interval> <dbl>
# 1 2003-07-03 04:00:00 UTC--2003-07-04 10:00:00 UTC 1
# 2 2003-07-04 13:00:00 UTC--2003-07-05 18:00:00 UTC 2
# 3 2003-07-07 23:00:00 UTC--2003-07-09 17:00:00 UTC 6
任意演示数据:
df1 <- structure(list(event_start = c("2003-07-03 04:00:00", "2003-07-04 13:00:00",
"2003-07-07 23:00:00"), event_end = c("2003-07-04 10:00:00",
"2003-07-05 18:00:00", "2003-07-09 17:00:00")), row.names = c(NA,
-3L), class = c("tbl_df", "tbl", "data.frame"))
df2 <- structure(list(dateTime = c("2003-07-03 04:00:00", "2003-07-04 13:00:00",
"2003-07-07 23:00:00", "2003-07-07 23:00:01"), rain_mm = c(1,
2, 3, 3)), row.names = c(NA, -4L), class = c("tbl_df", "tbl",
"data.frame"))