使用两个数据框在日期之间累积的总和

时间:2020-10-23 15:45:26

标签: r datetime cumsum

我正在尝试获取两个日期之间的变量的累计和。

我有两个数据框,如下所示:

数据框1:事件开始/结束:

# A tibble: 128 x 3
   event_start         event_end            year
   <dttm>              <dttm>              <int>
 1 2003-07-03 04:00:00 2003-07-04 10:00:00  2003
 2 2003-07-04 13:00:00 2003-07-05 18:00:00  2003
 3 2003-07-07 23:00:00 2003-07-09 17:00:00  2003
 4 2003-07-20 03:00:00 2003-07-22 19:00:00  2003
 5 2003-07-29 17:00:00 2003-07-30 18:00:00  2003
 6 2003-07-31 22:00:00 2003-08-03 20:00:00  2003
 7 2003-08-23 01:00:00 2003-08-24 13:00:00  2003
 8 2004-07-31 22:00:00 2004-08-05 03:00:00  2004
 9 2004-08-11 13:00:00 2004-08-12 17:00:00  2004
10 2004-08-26 01:00:00 2004-08-29 12:00:00  2004
...

数据框2:具有我要从中获取累计总和的每小时数据,并且具有近30,000行数据:

              datetime       Date month_day julian_day year rain_mm Temp_C discharge Air_Temp Net_Radiation Incoming_Shortwave_Radiation
1  2003-07-01 00:00:00 2003-07-01     07-01        182 2003     0.0  5.300     0.183     7.99        -40.31                         2.91
2  2003-07-01 01:00:00 2003-07-01     07-01        182 2003     0.0  4.910     0.178     7.15        -41.36                         1.63
3  2003-07-01 02:00:00 2003-07-01     07-01        182 2003     0.0  4.440     0.174     6.08        -42.76                         1.57
4  2003-07-01 03:00:00 2003-07-01     07-01        182 2003     0.0  4.210     0.168     5.61        -43.03                         1.63
5  2003-07-01 04:00:00 2003-07-01     07-01        182 2003     0.0  3.970     0.164     4.26        -41.51                         2.84
6  2003-07-01 05:00:00 2003-07-01     07-01        182 2003     0.0  3.740     0.155     3.58        -30.97                        15.27
7  2003-07-01 06:00:00 2003-07-01     07-01        182 2003     0.0  3.580     0.148     5.90         -3.40                        67.20
8  2003-07-01 07:00:00 2003-07-01     07-01        182 2003     0.0  3.660     0.141     9.47         75.78                       191.00
9  2003-07-01 08:00:00 2003-07-01     07-01        182 2003     0.0  4.130     0.136    12.52        180.31                       303.65
10 2003-07-01 09:00:00 2003-07-01     07-01        182 2003     0.0  4.755     0.129    14.47        303.49                       425.95
11 2003-07-01 10:00:00 2003-07-01     07-01        182 2003     0.0  5.925     0.125    15.41        433.01                       555.10
12 2003-07-01 11:00:00 2003-07-01     07-01        182 2003     0.0  7.095     0.122    16.66        536.61                       656.30
...

我正在尝试获取每个event_start和event_date日期时间段之间的变量的累计和,特别是“ rain_mm”。我希望实现的输出数据帧将如下所示:(注意:cumsum_rain_mm值由该示例组成)。

# A tibble: 11 x 3
   event_start         event_end            year   cumsum_rain_mm
   <dttm>              <dttm>              <int>
 1 2005-07-04 09:00:00 2005-07-05 12:00:00  2005   11.2
 2 2005-07-06 22:00:00 2005-07-08 00:00:00  2005   7.1
 3 2005-07-10 22:00:00 2005-07-11 23:00:00  2005   7.1
...
10 2005-08-27 02:00:00 2005-08-29 04:00:00  2005   5.8
11 2005-08-30 17:00:00 2007-07-01 20:00:00  2005   6.4

我不能简单地根据两个datetime列每小时进行汇总,并且不确定从哪里开始,特别是由于每个数据帧的行数差异很大。

编辑: 最初的解决方案有效,但是现在sum_rain_mm列似乎不正确,现在看起来像这样:

# A tibble: 80 x 2
   interval                                         sum_rain_mm
   <Interval>                                             <dbl>
 1 2003-07-20 14:00:00 PDT--2003-07-21 03:00:00 PDT        412.
 2 2003-07-21 05:00:00 PDT--2003-07-22 01:00:00 PDT        412.
 3 2003-07-29 12:00:00 PDT--2003-07-30 02:00:00 PDT        412.
 4 2003-07-31 18:00:00 PDT--2003-08-01 15:00:00 PDT        412.
 5 2003-08-05 01:00:00 PDT--2003-08-05 14:00:00 PDT        412.
 6 2003-08-22 23:00:00 PDT--2003-08-23 23:00:00 PDT        412.
 7 2003-08-30 17:00:00 PDT--2003-08-31 06:00:00 PDT        412.
 8 2004-07-09 09:00:00 PDT--2004-07-11 03:00:00 PDT        412.
 9 2004-07-30 09:00:00 PDT--2004-07-31 09:00:00 PDT        412.
10 2004-08-02 02:00:00 PDT--2004-08-02 20:00:00 PDT        412.
# ... with 70 more rows

1 个答案:

答案 0 :(得分:2)

基于tidyverse的解决方案可以如下所示:

library(dplyr)
library(lubridate)
library(tidyr)

df1 %>%
  crossing(df2) %>%
  mutate(across(c(event_start, event_end, dateTime), ymd_hms),
         interval = interval(event_start, event_end)) %>%
  filter(dateTime %within% interval) %>%
  group_by(interval) %>%
  mutate(sum_rain_mm = sum(rain_mm)) %>%
  ungroup() %>%
  select(interval, sum_rain_mm) %>%
  distinct()

#   interval                                         sum_rain_mm
#   <Interval>                                             <dbl>
# 1 2003-07-03 04:00:00 UTC--2003-07-04 10:00:00 UTC           1
# 2 2003-07-04 13:00:00 UTC--2003-07-05 18:00:00 UTC           2
# 3 2003-07-07 23:00:00 UTC--2003-07-09 17:00:00 UTC           6

任意演示数据:

df1 <- structure(list(event_start = c("2003-07-03 04:00:00", "2003-07-04 13:00:00", 
"2003-07-07 23:00:00"), event_end = c("2003-07-04 10:00:00", 
"2003-07-05 18:00:00", "2003-07-09 17:00:00")), row.names = c(NA, 
-3L), class = c("tbl_df", "tbl", "data.frame"))

df2 <- structure(list(dateTime = c("2003-07-03 04:00:00", "2003-07-04 13:00:00", 
"2003-07-07 23:00:00", "2003-07-07 23:00:01"), rain_mm = c(1, 
2, 3, 3)), row.names = c(NA, -4L), class = c("tbl_df", "tbl", 
"data.frame"))