Question

我有两个数据框，第一个数据集是公司每个项目在接下来的27天中的预测需求记录，如下所示：

library(tidyverse)
library(lubridate)

daily_forecast <- data.frame(
  item=c("A","B","A","B"),
  date_fcsted=c("2020-8-1","2020-8-1","2020-8-15","2020-8-15"),
  fcsted_qty=c(100,200,200,100)
) %>% 
  mutate(date_fcsted=ymd(date_fcsted)) %>% 
  mutate(extended_date=date_fcsted+days(27))

另一个日期集是每个项目的实际每日需求：

actual_orders <- data.frame(
  order_date=rep(seq(ymd("2020-8-3"),ymd("2020-9-15"),by = "1 week"),2),
  item=rep(c("A","B"),7),
  order_qty=round(rnorm(n=14,mean=50,sd=10),0)
)

我要完成的工作是在第一个数据集中的 date_fcsted 和 extended_date 中获取每个项目的实际总需求，然后将它们合并以计算预测准确性。

使用 tidyverse 的解决方案将受到高度赞赏。

Answer 1

您可以尝试以下操作：

library(dplyr)

daily_forecast %>%
  left_join(actual_orders, by = 'item') %>%
  filter(order_date >= date_fcsted & order_date <= extended_date) %>%
  group_by(item, date_fcsted, extended_date, fcsted_qty) %>%
  summarise(value = sum(order_qty))

#  item  date_fcsted extended_date fcsted_qty value
#  <chr> <date>      <date>             <dbl> <dbl>
#1 A     2020-08-01  2020-08-28           100   179
#2 A     2020-08-15  2020-09-11           200   148
#3 B     2020-08-01  2020-08-28           200   190
#4 B     2020-08-15  2020-09-11           100   197

Answer 2

您也可以按照@Gregor Thomas的建议尝试使用fuzzy_join。我添加了一个行号列，以确保您拥有与item和日期范围无关的唯一行（但这可能不是必需的）。

library(fuzzyjoin)
library(dplyr)

daily_forecast %>%
  mutate(rn = row_number()) %>%
  fuzzy_left_join(actual_orders,
                  by = c("item" = "item",
                         "date_fcsted" = "order_date",
                         "extended_date" = "order_date"),
                  match_fun = list(`==`, `<=`, `>=`)) %>%
  group_by(rn, item.x, date_fcsted, extended_date, fcsted_qty) %>%
  summarise(actual_total_demand = sum(order_qty))

输出

     rn item.x date_fcsted extended_date fcsted_qty actual_total_demand
  <int> <chr>  <date>      <date>             <dbl>               <dbl>
1     1 A      2020-08-01  2020-08-28           100                 221
2     2 B      2020-08-01  2020-08-28           200                 219
3     3 A      2020-08-15  2020-09-11           200                 212
4     4 B      2020-08-15  2020-09-11           100                 216

有什么办法可以按日期范围连接两个数据框？

2 个答案: