假设我有两个数据集。一个包含具有开始/结束日期的促销列表,另一个包含每个程序的月度销售数据。
promotions = data.frame(
start.date = as.Date(c("2012-01-01", "2012-06-14", "2012-02-01", "2012-03-31", "2012-07-13")),
end.date = as.Date(c("2014-04-05", "2014-11-13", "2014-02-25", "2014-08-02", "2014-09-30")),
program = c("a", "a", "a", "b", "b"))
sales = data.frame(
year.month.day = as.Date(c("2013-02-01", "2014-09-01", "2013-08-01", "2013-04-01", "2012-11-01")),
program = c("a", "b", "a", "a", "b"),
monthly.sales = c(200, 200, 200, 400, 200))
请注意,sales$year.month.day
用于表示年/月。包括日,因此R可以更简单地将列视为日期对象的向量,但它与实际销售额无关。
我需要确定每个计划每月发生的促销数量。这是一个产生我想要的输出的循环示例:
sales$count = rep(0, nrow(sales))
sub = list()
for (i in 1:nrow(sales)) {
sub[[i]] = promotions[which(promotions$program == sales$program[i]),]
if (nrow(sub[[i]]) > 1) {
for (j in 1:nrow(sub[[i]])) {
if (sales$year.month.day[i] %in% seq(from = as.Date(sub[[i]]$start.date[j]), to = as.Date(sub[[i]]$end.date[j]), by = "day")) {
sales$count[i] = sales$count[i] + 1
}
}
}
}
示例输出:
sales = data.frame(
year.month.day = as.Date(c("2013-02-01", "2014-09-01", "2013-08-01", "2013-04-01", "2012-11-01")),
program = c("a", "b", "a", "a", "b"),
monthly.sales = c(200, 200, 200, 400, 200),
count = c(3, 1, 3, 3, 2)
)
但是由于我的实际数据集非常大,所以当我在R中运行它时,这个循环会崩溃。
是否有更有效的方法来实现相同的结果?也许是dplyr的东西?
答案 0 :(得分:5)
您可以使用sql执行此操作。
library(sqldf)
sqldf("select s.ymd,p.program,s.monthlysales, count(*) from promotions p outer left join sales s on p.program=s.program
where s.ymd between p.startdate and p.enddate and p.program=s.program group by s.ymd, s.program" )
这将首先加入2数据集,其中销售中的ymd介于促销的开始和结束日期之间,并且两个数据中的程序是相同的。然后它将按ymd分组并计算实例。我已从变量名称中删除了句点。
答案 1 :(得分:5)
使用当前开发版data.table中新实现的 non-equi 连接:
require(data.table) # v1.9.7+
setDT(promotions) # convert to data.table by reference
setDT(sales)
ans = promotions[sales, .(monthly.sales, .N), by=.EACHI, allow.cartesian=TRUE,
on=.(program, start.date<=year.month.day, end.date>=year.month.day), nomatch=0L]
ans[, end.date := NULL]
setnames(ans, "start.date", "year.month.date")
# program year.month.date monthly.sales N
# 1: a 2013-02-01 200 3
# 2: b 2014-09-01 200 1
# 3: a 2013-08-01 200 3
# 4: a 2013-04-01 400 3
# 5: b 2012-11-01 200 2
请参阅开发版here的安装说明。
答案 2 :(得分:3)
可以尝试?data.table::foverlaps
library(data.table)
setDT(sales)[, c("start.date", "end.date") := year.month.day] # Add overlap cols
setkey(sales, program, start.date, end.date) # Key for join
res <- foverlaps(setDT(promotions), sales)[, .N, by = year.month.day] # Count joins
sales[res, count := i.N, on = "year.month.day"] # Update `sales` with results
sales
# year.month.day program monthly.sales start.date end.date count
# 1: 2013-02-01 a 200 2013-02-01 2013-02-01 3
# 2: 2013-04-01 a 400 2013-04-01 2013-04-01 3
# 3: 2013-08-01 a 200 2013-08-01 2013-08-01 3
# 4: 2012-11-01 b 200 2012-11-01 2012-11-01 2
# 5: 2014-09-01 b 200 2014-09-01 2014-09-01 1
这基本上是在sales
创建间隔列,由program
加入+,重叠计数,然后加入sales
。如果真的困扰你,可以通过sales[, c("start.date", "end.date") := NULL]
删除其他列。 Google foverlaps
和data.table
了解更多示例
答案 3 :(得分:3)
我是哈德利包裹的粉丝:
library(dplyr)
library(lubridate)
发言日期,因此它们的格式与sales
数据框的格式相同:
df <- promotions %>%
mutate(start.date = floor_date(start.date, unit = "month"),
end.date = floor_date(end.date, unit = "month"))
展开日期间隔:
df$output <- mapply(function(x,y) seq(x, y, by = "month"),
df$start.date,
df$end.date)
根据日期范围,组和计数展开数据框,并合并到销售日期和计划:
df %>% tidyr::unnest(output) %>%
group_by(output, program) %>%
summarise(prom_num = n()) %>%
merge(sales, .,
by.x = c("year.month.day", "program"),
by.y = c("output", "program"))
输出:
year.month.day program monthly.sales prom_num
1 2012-11-01 b 200 2
2 2013-02-01 a 200 3
3 2013-04-01 a 400 3
4 2013-08-01 a 200 3
5 2014-09-01 b 200 1