我正在尝试查找历史上连续多年的商品销售高峰。我的问题是,某些商品在过去已经售出并停产,但仍需要作为分析的一部分。例如:
我已经研究过r中的一些for循环,但是我不确定如何解决连续多年的总和并将其与同一数据集中的其他局部最大值进行比较的问题。
Year Item Sales
2001 Trash Can 100
2002 Trash Can 125
2003 Trash Can 90
2004 Trash Can 97
2002 Red Balloon 23
2003 Red Balloon 309
2004 Red Balloon 67
2005 Red Balloon 8
1998 Blue Bottle 600
1999 Blue Bottle 565
基于上述数据,如果我想计算2年的销售高峰,我想输出Blue Bottle 1165(1998和1999年的总和),Red Balloon 376(2003和2004年的总和)和垃圾桶。 225(2001年和2002年之和)。但是,如果我想要一个3年的峰值,那么Blue瓶将是不合格的,因为它只有2年的数据。
如果我想计算三年的销售高峰,我想输出Red Balloon 399(2002年至2004年之和)和Trash Can 315(2001年至2003年之和)。
答案 0 :(得分:0)
在SQL中,可以使用窗口函数。对于两年的合格销售:
select item, sales, year
from (select t.*,
sum(sales) over (partition by item order by year rows between 1 preceding and current row) as two_year_sales,
row_number() over (partition by item order by year) as seqnum
from t
) t
where seqnum >= 2;
并达到顶峰:
select t.*
from (select item, two_year_sales, year,
max(two_year_sales) over (partition by item) as max_two_year_sales
from (select t.*,
sum(sales) over (partition by item order by year rows between 1 preceding and current row) as two_year_sales,
row_number() over (partition by item order by year) as seqnum
from t
) t
where seqnum >= 2
) t
where two_year_sales = max_two_year_sales;
答案 1 :(得分:0)
R中使用tidyverse
和RcppRoll
的解决方案:
#Loading the packages and your data as a `tibble`
library("RcppRoll")
library("dplyr")
tbl <- tribble(
~Year, ~Item, ~Sales,
2001, "Trash Can", 100,
2002, "Trash Can", 125,
2003, "Trash Can", 90,
2004, "Trash Can", 97,
2002, "Red Balloon", 23,
2003, "Red Balloon", 309,
2004, "Red Balloon", 67,
2005, "Red Balloon", 8,
1998, "Blue Bottle", 600,
1999, "Blue Bottle", 565
)
# Set the number of consecutive years
n <- 2
# Compute the rolling sums (assumes data to be sorted) and take max
res <- tbl %>%
group_by(Item) %>%
mutate(rollingsum = roll_sumr(Sales, n)) %>%
summarize(best_sum = max(rollingsum, na.rm = TRUE))
print(res)
## A tibble: 3 x 2
# Item best_sum
# <chr> <dbl>
#1 Blue Bottle 1165
#2 Red Balloon 376
#3 Trash Can 225
设置n <- 3
会产生不同的res
:
print(res)
## A tibble: 3 x 2
# Item best_sum
# <chr> <dbl>
#1 Blue Bottle -Inf
#2 Red Balloon 399
#3 Trash Can 315
答案 2 :(得分:0)
我只能在SQL
部分为您提供帮助;将GROUP BY
与HAVING
一起使用。使用HAVIG
,它将过滤掉所有没有指定最小历史数据年数的项目。
检查此查询是否可以调整您的要求。
SELECT
item
, count(*) as num_years
, sum(Sales) as local_max
from [your_table]
where year between [year_ini] and [year_end]
group by item
having count(*) >= [number_of_years]
答案 3 :(得分:0)
将数据dat
(在末尾的注释中可重复显示)读入一个动物园系列中,每个Item
包含一列,然后转换为ts系列tt
(它将填充在缺少的年份中与NA)。然后使用rollsumr
取每个k
每隔Item
年的总和,找出每个Item
的最大值,将其堆叠到数据帧中并忽略任何NA行。函数Max
与max(x, na.rm = TRUE)
相似,除了如果x是所有NA,它将返回NA而不是-Inf并且不会发出警告。 stack
秒输出项目列,因此使用2:1反转列并添加更好的名称。
library(zoo)
Max <- function(x) if (all(is.na(x))) NA else max(x, na.rm = TRUE)
peak <- function(data, k) {
tt <- as.ts(read.zoo(data, split = "Item"))
s <- na.omit(stack(apply(rollsumr(tt, k), 2, Max)))
setNames(s[2:1], c("Item", "Sum"))
}
peak(dat, 2)
## Item Sum
## 1 Blue Bottle 1165
## 2 Red Balloon 376
## 3 Trash Can 225
peak(dat, 3)
## Item Sum
## 2 Red Balloon 399
## 3 Trash Can 315
可重复输入的形式假定为:
dat <-
structure(list(Year = c(2001L, 2002L, 2003L, 2004L, 2002L, 2003L,
2004L, 2005L, 1998L, 1999L), Item = c("Trash Can", "Trash Can",
"Trash Can", "Trash Can", "Red Balloon", "Red Balloon", "Red Balloon",
"Red Balloon", "Blue Bottle", "Blue Bottle"), Sales = c(100L,
125L, 90L, 97L, 23L, 309L, 67L, 8L, 600L, 565L)), row.names = c(NA,
-10L), class = "data.frame")