以下是可用于计算max_price的销售数据。 最高价格的逻辑
Max(last 3 weeks price)
前3周(前几周没有数据) 最高价格将为
max of(week 1 , week 2 , week 3)
在以下示例中最多(排名5、6、7)。
如何在spark中使用window函数实现相同的功能?
答案 0 :(得分:1)
以下是使用PySpark Window(线索/ udf)的解决方案。
请注意,我将排名5,6,7的价格更改为1,2,3,以与其他值进行区分以进行解释。这个逻辑就是您所解释的。
max_price_udf = udf(lambda prices_list: max(prices_list), IntegerType())
df = spark.createDataFrame([(1, 5, 2019,1,20),(2, 4, 2019,2,18),
(3, 3, 2019,3,21),(4, 2, 2019,4,20),
(5, 1, 2019,5,1),(6, 52, 2018,6,2),
(7, 51, 2018,7,3)], ["product_id", "week", "year","rank","price"])
window = Window.orderBy(col("year").desc(),col("week").desc())
df = df.withColumn("prices_list", array([coalesce(lead(col("price"),x, None).over(window),lead(col("price"),x-3, None).over(window)) for x in range(1, 4)]))
df = df.withColumn("max_price",max_price_udf(col("prices_list")))
df.show()
结果
+----------+----+----+----+-----+------------+---------+
|product_id|week|year|rank|price| prices_list|max_price|
+----------+----+----+----+-----+------------+---------+
| 1| 5|2019| 1| 20|[18, 21, 20]| 21|
| 2| 4|2019| 2| 18| [21, 20, 1]| 21|
| 3| 3|2019| 3| 21| [20, 1, 2]| 20|
| 4| 2|2019| 4| 20| [1, 2, 3]| 3|
| 5| 1|2019| 5| 1| [2, 3, 1]| 3|
| 6| 52|2018| 6| 2| [3, 1, 2]| 3|
| 7| 51|2018| 7| 3| [1, 2, 3]| 3|
+----------+----+----+----+-----+------------+---------+
这是Scala中的解决方案
var df = Seq((1, 5, 2019, 1, 20), (2, 4, 2019, 2, 18),
(3, 3, 2019, 3, 21), (4, 2, 2019, 4, 20),
(5, 1, 2019, 5, 1), (6, 52, 2018, 6, 2),
(7, 51, 2018, 7, 3)).toDF("product_id", "week", "year", "rank", "price")
val window = Window.orderBy($"year".desc, $"week".desc)
df = df.withColumn("max_price", greatest((for (x <- 1 to 3) yield coalesce(lead(col("price"), x, null).over(window), lead(col("price"), x - 3, null).over(window))):_*))
df.show()
答案 1 :(得分:0)
您可以将SQL窗口函数与great()结合使用。当SQL窗口函数的行数少于3时,您将考虑当前行甚至是先前的行。因此,您需要在内部子查询中计算lag1_price和lag2_price。在外部查询中,您可以使用row_count值,并通过将lag1,lag2和当前价格分别传递给2,1,0来获取值的最大值,并使用great()函数。
检查一下:
val df = Seq((1, 5, 2019,1,20),(2, 4, 2019,2,18),
(3, 3, 2019,3,21),(4, 2, 2019,4,20),
(5, 1, 2019,5,1),(6, 52, 2018,6,2),
(7, 51, 2018,7,3)).toDF("product_id", "week", "year","rank","price")
df.createOrReplaceTempView("sales")
val df2 = spark.sql("""
select product_id, week, year, price,
count(*) over(order by year desc, week desc rows between 1 following and 3 following ) as count_row,
lag(price) over(order by year desc, week desc ) as lag1_price,
sum(price) over(order by year desc, week desc rows between 2 preceding and 2 preceding ) as lag2_price,
max(price) over(order by year desc, week desc rows between 1 following and 3 following ) as max_price1 from sales
""")
df2.show(false)
df2.createOrReplaceTempView("sales_inner")
spark.sql("""
select product_id, week, year, price,
case
when count_row=2 then greatest(price,max_price1)
when count_row=1 then greatest(price,lag1_price,max_price1)
when count_row=0 then greatest(price,lag1_price,lag2_price)
else max_price1
end as max_price
from sales_inner
""").show(false)
结果:
+----------+----+----+-----+---------+----------+----------+----------+
|product_id|week|year|price|count_row|lag1_price|lag2_price|max_price1|
+----------+----+----+-----+---------+----------+----------+----------+
|1 |5 |2019|20 |3 |null |null |21 |
|2 |4 |2019|18 |3 |20 |null |21 |
|3 |3 |2019|21 |3 |18 |20 |20 |
|4 |2 |2019|20 |3 |21 |18 |3 |
|5 |1 |2019|1 |2 |20 |21 |3 |
|6 |52 |2018|2 |1 |1 |20 |3 |
|7 |51 |2018|3 |0 |2 |1 |null |
+----------+----+----+-----+---------+----------+----------+----------+
+----------+----+----+-----+---------+
|product_id|week|year|price|max_price|
+----------+----+----+-----+---------+
|1 |5 |2019|20 |21 |
|2 |4 |2019|18 |21 |
|3 |3 |2019|21 |20 |
|4 |2 |2019|20 |3 |
|5 |1 |2019|1 |3 |
|6 |52 |2018|2 |3 |
|7 |51 |2018|3 |3 |
+----------+----+----+-----+---------+