在其他字段上使用“窗口功能”扎带器以获取最新记录

时间:2019-06-10 23:28:17

标签: sql apache-spark pyspark apache-spark-sql pyspark-sql

我有以下数据,其中该数据按商店和月份ID进行分区,并按数量进行排序,以获得商店的主要供应商。

如果两个供应商之间的金额相等,我需要一个平局, 那么如果捆绑销售商之一是前几个月销售量最大的销售商,则将该销售商作为该月销售量最大的销售商。

如果再打平领带,回望会增加。如果再打结,则1个月的延迟将不起作用。最坏的情况是上个月我们还会有更多重复项。

样本数据

val data = Seq((201801,      10941,            115,  80890.44900, 135799.66400),
               (201801,      10941,            3,  80890.44900, 135799.66400) ,
               (201712,      10941,            3, 517440.74500, 975893.79000),
               (201712,      10941,            115, 517440.74500, 975893.79000),
               (201711,      10941,            3 , 371501.92100, 574223.52300),
               (201710,      10941,            115, 552435.57800, 746912.06700),
               (201709,      10941,            115,1523492.60700,1871480.06800),
               (201708,      10941,            115,1027698.93600,1236544.50900),
               (201707,      10941,            33 ,1469219.86900,1622949.53000)
               ).toDF("MTH_ID", "store_id" ,"brand" ,"brndSales","TotalSales")

代码:

val window = Window.partitionBy("store_id","MTH_ID").orderBy("brndSales")
val res = data.withColumn("rank",rank over window)

输出:

    +------+--------+-----+-----------+-----------+----+
 |MTH_ID|store_id|brand|  brndSales| TotalSales|rank|
+------+--------+-----+-----------+-----------+----+
|201801|   10941|  115|  80890.449| 135799.664|   1|
|201801|   10941|    3|  80890.449| 135799.664|   1|
|201712|   10941|    3| 517440.745|  975893.79|   1|
|201712|   10941|  115| 517440.745|  975893.79|   1|
|201711|   10941|  115| 371501.921| 574223.523|   1|
|201710|   10941|  115| 552435.578| 746912.067|   1|
|201709|   10941|  115|1523492.607|1871480.068|   1|
|201708|   10941|  115|1027698.936|1236544.509|   1|
|201707|   10941|   33|1469219.869| 1622949.53|   1|
+------+--------+-----+-----------+-----------+----+

基于上个月的最高金额,我的1条和2条记录的排名均为1,但我的第二条记录的排名应该为1

我期望以下输出。

    +------+--------+-----+-----------+-----------+----+
    |MTH_ID|store_id|brand|  brndSales| TotalSales|rank|
    +------+--------+-----+-----------+-----------+----+
    |201801|   10941|  115|  80890.449| 135799.664|   2|
    |201801|   10941|    3|  80890.449| 135799.664|   1|
    |201712|   10941|    3| 517440.745|  975893.79|   1|
    |201712|   10941|  115| 517440.745|  975893.79|   1|
    |201711|   10941|    3| 371501.921| 574223.523|   1|
    |201710|   10941|  115| 552435.578| 746912.067|   1|
    |201709|   10941|  115|1523492.607|1871480.068|   1|
    |201708|   10941|  115|1027698.936|1236544.509|   1|
    |201707|   10941|   33|1469219.869| 1622949.53|   1|
    +------+--------+-----+-----------+-----------+----+

我应该写UDAF吗?任何建议都会有所帮助。

2 个答案:

答案 0 :(得分:3)

您可以使用2个窗口执行此操作。首先,您将需要使用lag()函数来保留上个月的销售值,以便可以在排名窗口中使用该值。这是pyspark的一部分:

lag_window = Window.partitionBy("store_id", "brand").orderBy("MTH_ID")
lag_df = data.withColumn("last_month_sales", lag("brndSales").over(lag_window))

然后编辑窗口以包括该新列:

window = Window.partitionBy("store_id","MTH_ID").orderBy("brndSales", "last_month_sales")
lag_df.withColumn("rank",rank().over(window)).show()
+------+--------+-----+-----------+-----------+----------------+----+
|MTH_ID|store_id|brand|  brndSales| TotalSales|last_month_sales|rank|
+------+--------+-----+-----------+-----------+----------------+----+
|201711|   10941|   99| 371501.921| 574223.523|            null|   1|
|201709|   10941|  115|1523492.607|1871480.068|     1027698.936|   1|
|201707|   10941|   33|1469219.869| 1622949.53|            null|   1|
|201708|   10941|  115|1027698.936|1236544.509|            null|   1|
|201710|   10941|  115| 552435.578| 746912.067|     1523492.607|   1|
|201712|   10941|    3| 517440.745|  975893.79|            null|   1|
|201801|   10941|    3|  80890.449| 135799.664|      517440.745|   1|
|201801|   10941|  115|  80890.449| 135799.664|      552435.578|   2|
+------+--------+-----+-----------+-----------+----------------+----+

答案 1 :(得分:0)

在(月销售)结构中,为每一行收集一系列该品牌以前的销售。

val storeAndBrandWindow = Window.partitionBy("store_id", "brand").orderBy($"MTH_ID")
val df1 = data.withColumn("brndSales_list", collect_list(struct($"MTH_ID", $"brndSales")).over(storeAndBrandWindow))

使用UDF反转该数组。

val returnType = ArrayType(StructType(Array(StructField("month", IntegerType), StructField("sales", DoubleType))))
val reverseUdf = udf((list: Seq[Row]) => list.reverse, returnType)
val df2 = df1.withColumn("brndSales_list", reverseUdf($"brndSales_list"))

然后按数组排序。

val window = Window.partitionBy("store_id", "MTH_ID").orderBy($"brndSales_list".desc)
val df3 = df2.withColumn("rank", rank over window).orderBy("MTH_ID", "brand")
df3.show(false)

结果

+------+--------+-----+-----------+-----------+-----------------------------------------------------------------------------------------+----+
|MTH_ID|store_id|brand|brndSales  |TotalSales |brndSales_list                                                                           |rank|
+------+--------+-----+-----------+-----------+-----------------------------------------------------------------------------------------+----+
|201707|10941   |33   |1469219.869|1622949.53 |[[201707, 1469219.869]]                                                                  |1   |
|201708|10941   |115  |1027698.936|1236544.509|[[201708, 1027698.936]]                                                                  |1   |
|201709|10941   |115  |1523492.607|1871480.068|[[201709, 1523492.607], [201708, 1027698.936]]                                           |1   |
|201710|10941   |115  |552435.578 |746912.067 |[[201710, 552435.578], [201709, 1523492.607], [201708, 1027698.936]]                     |1   |
|201711|10941   |99   |371501.921 |574223.523 |[[201711, 371501.921]]                                                                   |1   |
|201712|10941   |3    |517440.745 |975893.79  |[[201712, 517440.745]]                                                                   |1   |
|201801|10941   |3    |80890.449  |135799.664 |[[201801, 80890.449], [201712, 517440.745]]                                              |1   |
|201801|10941   |115  |80890.449  |135799.664 |[[201801, 80890.449], [201710, 552435.578], [201709, 1523492.607], [201708, 1027698.936]]|2   |
+------+--------+-----+-----------+-----------+-----------------------------------------------------------------------------------------+----+