Pyspark创建一年前一周平均的列

时间:2019-03-22 12:21:22

标签: python pyspark apache-spark-sql pyspark-sql

我的目标是获得一列对不同的夫妇(产品/商店/天)的“参考价值”。

更准确地说,如果对于2018年10月10日商店1中的产品15,我想要一列返回2017年10月10日商店1中产品15的已售数量但该值可以遗失,因此,如果不存在上一年的平均值,那么我想在此新栏内估算,他在2017-10-10-7天到2017-10-10 + 7天之间售出的数量继续使用该方法(直到-1month + 1month)。

#-- > new columns as ...
data = [Row(store= 1, product = 1, date = "2017-01-01", quantity = 5, previous_year_qty = None),
    Row(store=1 , product =1, date = "2016-12-29", quantity = 8, previous_year_qty = None),
    Row(store=1, product =1, date = "2017-01-03", quantity = 12, previous_year_qty = None),
    Row(store=1, product =1, date = "2018-01-01", quantity = 10, previous_year_qty = 5)
   ]

df = sqlContext.createDataFrame(data)

+----------+-----------------+-------+--------+-----+                           
|      date|previous_year_qty|product|quantity|store|
+----------+-----------------+-------+--------+-----+
|2017-01-01|             null|      1|       5|    1|
|2016-12-29|             null|      1|       8|    1|
|2017-01-03|             null|      1|      12|    1|
|2018-01-01|                5|      1|      10|    1|
+----------+-----------------+-------+--------+-----+

"""
--> if previous year is None so : 

previous year qty for the last row should be (8 + 12)/2 = 10
"""

我试图这样做:

w7 = (Window.partitionBy(["id_sku", "id_store", "REF_DAY"]).orderBy(F.col("REF_DAY").cast(IntegerType())).rangeBetween(-7, 7))
w15  = (Window.partitionBy(["id_sku", "id_store", "REF_DAY"]).orderBy(F.col("REF_DAY").cast(IntegerType())).rangeBetween(-15, 15))
w30 = (Window.partitionBy(["id_sku", "id_store", "REF_DAY"]).orderBy(F.col("REF_DAY").cast(IntegerType())).rangeBetween(-30, 30))
wlarge = (Window.partitionBy(["id_sku", "id_store", "REF_DAY"]).orderBy(F.col("REF_DAY").cast(IntegerType())).rangeBetween(-60, 60))
wsos7 = (Window.partitionBy(["id_sku", "REF_DAY"]).orderBy(F.col("REF_DAY").cast(IntegerType())).rangeBetween(-7, 7))
wsos15 = (Window.partitionBy(["id_sku", "REF_DAY"]).orderBy(F.col("REF_DAY").cast(IntegerType())).rangeBetween(-15, 15))

#if qty ref is still > 50% null 
self_ticket_join = (ticket
    .withColumn("REF_DAY", F.date_sub("dt_ticket", 365))
    .withColumn("prev_qty_7",  F.avg("f_qty_recalc").over(w7))
    .withColumn("prev_qty_15",  F.avg("f_qty_recalc").over(w15))
    .withColumn("prev_qty_30",  F.avg("f_qty_recalc").over(w30))
    .withColumn("prev_qty_large",  F.avg("f_qty_recalc").over(wlarge))
    .withColumn("prev_qty_sos7",  F.avg("f_qty_recalc").over(wsos7))
    .withColumn("prev_qty_sos15",  F.avg("f_qty_recalc").over(wsos15))
    .withColumn("prev_prc_7",  F.avg("prc_sku").over(w7))
    .withColumn("prev_prc_15",  F.avg("prc_sku").over(w15))
    .withColumn("prev_prc_30",  F.avg("prc_sku").over(w30))
    .withColumn("prev_prc_large",  F.avg("prc_sku").over(wlarge))
    .withColumn("prev_prc_sos7",  F.avg("prc_sku").over(wsos7))
    .withColumn("prev_prc_sos15",  F.avg("prc_sku").over(wsos15))
    .select(
        F.col('id_sku').alias("prev_id_sku"),
        F.col('id_store').alias("prev_id_store"), 
        F.col('REF_DAY').alias("REF_DAY"), 
        F.col("f_qty_recalc").alias("prev_year_qty"), 
        F.col("prc_sku").alias("prev_year_price"), 
        F.col("prev_qty_7").alias("prev_qty_7"),
        F.col("prev_qty_30").alias("prev_qty_30"),
        F.col("prev_qty_15").alias("prev_qty_15"),
        F.col("prev_qty_sos7").alias("prev_qty_sos7"),
        F.col("prev_qty_sos15").alias("prev_qty_sos15"), 
        F.col("prev_prc_7").alias("prev_prc_7"),
        F.col("prev_prc_15").alias("prev_prc_15"),
        F.col("prev_prc_30").alias("prev_prc_30"),
        F.col("prev_prc_sos7").alias("prev_prc_sos7"), 
        F.col("prev_prc_sos15").alias("prev_prc_sos15"),
        F.col("prev_qty_large").alias("prev_qty_large"),
        F.col("prev_prc_large").alias("prev_prc_large"))
    ).cache()

首先,我计算一个参考日的日期-365,然后使用该参考周围的窗口计算一些移动平均值。 然后,我将在以下联接的键上对同一数据帧进行自我联接:

  • (id_store = id_store)(商店ID)
  • (id_sku = id_sku)(产品ID)
  • (ref_day = dt_ticet)(date-365 = date)

    ticket= (ticket
    .join(self_ticket_join
        , ([self_ticket_join.prev_id_store == ticket.id_store,
          self_ticket_join.prev_id_sku ==  ticket.id_sku, 
          self_ticket_join.REF_DAY ==  ticket.dt_ticket]), how = "left")
    .withColumn("qty_ref", F.coalesce(F.col("prev_year_qty"), F.col("prev_qty_7"), F.col("prev_qty_15"), F.col("prev_qty_30"), F.col("prev_qty_large"),
        F.col("prev_qty_sos7"), F.col("prev_qty_sos15")))
    .withColumn("price_ref", F.coalesce(F.col("prev_year_price"), F.col("prev_prc_7"), F.col("prev_prc_15"), F.col("prev_prc_large"),
        F.col("prev_prc_30"), F.col("prev_prc_sos7"), F.col("prev_prc_sos15")))
    .drop("prev_prc_7", "prev_prc_15", "prev_prc_30", "prev_prc_sos7", "prev_prc_sos15", 
     "prev_qty_7", "prev_qty_15", "prev_qty_30", "prev_qty_sos7", "prev_qty_sos15")
    ).cache()
    

我试图通过在同一数据帧上进行自我联接来实现此目的,如果我的不同夫妇产品/商店的参考日不存在,则我上一年将为“无”,但计算出的每个Windows函数也将为“无”

所以我正在寻找一种方法来计算“参考栏”,即使参考日不存在,也不会丢失窗口信息

0 个答案:

没有答案