我的目标是获得一列对不同的夫妇(产品/商店/天)的“参考价值”。
更准确地说,如果对于2018年10月10日商店1中的产品15,我想要一列返回2017年10月10日商店1中产品15的已售数量但该值可以遗失,因此,如果不存在上一年的平均值,那么我想在此新栏内估算,他在2017-10-10-7天到2017-10-10 + 7天之间售出的数量继续使用该方法(直到-1month + 1month)。
#-- > new columns as ...
data = [Row(store= 1, product = 1, date = "2017-01-01", quantity = 5, previous_year_qty = None),
Row(store=1 , product =1, date = "2016-12-29", quantity = 8, previous_year_qty = None),
Row(store=1, product =1, date = "2017-01-03", quantity = 12, previous_year_qty = None),
Row(store=1, product =1, date = "2018-01-01", quantity = 10, previous_year_qty = 5)
]
df = sqlContext.createDataFrame(data)
+----------+-----------------+-------+--------+-----+
| date|previous_year_qty|product|quantity|store|
+----------+-----------------+-------+--------+-----+
|2017-01-01| null| 1| 5| 1|
|2016-12-29| null| 1| 8| 1|
|2017-01-03| null| 1| 12| 1|
|2018-01-01| 5| 1| 10| 1|
+----------+-----------------+-------+--------+-----+
"""
--> if previous year is None so :
previous year qty for the last row should be (8 + 12)/2 = 10
"""
我试图这样做:
w7 = (Window.partitionBy(["id_sku", "id_store", "REF_DAY"]).orderBy(F.col("REF_DAY").cast(IntegerType())).rangeBetween(-7, 7))
w15 = (Window.partitionBy(["id_sku", "id_store", "REF_DAY"]).orderBy(F.col("REF_DAY").cast(IntegerType())).rangeBetween(-15, 15))
w30 = (Window.partitionBy(["id_sku", "id_store", "REF_DAY"]).orderBy(F.col("REF_DAY").cast(IntegerType())).rangeBetween(-30, 30))
wlarge = (Window.partitionBy(["id_sku", "id_store", "REF_DAY"]).orderBy(F.col("REF_DAY").cast(IntegerType())).rangeBetween(-60, 60))
wsos7 = (Window.partitionBy(["id_sku", "REF_DAY"]).orderBy(F.col("REF_DAY").cast(IntegerType())).rangeBetween(-7, 7))
wsos15 = (Window.partitionBy(["id_sku", "REF_DAY"]).orderBy(F.col("REF_DAY").cast(IntegerType())).rangeBetween(-15, 15))
#if qty ref is still > 50% null
self_ticket_join = (ticket
.withColumn("REF_DAY", F.date_sub("dt_ticket", 365))
.withColumn("prev_qty_7", F.avg("f_qty_recalc").over(w7))
.withColumn("prev_qty_15", F.avg("f_qty_recalc").over(w15))
.withColumn("prev_qty_30", F.avg("f_qty_recalc").over(w30))
.withColumn("prev_qty_large", F.avg("f_qty_recalc").over(wlarge))
.withColumn("prev_qty_sos7", F.avg("f_qty_recalc").over(wsos7))
.withColumn("prev_qty_sos15", F.avg("f_qty_recalc").over(wsos15))
.withColumn("prev_prc_7", F.avg("prc_sku").over(w7))
.withColumn("prev_prc_15", F.avg("prc_sku").over(w15))
.withColumn("prev_prc_30", F.avg("prc_sku").over(w30))
.withColumn("prev_prc_large", F.avg("prc_sku").over(wlarge))
.withColumn("prev_prc_sos7", F.avg("prc_sku").over(wsos7))
.withColumn("prev_prc_sos15", F.avg("prc_sku").over(wsos15))
.select(
F.col('id_sku').alias("prev_id_sku"),
F.col('id_store').alias("prev_id_store"),
F.col('REF_DAY').alias("REF_DAY"),
F.col("f_qty_recalc").alias("prev_year_qty"),
F.col("prc_sku").alias("prev_year_price"),
F.col("prev_qty_7").alias("prev_qty_7"),
F.col("prev_qty_30").alias("prev_qty_30"),
F.col("prev_qty_15").alias("prev_qty_15"),
F.col("prev_qty_sos7").alias("prev_qty_sos7"),
F.col("prev_qty_sos15").alias("prev_qty_sos15"),
F.col("prev_prc_7").alias("prev_prc_7"),
F.col("prev_prc_15").alias("prev_prc_15"),
F.col("prev_prc_30").alias("prev_prc_30"),
F.col("prev_prc_sos7").alias("prev_prc_sos7"),
F.col("prev_prc_sos15").alias("prev_prc_sos15"),
F.col("prev_qty_large").alias("prev_qty_large"),
F.col("prev_prc_large").alias("prev_prc_large"))
).cache()
首先,我计算一个参考日的日期-365,然后使用该参考周围的窗口计算一些移动平均值。 然后,我将在以下联接的键上对同一数据帧进行自我联接:
(ref_day = dt_ticet)(date-365 = date)
ticket= (ticket
.join(self_ticket_join
, ([self_ticket_join.prev_id_store == ticket.id_store,
self_ticket_join.prev_id_sku == ticket.id_sku,
self_ticket_join.REF_DAY == ticket.dt_ticket]), how = "left")
.withColumn("qty_ref", F.coalesce(F.col("prev_year_qty"), F.col("prev_qty_7"), F.col("prev_qty_15"), F.col("prev_qty_30"), F.col("prev_qty_large"),
F.col("prev_qty_sos7"), F.col("prev_qty_sos15")))
.withColumn("price_ref", F.coalesce(F.col("prev_year_price"), F.col("prev_prc_7"), F.col("prev_prc_15"), F.col("prev_prc_large"),
F.col("prev_prc_30"), F.col("prev_prc_sos7"), F.col("prev_prc_sos15")))
.drop("prev_prc_7", "prev_prc_15", "prev_prc_30", "prev_prc_sos7", "prev_prc_sos15",
"prev_qty_7", "prev_qty_15", "prev_qty_30", "prev_qty_sos7", "prev_qty_sos15")
).cache()
我试图通过在同一数据帧上进行自我联接来实现此目的,如果我的不同夫妇产品/商店的参考日不存在,则我上一年将为“无”,但计算出的每个Windows函数也将为“无”
所以我正在寻找一种方法来计算“参考栏”,即使参考日不存在,也不会丢失窗口信息