我有一张订单表,我可以将每日订单累计金额总结如下:
date amount
2017/5/1 1000
2017/5/5 2000
但我想得到:
date amount
2017/5/1 1000
2017/5/2 1000
2017/5/3 1000
2017/5/4 1000
2017/5/5 2000
2017-05-02 与 2017-05-04 之间没有订单,因此金额保持为1000.我该怎么做?
答案 0 :(得分:3)
下面的代码片段应该可行...这里我们在我们的数据集与另一个数据集之间执行左连接,该数据帧基本上列出了我们的开始和结束日期之间的所有日期。
import org.apache.spark.sql.expressions.Window
val df1 = Seq(("2017/5/1", 1000), ("2017/5/5", 1000)).toDF("day","value")
val df2 = Seq("2017/5/1","2017/5/2","2017/5/3","2017/5/4","2017/5/5").toDF("date")
val result = df2
.join(df1, df1("day") === df2("date"), "left_outer")
.withColumn("value", when($"value".isNull,0).otherwise($"value"))
.select("date","value")
.withColumn("value", sum($"value").over(Window.orderBy($"date")))
result.show()
{"level": "WARN ", "timestamp": "2017-05-22 05:01:28,693", "classname": "org.apache.spark.sql.execution.WindowExec", "body": "No Partition Defined for Window operation! Moving all data to a single partition, this can cause serious performance degradation."}
+--------+-----+
| date|value|
+--------+-----+
|2017/5/1| 1000|
|2017/5/2| 1000|
|2017/5/3| 1000|
|2017/5/4| 1000|
|2017/5/5| 2000|
+--------+-----+