如何在Spark中填写空日期?

时间:2017-05-22 04:10:02

标签: sql apache-spark hive

我有一张订单表,我可以将每日订单累计金额总结如下:

date    amount

2017/5/1    1000

2017/5/5    2000

但我想得到:

date    amount

2017/5/1    1000

2017/5/2    1000

2017/5/3    1000

2017/5/4    1000

2017/5/5    2000

2017-05-02 2017-05-04 之间没有订单,因此金额保持为1000.我该怎么做?

1 个答案:

答案 0 :(得分:3)

下面的代码片段应该可行...这里我们在我们的数据集与另一个数据集之间执行左连接,该数据帧基本上列出了我们的开始和结束日期之间的所有日期。

import org.apache.spark.sql.expressions.Window

val df1 = Seq(("2017/5/1", 1000), ("2017/5/5", 1000)).toDF("day","value")

val df2 = Seq("2017/5/1","2017/5/2","2017/5/3","2017/5/4","2017/5/5").toDF("date")

val result = df2
      .join(df1, df1("day") === df2("date"), "left_outer")
      .withColumn("value", when($"value".isNull,0).otherwise($"value"))
      .select("date","value")
      .withColumn("value", sum($"value").over(Window.orderBy($"date")))

result.show()
{"level": "WARN ", "timestamp": "2017-05-22 05:01:28,693", "classname": "org.apache.spark.sql.execution.WindowExec", "body": "No Partition Defined for Window operation! Moving all data to a single partition, this can cause serious performance degradation."}
+--------+-----+
|    date|value|
+--------+-----+
|2017/5/1| 1000|
|2017/5/2| 1000|
|2017/5/3| 1000|
|2017/5/4| 1000|
|2017/5/5| 2000|
+--------+-----+