我想了解解决Spark SQL中与日期相关的问题的最佳方法。我正在尝试解决一个简单的问题,该文件中的日期范围如下:
startdate,enddate
01/01/2018,30/01/2018
01/02/2018,28/02/2018
01/03/2018,30/03/2018
和另一个具有日期和计数的表:
date,counts
03/01/2018,10
25/01/2018,15
05/02/2018,23
17/02/2018,43
现在我要查找的是每个日期范围内的计数总和,因此预期的输出是:
startdate,enddate,sum(count)
01/01/2018,30/01/2018,25
01/02/2018,28/02/2018,66
01/03/2018,30/03/2018,0
以下是我编写的代码,但这给了我笛卡尔结果集:
val spark = SparkSession.builder().appName("DateBasedCount").master("local").getOrCreate()
import spark.implicits._
val df1 = spark.read.option("header","true").csv("dateRange.txt").toDF("startdate","enddate")
val df2 = spark.read.option("header","true").csv("dateCount").toDF("date","count")
df1.createOrReplaceTempView("daterange")
df2.createOrReplaceTempView("datecount")
val res = spark.sql("select startdate,enddate,date,visitors from daterange left join datecount on date >= startdate and date <= enddate")
res.rdd.foreach(println)
输出为:
| startdate| enddate| date|visitors|
|01/01/2018|30/01/2018|03/01/2018| 10|
|01/01/2018|30/01/2018|25/01/2018| 15|
|01/01/2018|30/01/2018|05/02/2018| 23|
|01/01/2018|30/01/2018|17/02/2018| 43|
|01/02/2018|28/02/2018|03/01/2018| 10|
|01/02/2018|28/02/2018|25/01/2018| 15|
|01/02/2018|28/02/2018|05/02/2018| 23|
|01/02/2018|28/02/2018|17/02/2018| 43|
|01/03/2018|30/03/2018|03/01/2018| 10|
|01/03/2018|30/03/2018|25/01/2018| 15|
|01/03/2018|30/03/2018|05/02/2018| 23|
|01/03/2018|30/03/2018|17/02/2018| 43|
现在,如果我groupby
的开始日期和结束日期的总和为计数,我将看到以下不正确的结果:
| startdate| enddate| sum(count)|
|01/01/2018|30/01/2018| 91.0|
|01/02/2018|28/02/2018| 91.0|
|01/03/2018|30/03/2018| 91.0|
那么我们该如何处理呢?在Spark SQL中处理日期的最佳方法是什么?我们应该首先将列建为dateType还是读为字符串,然后在必要时将其强制转换为日期?
答案 0 :(得分:1)
问题在于,Spark不会自动将您的日期解释为日期,它们只是字符串。因此,解决方案是将它们转换为日期:
val df1 = spark.read.option("header","true").csv("dateRange.txt")
.toDF("startdate","enddate")
.withColumn("startdate", to_date(unix_timestamp($"startdate", "dd/MM/yyyy").cast("timestamp")))
.withColumn("enddate", to_date(unix_timestamp($"enddate", "dd/MM/yyyy").cast("timestamp")))
val df2 = spark.read.option("header","true").csv("dateCount")
.toDF("date","count")
.withColumn("date", to_date(unix_timestamp($"date", "dd/MM/yyyy").cast("timestamp")))
然后使用与以前相同的代码。现在,SQL命令的输出为:
+----------+----------+----------+------+
| startdate| enddate| date|counts|
+----------+----------+----------+------+
|2018-01-01|2018-01-30|2018-01-03| 10|
|2018-01-01|2018-01-30|2018-01-25| 15|
|2018-02-01|2018-02-28|2018-02-05| 23|
|2018-02-01|2018-02-28|2018-02-17| 43|
|2018-03-01|2018-03-30| null| null|
+----------+----------+----------+------+
如果应该忽略最后一行,只需更改为内部联接即可。
在此新数据帧上使用df.groupBy("startdate", "enddate").sum()
将提供所需的输出。