我有两个如下的spark sql临时视图。
spark.sql("select cast('2017-03-01 00:00:00' as timestamp) as col1").createOrReplaceTempView("firstTable")
spark.sql("select '2017-03-01' as col2").createOrReplaceTempView("secondTable")
当我在spark-shell中运行以下命令时,在spark 2.1.1和spark 2.2.0中得到不同的结果
spark 2.1.1的结果:
spark.sql("select * from firstTable where firstTable.col1 in (select secondTable.col2 from secondTable)").show(false)
col1
2017-03-01 00:00:00.0
|== Physical Plan ==
*Project [1488326400000000 AS col1#93]
+- BroadcastNestedLoopJoin BuildRight, LeftSemi, (1488326400000000 = cast(col2#97 as timestamp))
:- Scan OneRowRelation[]
+- BroadcastExchange IdentityBroadcastMode
+- *Project [2017-03-01 AS col2#97]
+- Scan OneRowRelation[]|
spark 2.2.0的结果:
spark.sql("select * from firstTable where firstTable.col1 in (select secondTable.col2 from secondTable)").show(false)
col1
+ ---- +
结果在这里为空。
|== Physical Plan ==
*Project [1488326400000000 AS col1#12]
+- BroadcastNestedLoopJoin BuildRight, LeftSemi, (2017-03-01 00:00:00 = col2#16)
:- Scan OneRowRelation[]
+- BroadcastExchange IdentityBroadcastMode
+- *Project [2017-03-01 AS col2#16]
+- Scan OneRowRelation[]|
在spark 2.1.1中,由于firstTable的col1是时间戳,因此secondTable的col2将转换为时间戳,并在where子句中进行比较。
在spark 2.2.0中,作为firstTable时间戳的col1被视为字符串,并在where子句中进行比较。
这是我根据上面发布的身体计划进行的分析。
这是Spark 2.2.0中的错误吗?我们是否有任何配置标志可以在2.2.0中获得2.1.1行为?
任何人都已看到此问题,或者任何解决方法都将受到赞赏。
致谢
Srini