使用Spark 2.0时遇到了一个有趣的问题。这是我的情况:
select
a.*,
b.bcol3
from
(
select
col1,
col2,
sum(col3) over(partition by
col1,
col2
order by col3desc rows unbounded preceding
)as col3
from V1
)a
join
(
select
col1,
col2,
sum(col3) over(partition by
market,
timegrouptype,
periodunit,
period
order by trxquantity desc rows unbounded preceding
)as bcol3
from V1
)b
on a.col1=b.col1 and a.col2=b.col2
当我从V2获得结果(使用count,select *)时,我得到如下例外:
org.apache.spark.sql.catalyst.errors.package$TreeNodeException: execute, tree:
Exchange SinglePartition
+- *HashAggregate(keys=[], functions=[partial_count(1)], output=[count#73739L])
+- *Project
+- *SortMergeJoin [market#69662, timegrouptype#67731, periodunit#67733, period#67732, hcp_system_id#69357], [market#73717, timegrouptype#73491, periodunit#73493, period#73492, hcp_system_id#73608], Inner
:- *Sort [market#69662 ASC, timegrouptype#67731 ASC, periodunit#67733 ASC, period#67732 ASC, hcp_system_id#69357 ASC], false, 0
: +- Exchange hashpartitioning(market#69662, timegrouptype#67731, periodunit#67733, period#67732, hcp_system_id#69357, 200)
: +- *HashAggregate(keys=[hcp_system_id#69357, market#69662, timegrouptype#67731, periodunit#67733, period#67732, displayname#67734], functions=[], output=[hcp_system_id#69357, market#69662, timegrouptype#67731, periodunit#67733, period#67732])
: +- Exchange hashpartitioning(hcp_system_id#69357, market#69662, timegrouptype#67731, periodunit#67733, period#67732, displayname#67734, 200)
: +- *HashAggregate(keys=[hcp_system_id#69357, market#69662, timegrouptype#67731, periodunit#67733, period#67732, displayname#67734], functions=[], output=[hcp_system_id#69357, market#69662, timegrouptype#67731, periodunit#67733, period#67732, displayname#67734])
: +- *Project [timegrouptype#67731, periodunit#67733, period#67732, displayname#67734, hcp_system_id#69357, market#69662]
: +- *SortMergeJoin [product_system_id#69514], [product#69666], Inner
: :- *Sort [product_system_id#69514 ASC], false, 0
: : +- Exchange hashpartitioning(product_system_id#69514, 200)
: : +- *Project [timegrouptype#67731, periodunit#67733, period#67732, displayname#67734, hcp_system_id#69357, product_system_id#69514]
: : +- *SortMergeJoin [productgroup_system_id#69829], [productgroup_system_id#69636], Inner
: : :- *Sort [productgroup_system_id#69829 ASC], false, 0
: : : +- Exchange hashpartitioning(productgroup_system_id#69829, 200)
: : : +- *Project [timegrouptype#67731, periodunit#67733, period#67732, displayname#67734, hcp_system_id#69357, product_system_id#69514, productgroup_system_id#69829]
: : : +- *SortMergeJoin [product_system_id#69514], [product_system_id#69827], Inner
: : : :- *Sort [product_system_id#69514 ASC], false, 0
: : : : +- *Project [timegrouptype#67731, periodunit#67733, period#67732, displayname#67734, hcp_system_id#69357, product_system_id#69514]
: : : : +- *SortMergeJoin [product_system_id#69242], [product_system_id#69514], Inner
: : : : :- *Sort [product_system_id#69242 ASC], false, 0
: : : : : +- Exchange hashpartitioning(product_system_id#69242, 200)
......还有更多。
Spark 1.6.2中的此类查询没有错误。只有更高版本的Spark 2.0才会发生异常。
有没有人遇到过这个问题?你知道它为什么会抛出异常吗?
注意:避免异常的解决方案是缓存V1。或者你用于自我加入的任何表格。