下面是我使用Spark 1.3.1执行的相当复杂查询的示例代码段(此版本中窗口函数不是一个选项)。这个查询从HDFS读取大约18K块,然后用18K分区进行洗牌。
由于它是一个自我加入,并且因为两个表都由相同的键分组并连接,所以我假设所有的键都位于同一个连接的分区上,可能避免了Shuffle。
是否有办法避免阅读两次并避免随机播放?我可以通过默认分区程序对输入集进行重新分区,还是在DataFrame上单独使用该组而不是将其作为单个查询执行?感谢。
val df = hiveContext.sql("""SELECT
EVNT.COL1
,EVNT.COL2
,EVNT.COL3
,MAX(CASE WHEN (EVNT.COL4 = EVNT_DRV.min_COL4) THEN EVNT.COL5
ELSE -2147483648 END) AS COL5
FROM
TRANS_EVNT EVNT
INNER JOIN (SELECT
COL1
,COL2
,COL3
,COL6
,MIN(COL4) AS min_COL4
FROM
TRANS_EVNT
WHERE partition_key between '2015-01-01' and '2015-01-31'
GROUP BY
COL1
,COL2
,COL3
,COL6) EVNT_DRV
ON
EVNT.COL1 = EVNT_DRV.COL1
AND EVNT.COL2 = EVNT_DRV.COL2
AND EVNT.COL3 = EVNT_DRV.COL3
AND EVNT.COL6 = EVNT_DRV.COL6
WHERE partition_key between '2015-01-01' and '2015-01-31'
GROUP BY
EVNT.COL1
,EVNT.COL2
,EVNT.COL3
,EVNT.COL6""")