SPARK SQL自我加入从HDFS块

时间:2015-10-06 05:51:55

标签: apache-spark apache-spark-sql

下面是我使用Spark 1.3.1执行的相当复杂查询的示例代码段(此版本中窗口函数不是一个选项)。这个查询从HDFS读取大约18K块,然后用18K分区进行洗牌。

由于它是一个自我加入,并且因为两个表都由相同的键分组并连接,所以我假设所有的键都位于同一个连接的分区上,可能避免了Shuffle。

是否有办法避免阅读两次并避免随机播放?我可以通过默认分区程序对输入集进行重新分区,还是在DataFrame上单独使用该组而不是将其作为单个查询执行?感谢。

val df = hiveContext.sql("""SELECT  
          EVNT.COL1  
         ,EVNT.COL2  
         ,EVNT.COL3  
         ,MAX(CASE WHEN (EVNT.COL4 = EVNT_DRV.min_COL4) THEN EVNT.COL5  
             ELSE -2147483648 END) AS COL5  
   FROM  
    TRANS_EVNT EVNT  
    INNER JOIN (SELECT  
      COL1  
     ,COL2  
     ,COL3  
     ,COL6  
     ,MIN(COL4) AS min_COL4  
    FROM  
     TRANS_EVNT  
    WHERE partition_key between '2015-01-01' and '2015-01-31'   
     GROUP BY  
      COL1  
     ,COL2  
     ,COL3  
     ,COL6) EVNT_DRV  
   ON   
        EVNT.COL1 = EVNT_DRV.COL1   
    AND EVNT.COL2 = EVNT_DRV.COL2   
    AND EVNT.COL3 = EVNT_DRV.COL3   
    AND EVNT.COL6 = EVNT_DRV.COL6  
   WHERE partition_key between '2015-01-01' and '2015-01-31'   
   GROUP BY  
    EVNT.COL1  
   ,EVNT.COL2  
   ,EVNT.COL3  
   ,EVNT.COL6""")

0 个答案:

没有答案