针对hive分区表

时间:2017-02-23 07:45:26

标签: apache-spark apache-spark-sql

当我运行一些查询时,spark似乎不是谓词下推到特定的hive表的分区。

将“spark.sql.orc.filterPushdown”设置为“true”没有帮助。 Spark版本为1.6,hive版本为1.2。和hive表按ORC格式按日期分配。

val sc = new SparkContext(new SparkConf())
var hql = new org.apache.spark.sql.hive.HiveContext(sc)
hql.setConf("spark.sql.orc.filterPushdown", "true")
hql.sql("""
    SELECT i.*, 
        from_unixtime(unix_timestamp('20170220','yyyyMMdd'),"yyyy-MM-dd'T'HH:mm:ssZ") bounce_date
        FROM 
          (SELECT country,
                 device_id,
                 os_name,
                 app_ver
          FROM jpl_band_orc
          WHERE yyyymmdd='20170220'
                  AND scene_id='app_intro'
                  AND action_id='scene_enter'
                  AND classifier='app_intro'
          GROUP BY  country, device_id, os_name, app_ver ) i
        LEFT JOIN 
          (SELECT device_id
          FROM jpl_band_orc
          WHERE yyyymmdd='20170220'
                  AND scene_id='band_list'
                  AND action_id='scene_enter') s
              ON i.device_id = s.device_id
        WHERE s.device_id is null
""")

这是show create table

1 CREATE TABLE `jpl_band_orc`(
2     ... many fields ...
3     )
39  PARTITIONED BY ( 
40    `yyyymmdd` string)
41  CLUSTERED BY ( 
42    ac_hash) 
43  SORTED BY ( 
44    ac_hash ASC) 
45  INTO 256 BUCKETS
46  ROW FORMAT SERDE 
47    'org.apache.hadoop.hive.ql.io.orc.OrcSerde' 
48  STORED AS INPUTFORMAT 
49    'org.apache.hadoop.hive.ql.io.orc.OrcInputFormat' 
50  OUTPUTFORMAT 
51    'org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat'
52  LOCATION
53    'BLAH~BLAH~/jpl_band_orc'
54  TBLPROPERTIES (
55    'orc.bloom.filter.columns'='ac_hash,action_id,classifier', 
56    'orc.bloom.filter.fpp'='0.05', 
57    'orc.compress'='SNAPPY', 
58    'orc.row.index.stride'='30000', 
59    'orc.stripe.size'='268435456', 
60    'transient_lastDdlTime'='1464922691')

Spark作业输出

17/02/22 17:05:32 INFO HadoopFsRelation: Listing leaf files and directories in parallel under:
hdfs://banda/apps/hive/warehouse/jpl_band_orc/yyyymmdd=20160604, hdfs://banda/apps/hive/warehouse/jpl_band_orc/yyyymmdd=20160608, 
...
hdfs://banda/apps/hive/warehouse/jpl_band_orc/yyyymmdd=20160620, hdfs://banda/apps/hive/warehouse/jpl_band_orc/yyyymmdd=20160621, 

最后以OOM结束

Exception in thread "qtp1779914089-88" java.lang.OutOfMemoryError: GC overhead limit exceeded
    at java.util.HashMap$KeySet.iterator(HashMap.java:912)
    at java.util.HashSet.iterator(HashSet.java:172)
    at sun.nio.ch.Util$2.iterator(Util.java:243)
    at org.spark-project.jetty.io.nio.SelectorManager$SelectSet.doSelect(SelectorManager.java:600)
    at org.spark-project.jetty.io.nio.SelectorManager$1.run(SelectorManager.java:290)
    at org.spark-project.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:608)
    at org.spark-project.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:543)
    at java.lang.Thread.run(Thread.java:745)
Exception in thread "main" java.lang.OutOfMemoryError: GC overhead limit exceeded
    at java.util.Arrays.copyOfRange(Arrays.java:3664)
    at java.lang.String.<init>(String.java:207)
    at java.lang.String.substring(String.java:1969)
    at java.net.URI$Parser.substring(URI.java:2869)

似乎读了所有分区然后发生了OOM。 如何检查It分区是否已完全修剪。

0 个答案:

没有答案