当我运行一些查询时,spark似乎不是谓词下推到特定的hive表的分区。
将“spark.sql.orc.filterPushdown”设置为“true”没有帮助。 Spark版本为1.6,hive版本为1.2。和hive表按ORC格式按日期分配。
val sc = new SparkContext(new SparkConf())
var hql = new org.apache.spark.sql.hive.HiveContext(sc)
hql.setConf("spark.sql.orc.filterPushdown", "true")
hql.sql("""
SELECT i.*,
from_unixtime(unix_timestamp('20170220','yyyyMMdd'),"yyyy-MM-dd'T'HH:mm:ssZ") bounce_date
FROM
(SELECT country,
device_id,
os_name,
app_ver
FROM jpl_band_orc
WHERE yyyymmdd='20170220'
AND scene_id='app_intro'
AND action_id='scene_enter'
AND classifier='app_intro'
GROUP BY country, device_id, os_name, app_ver ) i
LEFT JOIN
(SELECT device_id
FROM jpl_band_orc
WHERE yyyymmdd='20170220'
AND scene_id='band_list'
AND action_id='scene_enter') s
ON i.device_id = s.device_id
WHERE s.device_id is null
""")
这是show create table
1 CREATE TABLE `jpl_band_orc`(
2 ... many fields ...
3 )
39 PARTITIONED BY (
40 `yyyymmdd` string)
41 CLUSTERED BY (
42 ac_hash)
43 SORTED BY (
44 ac_hash ASC)
45 INTO 256 BUCKETS
46 ROW FORMAT SERDE
47 'org.apache.hadoop.hive.ql.io.orc.OrcSerde'
48 STORED AS INPUTFORMAT
49 'org.apache.hadoop.hive.ql.io.orc.OrcInputFormat'
50 OUTPUTFORMAT
51 'org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat'
52 LOCATION
53 'BLAH~BLAH~/jpl_band_orc'
54 TBLPROPERTIES (
55 'orc.bloom.filter.columns'='ac_hash,action_id,classifier',
56 'orc.bloom.filter.fpp'='0.05',
57 'orc.compress'='SNAPPY',
58 'orc.row.index.stride'='30000',
59 'orc.stripe.size'='268435456',
60 'transient_lastDdlTime'='1464922691')
Spark作业输出
17/02/22 17:05:32 INFO HadoopFsRelation: Listing leaf files and directories in parallel under:
hdfs://banda/apps/hive/warehouse/jpl_band_orc/yyyymmdd=20160604, hdfs://banda/apps/hive/warehouse/jpl_band_orc/yyyymmdd=20160608,
...
hdfs://banda/apps/hive/warehouse/jpl_band_orc/yyyymmdd=20160620, hdfs://banda/apps/hive/warehouse/jpl_band_orc/yyyymmdd=20160621,
最后以OOM结束
Exception in thread "qtp1779914089-88" java.lang.OutOfMemoryError: GC overhead limit exceeded
at java.util.HashMap$KeySet.iterator(HashMap.java:912)
at java.util.HashSet.iterator(HashSet.java:172)
at sun.nio.ch.Util$2.iterator(Util.java:243)
at org.spark-project.jetty.io.nio.SelectorManager$SelectSet.doSelect(SelectorManager.java:600)
at org.spark-project.jetty.io.nio.SelectorManager$1.run(SelectorManager.java:290)
at org.spark-project.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:608)
at org.spark-project.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:543)
at java.lang.Thread.run(Thread.java:745)
Exception in thread "main" java.lang.OutOfMemoryError: GC overhead limit exceeded
at java.util.Arrays.copyOfRange(Arrays.java:3664)
at java.lang.String.<init>(String.java:207)
at java.lang.String.substring(String.java:1969)
at java.net.URI$Parser.substring(URI.java:2869)
似乎读了所有分区然后发生了OOM。 如何检查It分区是否已完全修剪。