要提前声明,这不是Hadoop的生产环境。它是我们正在测试工作流程的单节点环境
问题:尝试加载具有单个数据分区的Parquet表时,下面的Hive查询失败。源表/分区只有142MB的文件。 insert语句生成一个映射器作业,最终失败,出现Java内存不足错误。这个小测试用例似乎不应该产生这样的开销吗?
我们在尝试插入Parquet时只遇到此问题。插入Avro,Orc,Text没有问题。查询也没有任何问题。
我尝试使用以下命令,但它们只调整初始选择中使用的映射器。插入阶段仍然使用1个映射器。
set mapreduce.input.fileinputformat.split.minsize
set mapreduce.input.fileinputformat.split.maxsize
set mapreduce.job.maps
我在CDH 5.8 / Hadoop 2.6上。 VM实例分配了4个内核/ 24GB RAM。
DROP TABLE IF EXISTS web.traffic_pageviews;
CREATE TABLE web.traffic_pageviews(
SESSION_ID STRING,
COOKIE_ID STRING,
TS TIMESTAMP,
PAGE STRING,
PAGE_URL_BASE STRING,
PAGE_URL_QUERY STRING,
PAGE_REFERRAL_URL_BASE STRING,
PAGE_REFERRAL_URL_QUERY STRING)
PARTITIONED BY (DS STRING)
STORED AS PARQUET;
INSERT OVERWRITE TABLE web.traffic_pageviews PARTITION(ds='2016-12-28')
select
session_id,
cookie_id,
ts,
page,
SPLIT(PAGE_URL,'\\?')[0] PAGE_URL_BASE,
SPLIT(PAGE_URL,'\\?')[1] PAGE_URL_QUERY,
SPLIT(PAGE_REFERRAL_URL,'\\?')[0] PAGE_REFERRAL_URL_BASE,
SPLIT(PAGE_REFERRAL_URL,'\\?')[1] PAGE_REFERRAL_URL_QUERY
from
web.stg_traffic_pageviews
where
ds='2016-12-28';
错误输出如下所示。我觉得我们做了一些基本的错误,不应该调整Java内存分配?
2017-01-03 07:11:02,053 INFO [main] org.apache.hadoop.hive.ql.io.parquet.write.ParquetRecordWriterWrapper: real writer: parquet.hadoop.ParquetRecordWriter@755cce4b
2017-01-03 07:11:02,057 INFO [main] org.apache.hadoop.hive.ql.exec.FileSinkOperator: FS[1]: records written - 1
2017-01-03 07:11:02,062 INFO [main] org.apache.hadoop.hive.ql.exec.MapOperator: MAP[2]: records read - 1
2017-01-03 07:11:02,064 INFO [main] org.apache.hadoop.hive.ql.exec.FileSinkOperator: FS[1]: records written - 10
2017-01-03 07:11:02,064 INFO [main] org.apache.hadoop.hive.ql.exec.MapOperator: MAP[2]: records read - 10
2017-01-03 07:11:02,082 INFO [main] org.apache.hadoop.hive.ql.exec.FileSinkOperator: FS[1]: records written - 100
2017-01-03 07:11:02,082 INFO [main] org.apache.hadoop.hive.ql.exec.MapOperator: MAP[2]: records read - 100
2017-01-03 07:11:02,356 INFO [main] org.apache.hadoop.hive.ql.exec.FileSinkOperator: FS[1]: records written - 1000
2017-01-03 07:11:02,356 INFO [main] org.apache.hadoop.hive.ql.exec.MapOperator: MAP[2]: records read - 1000
2017-01-03 07:11:03,775 INFO [main] org.apache.hadoop.hive.ql.exec.FileSinkOperator: FS[1]: records written - 10000
2017-01-03 07:11:03,775 INFO [main] org.apache.hadoop.hive.ql.exec.MapOperator: MAP[2]: records read - 10000
2017-01-03 07:12:03,679 FATAL [LeaseRenewer:cloudera@quickstart.cloudera:8020] org.apache.hadoop.yarn.YarnUncaughtExceptionHandler: Thread Thread[LeaseRenewer:cloudera@quickstart.cloudera:8020,5,main] threw an Error. Shutting down now...
java.lang.OutOfMemoryError: Java heap space
答案 0 :(得分:0)
在表上指定压缩后问题自行解决。具体做法是:
CREATE TABLE web.traffic_pageviews(
...
)
PARTITIONED BY (DS STRING)
STORED AS PARQUET
TBLPROPERTIES ("parquet.compression"="SNAPPY");
虽然这是答案,但我不明白为什么会有效。如果有人有洞察力,我们将不胜感激。