Set parquet snappy output file size is hive?

时间:2015-06-15 15:13:16

标签: hive impala parquet snappy

I'm trying to split parquet/snappy files created by hive INSERT OVERWRITE TABLE... on dfs.block.size boundary as impala issues a warning when a file in a partition is larger then block size.

impala logs the following WARNINGS:

Parquet files should not be split into multiple hdfs-blocks. file=hdfs://<SERVER>/<PATH>/<PARTITION>/000000_0 (1 of 7 similar)

Code:

CREATE TABLE <TABLE_NAME>(<FILEDS>)
PARTITIONED BY (
    year SMALLINT,
    month TINYINT
)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\037'
STORED AS PARQUET TBLPROPERTIES ("parquet.compression"="SNAPPY");

As for the INSERT hql script:

SET dfs.block.size=134217728;
SET hive.exec.reducers.bytes.per.reducer=134217728;
SET hive.merge.mapfiles=true;
SET hive.merge.size.per.task=134217728;
SET hive.merge.smallfiles.avgsize=67108864;
SET hive.exec.compress.output=true;
SET mapred.max.split.size=134217728;
SET mapred.output.compression.type=BLOCK;
SET mapred.output.compression.codec=org.apache.hadoop.io.compress.SnappyCodec;
INSERT OVERWRITE TABLE <TABLE_NAME>
PARTITION (year=<YEAR>, month=<MONTH>)
SELECT <FIELDS>
from <ANOTHER_TABLE> where year=<YEAR> and month=<MONTH>;

The issue is file seizes are all over the place:

partition 1: 1 file: size = 163.9 M 
partition 2: 2 file: size = 207.4 M, 128.0 M
partition 3: 3 file: size = 166.3 M, 153.5 M, 162.6 M
partition 4: 3 file: size = 151.4 M, 150.7 M, 45.2 M

The issue is the same no matter the dfs.block.size setting (and other settings above) increased to 256M, 512M or 1G (for different data sets).

Is there a way/settings to make sure that the splitting of the output parquet/snappy files are just below hdfs block size?

3 个答案:

答案 0 :(得分:2)

一旦文件增长到单个HDFS块的大小并启动​​新文件,就无法关闭文件。这将违背HDFS通常的工作方式:拥有跨越多个块的文件。

正确的解决方案是让Impala安排其任务,其中块是本地的,而不是抱怨文件跨越多个块。这最近以IMPALA-1881完成,将在Impala 2.3中发布。

答案 1 :(得分:1)

您需要设置镶木地板块大小和dfs块大小:

SET dfs.block.size=134217728;  
SET parquet.block.size=134217728; 

两者都需要设置为相同的,因为您希望镶木地块适合 hdfs 块。

答案 2 :(得分:0)

在某些情况下,您可以通过设置已经执行过的mapred.max.split.size(parquet 1.4.2+)来设置镶木地板块大小。您可以将其设置为低于hdf​​s块大小以增加并行性。如果可能,Parquet会尝试与hdfs块对齐:

https://github.com/Parquet/parquet-mr/pull/365

2015年11月16日编辑: 根据 https://github.com/Parquet/parquet-mr/pull/365#issuecomment-157108975 这也可能是在Impala 2.3中修复的IMPALA-1881。