将表格设置从ORC复制到Parquet

时间:2018-02-08 18:50:34

标签: hadoop hive parquet

我有ORC的以下表格定义,我想复制到Parquet(我还没有显示更多字段):

CREATE EXTERNAL TABLE `test_a`(
  `some_id` int,
  `sha_sum` string,
  `parent_sha_sum` string,
  `md5_sum` string
)
PARTITIONED BY (
  `server_date` date
)
CLUSTERED BY (
  sha_sum
)
SORTED BY (
  sha_sum, parent_sha_sum, md5_sum
)
INTO 256 BUCKETS
ROW FORMAT SERDE
  'org.apache.hadoop.hive.ql.io.orc.OrcSerde'
STORED AS INPUTFORMAT
  'org.apache.hadoop.hive.ql.io.orc.OrcInputFormat'
OUTPUTFORMAT
  'org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat'
LOCATION
  'hdfs://cluster/user/myuser/test_a'
TBLPROPERTIES (
  'orc.compress'='ZLIB',
  'orc.create.index'='true',
  'orc.stripe.size'='130023424',
  'orc.row.index.stride'='64000',
  'orc.create.index'='true';

我想知道如何将它复制到Parquet。我想使用ZLIB或类似的东西进行压缩,我想有索引并可能调整一些用于Parquet的TBLPROPERTIES。

CREATE EXTERNAL TABLE `test_b`(
  `some_id` int,
  `sha_sum` string,
  `parent_sha_sum` string,
  `md5_sum` string
)
PARTITIONED BY (
  `server_date` date
)
CLUSTERED BY (
  sha_sum
)
SORTED BY (
  sha_sum, parent_sha_sum, md5_sum
)
INTO 256 BUCKETS
STORED AS PARQUET
LOCATION 'hdfs://cluster/user/myuser/test_b'
TBLPROPERTIES (
 'COLUMN_STATS_ACCURATE'='true'
)

是否有通过TBLPROPERTIES可用于Parquet的所有选项的列表?

0 个答案:

没有答案