Question

我的配置单元表按2年内的日期进行分区，每个分区中有200个2mb文件。

我可以连接正在运行的以下命令 “ ALTER TABLE table_name分区（partition_column_name ='2017-12-31'）串联”

手动运行每个查询会花费更多时间，那么有什么简单的方法吗？

Answer 1

选项1：Select and overwrite same hive table:

Hive支持插入覆盖同一表，如果您确定使用insert statements only（不通过hdfs 加载文件）在hive表中插入数据，请使用此功能选项。

hive> SET hive.exec.dynamic.partition = true; hive> SET hive.exec.dynamic.partition.mode = nonstrict; hive> Insert overwrite table <partition_table_name> partition(<partition_col>) select * from <db>.<partition_table_name>;

您还可以使用sort by,distribute by和these additional params来控制表中创建的文件数。

选项2 ： Using Shell script:

bash$ cat cnct.hql alter table default.partitn1 partition(${hiveconf:var1} = '${hiveconf:var2}') concatenate

使用Shell脚本（用于循环）触发上述.hql脚本

bash$ cat trigg.sh #!/bin/bash id=`hive -e "show partitions default.partitn"` echo "partitions: " $id for f in $id; do echo "select query for: " $f #split the partitions on = then assigning to two variables IFS="=" read var1 var2 <<< $f #pass the variables and execute the cnct.hql script hive --hiveconf var1=$var1 --hiveconf var2=$var2 -f cnct.hql done

连接Hive动态分区表中的所有分区

1 个答案: