Question

我有一个外部表，现在我想向其添加分区。我有224个唯一的城市ID，我只想写alter table my_table add partition (cityid) location /path;，但蜂巢抱怨说，我没有为城市ID值提供任何信息，例如alter table my_table add partition (cityid=VALUE) location /path;，但我不想为城市ID的每个值运行alter table命令，如何一次性完成所有ID的操作？

这是蜂巢命令行的样子：

hive> alter table pavel.browserdata add partition (cityid) location '/user/maria_dev/data/cityidPartition';

失败：未对ValidationFailureSemanticException表进行分区，但存在分区规范：{cityid = null}

Answer 1

物理级别上的分区是带有数据文件的位置（每个值的单独位置，通常看起来像key=value）。如果您已经具有包含文件的分区目录结构，则只需在Hive Metastore中创建分区，然后可以使用ALTER TABLE SET LOCATION将表指向根目录，然后使用MSCK REPAIR TABLE命令。 Amazon Elastic MapReduce（EMR）的Hive版本上的等效命令为：ALTER TABLE table_name RECOVER PARTITIONS。这将添加Hive分区元数据。在此处查看手册：RECOVER PARTITIONS

如果只有未分区的表中有数据，那么添加分区将无法进行，因为需要重新加载数据，您需要：

创建另一个分区表，并使用insert overwrite通过动态分区加载来加载分区数据：

set hive.exec.dynamic.partition=true;   
set hive.exec.dynamic.partition.mode=nonstrict; 

insert overwrite table2 partition(cityid) 
select col1, ... colN,
       cityid    
  from table1; --partitions columns should be last in the select

这是重组数据的非常有效的方法。

此后，您可以删除源表并重命名目标表。

如何按所有值对表进行分区？

1 个答案: