Question

我正在尝试将RDBMS表加载到Hive中。我需要基于列数据动态分区表。我具有Greenplum表的架构，如下所示：

forecast_id:bigint
period_year:numeric(15,0)
period_num:numeric(15,0)
period_name:character varying(15)
drm_org:character varying(10)
ledger_id:bigint
currency_code:character varying(15)
source_system_name:character varying(30)
source_record_type:character varying(30)
xx_last_update_log_id:integer
xx_data_hash_code:character varying(32)
xx_data_hash_id:bigint
xx_pk_id:bigint

当我在Hive上检查同一表的模式（通常在Hive上复制）时，我做了describe extended tablename并得到了以下模式：

forecast_id             bigint
period_year             bigint
period_num              bigint
period_name             string
drm_org                 string
ledger_id               bigint
currency_code           string
source_record_type      string
xx_last_update_log_id   int
xx_data_hash_code       string
xx_data_hash_id         bigint
xx_pk_id                bigint
source_system_name      String

所以我问了领导为什么在Hive表的末尾给出了column: source_system_name，我得到了一个答案："The columns that are used to partition the hive table dynamically, comes at the end of the table"

确实，对蜂巢表进行动态分区的列应该出现在模式的末尾吗？

Answer 1

在Hive中进行动态分区时，列的顺序很重要。您可以找到更多详细信息here。从文档中

在INSERT ... SELECT ...查询中，动态分区列必须在SELECT语句和列中的 last 中指定它们在PARTITION（）子句中出现的顺序相同。

创建Hive表时，是否需要动态配对任何列的顺序？

1 个答案: