我正在尝试在已分区的托管(内部)表上加速配置单元的动态分区。该表的架构如下:
hive> describe formatted saibhargav_history;
OK
# col_name data_type comment
appid string
appstatus string
apptype string
submittime bigint
starttime bigint
finishtime bigint
launchtime bigint
jobcounters map<string,string>
# Partition Information
# col_name data_type comment
finishyear string
finishmonth string
finishday string
finishhour string
# Detailed Table Information
Database: saibhvar
Owner: bhargav
CreateTime: Thu Sep 26 09:54:48 GMT 2019
LastAccessTime: UNKNOWN
Protect Mode: None
Retention: 0
Location: hdfs://user/saibhargav/jobhistory
Table Type: MANAGED_TABLE
Table Parameters:
bucketing_version 2
orc.compress SNAPPY
transient_lastDdlTime 1569491688
# Storage Information
SerDe Library: org.apache.hadoop.hive.ql.io.orc.OrcSerde
InputFormat: org.apache.hadoop.hive.ql.io.orc.OrcInputFormat
OutputFormat: org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat
Compressed: No
Num Buckets: -1
Bucket Columns: []
Sort Columns: []
Storage Desc Params:
line.delim \n
serialization.format 1
Time taken: 0.401 seconds, Fetched: 54 row(s)
我运行每4小时提交的历史记录提取服务(所有临时和计划的配置单元查询),该记录提取服务是根据这次运行的作业的finishtime(finishyear,finishmonth,finishday和finishyear)对表进行分区来填充该表的帧。假设在以后的迭代中,如果将属于该分区(finishyear = 2020,finishmonth = 04,finishday = 28和finishhour = 04)的记录动态添加到表中,则它将用该作业的内容覆盖该分区的内容。
使用以下查询插入托管表:
insert into table `saibhvar.saibhargav_history` partition(`finishyear`,`finishmonth`,`finishday`,`finishhour`)
select `appId`,`appStatus`,`appType`,`submitTime`,`startTime`,`finishTime`, str_to_map(`jobCounters`,'\\006','\\005'),`finishYear`,`finishMonth`,`finishDay`,`finishHour` from `saibhvar.temp_table_1588226958`
saibhvar.temp_table_1588226958
是一个临时表,历史记录提取服务在该表中流式传输数据,并有助于将动态分区插入到托管表中。
我遵循了文档https://cwiki.apache.org/confluence/display/Hive/DynamicPartitions。
关于如何解决此问题并防止分区中的数据被覆盖的任何想法。