Question

我有一个名为zhihu_answer的表。我将其作为存储每天抓取的数据的仓库。每天都会创建一个zhihu_answer_tmp表，用于存储新抓取的数据，这些数据与zhihu_answer共享相同的数据结构。

DDL是：

createtab_stmt  
CREATE TABLE `zhihu_answer`(    
  `admin_closed_comment` boolean,   
  `answer_content` string,  
  `answer_created` string,  
  `answer_id` string,   
  `insert_time` string,         
  `voteup_count` int)   
PARTITIONED BY (`year_month` string)

我以answer_id和insert_time作为唯一键，我的问题是如何基于{{将zhihu_answer_tmp中的新数据合并到历史数据表zhihu_answer中1}}和answer_id？

具体来说，如果insert_time中存在一行具有相同answer_id and insert_time的行，则什么也不做，只需忽略即可（对于幂等性，防止副作用多次插入数据）。 / p>

另一方面，如果在zhihu_answer中没有与answer_id and insert_time相同的zhihu_answer_tmp行，则插入这些行（新抓取的数据）。

感谢您提供任何建议或解决方案。

配置单元如何基于一些列将tmp表合并到历史表中。如果存在则忽略，如果不存在则插入

0 个答案: