Question

我在Hive中有一个用户表，格式为：

User: 
Id    String,
Name  String,
Col1  String,
UpdateTimestamp Timestamp

我正在从具有以下格式的文件中向该表中插入数据：

I / U，记录写入文件时的时间戳记，ID，名称，Col1，UpdateTimestamp

例如用于插入具有ID 1的用户：

I,2019-08-21 14:18:41.002947,1,Bob,stuff,123456

并使用ID 1为同一用户更新col1：

U,2019-08-21 14:18:45.000000,1,,updatedstuff,123457

未更新的列返回为null。

现在，使用暂存表中的路径中的负载，然后忽略暂存表中的前两个字段，就可以轻松地进行简单插入。

但是，我将如何处理更新语句？这样我在蜂巢中的最后一行如下所示：

1,Bob,updatedstuff,123457

我当时正在考虑将所有行插入到临时表中，然后执行某种合并查询。有什么想法吗？

Answer 1

通常，对于合并语句，您的“文件”在ID上仍然是唯一的，并且合并语句将确定是否需要将其作为新记录插入，或从该记录更新值。

但是，如果文件不可协商且始终具有I / U格式，则可以按照建议将过程分为两步，即插入和更新。

为了在Hive中执行更新，您将需要将users表存储为ORC并在群集上启用ACID。对于我的示例，我将使用集群键和事务表属性创建用户表：

create table test.orc_acid_example_users
(
  id int
  ,name string
  ,col1 string
  ,updatetimestamp timestamp
)
clustered by (id) into 5 buckets
stored as ORC
tblproperties('transactional'='true');

在插入语句之后，您的Bob记录将在col1中说“ stuff”：

关于更新-您可以使用更新或合并语句解决这些问题。我认为这里的关键是null值。如果文件中的登台表具有null值，则必须保留原始名称或col1或其他名称，这一点很重要。这是合并登台表字段的合并示例。基本上，如果登台表中有一个值，请采用该值，否则退回到原始值。

merge into test.orc_acid_example_users as t
  using test.orc_acid_example_staging as s
on t.id = s.id
  and s.type = 'U'
when matched
  then update set name = coalesce(s.name,t.name), col1 = coalesce(s.col1, t.col1)

现在，鲍勃将显示“ updatedstuff”

快速免责声明-如果您在登台表中对Bob进行了多个更新，事情将变得一团糟。在执行更新/合并之前，您将需要执行预处理步骤以获取所有更新的最新非空值。 Hive实际上并不是一个完整的事务性数据库-在有更新的时候，源最好发送 full 用户记录，而不仅仅是发送更改的字段。

Answer 2

您可以通过将last_value()与null选项一起使用来重构表中的每个记录：

select h.id,
       coalesce(h.name, last_value(h.name, true) over (partition by h.id order by h.timestamp) as name,
       coalesce(h.col1, last_value(h.col1, true) over (partition by h.id order by h.timestamp) as col1,
       update_timestamp
from history h;

如果需要最新记录，可以使用row_number()和子查询。

将更新记录合并到最终表中

2 个答案: