Question

这应该很简单，但我不明白。我需要进行选择以获取某些帐户的较新日期值。

我从这里开始，T1：

+----------+---------+
|  date   | account |
+----------+---------+
| 4/1/2018 |       1 |
| 4/1/2018 |       2 |
| 4/1/2018 |       3 |
| 4/1/2018 |       4 |
| 4/1/2018 |       5 |
+----------+---------+

然后在T2中更新一些日期：

+----------+---------+
|   date   | account |
+----------+---------+
| 7/1/2018 |       1 |
| 7/1/2018 |       2 |
+----------+---------+

我如何才能将输出输出到T3中，仅更新那些帐户？

+----------+---------+
|   date   | account |
+----------+---------+
| 7/1/2018 |       1 |
| 7/1/2018 |       2 |
| 4/1/2018 |       3 |
| 4/1/2018 |       4 |
| 4/1/2018 |       5 |
+----------+---------+

我可以加入帐号，但是没有变化的帐号怎么办？如何捕获那些？

此外，T1大约有800万条记录，因此性能将是一个因素。从Teradata中提取，加载到Hive中。

谢谢！

Answer 1

只是先前的好答案的补充。尝试与coalesce一起使用，并让我知道它是否可以改善性能。

select t1.Account, coalesce(t2.Date, t1.Date) 
from t1
left outer join t2
  on t2.Account = t1.Account

Answer 2

我想你想要

select t2.*
from t2
union all
select t1.*
from t1
where not exists (select 1 from t2 where t2.account = t1.account);

这首先从t2中选择。然后它从t1中提取剩余的帐户。

Answer 3

这是左外连接的另一种解决方案：

select t1.Account, case when t2.Date is null then t1.Date else t2.Date end
from t1
left outer join t2 on t2.Account = t1.Account

如何通过比较两个字段并考虑性能来联接表

3 个答案: