我有4列
date number Estimate Client
---- ------
1 3 10 A
2 NULL 10 Null
3 5 10 A
4 NULL 10 Null
5 NULL 10 Null
6 2 10 A
.......
我需要用新值替换NULL值,取值来自日期列中上一个日期的最后一个已知值,例如:date = 2 number = 3,date 4和5 number = 5和5。 NULL值随机出现。
这需要在Hive中完成。
答案 0 :(得分:4)
关于滑动窗口;
这是我的表格内容;
hive> select * from my_table;
OK
1 3 10 A
2 NULL 10 NULL
3 5 10 A
4 NULL 10 NULL
5 NULL 10 NULL
6 2 10 A
Time taken: 0.06 seconds, Fetched: 6 row(s)
您需要做的就是在上一行和当前行之间的窗口上滑动,并找到最近的非空值。 LAST_VALUE
可窗口函数有一个参数,可以将空值忽略为boolean。 LAST_VALUE(<field>,<ignore_nulls> as boolean)
;
SELECT
COALESCE(`date`, LAST_VALUE(`date`, TRUE) OVER(ORDER BY `date` ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW)),
COALESCE(number, LAST_VALUE(number, TRUE) OVER(ORDER BY `date` ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW)),
COALESCE(estimate, LAST_VALUE(estimate, TRUE) OVER(ORDER BY `date` ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW)),
COALESCE(client, LAST_VALUE(client, TRUE) OVER(ORDER BY `date` ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW))
FROM my_table;
结果将是;
OK
1 3 10 A
2 3 10 A
3 5 10 A
4 5 10 A
5 5 10 A
6 2 10 A
Time taken: 19.177 seconds, Fetched: 6 row(s)
答案 1 :(得分:1)
这是使用标准hiveql连接的解决方案。这适用于所有版本的配置单元。表c合并了非空客户端的最近日期。后面的表d合并了与该日期相关的数字。使用coalesce时,仅在数字为null时使用附加值。
select c.date
, coalesce(c.number,d.number) as number
, c.client
, estimate
from
(select date
, max(prior_date) as prior_date -- nearest date not null number
, value
, estimate
, a.client
from
(select date
, value
, estimate
, client
from table_have
) a
left outer join
(select date as prior_date -- dates without nulls
, client
from table_have
where number is not null
) b
on a.client=b.client
where date > prior_dates
group by a.client, date, value
) c
left outer join
(select date
, number
, client
from table_have
where number is not null
) d
on c.client = d.client and c.prior_date=d.date
group by c.date, c.client, estimate
;
通过使用与备用解决方案中使用的类似的公用表表达式,可以更好地优化此查询。但是,这种解决方案不需要重复N次,并且应该普遍适用。其他解决方案中所需的数量N可能不是静态的,因为这种解决方案可能适用于更一般的情况。
答案 2 :(得分:0)
这实际上是一个相当棘手的问题,因为Hive不支持递归CTE或相关子查询,这是解决此类问题的常用方法。
我能想到的唯一纯粹的Hive方法是做一堆自我连接。您必须执行数据中连续空值的最大长度。
--add in row numbers
with T as
(select select *, row_number() over (order by date) rn
from mytable)
--main query
select T.date,
case when T.number is not null then T.number
else when T1.number is not null then T1.number
else when T2.number is not null then T2.number end as number
--repeat this N times
--where N is the length of the longest sequece of consectutive nulls
-- add in your other columns here
from T
join T T1 on T1.date = t.date - 1
join T T2 on T2.date = t.date - 2
--repeat this N times
答案 3 :(得分:0)
如果您使用的是SQL,则下面的查询无法提供帮助。另外,您可以使用pfills的ffill和bfill功能。
select primary_key_val,country,
COALESCE(country, LAST_VALUE(country, TRUE) OVER(partition by primary_key_val order **by eff_start_dt ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW)) as upd_country,**
eff_start_dt from dim_acct_keys order by primary_key_val,eff_start_dt
例如某些数据
+------------------+----------+--------------+---------------+
| primary_key_val | country | upd_country | eff_start_dt |
+------------------+----------+--------------+---------------+
| act1010 | USA | USA | 20190101 |
| act1010 | NULL | USA | 20190102 |
| act1010 | NULL | USA | 20190103 |
| act1012 | USA | USA | 20190101 |
| act1012 | NULL | USA | 20190102 |
| act1012 | MEX | MEX | 20190103 |