我在一列中有一个空值的数据集:
price time id
1 12:00:00 id1
10 12:00:00 id2
NULL 12:05:00 id1
NULL 12:05:00 id2
NULL 12:10:00 id2
2 12:10:00 id1
3 12:15:00 id1
NULL 12:20:00 id1
NULL 12:25:00 id1
4 12:30:00 id1
我想在Pig或Hive中为每个id / time添加以前已知行值为null的行。
所以,输出应该是:
price time id
1 12:00:00 id1
10 12:00:00 id2
**1** 12:05:00 id1
**10** 12:05:00 id2
**10** 12:10:00 id2
2 12:10:00 id1
3 12:15:00 id1
**3** 12:20:00 id1
**3** 12:25:00 id1
4 12:30:00 id1
非常感谢提前。
编辑:这就是我在蜂巢中运行的内容:
Select price,time, id,last_value(price,true) over (partition by id order by time) as LatestPrice from table;
对于某些行(1000s)工作正常,但是在完成100%mapper和100%reducer之后,对于更大的集合(24 M行),作业仍然从最后1天开始运行。有什么建议吗?
答案 0 :(得分:0)
select
notNullTmp.price, tmp.id, tmp.time
(
select LAG(b.time, 1) over (PARTITION BY a.id ORDER BY a.time) as prev_time, b.time as time, b.id as id
FROM
(
select price, time, id
from table
where price is NOT NULL
) a
JOIN
(
select price, time, id
from table
where price is NULL
)b
ON (a.id = b.id)
) tmp
JOIN
(
select price, time, id
from table
where price is NOT NULL
) notNullTmp
ON (tmp.id = notNullTmp.id AND tmp.prev_time == notNullTmp.time)
UNION
select price, time, id
from table
where price is NOT NULL;
我们的想法是将空价格记录与非空价格记录分开,然后在条目的非空价格记录中查找每个空价格记录ID。在为每个空价格条目选择价格后,我们加入这个数据带有非零价格数据。