我刚刚了解了Hive中的collect_set()函数,并在开发3节点集群上开始了一项工作。
我只有大约10 GB要处理。然而,这项工作实际上是永远的。我认为在collect_set()的实现中可能存在错误,我的代码中存在错误,或者collect_set()函数实际上是资源密集型的。
这是我的Hive的SQL(没有双关语):
INSERT OVERWRITE TABLE sequence_result_1
SELECT sess.session_key as session_key,
sess.remote_address as remote_address,
sess.hit_count as hit_count,
COLLECT_SET(evt.event_id) as event_set,
hit.rsp_timestamp as hit_timestamp,
sess.site_link as site_link
FROM site_session sess
JOIN (SELECT * FROM site_event
WHERE event_id = 274 OR event_id = 284 OR event_id = 55 OR event_id = 151) evt
ON (sess.session_key = evt.session_key)
JOIN site_hit hit ON (sess.session_key = evt.session_key)
GROUP BY sess.session_key, sess.remote_address, sess.hit_count, hit.rsp_timestamp, sess.site_link
ORDER BY hit_timestamp;
有4张MR通行证。第一次花了大约30秒。第二张地图花了大约1分钟。大多数第二次减少大约需要2分钟。在最近两个小时里,它一直在从97.71%增加到97.73%。这是正确的吗?我认为一定有一些问题。我看了一下日志,我不知道这是否正常。
[日志样本]
2011-06-21 16:32:22,715 WARN org.apache.hadoop.hive.ql.exec.GroupByOperator: Hash Tbl flush: #hash table = 120894
2011-06-21 16:32:22,758 WARN org.apache.hadoop.hive.ql.exec.GroupByOperator: Hash Table flushed: new size = 108804
2011-06-21 16:32:23,003 INFO org.apache.hadoop.hive.ql.exec.JoinOperator: 4 forwarding 5142000000 rows
2011-06-21 16:32:23,003 INFO org.apache.hadoop.hive.ql.exec.SelectOperator: 5 forwarding 5142000000 rows
2011-06-21 16:32:24,138 INFO org.apache.hadoop.hive.ql.exec.JoinOperator: 4 forwarding 5143000000 rows
2011-06-21 16:32:24,138 INFO org.apache.hadoop.hive.ql.exec.SelectOperator: 5 forwarding 5143000000 rows
2011-06-21 16:32:24,725 WARN org.apache.hadoop.hive.ql.exec.GroupByOperator: Hash Tbl flush: #hash table = 120894
2011-06-21 16:32:24,768 INFO org.apache.hadoop.hive.ql.exec.GroupByOperator: 6 forwarding 42000000 rows
2011-06-21 16:32:24,771 WARN org.apache.hadoop.hive.ql.exec.GroupByOperator: Hash Table flushed: new size = 108804
2011-06-21 16:32:25,338 INFO org.apache.hadoop.hive.ql.exec.JoinOperator: 4 forwarding 5144000000 rows
2011-06-21 16:32:25,338 INFO org.apache.hadoop.hive.ql.exec.SelectOperator: 5 forwarding 5144000000 rows
2011-06-21 16:32:26,467 INFO org.apache.hadoop.hive.ql.exec.JoinOperator: 4 forwarding 5145000000 rows
2011-06-21 16:32:26,468 INFO org.apache.hadoop.hive.ql.exec.SelectOperator: 5 forwarding 5145000000 rows
我对此很陌生,尝试使用collect_set()和Hive Array正在推动我走出困境。
提前致谢:)
答案 0 :(得分:2)
重大失败。我的解决方案如下毕竟COLLECT_SET没有问题,它只是试图收集所有的项目,其中有无限的。
为什么呢?因为我加入了一些甚至不属于集合的东西。第二次连接曾经是相同的ON条件,现在它正确地说hit.session_key = evt.session_key
INSERT OVERWRITE TABLE sequence_result_1
SELECT sess.session_key as session_key,
sess.remote_address as remote_address,
sess.hit_count as hit_count,
COLLECT_SET(evt.event_id) as event_set,
hit.rsp_timestamp as hit_timestamp,
sess.site_link as site_link
FROM tealeaf_session sess
JOIN site_event evt ON (sess.session_key = evt.session_key)
JOIN site_hit hit ON (sess.session_key = hit.session_key)
WHERE evt.event_id IN(274,284,55,151)
GROUP BY sess.session_key, sess.remote_address, sess.hit_count, hit.rsp_timestamp, sess.site_link
ORDER BY hit_timestamp;
答案 1 :(得分:0)
我要尝试的第一件事是摆脱子选择并加入到site_event,然后将event_id过滤器移动到外部where子句并将其更改为in()。如下所示:
SELECT sess.session_key as session_key,
sess.remote_address as remote_address,
sess.hit_count as hit_count,
COLLECT_SET(evt.event_id) as event_set,
hit.rsp_timestamp as hit_timestamp,
sess.site_link as site_link
FROM site_session sess
JOIN site_event evt ON (sess.session_key = evt.session_key)
JOIN site_hit hit ON (sess.session_key = evt.session_key)
WHERE evt.event_id in(274,284,55151)
GROUP BY sess.session_key, sess.remote_address, sess.hit_count, hit.rsp_timestamp, sess.site_link
ORDER BY hit_timestamp;
此外,我不知道每个表的大小,但通常在Hive中,您希望在连接的右侧保留最大的表(通常是事实表)以减少内存使用量。原因是Hive试图将连接的左侧保持在内存中,并且流式传输右侧以完成连接。
答案 2 :(得分:0)
我猜是发生的事情是它为返回的每一行产生COLLECT_SET()
。
因此,对于您返回的每一行,它将返回由COLLECT_SET
生成的整个数组。这可能会很费力并需要很长时间。
使用查询中的COLLECT_SET
检查效果。如果速度足够快,请将COLLECT_SET
的计算推送到子查询中,然后使用该列而不是按原样进行计算。
我没有使用COLLECT_SET或进行任何测试,从你的帖子开始,这是我首先怀疑的。