我有下表:
hive> describe tv_counter_stats;
OK
day string
event string
query_id string
userid string
headers string
我想执行以下查询:
hive -e 'SELECT
day,
event,
query_id,
COUNT(1) AS count,
COLLECT_SET(userid)
FROM
tv_counter_stats
GROUP BY
day,
event,
query_id;' > counter_stats_data.csv
但是,此查询失败。但以下查询工作正常:
hive -e 'SELECT
day,
event,
query_id,
COUNT(1) AS count
FROM
tv_counter_stats
GROUP BY
day,
event,
query_id;' > counter_stats_data.csv
我删除了collect_set命令。所以我的问题是:有人知道为什么在这种情况下collect_set可能会失败吗?
更新:添加了错误消息:
Diagnostic Messages for this Task:
FAILED: Execution Error, return code 2 from org.apache.hadoop.hive.ql.exec.MapRedTask
MapReduce Jobs Launched:
Job 0: Map: 3 Reduce: 1 Cumulative CPU: 10.49 sec HDFS Read: 109136387 HDFS Write: 0 FAIL
Total MapReduce CPU Time Spent: 10 seconds 490 msec
java.lang.Throwable: Child Error
at org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:250)
Caused by: java.io.IOException: Task process exit with nonzero status of 1.
at org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:237)
Error: GC overhead limit exceeded
java.lang.Throwable: Child Error
at org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:250)
Caused by: java.io.IOException: Task process exit with nonzero status of 1.
at org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:237)
Error: GC overhead limit exceeded
更新2: 我改变了查询,使它看起来像这样:
hive -e '
SET mapred.child.java.opts="-server -Xmx1g -XX:+UseConcMarkSweepGC";
SELECT
day,
event,
query_id,
COUNT(1) AS count,
COLLECT_SET(userid)
FROM
tv_counter_stats
GROUP BY
day,
event,
query_id;' > counter_stats_data.csv
但是,我收到以下错误:
Diagnostic Messages for this Task:
java.lang.Throwable: Child Error
at org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:250)
Caused by: java.io.IOException: Task process exit with nonzero status of 1.
at org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:237)
FAILED: Execution Error, return code 2 from org.apache.hadoop.hive.ql.exec.MapRedTask
MapReduce Jobs Launched:
Job 0: Map: 3 Reduce: 1 HDFS Read: 0 HDFS Write: 0 FAIL
Total MapReduce CPU Time Spent: 0 msec
答案 0 :(得分:1)
这可能是内存问题,因为collect_set
会聚合内存中的数据。
尝试增加堆大小并启用并发GC(通过将Hadoop mapred.child.java.opts
设置为例如-Xmx1g -XX:+UseConcMarkSweepGC
)。
This answer有关于“GC开销限制”错误的更多信息。
答案 1 :(得分:1)
我遇到了同样的问题并遇到了这个问题,所以我想我会分享我找到的解决方案。
潜在的问题很可能是Hive正在尝试在映射器端进行聚合,并且它用于管理该方法的内存中哈希映射的启发式方法被广泛的数据抛弃但是浅" - 即在您的情况下,如果每天/ event / query_id组中的user_id值非常少。
我发现article解释了解决此问题的各种方法,但其中大多数只是对完全核选项的优化:完全禁用映射器端聚合。
使用set hive.map.aggr = false;
应该可以解决问题。