我们从客户端收到了一个错误,原因是oozie作业因内存不足而失败。 oozie工作有三到四个动作,其中一个是蜂巢动作。
Hive操作显然执行了一个连接,该操作由实习生进行全表扫描。由于客户端的维护活动,没有进行定期清除,并且几天累积了数据。导致蜂巢动作再扫描几天。
下面是错误的堆栈跟踪:
2018-06-15 00:54:28,977 INFO org.apache.hadoop.hive.ql.io.HiveContextAwareRecordReader: Processing file hdfs://xxx:8020/data/csv/7342/2018-06-14/17/1/Network_xxx.dat
2018-06-15 00:54:29,005 INFO org.apache.hadoop.hive.ql.exec.MapOperator: Processing alias ntwk for file hdfs://xxx:8020/data/csv/7342/2018-06-14/17/1
2018-06-15 00:55:04,029 INFO org.apache.hadoop.hive.ql.exec.MapOperator: 7 finished. closing...
2018-06-15 00:55:04,129 INFO org.apache.hadoop.hive.ql.exec.MapOperator: 7 forwarded 6672342 rows
2018-06-15 00:55:04,266 INFO org.apache.hadoop.hive.ql.exec.TableScanOperator: 3 finished. closing...
2018-06-15 00:55:04,266 INFO org.apache.hadoop.hive.ql.exec.TableScanOperator: 3 forwarded 0 rows
2018-06-15 00:55:04,513 INFO org.apache.hadoop.hive.ql.exec.ReduceSinkOperator: 2 finished. closing...
2018-06-15 00:55:04,538 INFO org.apache.hadoop.hive.ql.exec.ReduceSinkOperator: 2 forwarded 0 rows
2018-06-15 00:55:04,563 INFO org.apache.hadoop.hive.ql.exec.TableScanOperator: 3 Close done
2018-06-15 00:55:04,589 INFO org.apache.hadoop.hive.ql.exec.MapOperator: DESERIALIZE_ERRORS:0
2018-06-15 00:55:04,616 INFO org.apache.hadoop.hive.ql.exec.TableScanOperator: 1 finished. closing...
2018-06-15 00:55:04,641 INFO org.apache.hadoop.hive.ql.exec.TableScanOperator: 1 forwarded 6672342 rows
2018-06-15 00:55:04,666 INFO org.apache.hadoop.hive.ql.exec.ReduceSinkOperator: 0 finished. closing...
2018-06-15 00:55:04,691 INFO org.apache.hadoop.hive.ql.exec.ReduceSinkOperator: 0 forwarded 0 rows
2018-06-15 00:55:04,716 INFO org.apache.hadoop.hive.ql.exec.TableScanOperator: 1 Close done
2018-06-15 00:55:04,741 INFO org.apache.hadoop.hive.ql.exec.MapOperator: 7 Close done
2018-06-15 00:55:04,792 INFO ExecMapper: ExecMapper: processed 6672342 rows: used memory = 412446808
2018-06-15 00:55:10,316 INFO org.apache.hadoop.mapred.TaskLogsTruncater: Initializing logs' truncater with mapRetainSize=-1 and reduceRetainSize=-1
2018-06-15 00:55:10,852 FATAL org.apache.hadoop.mapred.Child: Error running child : java.lang.OutOfMemoryError: Java heap space
at org.apache.hadoop.io.compress.DecompressorStream.<init>(DecompressorStream.java:50)
at org.apache.hadoop.io.compress.BlockDecompressorStream.<init>(BlockDecompressorStream.java:50)
at org.apache.hadoop.io.compress.SnappyCodec.createInputStream(SnappyCodec.java:173)
at org.apache.hadoop.hive.ql.io.RCFile$Reader.nextKeyBuffer(RCFile.java:1447)
at org.apache.hadoop.hive.ql.io.RCFile$Reader.next(RCFile.java:1602)
at org.apache.hadoop.hive.ql.io.RCFileRecordReader.next(RCFileRecordReader.java:98)
at org.apache.hadoop.hive.ql.io.RCFileRecordReader.next(RCFileRecordReader.java:85)
at org.apache.hadoop.hive.ql.io.RCFileRecordReader.next(RCFileRecordReader.java:39)
at org.apache.hadoop.hive.ql.io.HiveContextAwareRecordReader.doNext(HiveContextAwareRecordReader.java:274)
at org.apache.hadoop.hive.ql.io.CombineHiveRecordReader.doNext(CombineHiveRecordReader.java:101)
at org.apache.hadoop.hive.ql.io.CombineHiveRecordReader.doNext(CombineHiveRecordReader.java:41)
at org.apache.hadoop.hive.ql.io.HiveContextAwareRecordReader.next(HiveContextAwareRecordReader.java:108)
at org.apache.hadoop.hive.shims.HadoopShimsSecure$CombineFileRecordReader.doNextWithExceptionHandler(HadoopShimsSecure.java:329)
at org.apache.hadoop.hive.shims.HadoopShimsSecure$CombineFileRecordReader.next(HadoopShimsSecure.java:247)
at org.apache.hadoop.mapred.MapTask$TrackedRecordReader.moveToNext(MapTask.java:215)
at org.apache.hadoop.mapred.MapTask$TrackedRecordReader.next(MapTask.java:200)
at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:48)
at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:417)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:332)
at org.apache.hadoop.mapred.Child$4.run(Child.java:268)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1408)
at org.apache.hadoop.mapred.Child.main(Child.java:262)
我们从客户那里收到了有关数据量的统计信息。看来他们每天能获得300MB数据。失败的配置单元查询已处理了三天的数据,因此将近1GB数据。三天共有3000万条记录
我们尝试在实验室设置中重现相同的错误,我们将模拟数据加载了1亿(五天总共5GB数据),但持续了五天,但是hive查询及其后台作业仍然无缝运行。
不确定为什么使用相同的map jvm参数,我们将无法获得OutOfMemory错误。请注意,我们没有客户数据转储。我们正在使用模拟数据。
尽管我们将数据量增加了五倍,却没有面对与客户相同的问题,这可能是为什么?无法理解原因。
下面是配置:
mapred.map.child.java.opts : -Xmx512M
mapred.job.reduce.memory.mb : -1
mapred.job.map.memory.mb : -1