我正在运行一个包含24台服务器的hadoop集群。它已运行了几个月,但在最后一次重启后,数据节点仍然因错误而死亡:
2016-02-05 11:35:56,615 INFO org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src: /192.168.0.133:50010, dest: /192.168.0.133:40786, bytes: 118143861, op: HDFS_READ, cliID: DFSClient_attempt_1454667838939_0001_m_000330_0_-1595784897_1, offset: 0, srvID: 6522904d-0698-4794-af45-613a0492753c, blockid: BP-2025286576-192.168.0.93-1414492170010:blk_1076219758_2486790, duration: 21719288540
2016-02-05 11:35:56,755 INFO org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src: /192.168.0.133:50010, dest: /192.168.0.133:40784, bytes: 118297616, op: HDFS_READ, cliID: DFSClient_attempt_1454667838939_0001_m_000231_0_-1089799971_1, offset: 0, srvID: 6522904d-0698-4794-af45-613a0492753c, blockid: BP-2025286576-192.168.0.93-1414492170010:blk_1076221376_2488408, duration: 22149605332
2016-02-05 11:35:56,837 INFO org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src: /192.168.0.133:50010, dest: /192.168.0.133:40780, bytes: 118345914, op: HDFS_READ, cliID: DFSClient_attempt_1454667838939_0001_m_000208_0_-2005378882_1, offset: 0, srvID: 6522904d-0698-4794-af45-613a0492753c, blockid: BP-2025286576-192.168.0.93-1414492170010:blk_1076231364_2498422, duration: 22460210591
2016-02-05 11:35:57,359 INFO org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src: /192.168.0.133:50010, dest: /192.168.0.133:40781, bytes: 118419792, op: HDFS_READ, cliID: DFSClient_attempt_1454667838939_0001_m_000184_0_406014429_1, offset: 0, srvID: 6522904d-0698-4794-af45-613a0492753c, blockid: BP-2025286576-192.168.0.93-1414492170010:blk_1076221071_2488103, duration: 22978732747
2016-02-05 11:35:58,008 INFO org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src: /192.168.0.133:50010, dest: /192.168.0.133:40787, bytes: 118151696, op: HDFS_READ, cliID: DFSClient_attempt_1454667838939_0001_m_000324_0_-608122320_1, offset: 0, srvID: 6522904d-0698-4794-af45-613a0492753c, blockid: BP-2025286576-192.168.0.93-1414492170010:blk_1076222362_2489394, duration: 23063230631
2016-02-05 11:36:00,295 INFO org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src: /192.168.0.133:50010, dest: /192.168.0.133:40776, bytes: 123206293, op: HDFS_READ, cliID: DFSClient_attempt_1454667838939_0001_m_000015_0_-846180274_1, offset: 0, srvID: 6522904d-0698-4794-af45-613a0492753c, blockid: BP-2025286576-192.168.0.93-1414492170010:blk_1076244668_2511731, duration: 26044953281
2016-02-05 11:36:00,407 INFO org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src: /192.168.0.133:50010, dest: /192.168.0.133:40764, bytes: 123310419, op: HDFS_READ, cliID: DFSClient_attempt_1454667838939_0001_m_000010_0_-310980548_1, offset: 0, srvID: 6522904d-0698-4794-af45-613a0492753c, blockid: BP-2025286576-192.168.0.93-1414492170010:blk_1076244751_2511814, duration: 26288883806
2016-02-05 11:36:01,371 INFO org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src: /192.168.0.133:50010, dest: /192.168.0.133:40783, bytes: 119653309, op: HDFS_READ, cliID: DFSClient_attempt_1454667838939_0001_m_000055_0_-558109635_1, offset: 0, srvID: 6522904d-0698-4794-af45-613a0492753c, blockid: BP-2025286576-192.168.0.93-1414492170010:blk_1076222182_2489214, duration: 26808381782
2016-02-05 11:36:05,224 ERROR org.apache.hadoop.hdfs.server.datanode.DataNode: RECEIVED SIGNAL 15: SIGTERM
2016-02-05 11:36:05,230 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: SHUTDOWN_MSG:
/************************************************************
SHUTDOWN_MSG: Shutting down DataNode at computer75/192.168.0.133
************************************************************/
每次重新启动群集时,它都会启动,所有节点都打开。但经过一段时间运行地图减少作业后,一些节点因该错误而死亡。每次死节点都不同。
你知道发生了什么吗?我正在使用Hadoop 2.4.1,正如我所说的那样,集群已经运行了好几个月没有问题。
感谢。