我在CDH群集5.7.0中运行HBase。运行几个月后没有任何问题,hbase服务停止了,现在无法启动HBase主服务器(1个主服务器和4个服务器区域服务器)。
当我尝试在某个时刻启动它时,机器挂起,我在主日志中看到的最后一件事是:
2016-10-24 12:17:15,150 INFO org.apache.hadoop.hbase.util.FSHDFSUtils: Recover lease on dfs file hdfs://namenode:8020/hbase/MasterProcWALs/state-00000000000000005528.log
2016-10-24 12:17:15,152 INFO org.apache.hadoop.hbase.util.FSHDFSUtils: Recovered lease, attempt=0 on file=hdfs://namenode:8020/hbase/MasterProcWALs/state-00000000000000005528.log after 2ms
2016-10-24 12:17:15,177 INFO org.apache.hadoop.hbase.util.FSHDFSUtils: Recover lease on dfs file hdfs://namenode:8020/hbase/MasterProcWALs/state-00000000000000005529.log
2016-10-24 12:17:15,179 INFO org.apache.hadoop.hbase.util.FSHDFSUtils: Recovered lease, attempt=0 on file=hdfs://namenode:8020/hbase/MasterProcWALs/state-00000000000000005529.log after 2ms
2016-10-24 12:17:15,394 INFO org.apache.hadoop.hbase.util.FSHDFSUtils: Recover lease on dfs file hdfs://namenode:8020/hbase/MasterProcWALs/state-00000000000000005530.log
2016-10-24 12:17:15,397 INFO org.apache.hadoop.hbase.util.FSHDFSUtils: Recovered lease, attempt=0 on file=hdfs://namenode:8020/hbase/MasterProcWALs/state-00000000000000005530.log after 3ms
2016-10-24 12:17:15,405 INFO org.apache.hadoop.hbase.util.FSHDFSUtils: Recover lease on dfs file hdfs://namenode:8020/hbase/MasterProcWALs/state-00000000000000005531.log
2016-10-24 12:17:15,409 INFO org.apache.hadoop.hbase.util.FSHDFSUtils: Recovered lease, attempt=0 on file=hdfs://namenode:8020/hbase/MasterProcWALs/state-00000000000000005531.log after 3ms
2016-10-24 12:17:15,414 WARN org.apache.hadoop.hdfs.BlockReaderFactory: I/O error constructing remote block reader.
java.net.SocketException: No buffer space available
at sun.nio.ch.Net.connect0(Native Method)
at sun.nio.ch.Net.connect(Net.java:465)
at sun.nio.ch.Net.connect(Net.java:457)
at sun.nio.ch.SocketChannelImpl.connect(SocketChannelImpl.java:670)
at org.apache.hadoop.net.SocketIOWithTimeout.connect(SocketIOWithTimeout.java:192)
at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:530)
at org.apache.hadoop.hdfs.DFSClient.newConnectedPeer(DFSClient.java:3499)
at org.apache.hadoop.hdfs.BlockReaderFactory.nextTcpPeer(BlockReaderFactory.java:838)
at org.apache.hadoop.hdfs.BlockReaderFactory.getRemoteBlockReaderFromTcp(BlockReaderFactory.java:753)
at org.apache.hadoop.hdfs.BlockReaderFactory.build(BlockReaderFactory.java:374)
at org.apache.hadoop.hdfs.DFSInputStream.blockSeekTo(DFSInputStream.java:662)
at org.apache.hadoop.hdfs.DFSInputStream.readWithStrategy(DFSInputStream.java:889)
at org.apache.hadoop.hdfs.DFSInputStream.read(DFSInputStream.java:942)
at org.apache.hadoop.hdfs.DFSInputStream.read(DFSInputStream.java:742)
at java.io.FilterInputStream.read(FilterInputStream.java:83)
at com.google.protobuf.AbstractParser.parsePartialDelimitedFrom(AbstractParser.java:232)
at com.google.protobuf.AbstractParser.parseDelimitedFrom(AbstractParser.java:253)
at com.google.protobuf.AbstractParser.parseDelimitedFrom(AbstractParser.java:259)
at com.google.protobuf.AbstractParser.parseDelimitedFrom(AbstractParser.java:49)
at org.apache.hadoop.hbase.protobuf.generated.ProcedureProtos$ProcedureWALHeader.parseDelimitedFrom(ProcedureProtos.java:3870)
at org.apache.hadoop.hbase.procedure2.store.wal.ProcedureWALFormat.readHeader(ProcedureWALFormat.java:138)
at org.apache.hadoop.hbase.procedure2.store.wal.ProcedureWALFile.open(ProcedureWALFile.java:76)
at org.apache.hadoop.hbase.procedure2.store.wal.WALProcedureStore.initOldLog(WALProcedureStore.java:1006)
at org.apache.hadoop.hbase.procedure2.store.wal.WALProcedureStore.initOldLogs(WALProcedureStore.java:969)
at org.apache.hadoop.hbase.procedure2.store.wal.WALProcedureStore.recoverLease(WALProcedureStore.java:300)
at org.apache.hadoop.hbase.procedure2.ProcedureExecutor.start(ProcedureExecutor.java:509)
at org.apache.hadoop.hbase.master.HMaster.startProcedureExecutor(HMaster.java:1175)
at org.apache.hadoop.hbase.master.HMaster.startServiceThreads(HMaster.java:1097)
at org.apache.hadoop.hbase.master.HMaster.finishActiveMasterInitialization(HMaster.java:681)
at org.apache.hadoop.hbase.master.HMaster.access$500(HMaster.java:187)
at org.apache.hadoop.hbase.master.HMaster$1.run(HMaster.java:1756)
at java.lang.Thread.run(Thread.java:745)
2016-10-24 12:17:15,427 WARN org.apache.hadoop.hdfs.DFSClient: Failed to connect to /xxx.xxx.xxx.xxx:50010 for block, add to deadNodes and continue. java.net.SocketException: No buffer space available
java.net.SocketException: No buffer space available
at sun.nio.ch.Net.connect0(Native Method)
at sun.nio.ch.Net.connect(Net.java:465)
at sun.nio.ch.Net.connect(Net.java:457)
at sun.nio.ch.SocketChannelImpl.connect(SocketChannelImpl.java:670)
at org.apache.hadoop.net.SocketIOWithTimeout.connect(SocketIOWithTimeout.java:192)
at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:530)
at org.apache.hadoop.hdfs.DFSClient.newConnectedPeer(DFSClient.java:3499)
at org.apache.hadoop.hdfs.BlockReaderFactory.nextTcpPeer(BlockReaderFactory.java:838)
at org.apache.hadoop.hdfs.BlockReaderFactory.getRemoteBlockReaderFromTcp(BlockReaderFactory.java:753)
at org.apache.hadoop.hdfs.BlockReaderFactory.build(BlockReaderFactory.java:374)
at org.apache.hadoop.hdfs.DFSInputStream.blockSeekTo(DFSInputStream.java:662)
at org.apache.hadoop.hdfs.DFSInputStream.readWithStrategy(DFSInputStream.java:889)
at org.apache.hadoop.hdfs.DFSInputStream.read(DFSInputStream.java:942)
at org.apache.hadoop.hdfs.DFSInputStream.read(DFSInputStream.java:742)
at java.io.FilterInputStream.read(FilterInputStream.java:83)
at com.google.protobuf.AbstractParser.parsePartialDelimitedFrom(AbstractParser.java:232)
at com.google.protobuf.AbstractParser.parseDelimitedFrom(AbstractParser.java:253)
at com.google.protobuf.AbstractParser.parseDelimitedFrom(AbstractParser.java:259)
at com.google.protobuf.AbstractParser.parseDelimitedFrom(AbstractParser.java:49)
at org.apache.hadoop.hbase.protobuf.generated.ProcedureProtos$ProcedureWALHeader.parseDelimitedFrom(ProcedureProtos.java:3870)
at org.apache.hadoop.hbase.procedure2.store.wal.ProcedureWALFormat.readHeader(ProcedureWALFormat.java:138)
at org.apache.hadoop.hbase.procedure2.store.wal.ProcedureWALFile.open(ProcedureWALFile.java:76)
at org.apache.hadoop.hbase.procedure2.store.wal.WALProcedureStore.initOldLog(WALProcedureStore.java:1006)
at org.apache.hadoop.hbase.procedure2.store.wal.WALProcedureStore.initOldLogs(WALProcedureStore.java:969)
at org.apache.hadoop.hbase.procedure2.store.wal.WALProcedureStore.recoverLease(WALProcedureStore.java:300)
at org.apache.hadoop.hbase.procedure2.ProcedureExecutor.start(ProcedureExecutor.java:509)
at org.apache.hadoop.hbase.master.HMaster.startProcedureExecutor(HMaster.java:1175)
at org.apache.hadoop.hbase.master.HMaster.startServiceThreads(HMaster.java:1097)
at org.apache.hadoop.hbase.master.HMaster.finishActiveMasterInitialization(HMaster.java:681)
at org.apache.hadoop.hbase.master.HMaster.access$500(HMaster.java:187)
at org.apache.hadoop.hbase.master.HMaster$1.run(HMaster.java:1756)
at java.lang.Thread.run(Thread.java:745)
2016-10-24 12:17:15,436 INFO org.apache.hadoop.hdfs.DFSClient: Successfully connected to /xxx.xxx.xxx.xxx:50010 for BP-813663273-xxx.xxx.xxx.xxx-1460963038761:blk_1079056868_5316127
2016-10-24 12:17:15,442 INFO org.apache.hadoop.hbase.util.FSHDFSUtils: Recover lease on dfs file hdfs://namenode:8020/hbase/MasterProcWALs/state-00000000000000005532.log
2016-10-24 12:17:15,444 INFO org.apache.hadoop.hbase.util.FSHDFSUtils: Recovered lease, attempt=0 on file=hdfs://namenode:8020/hbase/MasterProcWALs/state-00000000000000005532.log after 2ms
2016-10-24 12:17:15,669 INFO org.apache.hadoop.hbase.util.FSHDFSUtils: Recover lease on dfs file hdfs://namenode:8020/hbase/MasterProcWALs/state-00000000000000005533.log
2016-10-24 12:17:15,672 INFO org.apache.hadoop.hbase.util.FSHDFSUtils: Recovered lease, attempt=0 on file=hdfs://namenode:8020/hbase/MasterProcWALs/state-00000000000000005533.log after 2ms
我担心WALProcedureStore中有一些损坏的东西,但我不知道如何继续挖掘以找到问题。有线索吗?我是否可以在不尝试加载以前损坏的状态的情况下重新启动Master?
修改:
我刚刚看到this bug,我认为是发生在我身上的同一问题。我可以安全删除/hbase/MasterProcWALs
中的所有内容而不删除存储在hbase中的旧数据吗?
没有
由于
答案 0 :(得分:1)
WAL或Write-Ahead-Log是一种HBase机制,能够在一切崩溃时恢复数据修改。基本上,对HBase的每个写操作都将在WAL中记录在WAL中,如果系统崩溃并且数据仍未保留,则HBase将能够从WAL重新创建这些写入。
这article帮助我更好地理解了整个过程:
WAL是灾难发生时所需要的生命线。与MySQL中的BIN日志类似,它记录了对数据的所有更改。这在主存储器发生某些情况时很重要。因此,如果服务器崩溃,它可以有效地重播该日志,以便在崩溃之前获取服务器应该到达的位置。这也意味着如果将记录写入WAL失败,则必须将整个操作视为失败。
让我们看看如何在HBase中完成此操作的高级视图。首先,客户端启动一个修改数据的操作。这是当前对put(Put),delete(Delete)和incrementColumnValue()的调用(这些修改中的每一个都被包装到一个KeyValue对象实例中,并使用RPC调用通过线路发送。这些调用(理想情况下是批处理)到为受影响区域提供服务的HRegionServer。一旦到达有效载荷,即所述KeyValue,被路由到负责受影响行的HRegion。数据被写入WAL,然后被放入保存记录的实际Store的MemStore中。这也几乎描述了写入 - HBase的路径。
最终,当MemStore达到一定大小或在特定时间之后,数据将异步持久保存到文件系统。在那个时间帧之间,数据被存储在内存中。如果托管该内存崩溃的HRegionServer丢失了数据......但是对于这篇帖子的主题是什么,WAL!
问题在于,由于this bug (HBASE-14712),WAL最终拥有数千条日志。每次主服务器尝试变为活动状态时,都会有许多不同的日志来恢复,租用和读取...最终会使用namenode。它耗尽了tcp缓冲区空间,一切都崩溃了。
为了能够启动主服务器,我必须手动删除/hbase/MasterProcWALs
和/hbase/WALs
下的日志。执行此操作后,主人能够变为活动状态,HBase群集重新上线。