因此,当从s3快照将HBase恢复到亚马逊EMR集群时,我发生了一个非常奇怪的错误。
有时候hbase恢复工作正常,而其他时候没有 - 这是令人费解的部分。它似乎并不依赖于我的实例类型或节点数量,它偶尔发生,我无法确定究竟是什么失败(除了无法获得主表锁定和超时)和每次尝试谷歌这个问题变得空洞......
我的工作流程如下:
Master:1x M1.XL
核心:15x M1.XL * 竞价型实例 *
Bootstrap:设置Hbase(s3:// elasticmapreduce / bootstrap-actions / setup-hbase)
第1步:启动Hbase(/home/hadoop/lib/hbase.jar emr.hbase.backup.Main --start-master)
第2步:恢复Hbase(/home/hadoop/lib/hbase.jar emr.hbase.backup.Main --restore --backup-dir s3:// mybackupdir --backup-version mybackupversion)
在还原步骤中,还原将失败或成功(看似随机,但我相信这里可能存在超时/延迟问题)
似乎超时的恢复会保持重新锁定主机,在失败10分钟后,步骤失败
2014-04-02 17:46:57,028 WARN emr.hbase.backup.HBaseConnector (main): Master is not running, proceeding
2014-04-02 17:46:57,029 INFO emr.hbase.backup.Main (main): Attempting to aquire the master lock
2014-04-02 17:47:07,039 INFO emr.hbase.backup.Main (main): Unable to obtain master lock, attempting to shutdown master. java.lang.RuntimeException: Timeout while performing operation, expireTime=1396460818029 msg=obtaining write lock waiting for notification
2014-04-02 17:47:07,039 INFO emr.hbase.backup.HBaseConnector (main): Listing nodes at beginning of shutdown
2014-04-02 17:47:07,039 INFO emr.hbase.backup.HBaseConnector (main): Get master
2014-04-02 17:47:07,043 INFO emr.hbase.backup.ZooKeeperConnection (main-EventThread): Event received WatchedEvent state:SyncConnected type:None path:null
2014-04-02 17:48:18,370 WARN emr.hbase.backup.HBaseConnector (main): Master is not running, proceeding
2014-04-02 17:48:18,370 INFO emr.hbase.backup.Main (main): Attempting to aquire the master lock
2014-04-02 17:48:18,370 INFO emr.hbase.backup.Main (main): Releasing the lock
2014-04-02 17:48:18,374 FATAL emr.hbase.backup.Main (main): Exception raised in main
java.lang.RuntimeException: Timeout while performing operation, expireTime=1396460846811 msg=Attempting to shutdown master
at emr.hbase.fs.Utils.throwIfExpired(Utils.java:67)
at emr.hbase.backup.PerformBackup.restore(PerformBackup.java:201)
另一方面,当它工作时,尽管在锁定主机时超时几次,但启动恢复只需要大约3分钟
2014-04-01 19:29:43,720 INFO emr.hbase.backup.Main (main): Attempting to aquire the master lock
2014-04-01 19:29:53,730 INFO emr.hbase.backup.Main (main): Unable to obtain master lock, attempting to shutdown master. java.lang.RuntimeException: Timeout while performing operation, expireTime=1396380584720 msg=obtaining write lock waiting for notification
2014-04-01 19:29:53,730 INFO emr.hbase.backup.HBaseConnector (main): Listing nodes at beginning of shutdown
2014-04-01 19:29:53,731 INFO emr.hbase.backup.HBaseConnector (main): Get master
2014-04-01 19:29:53,734 INFO emr.hbase.backup.ZooKeeperConnection (main-EventThread): Event received WatchedEvent state:SyncConnected type:None path:null
2014-04-01 19:30:32,963 WARN emr.hbase.backup.HBaseConnector (main): Master is not running, proceeding
2014-04-01 19:30:32,963 INFO emr.hbase.backup.Main (main): Attempting to aquire the master lock
2014-04-01 19:30:33,028 INFO emr.hbase.backup.Main (main): Distributed copy from s3://myhbasebackup
2014-03-17 16:30:14,502 INFO org.apache.hadoop.mapreduce.Job (main): map 0% reduce 0%
2014-03-17 16:30:22,645 INFO org.apache.hadoop.mapreduce.Job (main): map 100% reduce 0%
2014-03-17 16:30:33,753 INFO org.apache.hadoop.mapreduce.Job (main): map 100% reduce 1%
2014-03-17 16:30:36,778 INFO org.apache.hadoop.mapreduce.Job (main): map 100% reduce 2%
2014-03-17 16:30:39,809 INFO org.apache.hadoop.mapreduce.Job (main): map 100% reduce 5%
2014-03-17 16:30:40,817 INFO org.apache.hadoop.mapreduce.Job (main): map 100% reduce 8%
*...and it works...*
我知道有一些超时参数我可以像动物园管理员超时那样改变但是我不确定超时限制实际上是问题,因为我看到这个字面意思一旦失败,并且如果我重试具有完全相同的设置,则工作。
任何帮助表示赞赏!谢谢!