Question

我已经在一个集群上设置了hadoop 2.7.2，其中一个主服务器（ubuntu 15.10）和两个slave（slave2,3）由虚拟机托管在主服务器中。

我已经运行了几个像wordcount这样的例子，一切正常。但是当我尝试自己的工作时，说Myjob，它最初运行良好，但过了一段时间，肯定会被这个错误打断：

INFO ipc.Client: Retrying connect to server:      slave3/xxx.216.227.176(the ip of slave):38046. Already tried 2 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=3, sleepTime=1000 MILLISECONDS)

有时它会是slave2，有时候会是slave3。我通过ssh与该奴隶的连接显示the connection is closed by remote。

但虚拟框显示奴隶运行良好，我可以回到那个奴隶，但是所有的hadoop进程都已被杀死。需要提一下，我自己的工作比例子工作要长。

起初，我认为这可能是我的配置文件导致的一些错误，因此，我重新安装了主人和奴隶的hadoop。但错误仍然存在。

所以，我认为这可能是由我在从属节点中的网络配置引起的。所以，我改变了奴隶的ip的最后一个字段，如xxx.xxx.xxx.183 to xxx.xxx.xxx.176并重新安装hadoop。

我重新开始工作，此时工作的时间比往常长。但是，最后，当地图阶段大部分结束时（map 86% reduce 28%），它由于同样的错误而失败了！

INFO ipc.Client: Retrying connect to server: slave3/125.xxx.227.xxx(the ip of slave):38046. Already tried 2 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=3, sleepTime=1000 MILLISECONDS)

还有一些登录yarn-user-resourcemanager-Master.log：

java.net.ConnectException: Call From Master/xxx.216.227.186 to slave2:44592 failed on connection exception: java.net.ConnectException: refuse to connect; For more details see:  http://wiki.apache.org/hadoop/ConnectionRefused

似乎应用运行的时间越长，失败的可能性就越大。

这是我的主机文件：

127.0.0.1       localhost                                                                                                                                                                                           
#127.0.1.1      Master
xxx.216.227.186 Master                                                                                                                                                                                              
xxx.216.227.185 slave1# the slave1 has some problem thus do not connect to the cluster                                                                                                                                                                                       
xxx.216.227.176 slave2                                                                                                                                                                                              
xxx.216.227.166 slave3                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              

# The following lines are desirable for IPv6 capable hosts                                                                                                                                                          
::1     ip6-localhost ip6-loopback                                                                                                                                                                                  
fe00::0 ip6-localnet                                                                                                                                                                                                
ff00::0 ip6-mcastprefix                                                                                                                                                                                             
ff02::1 ip6-allnodes                                                                                                                                                                                                
ff02::2 ip6-allrouters

为什么呢？怎么解决？谢谢！

为什么突然从属节点在hadoop中丢失了与主节点的连接？

0 个答案: