在5机器群集中以完全分布式模式运行Hadoop比在单个机器中花费更多时间

时间:2015-03-22 08:34:31

标签: algorithm hadoop mapreduce cluster-computing

我在5台机器(1个主机和4个从机)的集群中运行hadoop。我正在为朋友共同推荐运行map-reduce算法,我正在使用一个包含49995行的文件(或者每个人49995人,然后是他的朋友)。

问题是在集群上执行算法需要比在一台机器上执行算法更多的时间!!

我不知道这是否正常,因为文件不够大(因此机器之间的延迟导致时间慢)或者我必须更改某些内容以在不同节点上并行运行算法,但我认为这是自动完成的。

通常,在一台计算机上运行算法需要:

   real 3m10.044s
   user 2m53.766s
   sys  0m4.531s

在群集上需要这段时间:

    real    3m32.727s
    user    3m10.229s
    sys 0m5.545s

以下是我在master上执行start_all.sh脚本时的输出:

    ubuntu@ip:/usr/local/hadoop-2.6.0$ sbin/start-all.sh 
    This script is Deprecated. Instead use start-dfs.sh and start-yarn.sh
    Starting namenodes on [master]
    master: starting namenode, logging to /usr/local/hadoop-2.6.0/logs/hadoop-ubuntu-namenode-ip-172-31-37-184.out
    slave1: starting datanode, logging to /usr/local/hadoop-2.6.0/logs/hadoop-ubuntu-datanode-slave1.out
    slave2: starting datanode, logging to /usr/local/hadoop-2.6.0/logs/hadoop-ubuntu-datanode-slave2.out
    slave3: starting datanode, logging to /usr/local/hadoop-2.6.0/logs/hadoop-ubuntu-datanode-slave3.out
    slave4: starting datanode, logging to /usr/local/hadoop-2.6.0/logs/hadoop-ubuntu-datanode-slave4.out
    Starting secondary namenodes [0.0.0.0]
    0.0.0.0: starting secondarynamenode, logging to /usr/local/hadoop-2.6.0/logs/hadoop-ubuntu-secondarynamenode-ip-172-31-37-184.out
    starting yarn daemons
    starting resourcemanager, logging to /usr/local/hadoop-2.6.0/logs/yarn-ubuntu-resourcemanager-ip-172-31-37-184.out
    slave4: starting nodemanager, logging to /usr/local/hadoop-2.6.0/logs/yarn-ubuntu-nodemanager-slave4.out
    slave1: starting nodemanager, logging to /usr/local/hadoop-2.6.0/logs/yarn-ubuntu-nodemanager-slave1.out
    slave3: starting nodemanager, logging to /usr/local/hadoop-2.6.0/logs/yarn-ubuntu-nodemanager-slave3.out
    slave2: starting nodemanager, logging to /usr/local/hadoop-2.6.0/logs/yarn-ubuntu-nodemanager-slave2.out

这是我执行stop_all.sh脚本时的输出:

   ubuntu@ip:/usr/local/hadoop-2.6.0$ sbin/stop-all.sh 
   This script is Deprecated. Instead use stop-dfs.sh and stop-yarn.sh
   Stopping namenodes on [master]
   master: stopping namenode
   slave4: no datanode to stop
   slave3: stopping datanode
   slave1: stopping datanode
   slave2: stopping datanode
   Stopping secondary namenodes [0.0.0.0]
   0.0.0.0: stopping secondarynamenode
   stopping yarn daemons
   stopping resourcemanager
   slave2: no nodemanager to stop
   slave3: no nodemanager to stop
   slave4: no nodemanager to stop
   slave1: no nodemanager to stop
   no proxyserver to stop

提前谢谢!

1 个答案:

答案 0 :(得分:0)

一个可能的原因是您的文件未上传到HDFS上。换句话说,它存储在一台机器上,所有其他运行的机器必须从该机器获取数据。 在运行mapreduce程序之前。您可以执行以下步骤:

1-确保HDFS已启动并正在运行。打开链接: :50070 其中master是运行namenode的节点的IP,并检查该链接是否存在并运行所有节点。因此,如果您有4个数据节点,您应该看到:datanodes(4 live)。

2-致电:

  

hdfs dfs -put yourfile / someFolderOnHDFS / yourfile

这样您就可以将输入文件上传到HDFS,现在数据分布在多个节点中。

立即尝试运行您的程序,看看它是否更快

祝你好运