我在5台机器(1个主机和4个从机)的集群中运行hadoop。我正在为朋友共同推荐运行map-reduce算法,我正在使用一个包含49995行的文件(或者每个人49995人,然后是他的朋友)。
问题是在集群上执行算法需要比在一台机器上执行算法更多的时间!!
我不知道这是否正常,因为文件不够大(因此机器之间的延迟导致时间慢)或者我必须更改某些内容以在不同节点上并行运行算法,但我认为这是自动完成的。
通常,在一台计算机上运行算法需要:
real 3m10.044s
user 2m53.766s
sys 0m4.531s
在群集上需要这段时间:
real 3m32.727s
user 3m10.229s
sys 0m5.545s
以下是我在master上执行start_all.sh脚本时的输出:
ubuntu@ip:/usr/local/hadoop-2.6.0$ sbin/start-all.sh
This script is Deprecated. Instead use start-dfs.sh and start-yarn.sh
Starting namenodes on [master]
master: starting namenode, logging to /usr/local/hadoop-2.6.0/logs/hadoop-ubuntu-namenode-ip-172-31-37-184.out
slave1: starting datanode, logging to /usr/local/hadoop-2.6.0/logs/hadoop-ubuntu-datanode-slave1.out
slave2: starting datanode, logging to /usr/local/hadoop-2.6.0/logs/hadoop-ubuntu-datanode-slave2.out
slave3: starting datanode, logging to /usr/local/hadoop-2.6.0/logs/hadoop-ubuntu-datanode-slave3.out
slave4: starting datanode, logging to /usr/local/hadoop-2.6.0/logs/hadoop-ubuntu-datanode-slave4.out
Starting secondary namenodes [0.0.0.0]
0.0.0.0: starting secondarynamenode, logging to /usr/local/hadoop-2.6.0/logs/hadoop-ubuntu-secondarynamenode-ip-172-31-37-184.out
starting yarn daemons
starting resourcemanager, logging to /usr/local/hadoop-2.6.0/logs/yarn-ubuntu-resourcemanager-ip-172-31-37-184.out
slave4: starting nodemanager, logging to /usr/local/hadoop-2.6.0/logs/yarn-ubuntu-nodemanager-slave4.out
slave1: starting nodemanager, logging to /usr/local/hadoop-2.6.0/logs/yarn-ubuntu-nodemanager-slave1.out
slave3: starting nodemanager, logging to /usr/local/hadoop-2.6.0/logs/yarn-ubuntu-nodemanager-slave3.out
slave2: starting nodemanager, logging to /usr/local/hadoop-2.6.0/logs/yarn-ubuntu-nodemanager-slave2.out
这是我执行stop_all.sh脚本时的输出:
ubuntu@ip:/usr/local/hadoop-2.6.0$ sbin/stop-all.sh
This script is Deprecated. Instead use stop-dfs.sh and stop-yarn.sh
Stopping namenodes on [master]
master: stopping namenode
slave4: no datanode to stop
slave3: stopping datanode
slave1: stopping datanode
slave2: stopping datanode
Stopping secondary namenodes [0.0.0.0]
0.0.0.0: stopping secondarynamenode
stopping yarn daemons
stopping resourcemanager
slave2: no nodemanager to stop
slave3: no nodemanager to stop
slave4: no nodemanager to stop
slave1: no nodemanager to stop
no proxyserver to stop
提前谢谢!
答案 0 :(得分:0)
一个可能的原因是您的文件未上传到HDFS上。换句话说,它存储在一台机器上,所有其他运行的机器必须从该机器获取数据。 在运行mapreduce程序之前。您可以执行以下步骤:
1-确保HDFS已启动并正在运行。打开链接: 主:50070 其中master是运行namenode的节点的IP,并检查该链接是否存在并运行所有节点。因此,如果您有4个数据节点,您应该看到:datanodes(4 live)。
2-致电:
hdfs dfs -put yourfile / someFolderOnHDFS / yourfile
这样您就可以将输入文件上传到HDFS,现在数据分布在多个节点中。
立即尝试运行您的程序,看看它是否更快
祝你好运