hadoop - definitive guide - why is a block in hdfs so large

时间:2017-04-09 23:39:23

标签: hadoop mapreduce

I came across the following paragraph from the definitive guide(HDFS Concepts - blocks) and could not understand.

Map tasks in MapReduce normally operate on one block at a time, so if you have too few tasks (fewer than nodes in the cluster), your jobs will run slower than they could otherwise.

I am wondering how the jobs would be slower when the tasks are few when compared to the total number of nodes in the cluster. Say there are 1000 nodes in the cluster and 3 tasks(By tasks I took blocks as each block is sent to a node for a single task), the time it takes to get the result will always be less than the scenario that has say 1000 nodes and 1000 tasks right?

I couldn't get convinced by the paragraph given in the definitive guide.

1 个答案:

答案 0 :(得分:1)

你从书中引用的段落基本上说“尽可能多地使用节点”。如果您有1000个节点且只有3个块或任务,则您的任务上只运行3个节点,而其他所有997节点都不执行任务。如果您有1000个节点和1000个任务,并且这1000个节点中的每个节点都有一部分数据,则将在您的任务中使用所有1000个节点。您还可以利用数据局部性,因为每个节点将首先处理本地数据。