Hadoop:如何选择reducer节点?

时间:2014-09-08 16:54:58

标签: hadoop mapreduce

我刚开始学习Hadoop,但不了解datanode如何成为reducer节点。

  • 映射任务完成后,其排序缓冲区的内容将刷新到本地磁盘 在对KV对进行排序和分区之后
  • 然后通知jobtracker有关溢出的分区。
  • 之后,Reducer开始询问来自特定分区的数据。

如何 jobtracker决定哪个节点成为reducer节点?我正在阅读 Hadoop权威指南,但书中未提及此步骤。

谢谢, Bruckwald

1 个答案:

答案 0 :(得分:6)

非常first-come, first-serve。任务由心跳分配,因此如果Tasktracker对Jobtracker执行活动,它将获得可能包含要运行的新任务的响应:

List<Task> tasks = getSetupAndCleanupTasks(taskTrackerStatus);
if (tasks == null ) {
   tasks = taskScheduler.assignTasks(taskTrackerStatus);
}
if (tasks != null) {
   for (Task task : tasks) {
     expireLaunchingTasks.addNewTask(task.getTaskID());
     LOG.debug(trackerName + " -> LaunchTask: " + task.getTaskID());
     actions.add(new LaunchTaskAction(task));
   }
}

Here's the relevant source code of the Jobtracker。因此,除了首先使用哪个tasktracker之外,taskscheduler还将检查资源条件(例如,是否存在空闲插槽,或者单个节点是否未过载)。

可以找到相关代码here(这并不特别令人兴奋):

//
// Same thing, but for reduce tasks
// However we _never_ assign more than 1 reduce task per heartbeat
//
final int trackerCurrentReduceCapacity = 
  Math.min((int)Math.ceil(reduceLoadFactor * trackerReduceCapacity), 
           trackerReduceCapacity);
final int availableReduceSlots = 
  Math.min((trackerCurrentReduceCapacity - trackerRunningReduces), 1);
boolean exceededReducePadding = false;
if (availableReduceSlots > 0) {
  exceededReducePadding = exceededPadding(false, clusterStatus, 
                                          trackerReduceCapacity);
  synchronized (jobQueue) {
    for (JobInProgress job : jobQueue) {
      if (job.getStatus().getRunState() != JobStatus.RUNNING ||
          job.numReduceTasks == 0) {
        continue;
      }

      Task t = job.obtainNewReduceTask(taskTracker, numTaskTrackers, taskTrackerManager.getNumberOfUniqueHosts());                             
      if (t != null) {
        assignedTasks.add(t);
        break;
      }

      // Don't assign reduce tasks to the hilt!
      // Leave some free slots in the cluster for future task-failures,
      // speculative tasks etc. beyond the highest priority job
      if (exceededReducePadding) {
        break;
      }
    }
  }

基本上,第一个跟踪Jobtracker并且有足够可用插槽的任务跟踪器将获得减少任务。