Question

我想更多地了解Mapreduce中的DistributedCache概念。

在我的Mapper类中，我编写了一个逻辑来读取缓存中可用的文件。

    protected void setup(Context context) throws IOException,
        InterruptedException {


    super.setup(context);
    localFiles =DistributedCache.getLocalCacheFiles(context.getConfiguration());

    for(Path myfile:localFiles)
    {
        String line=null;
        String nameofFile=myfile.getName();
        File file =new File(nameofFile);
        FileReader fr= new FileReader(file);
        BufferedReader br= new BufferedReader(fr);
        line=br.readLine();
        while(line!=null)
        {
            String[] arr=line.split("\t");
            myMap.put(arr[0], arr[1]);
        line=br.readLine();
        }
    }

   }

有人可以告诉我何时调用上述setUp(context)方法。是setUP(context)方法只调用一次，还是调用setup(context)方法运行的每个地图任务？

Answer 1

每个Mapper任务或Reducer任务只调用一次。因此，如果为作业生成了10个映射器或缩减器，那么对于每个映射器和缩减器，它将被调用一次。在此方法中添加内容的一般准则是在此处编写一次所需的任何任务，例如，获取分布式缓存的路径，将参数传递给映射器和缩减器。类似的是清理方法。

分布式缓存（Map side Joins）

1 个答案: