Question

我想使用分布式缓存来允许我的映射器访问数据。主要是，我正在使用命令

DistributedCache.addCacheFile(new URI("/user/peter/cacheFile/testCache1"), conf);

其中/ user / peter / cacheFile / testCache1是hdfs中存在的文件

然后，我的设置功能如下所示：

public void setup(Context context) throws IOException, InterruptedException{
    Configuration conf = context.getConfiguration();
    Path[] localFiles = DistributedCache.getLocalCacheFiles(conf);
    //etc
}

但是，此localFiles数组始终为null。

我最初在单主机群集上运行以进行测试，但我读到这会阻止分布式缓存工作。我尝试使用伪分布式，但这不起作用

我正在使用hadoop 1.0.3

感谢彼得

Answer 1

问题在于我正在做以下事情：

Configuration conf = new Configuration();
Job job = new Job(conf, "wordcount");
DistributedCache.addCacheFile(new URI("/user/peter/cacheFile/testCache1"), conf);

由于Job构造函数生成conf实例的内部副本，因此之后添加缓存文件不会影响事物。相反，我应该这样做：

Configuration conf = new Configuration();
DistributedCache.addCacheFile(new URI("/user/peter/cacheFile/testCache1"), conf);
Job job = new Job(conf, "wordcount");

现在它有效。感谢hadoop用户列表上的Harsh帮助。

Answer 2

Configuration conf = new Configuration();  
Job job = new Job(conf, "wordcount");
DistributedCache.addCacheFile(new URI("/userpetercacheFiletestCache1"),job.getConfiguration());

您也可以这样做。

Answer 3

将作业分配给配置对象后，即Configuration conf = new Configuration();

Job job = new Job(conf, "wordcount");

然后如果处理conf的属性如下所示，例如

conf.set("demiliter","|");

或

DistributedCache.addCacheFile(new URI("/user/peter/cacheFile/testCache1"), conf);

此类更改不会反映在伪群集或群集中，以及它如何与本地环境一起使用。

Answer 4

此版本的代码（与上述构造略有不同）一直对我有用。

//in main(String [] args)
Job job = new Job(conf,"Word Count"); 
...
DistributedCache.addCacheFile(new URI(/user/peter/cacheFile/testCache1), job.getConfiguration());

我没有在Mapper代码中看到完整的setup（）函数

public void setup(Context context) throws IOException, InterruptedException {

    Configuration conf = context.getConfiguration();
    FileSystem fs = FileSystem.getLocal(conf);

    Path[] dataFile = DistributedCache.getLocalCacheFiles(conf);

    // [0] because we added just one file.
    BufferedReader cacheReader = new BufferedReader(new InputStreamReader(fs.open(dataFile[0])));
    // now one can use BufferedReader's readLine() to read data

}

访问hadoop分布式缓存中的文件

4 个答案: