Question

我正在使用MrJob编写hadoop应用程序。我需要使用分布式缓存来访问某些文件。我知道hadoop流中有一个选项-files，但不知道如何在程序中访问它。

感谢您的帮助。

Answer 1

我认为你必须使用

<强> mrjob.compat.supports_new_distributed_cache_options（版本）

然后使用-files和-archives代替-cacheFile和-cacheArchive

可能会得到更多here

Answer 2

您应该读取程序中的文件，就好像文件本身可用，即文件与运行代码位于同一目录中。

我在python中不擅长，因此这是ruby中的示例，mapper.rb：

begin
    file = File.open("my-distributed-cache-file.txt")
    while (line = file.gets)
            # do something with your file
    end
    file.close
end
# Rest of mapper code

从MrJob访问分布式缓存

2 个答案: