Hadoop的DistributedCache文档似乎没有足够的描述如何使用分布式缓存。以下是给出的示例:
// Setting up the cache for the application
1. Copy the requisite files to the FileSystem:
$ bin/hadoop fs -copyFromLocal lookup.dat /myapp/lookup.dat
$ bin/hadoop fs -copyFromLocal map.zip /myapp/map.zip
$ bin/hadoop fs -copyFromLocal mylib.jar /myapp/mylib.jar
$ bin/hadoop fs -copyFromLocal mytar.tar /myapp/mytar.tar
$ bin/hadoop fs -copyFromLocal mytgz.tgz /myapp/mytgz.tgz
$ bin/hadoop fs -copyFromLocal mytargz.tar.gz /myapp/mytargz.tar.gz
2. Setup the application's JobConf:
JobConf job = new JobConf();
DistributedCache.addCacheFile(new URI("/myapp/lookup.dat#lookup.dat"),
job);
DistributedCache.addCacheArchive(new URI("/myapp/map.zip", job);
DistributedCache.addFileToClassPath(new Path("/myapp/mylib.jar"), job);
DistributedCache.addCacheArchive(new URI("/myapp/mytar.tar", job);
DistributedCache.addCacheArchive(new URI("/myapp/mytgz.tgz", job);
DistributedCache.addCacheArchive(new URI("/myapp/mytargz.tar.gz", job);
3. Use the cached files in the Mapper
or Reducer:
public static class MapClass extends MapReduceBase
implements Mapper<K, V, K, V> {
private Path[] localArchives;
private Path[] localFiles;
public void configure(JobConf job) {
// Get the cached archives/files
File f = new File("./map.zip/some/file/in/zip.txt");
}
public void map(K key, V value,
OutputCollector<K, V> output, Reporter reporter)
throws IOException {
// Use data from the cached archives/files here
// ...
// ...
output.collect(k, v);
}
}
我一直在寻找超过一个小时试图弄清楚如何使用它。在拼凑了一些其他的SO问题后,这就是我想出的:
public static void main(String[] args) throws Exception {
Job job = new Job(new JobConf(), "Job Name");
JobConf conf = job.getConfiguration();
DistributedCache.createSymlink(conf);
DistributedCache.addCacheArchive(new URI("/ProjectDir/LookupTable.zip", job);
// *Rest of configuration code*
}
public static class MyMapper extends Mapper<Object, Text, Text, IntWritable>
{
private Path[] localArchives;
public void configure(JobConf job)
{
// Get the cached archive
File file1 = new File("./LookupTable.zip/file1.dat");
BufferedReader br1index = new BufferedReader(new FileInputStream(file1));
}
public void map(Object key, Text value, Context context) throws IOException, InterruptedException
{ // *Map code* }
}
void configure(JobConf job)
功能? private Path[] localArchives
对象? configure()
函数中的代码是否是访问存档中文件并将文件与BufferedReader链接的正确方法? 答案 0 :(得分:1)
我将回答您的问题,包括用于分布式缓存的新API和常用做法
Framework会在每个map任务开始时调用 protected void setup(Context context)方法,通常在这里处理与使用缓存文件相关的逻辑。例如,读取文件并将数据存储在变量中,以便在setup()
之后调用的map()函数中使用通常在setup()方法中使用它来检索缓存文件的路径。这样的事情。
Path[] localArchive =DistributedCache.getLocalCacheFiles(context.getConfiguration());
它缺少对方法的调用来检索存储缓存文件的路径(如上所示)。检索到路径后,可以按如下方式读取文件。
FSDataInputStream in = fs.open(localArchive);
BufferedReader br = new BufferedReader(new InputStreamReader(in));