Question

从下面的代码我不了解两件事：

DistributedCache.addcachefile(new URI ('/abc.dat'), job.getconfiguration())

我不明白URI路径必须存在于HDFS中。如果我错了，请纠正我。

以下代码中的p.getname().equals()是什么：

public class MyDC {

public static class MyMapper extends Mapper < LongWritable, Text, Text, Text > {

    private Map < String, String > abMap = new HashMap < String, String > ();

    private Text outputKey = new Text();

    private Text outputValue = new Text();

    protected void setup(Context context) throws
    java.io.IOException, InterruptedException {

        Path[] files = DistributedCache.getLocalCacheFiles(context.getConfiguration());

        for (Path p: files) {

            if (p.getName().equals("abc.dat")) {

                BufferedReader reader = new BufferedReader(new FileReader(p.toString()));

                String line = reader.readLine();

                while (line != null) {

                    String[] tokens = line.split("\t");

                    String ab = tokens[0];

                    String state = tokens[1];

                    abMap.put(ab, state);

                    line = reader.readLine();

                }

            }

        }

        if (abMap.isEmpty()) {

            throw new IOException("Unable to load Abbrevation data.");

        }

    }

    protected void map(LongWritable key, Text value, Context context)
    throws java.io.IOException, InterruptedException {

        String row = value.toString();

        String[] tokens = row.split("\t");

        String inab = tokens[0];

        String state = abMap.get(inab);

        outputKey.set(state);

        outputValue.set(row);

        context.write(outputKey, outputValue);

    }

}

public static void main(String[] args)
throws IOException, ClassNotFoundException, InterruptedException {

    Job job = new Job();

    job.setJarByClass(MyDC.class);

    job.setJobName("DCTest");

    job.setNumReduceTasks(0);

    try {

        DistributedCache.addCacheFile(new URI("/abc.dat"), job.getConfiguration());

    } catch (Exception e) {

        System.out.println(e);

    }

    job.setMapperClass(MyMapper.class);

    job.setMapOutputKeyClass(Text.class);

    job.setMapOutputValueClass(Text.class);


    FileInputFormat.addInputPath(job, new Path(args[0]));

    FileOutputFormat.setOutputPath(job, new Path(args[1]));

    job.waitForCompletion(true);

}

}

Answer 1

分布式缓存的想法是在开始执行任务节点之前使一些静态数据可用。

文件必须存在于HDFS中，以便它可以将其添加到分布式缓存（到每个任务节点）

DistributedCache.getLocalCacheFile基本上获取该任务节点中存在的所有缓存文件。通过if (p.getName().equals("abc.dat")) {，您将获得适合您的应用程序处理的缓存文件。

请参阅以下文档：

https://hadoop.apache.org/docs/r1.2.1/mapred_tutorial.html#DistributedCache

https://hadoop.apache.org/docs/r1.2.1/api/org/apache/hadoop/filecache/DistributedCache.html#getLocalCacheFiles(org.apache.hadoop.conf.Configuration)

Answer 2

DistributedCache是一种API，用于在内存中添加文件或一组文件，并且可用于每个数据节点，无论map-reduce是否有效。使用DistributedCache的一个示例是映射端连接。

DistributedCache.addcachefile（新URI（'/ abc.dat'），job.getconfiguration（））将在缓存区域中添加abc.dat文件。缓存中可以有n个文件，p.getName（）。equals（“abc.dat”））将检查您需要的文件。 HDFS中的每个路径都将在Path []下进行，以进行地图缩减处理。例如：

FileInputFormat.addInputPath(job, new Path(args[0]));

FileOutputFormat.setOutputPath(job, new Path(args[1]));

第一个路径（args [0]）是第一个参数你传递的（输入文件位置），而Jar执行和Path（args [1]）是输出文件位置的第二个参数。一切都被视为Path数组。

以同样的方式将任何文件添加到缓存时，它将在Path数组中排列，您可以使用以下代码检索它。

Path [] files = DistributedCache.getLocalCacheFiles（context.getConfiguration（））;

它将返回缓存中存在的所有文件，您将通过p.getName（）。equals（）方法获得文件名。

不了解分布式路径中的路径

2 个答案: