Question

我有一组文件说10个文件和一个大文件，它是所有10个文件的总和。

我将它们放在分布式缓存中，作业conf。

当我在reduce中阅读它时，我会观察以下事项：

我只读取在reduce方法中添加到分布式缓存中的选定文件。我期望速度更快，因为与在所有reduce方法中读取大文件相比，每个reduce中读取的文件大小更小。但是，它的速度较慢。
此外，当我将其拆分为更小的文件并将其添加到分布式缓存时，问题变得更糟。工作本身很久才开始运作。

我无法找到原因。请帮助。

Answer 1

我认为你的问题在于在reduce（）中读取文件。您应该阅读configure（）（使用旧API）或setup（）（使用新API）中的文件。因此，对于每个reducer，它只会被读取一次，而不是为每个输入组读取它们到reducer（基本上，每次调用reduce方法）

你可以这样写：使用新的mapreduce API（org.apache.hadoop.mapreduce。*） -

    public static class ReduceJob extends Reducer<Text, Text, Text, Text> {

    ...
Path file1;
Path file2;
...

    @Override
            protected void setup(Context context) throws IOException, InterruptedException {

                // Get the file from distributed cached
    file1 = DistributedCache.getLocalCacheFiles(context.getConfiguration())[0];
    file2 = DistributedCache.getLocalCacheFiles(context.getConfiguration())[1];

                // parse the file and get it's data in-memory for use in reduce method, probably in some ArrayList or HashMap.
            }



            @Override
            protected void reduce(Text key, Iterable<Text> values, Context context) throws IOException,
                    InterruptedException {
    ...
    }
    }

使用OLD mapred API（org.apache.hadoop.mapred。*） -

public static class ReduceJob extends MapReduceBase implements Reducer<Text, Text, Text, Text> {

    ...
Path file1;
Path file2;
...

        @Override
        public void configure(JobConf job) {

                // Get the file from distributed cached
    file1 = DistributedCache.getLocalCacheFiles(job)[0]
    file2 = DistributedCache.getLocalCacheFiles(job)[1]
...

                // parse the file and get it's data in-memory for use in reduce method, probably in some ArrayList or HashMap.
            }


@Override
        public synchronized void reduce(Text key, Iterator<Text> values, OutputCollector<Text, Text> output,
                Reporter reporter) throws IOException {
    ...
    }
    }

读取许多文件hadoop mapreduce分布式缓存

1 个答案: