Question

我正在编写一个MapReduce作业，最终可能会在reducer中输入大量值。我担心所有这些值会立即加载到内存中。

Iterable<VALUEIN> values的底层实现是否在需要时将值加载到内存中？ Hadoop：The Definitive Guide似乎暗示了这种情况，但没有给出“明确的”答案。

reducer输出将远远大于输入的值，但我相信输出会根据需要写入磁盘。

Answer 1

你正确地读了这本书。 reducer不会将所有值存储在内存中。相反，当循环遍历Iterable值列表时，每个Object实例都会被重用，因此它只在给定时间保留一个实例。

例如，在下面的代码中，objs ArrayList将在循环之后具有预期的大小，但每个元素将是相同的b / c，每次迭代都会重复使用Text val实例。

public static class ReducerExample extends Reducer<Text, Text, Text, Text> {
public void reduce(Text key, Iterable<Text> values, Context context) {
    ArrayList<Text> objs = new ArrayList<Text>();
            for (Text val : values){
                    objs.add(val);
            }
    }
}

（如果出于某种原因你确实想对每个val采取进一步行动，你应该制作一份深层副本然后存储它。）

当然，即使单个值也可能比内存大。在这种情况下，建议开发人员采取措施在前面的Mapper中削减数据，以使值不会太大。

更新：请参阅Hadoop The Definitive Guide第2版第199-200页。

This code snippet makes it clear that the same key and value objects are used on each 
invocation of the map() method -- only their contents are changed (by the reader's 
next() method). This can be a surprise to users, who might expect keys and vales to be 
immutable. This causes prolems when a reference to a key or value object is retained 
outside the map() method, as its value can change without warning. If you need to do 
this, make a copy of the object you want to hold on to. For example, for a Text object, 
you can use its copy constructor: new Text(value).

The situation is similar with reducers. In this case, the value object in the reducer's 
iterator are reused, so you need to copy any that you need to retain between calls to 
the iterator.

Answer 2

它不完全在内存中，其中一些来自磁盘，查看代码似乎框架将Iterable分解为段，并将它们从磁盘中逐个加载到内存中。

org.apache.hadoop.mapreduce.task.ReduceContextImpl org.apache.hadoop.mapred.BackupStore

Answer 3

正如其他用户所引用的那样，整个数据未加载到内存中。查看Apache文档链接中的一些mapred-site.xml参数。

mapreduce.reduce.merge.inmem.threshold

默认值：1000。就内存合并过程的文件数而言，它是阈值。

mapreduce.reduce.shuffle.merge.percent

默认值为0.66。将启动内存中合并的使用阈值，表示为分配给存储内存映射输出的总内存的百分比，由mapreduce.reduce.shuffle.input.buffer.percent定义。

mapreduce.reduce.shuffle.input.buffer.percent

默认值为0.70。在随机播放期间从最大堆大小分配到存储映射输出的内存百分比。

mapreduce.reduce.input.buffer.percent

默认值为0.内存百分比 - 相对于最大堆大小 - 在reduce期间保留映射输出。当shuffle结束时，内存中任何剩余的map输出必须消耗小于此阈值才能开始reduce。

mapreduce.reduce.shuffle.memory.limit.percent

默认值为：0.25。单个shuffle可以使用的内存限制的最大百分比

内存中的Hadoop Reducer值？

3 个答案: