Question

我面临GC开销问题，因为reducer输入很大。我无法过滤掉任何输入数据，因为Iterable参数中的所有条目都是有用的。我尝试在运行此作业的EMR集群上增加堆大小，但即使这样也无济于事。

基本上我的reducer做的是它需要一个字符串列表并将它们转换为对象列表A.然后将这些对象组合起来形成一个更大的对象B.我能想到的解决方案是，我可以将整个Iterable输入转储到磁盘上并完全释放Iterable对象。然后从磁盘一次读取一个转储的字符串并继续构建较大的对象B，通过中间构建对象A然后在从磁盘获取下一个字符串之前释放A.这样，我几乎只会在堆上保留一半的数据。但是，我认为我无法释放Iterable输入，因为我仍然在进行GC开销。

我尝试这样做的方式如下：

public void reduce(final Text key, Iterable<MapWritable> inputValues,
                   final Context context)
{
    BufferedWriter bw = new BufferedWriter(
                            new FileWriterWithEncoding(this.fileName, this.encoding));
    for (Iterator<MapWritable> iterator = inputValues.iterator(); iterator.hasNext();)
    {
        MapWritable mapWritable = iterator.next();
        // ----
        // Put string contained in mapWritable into a file on disk
        // ----
    }
    bw.close();

    // Release the input Iterable instance
    inputValues = null;  //This doesn't seem to work :'(

    BufferedReader br = new BufferedReader(new InputStreamReader(
                            new FileInputStream(this.fileName), this.encoding));
    for (String line; (line = br.readLine()) != null;)
    {
        // ----
        // Then read from the file saved in disk one line at a time
        // and process it to build the object B
        // ----
    }
    br.close();
}

我的问题是，有没有办法从reducer方法中释放reducer的Iterable输入的内存？

释放reducer Iterable输入的内存

0 个答案: