Question

我有一个输入文件（大小约为31GB），其中包含有关某些产品的消费者评论，我试图将其推理并找到相应的引理计数。该方法有点类似于Hadoop提供的WordCount示例。我总共有4个类来执行处理：StanfordLemmatizer [包含来自斯坦福的coreNLP软件包v3.3.0的lemmatizing的好东西]，WordCount [驱动程序]，WordCountMapper [mapper]和WordCountReducer [reducer]。

我在原始数据集的子集（以MB为单位）上测试了程序，并且运行正常。不幸的是，当我在大小为~31GB的完整数据集上运行作业时，作业失败了。我检查了 syslog 这个工作，它包含了这个：

java.lang.OutOfMemoryError：Java堆空间at edu.stanford.nlp.sequences.ExactBestSequenceFinder.bestSequence（ExactBestSequenceFinder.java:109） [...]

有关如何处理此事的任何建议吗？

注意：我正在使用雅虎的虚拟机预先配置了hadoop-0.18.0。我也尝试过如此线程中提到的分配更多堆的解决方案：out of Memory Error in Hadoop

WordCountMapper代码：

import java.io.IOException;

import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapred.MapReduceBase;
import org.apache.hadoop.mapred.Mapper;
import org.apache.hadoop.mapred.OutputCollector;
import org.apache.hadoop.mapred.Reporter;

public class WordCountMapper extends MapReduceBase
    implements Mapper<LongWritable, Text, Text, IntWritable> {

  private final IntWritable one = new IntWritable(1);
  private final Text word = new Text();
  private final StanfordLemmatizer slem = new StanfordLemmatizer();

  public void map(LongWritable key, Text value,
      OutputCollector output, Reporter reporter) throws IOException {

    String line = value.toString();

    if(line.matches("^review/(summary|text).*"))    //if the current line represents a summary/text of a review, process it! 
    {
        for(String lemma: slem.lemmatize(line.replaceAll("^review/(summary|text):.", "").toLowerCase()))
        {
            word.set(lemma);
            output.collect(word, one);
        }
    }
  }
}

Answer 1

您需要使正在处理的各个单位的大小（即map-reduce中的每个Map作业）合理。第一个单位是您为StanfordCoreNLP的annotate（）调用提供的文档大小。您在此处提供的整个文本将被标记化并在内存中处理。在标记化和处理过的形式中，它比磁盘上的大小大一个数量级。因此，文档大小需要合理。例如，您可以一次传递一个消费者评论（而不是31GB的文本文件！）

其次，在一级向下，POS标记器（在词形还原之前）一次注释一个句子，并且它使用大型临时动态编程数据结构来标记句子，这可能是3个数量级的大小比句子。因此，单个句子的长度也需要合理。如果有很长的文本或垃圾不分成句子，那么你可能也会遇到这个问题。解决这个问题的一个简单方法是使用pos.maxlen属性来避免对超长句子进行POS标记。

P.S。当然，如果你只需要使用变形器，你就不应该运行像你不使用的parse，dcoref这样的注释器。

Answer 2

如果您的StanfordLemmatizer不是mapreduce作业的一部分，则配置hadoop堆空间可能无法帮助您。你能提供这份工作的代码吗？所以，我认为限制你的是一般的Java堆空间。

在考虑配置之前，先检查一下：

我看了一下edu.stanford.nlp.sequences.ExactBestSequenceFinder的代码（您也应该尝试here）

我不知道您使用的是哪个版本的stanford.nlp，我对它不熟悉，但似乎根据您输入的“SequenceModel”进行了一些操作。它是这样开始的：

private int[] bestSequenceNew(SequenceModel ts) {
    // Set up tag options
    int length = ts.length();
    int leftWindow = ts.leftWindow();
    int rightWindow = ts.rightWindow();
    int padLength = length + leftWindow + rightWindow;
    int[][] tags = new int[padLength][];  //operations based on the length of ts
    int[] tagNum = new int[padLength];   //this is the guilty line 109 according to grepcode

所以输出 ts.length（）非常庞大（或者这个数组没有更多的Java堆空间）。你能把它缩小吗？

修改

显然是String

 line.replaceAll("^review/(summary|text):.", "").toLowerCase()

对于Java堆来说太多了。你能检查一下这是不是你想要的那个？你能打印它的长度吗？也许您应该考虑重新组织31GB数据集，以便它比您现在的工作线路（如果可能的话）更小更小。可能是因为错误和问题的原因，一条线太大了。

如果无法做到这一点，请打印异常的完整堆栈跟踪。

关于运行Hadoop作业的java.lang.OutOfMemoryError

2 个答案: