Question

我正在尝试将OpenNLP集成到Hadoop上的map-reduce作业中，从一些基本的句子分割开始。在map函数中，运行以下代码：

public AnalysisFile analyze(String content) {
    InputStream modelIn = null;
    String[] sentences = null;

    // references an absolute path to en-sent.bin
    logger.info("sentenceModelPath: " + sentenceModelPath);

    try {
        modelIn = getClass().getResourceAsStream(sentenceModelPath);
        SentenceModel model = new SentenceModel(modelIn);
        SentenceDetectorME sentenceBreaker = new SentenceDetectorME(model);
        sentences = sentenceBreaker.sentDetect(content);
    } catch (FileNotFoundException e) {
        logger.error("Unable to locate sentence model.");
        e.printStackTrace();
    } catch (IOException e) {
        e.printStackTrace();
    } finally {
        if (modelIn != null) {
            try {
                modelIn.close();
            } catch (IOException e) {
            }
        }
    }

    logger.info("number of sentences: " + sentences.length);

    <snip>
}

当我运行我的工作时，我在日志中收到一条错误，上面写着“一定不能为空！” (source of class throwing error)，这意味着我无法以某种方式打开模型的InputStream。其他花絮：

我已验证模型文件存在于sentenceModelPath所指的位置。
我为opennlp-maxent添加了Maven依赖：3.0.2-incubating，opennlp-tools：1.5.2-incubating，opennlp-uima：1.5.2-incubating。
Hadoop刚刚在我的本地计算机上运行。

这大部分都来自OpenNLP documentation的样板。在Hadoop方面或OpenNLP方面是否有一些我遗漏的东西会导致我无法从模型中读取？

Answer 1

您的问题是getClass().getResourceAsStream(sentenceModelPath)行。这将尝试从类路径加载文件 - HDFS中的文件和客户端本地文件系统上的文件都不是mapper / reducer运行时的类路径的一部分，因此这就是您看到Null错误（getResourceAsStream）的原因如果找不到资源，则返回null。

为了解决这个问题，你有很多选择：

修改您的代码以从HDFS加载文件：

modelIn = FileSystem.get(context.getConfiguration()).open(
                 new Path("/sandbox/corpus-analysis/nlp/en-sent.bin"));

修改您的代码以从本地目录加载文件，并使用-files GenericOptionsParser选项（从本地文件系统复制到文件到HDFS，然后返回到本地目录运行映射器/减速器）：
```
modelIn = new FileInputStream("en-sent.bin");
```
将文件硬烘焙到作业jar（在jar的根目录中），并修改代码以包含一个前导斜杠：
```
modelIn = getClass().getResourceAsStream("/en-sent.bin");</li>
```

无法在Hadoop map-reduce作业中加载OpenNLP句子模型

1 个答案: