Question

我想使用OPenNLP和Hadoop进行句子检测。我已成功在Java上实现了相同的功能。想在Mapreduce平台上实现相同的功能。任何人都可以帮助我吗？

Answer 1

我做了两种不同的方式。一种方法是将Sentence检测模型推送到每个节点到标准目录（即/ opt / opennlpmodels /），并在mapper类的类级别读取序列化模型，然后在地图中适当地使用它或减少功能。

另一种方法是将模型放在数据库或分布式缓存中（作为blob或其他东西......我之前使用过Accumulo存储文档分类模型）。然后在类级别建立与数据库的连接，并将模型作为bytearrayinputstream获取。

我已经使用Puppet推出模型，但使用您通常用来保持文件在集群上最新的任何内容。

根据您的hadoop版本，您可以将模型作为jobsetup上的属性隐藏，然后只有master（或从中启动作业的任何位置）将需要在其上具有实际的模型文件。我从来没有试过这个。

如果您需要知道如何实际使用OpenNLP句子检测器，请告诉我，我将发布一个示例。 HTH

import java.io.File;
import java.io.FileInputStream;
import opennlp.tools.sentdetect.SentenceDetector;
import opennlp.tools.sentdetect.SentenceDetectorME;
import opennlp.tools.sentdetect.SentenceModel;
import opennlp.tools.util.Span;

public class SentenceDetection {

  SentenceDetector sd;

  public Span[] getSentences(String docTextFromMapFunction) throws Exception {

    if (sd == null) {
      sd = new SentenceDetectorME(new SentenceModel(new FileInputStream(new File("/standardized-on-each-node/path/to/en-sent.zip"))));
    }
    /**
     * this gives you the actual sentences as a string array
     */
    // String[] sentences = sd.sentDetect(docTextFromMapFunction);
    /**
     * this gives you the spans (the charindexes to the start and end of each
     * sentence in the doc)
     *
     */
    Span[] sentenceSpans = sd.sentPosDetect(docTextFromMapFunction);
    /**
     * you can do this as well to get the actual sentence strings based on the spans
     */
    // String[] spansToStrings = Span.spansToStrings(sentPosDetect, docTextFromMapFunction);
    return sentenceSpans;
  }
}

HTH ......只需确保文件到位即可。有更优雅的方法可以做到这一点，但这很有效，而且很简单。

在hadoop上使用opennlp进行句子检测

1 个答案: