对文档中的所有命名实体进行分组

时间:2014-02-04 12:30:41

标签: n-gram named-entity-recognition part-of-speech

我想将给定文档中的所有命名实体分组。 例如,

**Barack Hussein Obama** II  is the 44th and current President of the United States, and the first African American to hold the office. 

我不想使用OpenNLP API,因为它可能无法识别所有命名实体。 有没有办法使用其他服务生成这样的n-gram,或者可能是将所有名词术语组合在一起的方法。

1 个答案:

答案 0 :(得分:4)

如果您想避免使用NER,可以使用句子chunker或解析器。这将一般地提取名词短语。 OpenNLP有一个句子chunker和解析器,但如果你因为某些原因不利于使用OpenNLP,你可以尝试其他人。 如果您对使用OpenNLP chunker感兴趣,我将发布一些使用OpenNLP提取名词短语的代码。

这是代码。您需要从sourceforge下载模型

http://opennlp.sourceforge.net/models-1.5/

import java.io.File;
import java.io.FileInputStream;
import java.io.IOException;
import java.io.InputStream;
import java.util.ArrayList;
import java.util.Arrays;
import java.util.List;
import opennlp.tools.chunker.ChunkerME;
import opennlp.tools.chunker.ChunkerModel;
import opennlp.tools.postag.POSModel;
import opennlp.tools.postag.POSTaggerME;
import opennlp.tools.tokenize.TokenizerME;
import opennlp.tools.tokenize.TokenizerModel;
import opennlp.tools.util.Span;

/**
 *
 * Extracts noun phrases from a sentence. To create sentences using OpenNLP use
 * the SentenceDetector classes.
 */
public class OpenNLPNounPhraseExtractor {

  static final int N = 2;

  public static void main(String[] args) {

    try {
      String modelPath = "c:\\temp\\opennlpmodels\\";
      TokenizerModel tm = new TokenizerModel(new FileInputStream(new File(modelPath + "en-token.zip")));
      TokenizerME wordBreaker = new TokenizerME(tm);
      POSModel pm = new POSModel(new FileInputStream(new File(modelPath + "en-pos-maxent.zip")));
      POSTaggerME posme = new POSTaggerME(pm);
      InputStream modelIn = new FileInputStream(modelPath + "en-chunker.zip");
      ChunkerModel chunkerModel = new ChunkerModel(modelIn);
      ChunkerME chunkerME = new ChunkerME(chunkerModel);
      //this is your sentence
      String sentence = "Barack Hussein Obama II  is the 44th and current President of the United States, and the first African American to hold the office.";
      //words is the tokenized sentence
      String[] words = wordBreaker.tokenize(sentence);
      //posTags are the parts of speech of every word in the sentence (The chunker needs this info of course)
      String[] posTags = posme.tag(words);
      //chunks are the start end "spans" indices to the chunks in the words array
      Span[] chunks = chunkerME.chunkAsSpans(words, posTags);
      //chunkStrings are the actual chunks
      String[] chunkStrings = Span.spansToStrings(chunks, words);
      for (int i = 0; i < chunks.length; i++) {
        if (chunks[i].getType().equals("NP")) {
          System.out.println("NP: \n\t" + chunkStrings[i]);
          String[] split = chunkStrings[i].split(" ");

          List<String> ngrams = ngram(Arrays.asList(split), N, " ");
          System.out.println("ngrams:");
          for (String gram : ngrams) {
            System.out.println("\t" + gram);
          }

        }
      }


    } catch (IOException e) {
    }
  }

  public static List<String> ngram(List<String> input, int n, String separator) {
    if (input.size() <= n) {
      return input;
    }
    List<String> outGrams = new ArrayList<String>();
    for (int i = 0; i < input.size() - (n - 2); i++) {
      String gram = "";
      if ((i + n) <= input.size()) {
        for (int x = i; x < (n + i); x++) {
          gram += input.get(x) + separator;
        }
        gram = gram.substring(0, gram.lastIndexOf(separator));
        outGrams.add(gram);
      }
    }
    return outGrams;
  }
}

我用你的句子得到的输出是这个(N设置为2(bigram)

NP: 
    Barack Hussein Obama II
ngrams:
    Barack Hussein
    Hussein Obama
    Obama II
NP: 
    the 44th and current President
ngrams:
    the 44th
    44th and
    and current
    current President
NP: 
    the United States
ngrams:
    the United
    United States
NP: 
    the first African American
ngrams:
    the first
    first African
    African American
NP: 
    the office
ngrams:
    the
    office

这并未明确处理形容词落在NP之外的情况......如果是这样,您可以从POS标签获取此信息并进行整合。我给你的东西应该把你送到正确的方向。