我想将给定文档中的所有命名实体分组。 例如,
**Barack Hussein Obama** II is the 44th and current President of the United States, and the first African American to hold the office.
我不想使用OpenNLP API,因为它可能无法识别所有命名实体。 有没有办法使用其他服务生成这样的n-gram,或者可能是将所有名词术语组合在一起的方法。
答案 0 :(得分:4)
如果您想避免使用NER,可以使用句子chunker或解析器。这将一般地提取名词短语。 OpenNLP有一个句子chunker和解析器,但如果你因为某些原因不利于使用OpenNLP,你可以尝试其他人。 如果您对使用OpenNLP chunker感兴趣,我将发布一些使用OpenNLP提取名词短语的代码。
这是代码。您需要从sourceforge下载模型
http://opennlp.sourceforge.net/models-1.5/
import java.io.File;
import java.io.FileInputStream;
import java.io.IOException;
import java.io.InputStream;
import java.util.ArrayList;
import java.util.Arrays;
import java.util.List;
import opennlp.tools.chunker.ChunkerME;
import opennlp.tools.chunker.ChunkerModel;
import opennlp.tools.postag.POSModel;
import opennlp.tools.postag.POSTaggerME;
import opennlp.tools.tokenize.TokenizerME;
import opennlp.tools.tokenize.TokenizerModel;
import opennlp.tools.util.Span;
/**
*
* Extracts noun phrases from a sentence. To create sentences using OpenNLP use
* the SentenceDetector classes.
*/
public class OpenNLPNounPhraseExtractor {
static final int N = 2;
public static void main(String[] args) {
try {
String modelPath = "c:\\temp\\opennlpmodels\\";
TokenizerModel tm = new TokenizerModel(new FileInputStream(new File(modelPath + "en-token.zip")));
TokenizerME wordBreaker = new TokenizerME(tm);
POSModel pm = new POSModel(new FileInputStream(new File(modelPath + "en-pos-maxent.zip")));
POSTaggerME posme = new POSTaggerME(pm);
InputStream modelIn = new FileInputStream(modelPath + "en-chunker.zip");
ChunkerModel chunkerModel = new ChunkerModel(modelIn);
ChunkerME chunkerME = new ChunkerME(chunkerModel);
//this is your sentence
String sentence = "Barack Hussein Obama II is the 44th and current President of the United States, and the first African American to hold the office.";
//words is the tokenized sentence
String[] words = wordBreaker.tokenize(sentence);
//posTags are the parts of speech of every word in the sentence (The chunker needs this info of course)
String[] posTags = posme.tag(words);
//chunks are the start end "spans" indices to the chunks in the words array
Span[] chunks = chunkerME.chunkAsSpans(words, posTags);
//chunkStrings are the actual chunks
String[] chunkStrings = Span.spansToStrings(chunks, words);
for (int i = 0; i < chunks.length; i++) {
if (chunks[i].getType().equals("NP")) {
System.out.println("NP: \n\t" + chunkStrings[i]);
String[] split = chunkStrings[i].split(" ");
List<String> ngrams = ngram(Arrays.asList(split), N, " ");
System.out.println("ngrams:");
for (String gram : ngrams) {
System.out.println("\t" + gram);
}
}
}
} catch (IOException e) {
}
}
public static List<String> ngram(List<String> input, int n, String separator) {
if (input.size() <= n) {
return input;
}
List<String> outGrams = new ArrayList<String>();
for (int i = 0; i < input.size() - (n - 2); i++) {
String gram = "";
if ((i + n) <= input.size()) {
for (int x = i; x < (n + i); x++) {
gram += input.get(x) + separator;
}
gram = gram.substring(0, gram.lastIndexOf(separator));
outGrams.add(gram);
}
}
return outGrams;
}
}
我用你的句子得到的输出是这个(N设置为2(bigram)
NP:
Barack Hussein Obama II
ngrams:
Barack Hussein
Hussein Obama
Obama II
NP:
the 44th and current President
ngrams:
the 44th
44th and
and current
current President
NP:
the United States
ngrams:
the United
United States
NP:
the first African American
ngrams:
the first
first African
African American
NP:
the office
ngrams:
the
office
这并未明确处理形容词落在NP之外的情况......如果是这样,您可以从POS标签获取此信息并进行整合。我给你的东西应该把你送到正确的方向。