我正在尝试为我们的公司创建一个聊天机器人,我们可以向这个机器人发送消息,然后使用opennlp来解析字符串并运行一些脚本。
例如,查询将是
"I'm going to work on ProjectY, can you close ProjectX?"
那应该使用参数ProjectX来激活脚本closeRepo.sh。
我遇到的问题是它正确解析上面的句子为2部分:
"I'm going to work on ProjectY"
和 "你可以关闭ProjectX"
但并非所有可能的项目都已正确解析。我有一个项目名称,其中opennlp没有将其视为NP,而是作为ADVB或其他东西,我认为它将其视为句子:你能快速关闭或类似的东西。
这是我的解析代码,我让模型加载(我使用这里提供的标准模型:http://opennlp.sourceforge.net/models-1.5/)
String sentences[] = sentenceDetector.sentDetect(input);
for(int i = 0; i < sentences.length; i++){
String[] tokens = tokenizer.tokenize(sentences[i]);
StringBuffer sb = new StringBuffer();
for(String t : tokens){
sb.append(t);
sb.append(' ');
}
sb.deleteCharAt(sb.length()-1);//remove last space
sentences[i] = sb.toString();
}
ArrayList<Parse> parses = new ArrayList<Parse>();
for(String s : sentences){
Parse topParses[] = ParserTool.parseLine(s, parser, 1);
if(topParses.length > 0){
parses.add(topParses[0]);
}
}
return parses;
我愿意切换到斯坦福的nlp,如果那样会更容易。但我的问题是:
有没有办法给opennlp一个我的项目列表并将其检测为 NP或NN?
答案 0 :(得分:1)
使用OpenNLP句子chunker可能会更好,它运行良好,并检查是否有任何名词短语包含您的项目名称之一。这样的事情。
import java.io.File;
import java.io.FileInputStream;
import java.io.IOException;
import java.io.InputStream;
import opennlp.tools.chunker.ChunkerME;
import opennlp.tools.chunker.ChunkerModel;
import opennlp.tools.postag.POSModel;
import opennlp.tools.postag.POSTaggerME;
import opennlp.tools.tokenize.TokenizerME;
import opennlp.tools.tokenize.TokenizerModel;
import opennlp.tools.util.Span;
/**
*
* Extracts noun phrases from a sentence. To create sentences using OpenNLP use
* the SentenceDetector classes.
*/
public class OpenNLPNounPhraseExtractor {
public static void main(String[] args) {
try {
String modelPath = "c:\\temp\\opennlpmodels\\";
TokenizerModel tm = new TokenizerModel(new FileInputStream(new File(modelPath + "en-token.zip")));
TokenizerME wordBreaker = new TokenizerME(tm);
POSModel pm = new POSModel(new FileInputStream(new File(modelPath + "en-pos-maxent.zip")));
POSTaggerME posme = new POSTaggerME(pm);
InputStream modelIn = new FileInputStream(modelPath + "en-chunker.zip");
ChunkerModel chunkerModel = new ChunkerModel(modelIn);
ChunkerME chunkerME = new ChunkerME(chunkerModel);
//this is your sentence
String sentence = "Barack Hussein Obama II is the 44th President of the United States, and the first African American to hold the office.";
//words is the tokenized sentence
String[] words = wordBreaker.tokenize(sentence);
//posTags are the parts of speech of every word in the sentence (The chunker needs this info of course)
String[] posTags = posme.tag(words);
//chunks are the start end "spans" indices to the chunks in the words array
Span[] chunks = chunkerME.chunkAsSpans(words, posTags);
//chunkStrings are the actual chunks
String[] chunkStrings = Span.spansToStrings(chunks, words);
for (int i = 0; i < chunks.length; i++) {
String np = chunkStrings[i];
if (np.contains("some project name")) {
System.out.println(np);
//do something here
}
}
} catch (IOException e) {
}
}
}
顺便说一句,你要做的事情意味着对统计NLP方法的极高期望。句子分块是基于模型的,如果你的聊天不符合模型创建的数据的一般形状,那么无论你使用opennlp还是stanford或其他任何东西,你的结果都会有问题。听起来你也试图提取一个&#34;采取行动&#34;与项目名称NP相关,您可以修改动词短语提取。我不建议根据对可能有噪音的句子的概率解析自动触发sh脚本!