我正在训练一个名为实体识别的模型,但它没有正确识别人名?
我的训练数据如下:
<START:person> Pierre Vinken <END> , 61 years old , will join the board as a nonexecutive director Nov. 29 . A nonexecutive director has many similar responsibilities as an executive director.However, there are no voting rights with this position.`
Mr . <START:person> Vinken <END> is chairman of Elsevier N.V., the Dutch publishing group.
The former chairman of the society <START:person> Rudolph Agnew <END> will be assisting <START:person> Vinken <End> in his activities.
Mr . <START:person> Vinken <END> is the most right person in the industry.
His competitior <START:person> Steve <END> is vice chairman of Himbeldon N.V., the Ericson publishing group.
<START:person> Vinken <END> will also be assisted by <START:person> Angelina Tucci <END> who has been recognized many times For Her Good Work.
<START:person> Juilie <END> vp of Weterwood A.B., THE ZS publishing group also supported him.
Mr . <START:person> Stewart <END> is a recruiter of Metric C.D., the Drishti publishing.
He recruited <START:person> Adam <END> who will work on nlp for <START:person> Vinken <END> .
The lead conference for appointing him as a director was held by <START:person> Daniel Smith <END> at Boston.
用于训练模型的java文件是:
public class NamedEntityModel {
public static void train(String inputfile,String modelfile) throws IOException {
Charset charset = Charset.forName("UTF-8");
MarkableFileInputStreamFactory factory = new MarkableFileInputStreamFactory( new File(inputfile));
ObjectStream<String> lineStream = new PlainTextByLineStream( factory, charset);
ObjectStream<NameSample> sampleStream = new NameSampleDataStream( lineStream);
TokenNameFinderModel model = null;
try {
model = NameFinderME.train("en", "person", sampleStream,TrainingParameters.defaultParams(),
new TokenNameFinderFactory());
} finally {
sampleStream.close();
}
BufferedOutputStream modelOut = null;
try {
modelOut = new BufferedOutputStream(new FileOutputStream(modelfile));
model.serialize(modelOut);
} finally {
if (modelOut != null)
modelOut.close();
}
}
}
这就是主要课程的外观:
public class NameFinder {
public static void main(String [] args) throws IOException{
String inputfile="C:/setup/apache-opennlp-1.7.2/bin/ner_training_data.txt";
String modelfile="C:/setup/apache-opennlp-1.7.2/bin/en-tr-ner-person.bin";
NamedEntityModel.train(inputfile, modelfile);
String sentence ="Pierre Vinken , 61 years old , will join the board as a nonexecutive director Nov. 29 . Mr . Vinken is chairman of Elsevier N.V. , the Dutch publishing group. Rudolph Agnew , 55 years old and former chairman of Consolidated Gold Fields PLC , was named a director of this British industrial conglomerate . Peter is on leave today . "
+ "Steve is his competitor . Daniel Smith lead the ceremony. Kristen is svery happpy to know about it. Thomas will u please look into the matter as Ruby is busy";
WhitespaceTokenizer whitespaceTokenizer = WhitespaceTokenizer.INSTANCE;
//Tokenizing the given paragraph
String tokens[] = whitespaceTokenizer.tokenize(sentence);
for(String str:tokens)
System.out.println(str);
InputStream inputStreamNameFinder = new FileInputStream(modelfile);
TokenNameFinderModel model = new TokenNameFinderModel(inputStreamNameFinder);
NameFinderME nameFinder = new NameFinderME(model);
Span nameSpans[] = nameFinder.find(tokens);
System.out.println(Arrays.toString(Span.spansToStrings(nameSpans, tokens)));
for(Span s: nameSpans)
System.out.println(s.toString()+" "+tokens[s.getStart()]);
}
}
输出是:
[Pierre Vinken, Vinken, Peter, Steve, Daniel Smith, Kristen, Thomas]
这个受过训练的模型无法识别像Rudolph Agnew和Ruby这样的名字。 如何更准确地训练它,以便能够更正确地识别名称?
答案 0 :(得分:1)
+1 @ caffeinator13的回答。此外,有一些参数(https://opennlp.apache.org/documentation/1.5.3/apidocs/opennlp-tools/opennlp/tools/util/TrainingParameters.html)(链接是旧版本,但我猜有params仍然在更新版本中),它控制迭代次数和(可能与您更相关)截止,即数字实体必须出现在训练数据中以供考虑的时间。此设置或多或少控制精度与召回,也许您应该设置它更宽松(不确定默认值是什么)。因此,您可以尝试使用默认参数:
TrainingParameters tp = new TrainingParameters();
tp.put(TrainingParameters.CUTOFF_PARAM, "1");
tp.put(TrainingParameters.ITERATIONS_PARAM, "100");
TokenNameFinderFactory tnff = new TokenNameFinderFactory();
model = NameFinderME.train(language, modelName, sampleStream, tp, tnff);
答案 1 :(得分:0)
根据opennlp documentation,训练数据应包含至少15000个句子,以创建表现良好的模型。因此,使用更多数据训练它并尝试给出不同的名称,而不是保持测试数据与训练数据相同!