Question

我正在根据在线手册（http://opennlp.apache.org/documentation/1.5.2-incubating/manual/opennlp.html）构建一个名为en-ner-person.train的15k线培训数据文档

我的问题是：在我的培训文档中，我是否包含整个报告？或者我只包含名称为<START:person> John Smith <END>？

的行

例如，我是否在训练数据中使用了整个报告：

<START:person> Pierre Vinken <END> , 61 years old , will join the board as a nonexecutive director Nov. 29 .
A nonexecutive  director has many similar responsibilities as an executive director.
However, there are no voting rights with this position.
Mr . <START:person> Vinken <END> is chairman of Elsevier N.V. , the Dutch publishing group .

或者我只在培训文档中包含这两行：

<START:person> Pierre Vinken <END> , 61 years old , will join the board as a nonexecutive director Nov. 29 .
Mr . <START:person> Vinken <END> is chairman of Elsevier N.V. , the Dutch publishing group .

Answer 1

您应该使用整个报告。这将有助于系统学习何时不标记实体，改善假阴性分数。

您可以使用evaluation tool进行衡量。保留你的语料库中的一些句子进行测试，例如总数的1/10，并使用其他9/10句子训练你的模型。您可以尝试使用整个报告进行培训，另一个只使用带有名称的句子进行培训。结果将以precision and recall为准。

请记住将测试样本与整个报告保持一致，而不仅仅是带有名称的句子，否则您将无法准确衡量模型在没有名称的句子中的表现。

Answer 2

我会包含所有内容，即使所有内容都可能对训练模型中的权重没有贡献。

训练文件中使用的内容由用于训练模型的特征生成器确定。如果你到了实际调整特征生成器的程度那么你至少不需要重建你的训练文件，如果它已经包含了所有内容。

文档中的示例特征生成器也恰好是用于名称查找器的代码中的默认值：Custom Feature Generation

AdaptiveFeatureGenerator featureGenerator = new CachedFeatureGenerator(
         new AdaptiveFeatureGenerator[]{
           new WindowFeatureGenerator(new TokenFeatureGenerator(), 2, 2),
           new WindowFeatureGenerator(new TokenClassFeatureGenerator(true), 2, 2),
           new OutcomePriorFeatureGenerator(),
           new PreviousMapFeatureGenerator(),
           new BigramNameFeatureGenerator(),
           new SentenceFeatureGenerator(true, false)
           });

我无法完全解释代码的全局，并且没有找到好的文档或者通过源代码来理解它，但WindowFeatureGenerators会考虑令牌和令牌的类（例如，如果令牌已被标记为人）在检查令牌之前和之后+/- 2个位置。

因此，句子中不包含实体的标记可能会对一个句子产生影响。通过裁剪出额外的句子，你可能会用不自然的模式训练你的模型，比如一个以名字结尾的句子，后跟一个以这样的名字开头的句子：

The car fell on <START:person> Pierre Vinken <END>. <START:person> Pierre Vinken<END> is the chairman.

打开NLP名称查找器培训

2 个答案: