在训练数据opennlp分类程序中应该有多少行和文档

时间:2015-05-11 13:10:25

标签: opennlp

我关注documentation for Apache open-nlp。我能够理解句子检测,标记器,名字查找器。但我被分类师困住了。我无法理解的原因是,如何为分类创建模型。

我明白我需要创建一个文件。格式非常清晰,它需要是一个类别空间和一行文档。使用.train扩展名保存文件。

所以我创建了以下文件:

Refund What is the refund status for my order #342 ?
NewOffers Are there any new offers for your products ?

我给了这个命令 -

opennlp DoccatTrainer -model en-doccat.bin -lang en -data en-doccat.train -encoding UTF-8

它开始做某事然后返回错误。这些是命令提示符中的内容:

Indexing events using cutoff of 5

    Computing event counts...  done. 2 events
    Indexing...  Dropped event Refund:[bow=What, bow=is, bow=the, bow=refund, bow=status, bow=for, bow=my, bow=order, bow=#342, bow=?]
Dropped event NewOffers:[bow=Are, bow=there, bow=any, bow=new, bow=offers, bow=for, bow=your, bow=products, bow=?]
done.
Sorting and merging events... Done indexing.
Incorporating indexed data for training...  
Exception in thread "main" java.lang.NullPointerException
    at opennlp.maxent.GISTrainer.trainModel(GISTrainer.java:263)
    at opennlp.maxent.GIS.trainModel(GIS.java:256)
    at opennlp.model.TrainUtil.train(TrainUtil.java:184)
    at opennlp.tools.doccat.DocumentCategorizerME.train(DocumentCategorizerME.java:162)
    at opennlp.tools.cmdline.doccat.DoccatTrainerTool.run(DoccatTrainerTool.java:61)
    at opennlp.tools.cmdline.CLI.main(CLI.java:222)

我只是无法弄清楚为什么这会给出一个空指针异常?我还尝试增加两行,但没有结果。

Refund What is the refund status for my order #342 ?
NewOffers Are there any new offers for your products ?
Refund Can I place a refund request for electronics ?
NewOffers Is there any new offer on buying worth 5000 ?  

我找到了this blog,但这里也完成了同样的事情。在尝试他的训练文件时,它具有魅力。我的档案有什么问题?我该如何解决错误。

当我尝试opennlp DoccatTrainer时,它会为我打开帮助,所以路径不是问题。任何帮助表示赞赏。

编辑:我将文件更改为

Refund What is the refund status for my order #342 ? Can I place a refund request for clothes ?
NewOffers Are there any new offers for your products ? what are the offers on new products or new offers on old products?
Refund Can I place a refund request for electronics ?
NewOffers Is there any new offer on buying worth 5000 ? 

并且它有效,我认为它必须对文档做一些事情(显然应该是两个句子)并删除最后两行。

使其成为

Refund What is the refund status for my order #342 ? Can I place a refund request for clothes ?
NewOffers Are there any new offers for your products ? what are the offers on new products or new offers on old products? 

但是又失败了,现在的问题总结了它需要什么样的数据/格式/文档?

由于

2 个答案:

答案 0 :(得分:4)

您必须从每个类别添加5个以上的样本。因为默认的截止标记大小是5,

请参阅此博客文章 http://madhawagunasekara.blogspot.com/2014/11/nlp-categorizer.html

答案 1 :(得分:0)

您可以在DoccatTrainer命令中使用-cutoff标志来更改默认值。在您的情况下,您将添加-cutoff 1以将每个类别的最小文档数设置为1.