我是NLP和OpenNLP库的新手,目前正在使用某些功能,尤其是图书馆提取组织名称的能力。如果我使用简单的字符串,如
"Bill worked at Microsoft Corp., JP Morgan Chase, Monsanto and General Motors and was amazed at what went on in Congress. "
我的代码退出:
Detected name "Bill". Type person with probability of 0.9604452678787172
Detected name "Microsoft Corp .". Type organization with probability of 0.9976452599132802
Detected name "JP Morgan Chase". Type organization with probability of 0.9064399433766583
Detected name "Monsanto". Type organization with probability of 0.7429123227376515
Detected name "General Motors". Type organization with probability of 0.965472905375375
Detected name "Congress". Type organization with probability of 0.9940809804351413
一切似乎都很好。但是,如果我转向更加英国的世界观,例如
"Mark worked at The University of London, HSBC, The Royal Bank of Scotland, Dyson and GlaxoSmithKline."
我得到了
Detected name "Mark". Type person with probability of 0.7496973664676362
Detected name "London". Type location with probability of 0.6625435519843291
Detected name "Scotland". Type location with probability of 0.9564118675997605
Detected name "University of London". Type organization with probability of 0.8516268558212053
Detected name "Royal Bank". Type organization with probability of 0.8953174632171774
显然不那么成功。这是因为组织发现者不了解英国机构还是我不幸运?如果前者有一种方法可以让我采用现有模型并扩展其知识以更好地覆盖英国机构?我快速查看了现有组织模型的培训数据,但找不到任何内容。
答案 0 :(得分:2)
我也无法通过快速搜索找到有关培训数据的任何文档,但它可能是针对美国报纸文本(华尔街日报或路透社可能来自MUC或CoNLL数据集)进行培训的,这可以解释为什么它对英国实体的影响不大。
无法扩展现有模型,但如果您有注释数据,则可以使用英国实体训练自己的模型。