Question

我正在尝试使用NLTK从我的文本中提取命名实体。我发现NLTK NER对我的目的不是很准确，我想添加一些我自己的标签。我一直在努力寻找培养自己的NER的方法，但我似乎无法找到合适的资源。我有几个关于NLTK的问题 -

我可以使用自己的数据在NLTK中训练命名实体识别器吗？
如果我可以使用自己的数据进行训练，那么named_entity.py是要修改的文件吗？
输入文件格式是否必须在IOB中。 Eric NNP B-PERSON？
是否有任何资源 - 除了nltk cookbook和我可以使用的python nlp之外？

我非常感谢这方面的帮助

Answer 1

您是否致力于使用NLTK / Python？我遇到了和你一样的问题，并且使用斯坦福的命名实体识别器得到了更好的结果：http://nlp.stanford.edu/software/CRF-NER.shtml。 FAQ中使用您自己的数据训练分类器的过程非常详细。

如果你真的需要使用NLTK，我会在邮件列表中找到其他用户的一些建议：http://groups.google.com/group/nltk-users。

希望这有帮助！

Answer 2

您可以轻松地使用Stanford NER和nltk。 python脚本就像

from nltk.tag.stanford import NERTagger
import os
java_path = "/Java/jdk1.8.0_45/bin/java.exe"
os.environ['JAVAHOME'] = java_path
st = NERTagger('../ner-model.ser.gz','../stanford-ner.jar')
tagging = st.tag(text.split())

要训练您自己的数据并创建模型，您可以参考有关斯坦福NER常见问题解答的第一个问题。

链接为http://nlp.stanford.edu/software/crf-faq.shtml

Answer 3

我也有这个问题，但我设法解决了这个问题。您可以使用自己的培训数据。我在github repository中记录了主要要求/步骤。

我使用NLTK-trainer，所以基本上你必须以正确的格式（令牌NNP B标签）获取训练数据，然后运行训练脚本。检查我的存储库以获取更多信息。

Answer 4

nltk.chunk.named_entity模块中有一些功能可以训练NER标记器。但是，它们是专门为ACE语料编写的，并未完全清理干净，因此，需要编写自己的培训程序，以作为参考。

在线上也有两个相对较新的指南（1 2），详细介绍了使用NLTK训练GMB语料库的过程。

但是，正如上面答案中提到的那样，由于现在有许多工具可用，如果需要简化的培训过程，则真的不需要使用NLTK。 CoreNLP和spaCy之类的工具包做得更好。由于使用NLTK与从头开始编写自己的训练代码没有太大区别，因此这样做没有太大价值。 NLTK和OpenNLP在某种程度上可以说是属于NLP的最新发展之前的过去时代。

Answer 5

<块引用>

是否有任何资源 - 除了我可以使用的 nltk 食谱和 nlp with python 之外？

您可以考虑使用 spaCy 为 NER 任务训练您自己的自定义数据。以下是此 thread 中的示例，用于在自定义训练集上训练模型以检测新实体 ANIMAL。代码已修复和更新，以便于阅读。

import random
import spacy
from spacy.training import Example

LABEL = 'ANIMAL'
TRAIN_DATA = [
    ("Horses are too tall and they pretend to care about your feelings", {'entities': [(0, 6, LABEL)]}),
    ("Do they bite?", {'entities': []}),
    ("horses are too tall and they pretend to care about your feelings", {'entities': [(0, 6, LABEL)]}),
    ("horses pretend to care about your feelings", {'entities': [(0, 6, LABEL)]}),
    ("they pretend to care about your feelings, those horses", {'entities': [(48, 54, LABEL)]}),
    ("horses?", {'entities': [(0, 6, LABEL)]})
]
nlp = spacy.load('en_core_web_sm')  # load existing spaCy model
ner = nlp.get_pipe('ner')
ner.add_label(LABEL)

optimizer = nlp.create_optimizer()

# get names of other pipes to disable them during training
other_pipes = [pipe for pipe in nlp.pipe_names if pipe != "ner"]
with nlp.disable_pipes(*other_pipes):  # only train NER
    for itn in range(20):
        random.shuffle(TRAIN_DATA)
        losses = {}
        for text, annotations in TRAIN_DATA:
            doc = nlp.make_doc(text)
            example = Example.from_dict(doc, annotations)
            nlp.update([example], drop=0.35, sgd=optimizer, losses=losses)
        print(losses)

# test the trained model
test_text = 'Do you like horses?'
doc = nlp(test_text)
print("Entities in '%s'" % test_text)
for ent in doc.ents:
    print(ent.label_, " -- ", ent.text)

输出如下：

{'ner': 9.60289144264557}
{'ner': 8.875474230820478}
{'ner': 6.370401408220459}
{'ner': 6.687456469517201}
... 
{'ner': 1.3796682589133492e-05}
{'ner': 1.7709562613218738e-05}

Entities in 'Do you like horses?'
ANIMAL  --  horses

Answer 6

要完成@Thang M. Pham 的回答，您需要在训练前标记您的数据。为此，您可以使用 spacy-annotator。

这是从另一个答案中摘取的示例： Train Spacy NER on Indian Names

NLTK使用自定义数据命名实体识别

6 个答案: