Question

我在spaCy文档的"Training an additional entity type"部分中获得了新NER类型的培训数据。

TRAIN_DATA = [
    ("Horses are too tall and they pretend to care about your feelings", {
        'entities': [(0, 6, 'ANIMAL')]
    }),

    ("Do they bite?", {
        'entities': []
    }),

    ("horses are too tall and they pretend to care about your feelings", {
        'entities': [(0, 6, 'ANIMAL')]
    }),

    ("horses pretend to care about your feelings", {
        'entities': [(0, 6, 'ANIMAL')]
    }),

    ("they pretend to care about your feelings, those horses", {
        'entities': [(48, 54, 'ANIMAL')]
    }),

    ("horses?", {
        'entities': [(0, 6, 'ANIMAL')]
    })
]

我想使用spacy command line application训练此数据的NER模型。这需要spaCy的JSON format中的数据。如何以此JSON格式编写上述数据（即带有标记字符偏移跨度的文本）？

在查看该格式的文档之后，我不清楚如何以这种格式手动写入数据。（例如，我是否将所有内容分区为段落？）还有一个convert命令行实用程序，可以将非spaCy数据格式转换为spaCy的格式，但这并不是spaCy格式与上面的输入相同。

我理解使用＆＃34;简单培训方式＆＃34;的NER培训代码示例，但我希望能够使用命令行实用程序进行培训。（虽然从我的previous spaCy question可以明显看出，我不清楚你何时应该使用这种风格以及何时应该使用命令行。）

有人可以用spaCy的JSON格式＆＃34;向我展示上述数据的示例，或指向解释如何进行此转换的文档。

Answer 1

spaCy的内置功能可以帮助你完成大部分工作：

from spacy.gold import biluo_tags_from_offsets

接受＆＃34;偏移＆＃34;输入您在那里的注释并将其转换为逐个令牌BILOU格式。

要将NER注释放入最终训练JSON格式，您只需要更多地围绕它们来填充数据所需的其他插槽：

sentences = []
for t in TRAIN_DATA:
    doc = nlp(t[0])
    tags = biluo_tags_from_offsets(doc, t[1]['entities'])
    ner_info = list(zip(doc, tags))
    tokens = []
    for n, i in enumerate(ner_info):
        token = {"head" : 0,
        "dep" : "",
        "tag" : "",
        "orth" : i[0].string,
        "ner" : i[1],
        "id" : n}
        tokens.append(token)
    sentences.append(tokens)

确保在使用此数据进行培训之前禁用非NER管道。我在仅限NER的数据上使用spacy train遇到了一些问题。有关可能的解决方法，请参阅#1907并查看Prodigy论坛上的this discussion。

如何将简单的训练样式数据转换为spaCy的命令行JSON格式？

1 个答案: