这是经典的训练格式。
TRAIN_DATA = [
("Who is Shaka Khan?", {"entities": [(7, 17, "PERSON")]}),
("I like London and Berlin.", {"entities": [(7, 13, "LOC"), (18, 24, "LOC")]}),
]
我曾经使用代码进行训练,但是据我了解,使用CLI训练方法会更好。但是,我的格式是这样。
我已经找到了用于这种类型转换的代码段,但是每个代码段都在执行spacy.load('en')
而不是空白-这让我觉得,他们是在训练现有模型而不是空白吗?
这个块看起来很简单:
import spacy
from spacy.gold import docs_to_json
import srsly
nlp = spacy.load('en', disable=["ner"]) # as you see it's loading 'en' which I don't have
TRAIN_DATA = #data from above
docs = []
for text, annot in TRAIN_DATA:
doc = nlp(text)
doc.ents = [doc.char_span(start_idx, end_idx, label=label) for start_idx, end_idx, label in annot["entities"]]
docs.append(doc)
srsly.write_json("ent_train_data.json", [docs_to_json(docs)])
运行此代码会抛出提示:找不到模型'en'。它似乎不是快捷方式链接,Python包或数据目录的有效路径。
我很困惑如何与空白的spacy train
一起使用。只需使用spacy.blank('en')
?但是disable=["ner"]
标志呢?
编辑:
如果我尝试使用spacy.blank('en')
,则会收到无法从spacy.lang导入语言目标:没有名为'spacy.lang.en'的模块
编辑2 :
我尝试加载en_core_web_sm
nlp = spacy.load('en_core_web_sm')
docs = []
for text, annot in TRAIN_DATA:
doc = nlp(text)
doc.ents = [doc.char_span(start_idx, end_idx, label=label) for start_idx, end_idx, label in annot["entities"]]
docs.append(doc)
srsly.write_json("ent_train_data.json", [docs_to_json(docs)])
TypeError:类型为'NoneType'的对象没有len()
Ailton-
print(text[start:end])
目标! FK Qarabag 1,地拉那Partizani0。菲利普·奥佐比奇-FK Qarabag-头部从禁区中央到球门中央。辅助-艾尔顿-
print(text)
无-
doc.ents =...
行TypeError:类型为'NoneType'的对象没有len()
编辑3 :From Ines' comment
nlp = spacy.load('en_core_web_sm')
docs = []
for text, annot in TRAIN_DATA:
doc = nlp(text)
tags = biluo_tags_from_offsets(doc, annot['entities'])
docs.append(doc)
srsly.write_json(train_name + "_spacy_format.json", [docs_to_json(docs)])
这创建了json,但是在生成的json中看不到任何标记的实体。
答案 0 :(得分:2)
import spacy
import srsly
from spacy.training import docs_to_json, offsets_to_biluo_tags, biluo_tags_to_spans
TRAIN_DATA = [
("Who is Shaka Khan?", {"entities": [(7, 17, "PERSON")]}),
("I like London and Berlin.", {"entities": [(7, 13, "LOC"), (18, 24, "LOC")]}),
]
nlp = spacy.load('en_core_web_lg')
docs = []
for text, annot in training_sub:
doc = nlp(text)
tags = offsets_to_biluo_tags(doc, annot['entities'])
entities = biluo_tags_to_spans(doc, tags)
doc.ents = entities
docs.append(doc)
srsly.write_json("spacy_format.json", [docs_to_json(docs)])
从 spaCy v3.1 开始,上述代码有效。 spacy.gold
中的一些相关方法已重命名并迁移到 spacy.training
。
答案 1 :(得分:1)
编辑3已结束,但是您缺少将实体添加到文档中的步骤。这应该起作用:
import spacy
import srsly
from spacy.gold import docs_to_json, biluo_tags_from_offsets, spans_from_biluo_tags
TRAIN_DATA = [
("Who is Shaka Khan?", {"entities": [(7, 17, "PERSON")]}),
("I like London and Berlin.", {"entities": [(7, 13, "LOC"), (18, 24, "LOC")]}),
]
nlp = spacy.load('en_core_web_sm')
docs = []
for text, annot in TRAIN_DATA:
doc = nlp(text)
tags = biluo_tags_from_offsets(doc, annot['entities'])
entities = spans_from_biluo_tags(doc, tags)
doc.ents = entities
docs.append(doc)
srsly.write_json("spacy_format.json", [docs_to_json(docs)])
最好添加一个内置函数来执行此转换,因为通常希望将示例脚本(这只是简单的演示)转换为火车CLI。
修改:
您还可以略过间接使用内置BILUO转换器,并使用上面的功能:
doc.ents = [doc.char_span(start_idx, end_idx, label=label) for start_idx, end_idx, label in annot["entities"]]