我正在尝试重新训练现有的POS Tagger,以便使用下面的代码显示某些误分类单词的正确标签。但这给了我这个错误:
警告:未命名向量-这将不允许多个向量模型 被加载。 (形状:(0,0))
from spacy.vocab import Vocab
from spacy.tokens import Doc
from spacy.gold import GoldParse
nlp = spacy.load('en_core_web_sm')
optimizer = nlp.begin_training()
vocab = Vocab(tag_map={})
doc = Doc(vocab, words=[word for word in ['ThermostatFailedOpen','ThermostatFailedClose','BlahDeBlah']])
gold = GoldParse(doc, tags=['NNP']*3)
nlp.update([doc], [gold], drop=0, sgd=optimizer)
此外,当我再次尝试检查代码是否已使用下面的代码正确分类
doc = nlp('If ThermostatFailedOpen moves from false to true, we are going to party')
for token in doc:
print(token.text, token.lemma_, token.pos_, token.tag_, token.dep_,
token.shape_, token.is_alpha, token.is_stop)
ThermostatFailedOpen ThermostatFailedopen VERB VB nsubj XxxxxXxxxxXxxx 真假
这些单词没有正确分类(我猜是预期的)!有关如何解决此问题的见解?
答案 0 :(得分:0)
#!/usr/bin/env python
# coding: utf8
import random
from pathlib import Path
import spacy
# You need to define a mapping from your data's part-of-speech tag names to the
# Universal Part-of-Speech tag set, as spaCy includes an enum of these tags.
# See here for the Universal Tag Set:
# http://universaldependencies.github.io/docs/u/pos/index.html
# You may also specify morphological features for your tags, from the universal
# scheme.
TAG_MAP = {
'N': {'pos': 'NOUN'},
'V': {'pos': 'VERB'},
'J': {'pos': 'ADJ'}
}
# Usually you'll read this in, of course. Data formats vary. Ensure your
# strings are unicode and that the number of tags assigned matches spaCy's
# tokenization. If not, you can always add a 'words' key to the annotations
# that specifies the gold-standard tokenization, e.g.:
# ("Eatblueham", {'words': ['Eat', 'blue', 'ham'] 'tags': ['V', 'J', 'N']})
TRAIN_DATA = [
("ThermostatFailedOpen", {'tags': ['V']}),
("EThermostatFailedClose", {'tags': ['V']})
]
def main(lang='en', output_dir=None, n_iter=25):
"""Create a new model, set up the pipeline and train the tagger. In order to
train the tagger with a custom tag map, we're creating a new Language
instance with a custom vocab.
"""
nlp = spacy.blank(lang)
# add the tagger to the pipeline
# nlp.create_pipe works for built-ins that are registered with spaCy
tagger = nlp.create_pipe('tagger')
# Add the tags. This needs to be done before you start training.
for tag, values in TAG_MAP.items():
tagger.add_label(tag, values)
nlp.add_pipe(tagger)
nlp.vocab.vectors.name = 'spacy_pretrained_vectors'
optimizer = nlp.begin_training()
for i in range(n_iter):
random.shuffle(TRAIN_DATA)
losses = {}
for text, annotations in TRAIN_DATA:
nlp.update([text], [annotations], sgd=optimizer, losses=losses)
print(losses)
# test the trained model
test_text = "If ThermostatFailedOpen moves from false to true, we are going to party"
doc = nlp(test_text)
print('Tags', [(t.text, t.tag_, t.pos_) for t in doc])
# save model to output directory
if output_dir is not None:
output_dir = Path(output_dir)
if not output_dir.exists():
output_dir.mkdir()
nlp.to_disk(output_dir)
print("Saved model to", output_dir)
# test the save model
print("Loading from", output_dir)
nlp2 = spacy.load(output_dir)
doc = nlp2(test_text)
print('Tags', [(t.text, t.tag_, t.pos_) for t in doc])
if __name__ == '__main__':
main('en','customPOS')
注意:如果您尝试追加
,将会出现以下错误 File "pipeline.pyx", line 550, in spacy.pipeline.Tagger.add_label
ValueError: [T003] Resizing pre-trained Tagger models is not currently supported.
最初我尝试了这个
nlp = spacy.load('en_core_web_sm')
tagger = nlp.get_pipe('tagger')
# Add the tags. This needs to be done before you start training.
for tag, values in TAG_MAP.items():
tagger.add_label(tag, values)
other_pipes = [pipe for pipe in nlp.pipe_names if pipe != 'tagger']
with nlp.disable_pipes(*other_pipes): # only train TAGGER
nlp.vocab.vectors.name = 'spacy_pretrained_vectors'
optimizer = nlp.begin_training()
for i in range(n_iter):
random.shuffle(TRAIN_DATA)
losses = {}
for text, annotations in TRAIN_DATA:
nlp.update([text], [annotations], sgd=optimizer, losses=losses)
print(losses)
答案 1 :(得分:0)
如果您使用相同的标签,并且只需要对其进行更好的培训,则无需添加新标签。但是,如果您使用其他标签集,则需要训练新模型。
对于第一种情况,您进行get_pipe('tagger')
,跳过add_label
循环并继续进行。
对于第二种情况,您需要创建一个新的标记器,对其进行训练,然后将其添加到管道中。为此,在加载模型时,您还需要禁用标记器(因为您将训练新的标记器)。我也回答了这个here