拥抱面部节省令牌生成器

时间:2020-10-27 08:20:04

标签: huggingface-transformers huggingface-tokenizers

我正在尝试将令牌生成器保存为拥抱状态,以便以后可以从不需要访问互联网的容器中加载令牌生成器。

BASE_MODEL = "distilbert-base-multilingual-cased"
tokenizer = AutoTokenizer.from_pretrained(BASE_MODEL)
tokenizer.save_vocabulary("./models/tokenizer/")
tokenizer2 = AutoTokenizer.from_pretrained("./models/tokenizer/")

但是,最后一行给出了错误:

OSError: Can't load config for './models/tokenizer3/'. Make sure that:

- './models/tokenizer3/' is a correct model identifier listed on 'https://huggingface.co/models'

- or './models/tokenizer3/' is the correct path to a directory containing a config.json file

变形金刚版本:3.1.0

How to load the saved tokenizer from pretrained model in Pytorch并没有帮助。

编辑1

由于下面的@ashwin回答,我改用save_pretrained,但出现以下错误:

OSError: Can't load config for './models/tokenizer/'. Make sure that:

- './models/tokenizer/' is a correct model identifier listed on 'https://huggingface.co/models'

- or './models/tokenizer/' is the correct path to a directory containing a config.json file

tokenizer文件夹的内容如下: enter image description here

我尝试将tokenizer_config.json重命名为config.json,然后收到错误消息:

ValueError: Unrecognized model in ./models/tokenizer/. Should have a `model_type` key in its config.json, or contain one of the following strings in its name: retribert, t5, mobilebert, distilbert, albert, camembert, xlm-roberta, pegasus, marian, mbart, bart, reformer, longformer, roberta, flaubert, bert, openai-gpt, gpt2, transfo-xl, xlnet, xlm, ctrl, electra, encoder-decoder

3 个答案:

答案 0 :(得分:1)

save_vocabulary(),仅保存令牌生成器的词汇表文件(BPE令牌列表)。

要保存整个令牌生成器,应使用save_pretrained()

因此,如下:

BASE_MODEL = "distilbert-base-multilingual-cased"
tokenizer = AutoTokenizer.from_pretrained(BASE_MODEL)
tokenizer.save_pretrained("./models/tokenizer/")
tokenizer2 = DistilBertTokenizer.from_pretrained("./models/tokenizer/")

编辑:

由于某些未知原因: 代替

tokenizer2 = AutoTokenizer.from_pretrained("./models/tokenizer/")

使用

tokenizer2 = DistilBertTokenizer.from_pretrained("./models/tokenizer/")

有效。

答案 1 :(得分:1)

您需要将模型和分词器保存在同一目录中。 HuggingFace 实际上是在寻找你模型的 config.json 文件,所以重命名 tokenizer_config.json 并不能解决问题

答案 2 :(得分:0)

将“tokenizer_config.json”文件(由 save_pretrained() 函数创建的文件)重命名为“config.json”在我的环境中解决了同样的问题。