在tutorial之后,我正在自己的数据集中训练DialoGPT。
当我完全按照提供的数据集教程学习时,我没有任何问题。我更改了示例数据集。该示例与我的代码之间的唯一区别是,与本教程的1906行相比,我的数据集长256397行。
我不确定该错误是否与我的数据集中的列标签有关,或者是特定行中的一个文本值或数据大小的问题。
06/12/2020 09:23:08 - WARNING - __main__ - Process rank: -1, device: cuda, n_gpu: 1, distributed training: False, 16-bits training: False
06/12/2020 09:23:10 - INFO - transformers.configuration_utils - loading configuration file https://s3.amazonaws.com/models.huggingface.co/bert/microsoft/DialoGPT-small/config.json from cache at cached/c3a09526c725b854c685b72cf60c50f1fea9b0e4d6227fa41573425ef4bd4bc6.4c1d7fc2ac6ddabeaf0c8bec2ffc7dc112f668f5871a06efcff113d2797ec7d5
06/12/2020 09:23:10 - INFO - transformers.configuration_utils - Model config GPT2Config {
"activation_function": "gelu_new",
"architectures": [
"GPT2LMHeadModel"
],
"attn_pdrop": 0.1,
"bos_token_id": 50256,
"embd_pdrop": 0.1,
"eos_token_id": 50256,
"initializer_range": 0.02,
"layer_norm_epsilon": 1e-05,
"model_type": "gpt2",
"n_ctx": 1024,
"n_embd": 768,
"n_head": 12,
"n_layer": 12,
"n_positions": 1024,
"resid_pdrop": 0.1,
"summary_activation": null,
"summary_first_dropout": 0.1,
"summary_proj_to_labels": true,
"summary_type": "cls_index",
"summary_use_proj": true,
"vocab_size": 50257
}
06/12/2020 09:23:11 - INFO - transformers.configuration_utils - loading configuration file https://s3.amazonaws.com/models.huggingface.co/bert/microsoft/DialoGPT-small/config.json from cache at cached/c3a09526c725b854c685b72cf60c50f1fea9b0e4d6227fa41573425ef4bd4bc6.4c1d7fc2ac6ddabeaf0c8bec2ffc7dc112f668f5871a06efcff113d2797ec7d5
06/12/2020 09:23:11 - INFO - transformers.configuration_utils - Model config GPT2Config {
"activation_function": "gelu_new",
"architectures": [
"GPT2LMHeadModel"
],
"attn_pdrop": 0.1,
"bos_token_id": 50256,
"embd_pdrop": 0.1,
"eos_token_id": 50256,
"initializer_range": 0.02,
"layer_norm_epsilon": 1e-05,
"model_type": "gpt2",
"n_ctx": 1024,
"n_embd": 768,
"n_head": 12,
"n_layer": 12,
"n_positions": 1024,
"resid_pdrop": 0.1,
"summary_activation": null,
"summary_first_dropout": 0.1,
"summary_proj_to_labels": true,
"summary_type": "cls_index",
"summary_use_proj": true,
"vocab_size": 50257
}
06/12/2020 09:23:11 - INFO - transformers.tokenization_utils - Model name 'microsoft/DialoGPT-small' not found in model shortcut name list (gpt2, gpt2-medium, gpt2-large, gpt2-xl, distilgpt2). Assuming 'microsoft/DialoGPT-small' is a path, a model identifier, or url to a directory containing tokenizer files.
06/12/2020 09:23:15 - INFO - transformers.tokenization_utils - loading file https://s3.amazonaws.com/models.huggingface.co/bert/microsoft/DialoGPT-small/vocab.json from cache at cached/78725a31b87003f46d5bffc3157ebd6993290e4cfb7002b5f0e52bb0f0d9c2dd.1512018be4ba4e8726e41b9145129dc30651ea4fec86aa61f4b9f40bf94eac71
06/12/2020 09:23:15 - INFO - transformers.tokenization_utils - loading file https://s3.amazonaws.com/models.huggingface.co/bert/microsoft/DialoGPT-small/merges.txt from cache at cached/570e31eddfc57062e4d0c5b078d44f97c0e5ac48f83a2958142849b59df6bbe6.70bec105b4158ed9a1747fea67a43f5dee97855c64d62b6ec3742f4cfdb5feda
06/12/2020 09:23:15 - INFO - transformers.tokenization_utils - loading file https://s3.amazonaws.com/models.huggingface.co/bert/microsoft/DialoGPT-small/added_tokens.json from cache at None
06/12/2020 09:23:15 - INFO - transformers.tokenization_utils - loading file https://s3.amazonaws.com/models.huggingface.co/bert/microsoft/DialoGPT-small/special_tokens_map.json from cache at None
06/12/2020 09:23:15 - INFO - transformers.tokenization_utils - loading file https://s3.amazonaws.com/models.huggingface.co/bert/microsoft/DialoGPT-small/tokenizer_config.json from cache at None
06/12/2020 09:23:19 - INFO - filelock - Lock 140392381680496 acquired on cached/9eab12d0b721ee394e9fe577f35d9b8b22de89e1d4f6a89b8a76d6e1a82bceae.906a78bee3add2ff536ac7ef16753bb3afb3a1cf8c26470f335b7c0e46a21483.lock
06/12/2020 09:23:19 - INFO - transformers.file_utils - https://cdn.huggingface.co/microsoft/DialoGPT-small/pytorch_model.bin not found in cache or force_download set to True, downloading to /content/drive/My Drive/Colab Notebooks/cached/tmpj1dveq14
Downloading: 100%
351M/351M [00:34<00:00, 10.2MB/s]
06/12/2020 09:23:32 - INFO - transformers.file_utils - storing https://cdn.huggingface.co/microsoft/DialoGPT-small/pytorch_model.bin in cache at cached/9eab12d0b721ee394e9fe577f35d9b8b22de89e1d4f6a89b8a76d6e1a82bceae.906a78bee3add2ff536ac7ef16753bb3afb3a1cf8c26470f335b7c0e46a21483
06/12/2020 09:23:32 - INFO - transformers.file_utils - creating metadata file for cached/9eab12d0b721ee394e9fe577f35d9b8b22de89e1d4f6a89b8a76d6e1a82bceae.906a78bee3add2ff536ac7ef16753bb3afb3a1cf8c26470f335b7c0e46a21483
06/12/2020 09:23:33 - INFO - filelock - Lock 140392381680496 released on cached/9eab12d0b721ee394e9fe577f35d9b8b22de89e1d4f6a89b8a76d6e1a82bceae.906a78bee3add2ff536ac7ef16753bb3afb3a1cf8c26470f335b7c0e46a21483.lock
06/12/2020 09:23:33 - INFO - transformers.modeling_utils - loading weights file https://cdn.huggingface.co/microsoft/DialoGPT-small/pytorch_model.bin from cache at cached/9eab12d0b721ee394e9fe577f35d9b8b22de89e1d4f6a89b8a76d6e1a82bceae.906a78bee3add2ff536ac7ef16753bb3afb3a1cf8c26470f335b7c0e46a21483
06/12/2020 09:23:39 - INFO - transformers.modeling_utils - Weights of GPT2LMHeadModel not initialized from pretrained model: ['transformer.h.0.attn.masked_bias', 'transformer.h.1.attn.masked_bias', 'transformer.h.2.attn.masked_bias', 'transformer.h.3.attn.masked_bias', 'transformer.h.4.attn.masked_bias', 'transformer.h.5.attn.masked_bias', 'transformer.h.6.attn.masked_bias', 'transformer.h.7.attn.masked_bias', 'transformer.h.8.attn.masked_bias', 'transformer.h.9.attn.masked_bias', 'transformer.h.10.attn.masked_bias', 'transformer.h.11.attn.masked_bias']
06/12/2020 09:23:54 - INFO - __main__ - Training/evaluation parameters <__main__.Args object at 0x7fafa60a00f0>
06/12/2020 09:23:54 - INFO - __main__ - Creating features from dataset file at cached
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-12-523c0d2a27d3> in <module>()
----> 1 main(trn_df, val_df)
7 frames
<ipython-input-11-d6dfa312b1f5> in main(df_trn, df_val)
59 # Training
60 if args.do_train:
---> 61 train_dataset = load_and_cache_examples(args, tokenizer, df_trn, df_val, evaluate=False)
62
63 global_step, tr_loss = train(args, train_dataset, model, tokenizer)
<ipython-input-9-3c4f1599e14e> in load_and_cache_examples(args, tokenizer, df_trn, df_val, evaluate)
40
41 def load_and_cache_examples(args, tokenizer, df_trn, df_val, evaluate=False):
---> 42 return ConversationDataset(tokenizer, args, df_val if evaluate else df_trn)
43
44 def set_seed(args):
<ipython-input-9-3c4f1599e14e> in __init__(self, tokenizer, args, df, block_size)
24 self.examples = []
25 for _, row in df.iterrows():
---> 26 conv = construct_conv(row, tokenizer)
27 self.examples.append(conv)
28
<ipython-input-9-3c4f1599e14e> in construct_conv(row, tokenizer, eos)
1 def construct_conv(row, tokenizer, eos = True):
2 flatten = lambda l: [item for sublist in l for item in sublist]
----> 3 conv = list(reversed([tokenizer.encode(x) + [tokenizer.eos_token_id] for x in row]))
4 conv = flatten(conv)
5 return conv
<ipython-input-9-3c4f1599e14e> in <listcomp>(.0)
1 def construct_conv(row, tokenizer, eos = True):
2 flatten = lambda l: [item for sublist in l for item in sublist]
----> 3 conv = list(reversed([tokenizer.encode(x) + [tokenizer.eos_token_id] for x in row]))
4 conv = flatten(conv)
5 return conv
/usr/local/lib/python3.6/dist-packages/transformers/tokenization_utils.py in encode(self, text, text_pair, add_special_tokens, max_length, stride, truncation_strategy, pad_to_max_length, return_tensors, **kwargs)
1432 pad_to_max_length=pad_to_max_length,
1433 return_tensors=return_tensors,
-> 1434 **kwargs,
1435 )
1436
/usr/local/lib/python3.6/dist-packages/transformers/tokenization_utils.py in encode_plus(self, text, text_pair, add_special_tokens, max_length, stride, truncation_strategy, pad_to_max_length, is_pretokenized, return_tensors, return_token_type_ids, return_attention_mask, return_overflowing_tokens, return_special_tokens_mask, return_offsets_mapping, **kwargs)
1574 )
1575
-> 1576 first_ids = get_input_ids(text)
1577 second_ids = get_input_ids(text_pair) if text_pair is not None else None
1578
/usr/local/lib/python3.6/dist-packages/transformers/tokenization_utils.py in get_input_ids(text)
1554 else:
1555 raise ValueError(
-> 1556 "Input is not valid. Should be a string, a list/tuple of strings or a list/tuple of integers."
1557 )
1558
ValueError: Input is not valid. Should be a string, a list/tuple of strings or a list/tuple of integers.