我应该如何格式化我的数据集以避免这种情况? “输入无效。应为字符串,字符串列表/元组或整数列表/元组”

时间:2020-06-12 09:42:35

标签: huggingface-transformers

tutorial之后,我正在自己的数据集中训练DialoGPT。

当我完全按照提供的数据集教程学习时,我没有任何问题。我更改了示例数据集。该示例与我的代码之间的唯一区别是,与本教程的1906行相比,我的数据集长256397行。

我不确定该错误是否与我的数据集中的列标签有关,或者是特定行中的一个文本值或数据大小的问题。

06/12/2020 09:23:08 - WARNING - __main__ -   Process rank: -1, device: cuda, n_gpu: 1, distributed training: False, 16-bits training: False
06/12/2020 09:23:10 - INFO - transformers.configuration_utils -   loading configuration file https://s3.amazonaws.com/models.huggingface.co/bert/microsoft/DialoGPT-small/config.json from cache at cached/c3a09526c725b854c685b72cf60c50f1fea9b0e4d6227fa41573425ef4bd4bc6.4c1d7fc2ac6ddabeaf0c8bec2ffc7dc112f668f5871a06efcff113d2797ec7d5
06/12/2020 09:23:10 - INFO - transformers.configuration_utils -   Model config GPT2Config {
  "activation_function": "gelu_new",
  "architectures": [
    "GPT2LMHeadModel"
  ],
  "attn_pdrop": 0.1,
  "bos_token_id": 50256,
  "embd_pdrop": 0.1,
  "eos_token_id": 50256,
  "initializer_range": 0.02,
  "layer_norm_epsilon": 1e-05,
  "model_type": "gpt2",
  "n_ctx": 1024,
  "n_embd": 768,
  "n_head": 12,
  "n_layer": 12,
  "n_positions": 1024,
  "resid_pdrop": 0.1,
  "summary_activation": null,
  "summary_first_dropout": 0.1,
  "summary_proj_to_labels": true,
  "summary_type": "cls_index",
  "summary_use_proj": true,
  "vocab_size": 50257
}

06/12/2020 09:23:11 - INFO - transformers.configuration_utils -   loading configuration file https://s3.amazonaws.com/models.huggingface.co/bert/microsoft/DialoGPT-small/config.json from cache at cached/c3a09526c725b854c685b72cf60c50f1fea9b0e4d6227fa41573425ef4bd4bc6.4c1d7fc2ac6ddabeaf0c8bec2ffc7dc112f668f5871a06efcff113d2797ec7d5
06/12/2020 09:23:11 - INFO - transformers.configuration_utils -   Model config GPT2Config {
  "activation_function": "gelu_new",
  "architectures": [
    "GPT2LMHeadModel"
  ],
  "attn_pdrop": 0.1,
  "bos_token_id": 50256,
  "embd_pdrop": 0.1,
  "eos_token_id": 50256,
  "initializer_range": 0.02,
  "layer_norm_epsilon": 1e-05,
  "model_type": "gpt2",
  "n_ctx": 1024,
  "n_embd": 768,
  "n_head": 12,
  "n_layer": 12,
  "n_positions": 1024,
  "resid_pdrop": 0.1,
  "summary_activation": null,
  "summary_first_dropout": 0.1,
  "summary_proj_to_labels": true,
  "summary_type": "cls_index",
  "summary_use_proj": true,
  "vocab_size": 50257
}

06/12/2020 09:23:11 - INFO - transformers.tokenization_utils -   Model name 'microsoft/DialoGPT-small' not found in model shortcut name list (gpt2, gpt2-medium, gpt2-large, gpt2-xl, distilgpt2). Assuming 'microsoft/DialoGPT-small' is a path, a model identifier, or url to a directory containing tokenizer files.
06/12/2020 09:23:15 - INFO - transformers.tokenization_utils -   loading file https://s3.amazonaws.com/models.huggingface.co/bert/microsoft/DialoGPT-small/vocab.json from cache at cached/78725a31b87003f46d5bffc3157ebd6993290e4cfb7002b5f0e52bb0f0d9c2dd.1512018be4ba4e8726e41b9145129dc30651ea4fec86aa61f4b9f40bf94eac71
06/12/2020 09:23:15 - INFO - transformers.tokenization_utils -   loading file https://s3.amazonaws.com/models.huggingface.co/bert/microsoft/DialoGPT-small/merges.txt from cache at cached/570e31eddfc57062e4d0c5b078d44f97c0e5ac48f83a2958142849b59df6bbe6.70bec105b4158ed9a1747fea67a43f5dee97855c64d62b6ec3742f4cfdb5feda
06/12/2020 09:23:15 - INFO - transformers.tokenization_utils -   loading file https://s3.amazonaws.com/models.huggingface.co/bert/microsoft/DialoGPT-small/added_tokens.json from cache at None
06/12/2020 09:23:15 - INFO - transformers.tokenization_utils -   loading file https://s3.amazonaws.com/models.huggingface.co/bert/microsoft/DialoGPT-small/special_tokens_map.json from cache at None
06/12/2020 09:23:15 - INFO - transformers.tokenization_utils -   loading file https://s3.amazonaws.com/models.huggingface.co/bert/microsoft/DialoGPT-small/tokenizer_config.json from cache at None
06/12/2020 09:23:19 - INFO - filelock -   Lock 140392381680496 acquired on cached/9eab12d0b721ee394e9fe577f35d9b8b22de89e1d4f6a89b8a76d6e1a82bceae.906a78bee3add2ff536ac7ef16753bb3afb3a1cf8c26470f335b7c0e46a21483.lock
06/12/2020 09:23:19 - INFO - transformers.file_utils -   https://cdn.huggingface.co/microsoft/DialoGPT-small/pytorch_model.bin not found in cache or force_download set to True, downloading to /content/drive/My Drive/Colab Notebooks/cached/tmpj1dveq14
Downloading: 100%
351M/351M [00:34<00:00, 10.2MB/s]
06/12/2020 09:23:32 - INFO - transformers.file_utils -   storing https://cdn.huggingface.co/microsoft/DialoGPT-small/pytorch_model.bin in cache at cached/9eab12d0b721ee394e9fe577f35d9b8b22de89e1d4f6a89b8a76d6e1a82bceae.906a78bee3add2ff536ac7ef16753bb3afb3a1cf8c26470f335b7c0e46a21483
06/12/2020 09:23:32 - INFO - transformers.file_utils -   creating metadata file for cached/9eab12d0b721ee394e9fe577f35d9b8b22de89e1d4f6a89b8a76d6e1a82bceae.906a78bee3add2ff536ac7ef16753bb3afb3a1cf8c26470f335b7c0e46a21483

06/12/2020 09:23:33 - INFO - filelock -   Lock 140392381680496 released on cached/9eab12d0b721ee394e9fe577f35d9b8b22de89e1d4f6a89b8a76d6e1a82bceae.906a78bee3add2ff536ac7ef16753bb3afb3a1cf8c26470f335b7c0e46a21483.lock
06/12/2020 09:23:33 - INFO - transformers.modeling_utils -   loading weights file https://cdn.huggingface.co/microsoft/DialoGPT-small/pytorch_model.bin from cache at cached/9eab12d0b721ee394e9fe577f35d9b8b22de89e1d4f6a89b8a76d6e1a82bceae.906a78bee3add2ff536ac7ef16753bb3afb3a1cf8c26470f335b7c0e46a21483
06/12/2020 09:23:39 - INFO - transformers.modeling_utils -   Weights of GPT2LMHeadModel not initialized from pretrained model: ['transformer.h.0.attn.masked_bias', 'transformer.h.1.attn.masked_bias', 'transformer.h.2.attn.masked_bias', 'transformer.h.3.attn.masked_bias', 'transformer.h.4.attn.masked_bias', 'transformer.h.5.attn.masked_bias', 'transformer.h.6.attn.masked_bias', 'transformer.h.7.attn.masked_bias', 'transformer.h.8.attn.masked_bias', 'transformer.h.9.attn.masked_bias', 'transformer.h.10.attn.masked_bias', 'transformer.h.11.attn.masked_bias']
06/12/2020 09:23:54 - INFO - __main__ -   Training/evaluation parameters <__main__.Args object at 0x7fafa60a00f0>
06/12/2020 09:23:54 - INFO - __main__ -   Creating features from dataset file at cached
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-12-523c0d2a27d3> in <module>()
----> 1 main(trn_df, val_df)

7 frames
<ipython-input-11-d6dfa312b1f5> in main(df_trn, df_val)
     59    # Training
     60    if args.do_train:
---> 61        train_dataset = load_and_cache_examples(args, tokenizer, df_trn, df_val, evaluate=False)
     62 
     63        global_step, tr_loss = train(args, train_dataset, model, tokenizer)

<ipython-input-9-3c4f1599e14e> in load_and_cache_examples(args, tokenizer, df_trn, df_val, evaluate)
     40 
     41 def load_and_cache_examples(args, tokenizer, df_trn, df_val, evaluate=False):
---> 42     return ConversationDataset(tokenizer, args, df_val if evaluate else df_trn)
     43 
     44 def set_seed(args):

<ipython-input-9-3c4f1599e14e> in __init__(self, tokenizer, args, df, block_size)
     24             self.examples = []
     25             for _, row in df.iterrows():
---> 26                 conv = construct_conv(row, tokenizer)
     27                 self.examples.append(conv)
     28 

<ipython-input-9-3c4f1599e14e> in construct_conv(row, tokenizer, eos)
      1 def construct_conv(row, tokenizer, eos = True):
      2     flatten = lambda l: [item for sublist in l for item in sublist]
----> 3     conv = list(reversed([tokenizer.encode(x) + [tokenizer.eos_token_id] for x in row]))
      4     conv = flatten(conv)
      5     return conv

<ipython-input-9-3c4f1599e14e> in <listcomp>(.0)
      1 def construct_conv(row, tokenizer, eos = True):
      2     flatten = lambda l: [item for sublist in l for item in sublist]
----> 3     conv = list(reversed([tokenizer.encode(x) + [tokenizer.eos_token_id] for x in row]))
      4     conv = flatten(conv)
      5     return conv

/usr/local/lib/python3.6/dist-packages/transformers/tokenization_utils.py in encode(self, text, text_pair, add_special_tokens, max_length, stride, truncation_strategy, pad_to_max_length, return_tensors, **kwargs)
   1432             pad_to_max_length=pad_to_max_length,
   1433             return_tensors=return_tensors,
-> 1434             **kwargs,
   1435         )
   1436 

/usr/local/lib/python3.6/dist-packages/transformers/tokenization_utils.py in encode_plus(self, text, text_pair, add_special_tokens, max_length, stride, truncation_strategy, pad_to_max_length, is_pretokenized, return_tensors, return_token_type_ids, return_attention_mask, return_overflowing_tokens, return_special_tokens_mask, return_offsets_mapping, **kwargs)
   1574             )
   1575 
-> 1576         first_ids = get_input_ids(text)
   1577         second_ids = get_input_ids(text_pair) if text_pair is not None else None
   1578 

/usr/local/lib/python3.6/dist-packages/transformers/tokenization_utils.py in get_input_ids(text)
   1554             else:
   1555                 raise ValueError(
-> 1556                     "Input is not valid. Should be a string, a list/tuple of strings or a list/tuple of integers."
   1557                 )
   1558 

ValueError: Input is not valid. Should be a string, a list/tuple of strings or a list/tuple of integers.

0 个答案:

没有答案