增量训练 ||暂停和恢复训练,GPT2 语言建模

时间:2021-01-30 23:33:47

标签: python tensorflow tensorflow2.0 huggingface-transformers gpt-2

我目前正在尝试学习 Python - 同时学习使用 GPT-2 语言建模的机器学习 - 我遇到了一些问题,我解决了大部分问题,最后得到了一些不错的运行。

>

但是...正如你们大多数人可能知道的那样,训练你的模型需要大量的 CPU/GPU 能力和时间 - 我可以腾出时间,但问题是我不能让它运行- 停在我的家用电脑上(是的,我知道我可以租用 GPU @ google) - 因为我希望在训练我的模型时能够做任何其他事情。

所以我有以下问题:

  • 我可以以某种方式停止并重新开始我的模型训练吗?我读了一些关于检查站的内容,但他们关于这个主题的信息太多了 - 所以我一直没能弄清楚。
  • 我可以逐步提供我的模型 fx。我的数据集的 10%,让它完成 - 然后下周再喂它 10%,依此类推?如果是这样怎么办?
  • 额外的问题......用较低的数据集瞄准多个时代是否更好?还是更大的数据集和更多的时代?什么是好的时代数?

包:

  • Python,3.7.9
  • Tensorflow-gpu 2.3.0
  • Tensorflow 估算器 2.3.0
  • 变形金刚 4.2.2
  • 分词器 0.9.4
  • cudatoolkit 10.1

代码 - 标记器

from tokenizers.models import BPE
from tokenizers import Tokenizer
from tokenizers.decoders import ByteLevel as ByteLevelDecoder
from tokenizers.normalizers import NFKC, Sequence
from tokenizers.pre_tokenizers import ByteLevel
from tokenizers.trainers import BpeTrainer

class BPE_token(object):
def __init__(self):
    self.tokenizer = Tokenizer(BPE())
    self.tokenizer.normalizer = Sequence([
        NFKC()
    ])
    self.tokenizer.pre_tokenizer = ByteLevel()
    self.tokenizer.decoder = ByteLevelDecoder()

def bpe_train(self, paths):
    trainer = BpeTrainer(vocab_size=50000, show_progress=True, inital_alphabet=ByteLevel.alphabet(),         special_tokens=[
        "<s>",
        "<pad>",
        "</s>",
        "<unk>",
        "<mask>"
    ])
    self.tokenizer.train(trainer, paths)

def save_tokenizer(self, location, prefix=None):
    if not os.path.exists(location):
        os.makedirs(location)
    self.tokenizer.model.save(location, prefix)

# ////////// TOKENIZE DATA ////////////
from pathlib import Pa th
import os# the folder 'text' contains all the files
paths = [str(x) for x in Path("./da_corpus/").glob("**/*.txt")]
tokenizer = BPE_token()# train the tokenizer model
tokenizer.bpe_train(paths)# saving the tokenized data in our specified folder
save_path = 'tokenized_data'
tokenizer.save_tokenizer(save_path)

代码 -- 模型训练师

save_path = 'tokenized_data'
tokenizer = GPT2Tokenizer.from_pretrained(save_path)
paths = [str(x) for x in Path("./da_corpus/").glob("**/*.txt")]
# tokenizer = Tokenizer.from_file("./tokenized_data/tokenizer-wiki.json")
tokenizer.add_special_tokens({
  "eos_token": "</s>",
  "bos_token": "<s>",
  "unk_token": "<unk>",
  "pad_token": "<pad>",
  "mask_token": "<mask>"
})# creating the configurations from which the model can be made
config = GPT2Config(
  vocab_size=tokenizer.vocab_size,
  bos_token_id=tokenizer.bos_token_id,
  eos_token_id=tokenizer.eos_token_id
)# creating the model
model = TFGPT2LMHeadModel(config)

single_string = ''
for filename in paths:
    with open(filename, "r", encoding='utf-8') as f:
        x = f.read()
    single_string += x + tokenizer.eos_token
string_tokenized = tokenizer.encode(single_string)
# print(string_tokenized)



examples = []
block_size = 100
BATCH_SIZE = 12
BUFFER_SIZE = 2000
for i in range(0, len(string_tokenized) - block_size + 1, block_size):
    examples.append(string_tokenized[i:i + block_size])
    inputs, labels = [], []


for ex in examples:
    inputs.append(ex[:-1])
    labels.append(ex[1:])

dataset = tf.data.Dataset.from_tensor_slices((inputs, labels))
dataset = dataset.shuffle(BUFFER_SIZE).batch(BATCH_SIZE, drop_remainder=True)

# defining our optimizer
optimizer = tf.keras.optimizers.Adam(learning_rate=3e-5, epsilon=1e-08, clipnorm=1.0)
loss = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True)
metric = tf.keras.metrics.SparseCategoricalAccuracy('accuracy')# compiling the model
model.compile(optimizer=optimizer, loss=[loss, *[None] * model.config.n_layer], metrics=[metric])
num_epoch = 20
history = model.fit(dataset, epochs=num_epoch)


output_dir = './model_bn_custom/'

if not os.path.exists(output_dir):
    os.mkdir(output_dir)


model_to_save = model.module if hasattr(model, 'module') else model
output_model_file = os.path.join(output_dir, WEIGHTS_NAME)
output_config_file = os.path.join(output_dir, CONFIG_NAME)

# save model and model configs
model.save_pretrained(output_dir)
model_to_save.config.to_json_file(output_config_file)

# save tokenizer
tokenizer.save_pretrained(output_dir)

0 个答案:

没有答案