微调方法

Question

我正在研究一个TextClassification问题，为此我试图在我的facesface-transformers库中给出的TFBertForSequenceClassification上训练我的模型。

我遵循了他们在github页上给出的示例，我能够使用tensorflow_datasets.load('glue/mrpc')对给定的示例数据运行示例代码。但是，我找不到如何加载自己的自定义数据并将其传递给我的示例。 model.fit(train_dataset, epochs=2, steps_per_epoch=115, validation_data=valid_dataset, validation_steps=7)。

如何定义自己的X，对X进行标记化，并使用X和Y准备train_dataset。其中X表示输入文本，Y表示给定X的分类类别。

样本培训数据框：

    text    category_index
0   Assorted Print Joggers - Pack of 2 ,/ Gray Pri...   0
1   "Buckle" ( Matt ) for 35 mm Width Belt  0
2   (Gagam 07) Barcelona Football Jersey Home 17 1...   2
3   (Pack of 3 Pair) Flocklined Reusable Rubber Ha...   1
4   (Summer special Offer)Firststep new born baby ...   0

Answer 1

微调方法

有多种方法可以为目标任务微调 BERT。

进一步预训练基础 BERT 模型
在可训练的基础 BERT 模型之上的自定义分类层
位于基础 BERT 模型之上的自定义分类层不可训练（冻结）

请注意，BERT 基础模型仅针对原始论文中的两个任务进行了预训练。

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

<块引用>

3.1 预训练 BERT ...我们使用两个无监督任务预训练 BERT

任务 1：蒙面 LM
任务 2：下一句预测 (NSP)

因此，基本的 BERT 模型就像半生不熟的，它可以完全针对目标域（第一种方式）。我们可以将其作为自定义模型训练的一部分，使用基础可训练（第 2 位）或不可训练（第 3 位）。

第一种方法

How to Fine-Tune BERT for Text Classification? 展示了进一步预训练的第一种方法，并指出学习率是避免灾难性遗忘的关键，即在学习新知识的过程中预训练的知识被擦除知识。

<块引用>

我们发现较低的学习率，例如 2e-5，是使 BERT 克服灾难性遗忘问题所必需的。在 4e-4 的积极学习率下，训练集无法收敛。

可能这就是 BERT paper 使用 5e-5、4e-5、3e-5 和 2e-5 进行微调的原因。

<块引用>

我们使用 32 的批量大小并对所有 GLUE 任务的数据进行 3 个时期的微调。对于每个任务，我们在开发集上选择了最佳的微调学习率（在 5e-5、4e-5、3e-5 和 2e-5 中）

请注意，基础模型预训练本身使用了更高的学习率。

bert-base-uncased - pretraining

<块引用>

该模型在 Pod 配置的 4 个云 TPU（总共 16 个 TPU 芯片）上训练了 100 万步，批量大小为 256。序列长度限制为 90% 的步骤的 128 个标记，其余的 512 个标记10%。使用的优化器是 Adam，学习率为 1e-4，β1=0.9 和 β2=0.999，权重衰减为 0.01，10,000 步的学习率预热和线性之后的学习率衰减。

将描述第一种方法作为下面第三种方法的一部分。

仅供参考： TFDistilBertModel 是名称为 distilbert 的基本模型。

Model: "tf_distil_bert_model_1"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
distilbert (TFDistilBertMain multiple                  66362880  
=================================================================
Total params: 66,362,880
Trainable params: 66,362,880
Non-trainable params: 0

第二种方法

Huggingface 采用第二种方法，如 Fine-tuning with native PyTorch/TensorFlow 中的 TFDistilBertForSequenceClassification 在可训练的基础 classifier 模型之上添加了自定义分类层 distilbert。小学习率要求也将适用以避免灾难性遗忘。

from transformers import TFDistilBertForSequenceClassification

model = TFDistilBertForSequenceClassification.from_pretrained('distilbert-base-uncased')
optimizer = tf.keras.optimizers.Adam(learning_rate=5e-5)
model.compile(optimizer=optimizer, loss=model.compute_loss) # can also use any keras loss fn
model.fit(train_dataset.shuffle(1000).batch(16), epochs=3, batch_size=16)

Model: "tf_distil_bert_for_sequence_classification_2"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
distilbert (TFDistilBertMain multiple                  66362880  
_________________________________________________________________
pre_classifier (Dense)       multiple                  590592    
_________________________________________________________________
classifier (Dense)           multiple                  1538      
_________________________________________________________________
dropout_59 (Dropout)         multiple                  0         
=================================================================
Total params: 66,955,010
Trainable params: 66,955,010  <--- All parameters are trainable
Non-trainable params: 0

第二种方法的实现

import pandas as pd
import tensorflow as tf
from sklearn.model_selection import train_test_split
from transformers import (
    DistilBertTokenizerFast,
    TFDistilBertForSequenceClassification,
)


DATA_COLUMN = 'text'
LABEL_COLUMN = 'category_index'
MAX_SEQUENCE_LENGTH = 512
LEARNING_RATE = 5e-5
BATCH_SIZE = 16
NUM_EPOCHS = 3


# --------------------------------------------------------------------------------
# Tokenizer
# --------------------------------------------------------------------------------
tokenizer = DistilBertTokenizerFast.from_pretrained('distilbert-base-uncased')
def tokenize(sentences, max_length=MAX_SEQUENCE_LENGTH, padding='max_length'):
    """Tokenize using the Huggingface tokenizer
    Args:
        sentences: String or list of string to tokenize
        padding: Padding method ['do_not_pad'|'longest'|'max_length']
    """
    return tokenizer(
        sentences,
        truncation=True,
        padding=padding,
        max_length=max_length,
        return_tensors="tf"
    )

# --------------------------------------------------------------------------------
# Load data
# --------------------------------------------------------------------------------
raw_train = pd.read_csv("./train.csv")
train_data, validation_data, train_label, validation_label = train_test_split(
    raw_train[DATA_COLUMN].tolist(),
    raw_train[LABEL_COLUMN].tolist(),
    test_size=.2,
    shuffle=True
)

# --------------------------------------------------------------------------------
# Prepare TF dataset
# --------------------------------------------------------------------------------
train_dataset = tf.data.Dataset.from_tensor_slices((
    dict(tokenize(train_data)),  # Convert BatchEncoding instance to dictionary
    train_label
)).shuffle(1000).batch(BATCH_SIZE).prefetch(1)
validation_dataset = tf.data.Dataset.from_tensor_slices((
    dict(tokenize(validation_data)),
    validation_label
)).batch(BATCH_SIZE).prefetch(1)

# --------------------------------------------------------------------------------
# training
# --------------------------------------------------------------------------------
model = TFDistilBertForSequenceClassification.from_pretrained(
    'distilbert-base-uncased',
    num_labels=NUM_LABELS
)
optimizer = tf.keras.optimizers.Adam(learning_rate=LEARNING_RATE)
model.compile(
    optimizer=optimizer,
    loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True),
)
model.fit(
    x=train_dataset,
    y=None,
    validation_data=validation_dataset,
    batch_size=BATCH_SIZE,
    epochs=NUM_EPOCHS,
)

第三种方法

基础

请注意，图片取自 A Visual Guide to Using BERT for the First Time 并进行了修改。

分词器

Tokenizer 生成 BatchEncoding 的实例，它可以像 Python 字典和 BERT 模型的输入一样使用。

BatchEncoding

<块引用>

保存 encode_plus() 和 batch_encode() 方法（令牌、attention_masks 等）的输出。
此类派生自 Python 字典，可用作字典。此外，该类公开了从单词/字符空间映射到标记空间的实用方法。

参数

data (dict) – 由 encode/batch_encode 方法（‘input_ids’、‘attention_mask’等）返回的列表/数组/张量字典。

类的 data 属性是生成的令牌，其中包含 input_ids 和 attention_mask 元素。

input_ids

input_ids

<块引用>

输入 id 通常是作为输入传递给模型的唯一必需参数。它们是标记索引、标记的数字表示，构建将用作模型输入的序列。

attention_mask

Attention mask

<块引用>

这个参数向模型表明哪些令牌应该被注意，哪些不应该。

如果 attention_mask 为 0，则忽略令牌 ID。例如，如果填充一个序列以调整序列长度，则应忽略填充的单词，因此它们的 attention_mask 为 0。

特殊令牌

BertTokenizer 添加特殊标记，用 [CLS] 和 [SEP] 包围一个序列。 [CLS] 代表分类，[SEP] 分隔序列。对于问答或释义任务，[SEP] 将两个句子分开以进行比较。

BertTokenizer

<块引用>

cls_token (str, optional, defaults to "[CLS]")
进行序列分类时使用的Classifier Token（对整体进行分类序列而不是每个令牌分类）。当使用特殊标记构建时，它是序列的第一个标记。
sep_token (str, optional, defaults to "[SEP]")
分隔符，用于从多个序列构建序列，例如用于序列分类或用于文本的两个序列和用于问答的问题。它还用作使用特殊标记构建的序列的最后一个标记。

A Visual Guide to Using BERT for the First Time 显示标记化。

[CLS]

基础模型最后一层输出中 [CLS] 的嵌入向量表示基础模型已学习到的分类。因此，将 [CLS] 标记的嵌入向量输入添加到基础模型之上的分类层。

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

<块引用>

每个序列的第一个标记总是 a special classification token ([CLS])。与此标记对应的最终隐藏状态用作分类任务的聚合序列表示。句子对被打包成一个序列。我们以两种方式区分句子。首先，我们用一个特殊的标记 ([SEP]) 将它们分开。其次，我们向每个标记添加一个学习嵌入，指示它属于句子 A 还是句子 B。

模型结构如下图所示。

矢量大小

在模型 distilbert-base-uncased 中，每个标记都嵌入到大小为 768 的向量中。基础模型的输出形状为 (batch_size, max_sequence_length, embedding_vector_size=768)。这与关于 BERT/BASE 模型的 BERT 论文一致（如 distilbert-base-uncased 所示）。

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

<块引用>

BERT/BASE（L=12，H=768，A=12，总参数=110M）和 BERT/LARGE（L=24，H=1024，A=16，总参数） =340M）。

基础模型 - TFDistilBertModel

Hugging Face Transformers: Fine-tuning DistilBERT for Binary Classification Tasks

<块引用>

TFDistilBertModel 类，用于实例化基础 DistilBERT 模型顶部没有任何特定的头部（与其他类相反，例如 TFDistilBertForSequenceClassification 确实具有添加的分类头部）。

我们不希望附加任何特定任务的头部，因为我们只是希望基础模型的预训练权重能够提供对英语语言的一般理解，并且在微调期间添加我们自己的分类头部将是我们的工作过程以帮助模型区分有毒评论。

TFDistilBertModel 生成 TFBaseModelOutput 的实例，其 last_hidden_state 参数是模型最后一层的输出。

TFBaseModelOutput([(
    'last_hidden_state',
    <tf.Tensor: shape=(batch_size, sequence_lendgth, 768), dtype=float32, numpy=array([[[...]]], dtype=float32)>
)])

TFBaseModelOutput

<块引用>

参数

last_hidden_state (tf.Tensor of shape (batch_size, sequence_length, hidden_size)) – 模型最后一层输出的隐藏状态序列。

实施

Python 模块

import pandas as pd
import tensorflow as tf
from sklearn.model_selection import train_test_split
from transformers import (
    DistilBertTokenizerFast,
    TFDistilBertModel,
)

配置

TIMESTAMP = datetime.datetime.now().strftime("%Y%b%d%H%M").upper()

DATA_COLUMN = 'text'
LABEL_COLUMN = 'category_index'

MAX_SEQUENCE_LENGTH = 512   # Max length allowed for BERT is 512.
NUM_LABELS = len(raw_train[LABEL_COLUMN].unique())

MODEL_NAME = 'distilbert-base-uncased'
NUM_BASE_MODEL_OUTPUT = 768

# Flag to freeze base model
FREEZE_BASE = True

# Flag to add custom classification heads
USE_CUSTOM_HEAD = True
if USE_CUSTOM_HEAD == False:
    # Make the base trainable when no classification head exists.
    FREEZE_BASE = False


BATCH_SIZE = 16
LEARNING_RATE = 1e-2 if FREEZE_BASE else 5e-5
L2 = 0.01

分词器

tokenizer = DistilBertTokenizerFast.from_pretrained(MODEL_NAME)
def tokenize(sentences, max_length=MAX_SEQUENCE_LENGTH, padding='max_length'):
    """Tokenize using the Huggingface tokenizer
    Args:
        sentences: String or list of string to tokenize
        padding: Padding method ['do_not_pad'|'longest'|'max_length']
    """
    return tokenizer(
        sentences,
        truncation=True,
        padding=padding,
        max_length=max_length,
        return_tensors="tf"
    )

输入层

基本模型期望 input_ids 和 attention_mask 的形状为 (max_sequence_length,)。分别用 Input 层为它们生成 Keras 张量。

# Inputs for token indices and attention masks
input_ids = tf.keras.layers.Input(shape=(MAX_SEQUENCE_LENGTH,), dtype=tf.int32, name='input_ids')
attention_mask = tf.keras.layers.Input((MAX_SEQUENCE_LENGTH,), dtype=tf.int32, name='attention_mask')

基础模型层

从基础模型生成输出。基本模型生成 TFBaseModelOutput。将 [CLS] 的嵌入提供给下一层。

base = TFDistilBertModel.from_pretrained(
    MODEL_NAME,
    num_labels=NUM_LABELS
)

# Freeze the base model weights.
if FREEZE_BASE:
    for layer in base.layers:
        layer.trainable = False
    base.summary()

# [CLS] embedding is last_hidden_state[:, 0, :]
output = base([input_ids, attention_mask]).last_hidden_state[:, 0, :]

分类层

if USE_CUSTOM_HEAD:
    # -------------------------------------------------------------------------------
    # Classifiation leayer 01
    # --------------------------------------------------------------------------------
    output = tf.keras.layers.Dropout(
        rate=0.15,
        name="01_dropout",
    )(output)
    
    output = tf.keras.layers.Dense(
        units=NUM_BASE_MODEL_OUTPUT,
        kernel_initializer='glorot_uniform',
        activation=None,
        name="01_dense_relu_no_regularizer",
    )(output)
    output = tf.keras.layers.BatchNormalization(
        name="01_bn"
    )(output)
    output = tf.keras.layers.Activation(
        "relu",
        name="01_relu"
    )(output)

    # --------------------------------------------------------------------------------
    # Classifiation leayer 02
    # --------------------------------------------------------------------------------
    output = tf.keras.layers.Dense(
        units=NUM_BASE_MODEL_OUTPUT,
        kernel_initializer='glorot_uniform',
        activation=None,
        name="02_dense_relu_no_regularizer",
    )(output)
    output = tf.keras.layers.BatchNormalization(
        name="02_bn"
    )(output)
    output = tf.keras.layers.Activation(
        "relu",
        name="02_relu"
    )(output)

Softmax 层

output = tf.keras.layers.Dense(
    units=NUM_LABELS,
    kernel_initializer='glorot_uniform',
    kernel_regularizer=tf.keras.regularizers.l2(l2=L2),
    activation='softmax',
    name="softmax"
)(output)

最终自定义模型

name = f"{TIMESTAMP}_{MODEL_NAME.upper()}"
model = tf.keras.models.Model(inputs=[input_ids, attention_mask], outputs=output, name=name)
model.compile(
    loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=False),
    optimizer=tf.keras.optimizers.Adam(learning_rate=LEARNING_RATE),
    metrics=['accuracy']
)
model.summary()
---
Layer (type)                    Output Shape         Param #     Connected to                     
==================================================================================================
input_ids (InputLayer)          [(None, 256)]        0                                            
__________________________________________________________________________________________________
attention_mask (InputLayer)     [(None, 256)]        0                                            
__________________________________________________________________________________________________
tf_distil_bert_model (TFDistilB TFBaseModelOutput(la 66362880    input_ids[0][0]                  
                                                                 attention_mask[0][0]             
__________________________________________________________________________________________________
tf.__operators__.getitem_1 (Sli (None, 768)          0           tf_distil_bert_model[1][0]       
__________________________________________________________________________________________________
01_dropout (Dropout)            (None, 768)          0           tf.__operators__.getitem_1[0][0] 
__________________________________________________________________________________________________
01_dense_relu_no_regularizer (D (None, 768)          590592      01_dropout[0][0]                 
__________________________________________________________________________________________________
01_bn (BatchNormalization)      (None, 768)          3072        01_dense_relu_no_regularizer[0][0
__________________________________________________________________________________________________
01_relu (Activation)            (None, 768)          0           01_bn[0][0]                      
__________________________________________________________________________________________________
02_dense_relu_no_regularizer (D (None, 768)          590592      01_relu[0][0]                    
__________________________________________________________________________________________________
02_bn (BatchNormalization)      (None, 768)          3072        02_dense_relu_no_regularizer[0][0
__________________________________________________________________________________________________
02_relu (Activation)            (None, 768)          0           02_bn[0][0]                      
__________________________________________________________________________________________________
softmax (Dense)                 (None, 2)            1538        02_relu[0][0]                    
==================================================================================================
Total params: 67,551,746
Trainable params: 1,185,794
Non-trainable params: 66,365,952   <--- Base BERT model is frozen

数据分配

# --------------------------------------------------------------------------------
# Split data into training and validation
# --------------------------------------------------------------------------------
raw_train = pd.read_csv("./train.csv")
train_data, validation_data, train_label, validation_label = train_test_split(
    raw_train[DATA_COLUMN].tolist(),
    raw_train[LABEL_COLUMN].tolist(),
    test_size=.2,
    shuffle=True
)

# X = dict(tokenize(train_data))
# Y = tf.convert_to_tensor(train_label)
X = tf.data.Dataset.from_tensor_slices((
    dict(tokenize(train_data)),  # Convert BatchEncoding instance to dictionary
    train_label
)).batch(BATCH_SIZE).prefetch(1)

V = tf.data.Dataset.from_tensor_slices((
    dict(tokenize(validation_data)),  # Convert BatchEncoding instance to dictionary
    validation_label
)).batch(BATCH_SIZE).prefetch(1)

火车

# --------------------------------------------------------------------------------
# Train the model
# https://www.tensorflow.org/api_docs/python/tf/keras/Model#fit
# Input data x can be a dict mapping input names to the corresponding array/tensors, 
# if the model has named inputs. Beware of the "names". y should be consistent with x 
# (you cannot have Numpy inputs and tensor targets, or inversely). 
# --------------------------------------------------------------------------------
history = model.fit(
    x=X,    # dictionary 
    # y=Y,
    y=None,
    epochs=NUM_EPOCHS,
    batch_size=BATCH_SIZE,
    validation_data=V,
)

要实现第一种方法，请按如下方式更改配置。

USE_CUSTOM_HEAD = False

然后将 FREEZE_BASE 更改为 False 并将 LEARNING_RATE 更改为 5e-5，这将在基础 BERT 模型上运行进一步的预训练。

保存模型

对于第 3 种方法，保存模型会导致问题。不能使用 Huggingface 模型的 save_pretrained 方法，因为该模型不是 Huggingface PreTrainedModel 的直接子类。

Keras save_model 导致默认 save_traces=True 出错，或在使用 Keras load_model 加载模型时导致 save_traces=True 出现不同错误。

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-71-01d66991d115> in <module>()
----> 1 tf.keras.models.load_model(MODEL_DIRECTORY)
 
11 frames
/usr/local/lib/python3.7/dist-packages/tensorflow/python/keras/saving/saved_model/load.py in _unable_to_call_layer_due_to_serialization_issue(layer, *unused_args, **unused_kwargs)
    865       'recorded when the object is called, and used when saving. To manually '
    866       'specify the input shape/dtype, decorate the call function with '
--> 867       '`@tf.function(input_signature=...)`.'.format(layer.name, type(layer)))
    868 
    869 
 
ValueError: Cannot call custom layer tf_distil_bert_model of type <class 'tensorflow.python.keras.saving.saved_model.load.TFDistilBertModel'>, because the call function was not serialized to the SavedModel.Please try one of the following methods to fix this issue:
 
(1) Implement `get_config` and `from_config` in the layer/model class, and pass the object to the `custom_objects` argument when loading the model. For more details, see: https://www.tensorflow.org/guide/keras/save_and_serialize
 
(2) Ensure that the subclassed model or layer overwrites `call` and not `__call__`. The input shape and dtype will be automatically recorded when the object is called, and used when saving. To manually specify the input shape/dtype, decorate the call function with `@tf.function(input_signature=...)`.

据我测试，只有 Keras Model save_weights 有效。

实验

就我使用 Toxic Comment Classification Challenge 进行的测试而言，第一种方法提供了更好的回忆（识别真正的有毒评论，真正的无毒评论）。代码可以访问如下。如有任何问题，请提供更正/建议。

Code for 1st and 3rd approach

BERT Document Classification Tutorial with Code - 使用 TFDistilBertForSequenceClassification 和 Pytorch 进行微调
Hugging Face Transformers: Fine-tuning DistilBERT for Binary Classification Tasks - 使用 TFDistilBertModel 进行微调

Answer 2

您需要使用预期的架构以tf.data格式转换输入数据，以便首先创建要素，然后训练分类模型。

如果您查看即将用于tensorflow_datasets link的胶水数据集，您将看到数据具有特定的模式：

dataset_ops.get_legacy_output_classes(data['train'])

{'idx': tensorflow.python.framework.ops.Tensor,
 'label': tensorflow.python.framework.ops.Tensor,
 'sentence': tensorflow.python.framework.ops.Tensor}

如果您想使用convert_examples_to_features准备准备注入模型的数据，则应使用这种模式。

例如，转换数据不像使用熊猫那样灵活，并且在很大程度上取决于输入数据的结构。

例如，您可以逐步找到here做这种转换。可以使用tf.data.Dataset.from_generator完成。

Answer 3

使用自定义数据集文件的HuggingFace转换器的确没有很多好的例子。

让我们先导入所需的库：

import numpy as np
import pandas as pd

import sklearn.model_selection as ms
import sklearn.preprocessing as p

import tensorflow as tf
import transformers as trfs

并定义所需的常量：

# Max length of encoded string(including special tokens such as [CLS] and [SEP]):
MAX_SEQUENCE_LENGTH = 64 

# Standard BERT model with lowercase chars only:
PRETRAINED_MODEL_NAME = 'bert-base-uncased' 

# Batch size for fitting:
BATCH_SIZE = 16 

# Number of epochs:
EPOCHS=5

现在是时候读取数据集了：

df = pd.read_csv('data.csv')

然后从预训练的BERT中定义所需的模型以进行序列分类：

def create_model(max_sequence, model_name, num_labels):
    bert_model = trfs.TFBertForSequenceClassification.from_pretrained(model_name, num_labels=num_labels)
    
    # This is the input for the tokens themselves(words from the dataset after encoding):
    input_ids = tf.keras.layers.Input(shape=(max_sequence,), dtype=tf.int32, name='input_ids')

    # attention_mask - is a binary mask which tells BERT which tokens to attend and which not to attend.
    # Encoder will add the 0 tokens to the some sequence which smaller than MAX_SEQUENCE_LENGTH, 
    # and attention_mask, in this case, tells BERT where is the token from the original data and where is 0 pad token:
    attention_mask = tf.keras.layers.Input((max_sequence,), dtype=tf.int32, name='attention_mask')
    
    # Use previous inputs as BERT inputs:
    output = bert_model([input_ids, attention_mask])[0]

    # We can also add dropout as regularization technique:
    #output = tf.keras.layers.Dropout(rate=0.15)(output)

    # Provide number of classes to the final layer:
    output = tf.keras.layers.Dense(num_labels, activation='softmax')(output)

    # Final model:
    model = tf.keras.models.Model(inputs=[input_ids, attention_mask], outputs=output)
    return model

现在，我们需要使用定义的函数实例化模型，并编译我们的模型：

model = create_model(MAX_SEQUENCE_LENGTH, PRETRAINED_MODEL_NAME, df.target.nunique())

opt = tf.keras.optimizers.Adam(learning_rate=3e-5)
model.compile(optimizer=opt, loss='categorical_crossentropy', metrics=['accuracy'])

创建用于标记化的功能（将文本转换为标记）：

def batch_encode(X, tokenizer):
    return tokenizer.batch_encode_plus(
    X,
    max_length=MAX_SEQUENCE_LENGTH, # set the length of the sequences
    add_special_tokens=True, # add [CLS] and [SEP] tokens
    return_attention_mask=True,
    return_token_type_ids=False, # not needed for this type of ML task
    pad_to_max_length=True, # add 0 pad tokens to the sequences less than max_length
    return_tensors='tf'
)

加载令牌生成器：

tokenizer = trfs.BertTokenizer.from_pretrained(PRETRAINED_MODEL_NAME)

将数据分为训练和验证部分：

X_train, X_val, y_train, y_val = ms.train_test_split(df.text.values, df.category_index.values, test_size=0.2)

编码我们的集合：

X_train = batch_encode(X_train)
X_val = batch_encode(X_val)

最后，我们可以使用训练集拟合模型，并在每个时期之后使用验证集进行验证：

model.fit(
    x=X_train.values(),
    y=y_train,
    validation_data=(X_val.values(), y_val),
    epochs=EPOCHS,
    batch_size=BATCH_SIZE
)

Answer 4

扩展来自 konstantin_doncov 的答案。

配置文件

在初始化模型时，您需要定义在 Transformers 配置文件中定义的模型初始化参数。基类是 PretrainedConfig。

PretrainedConfig

<块引用>

所有配置类的基类。处理所有模型配置通用的一些参数以及加载/下载/保存配置的方法。

每个子类都有自己的参数。例如，Bert 预训练模型具有 BertConfig。

BertConfig

<块引用>

这是用于存储 BertModel 或 TFBertModel 配置的配置类。它用于根据指定的参数实例化 BERT 模型，定义模型架构。使用默认值实例化配置将产生与 BERT bert-base-uncased 架构类似的配置。

例如，num_labels 参数来自 PretrainedConfig

<块引用>

num_labels (int, optional) – 添加到模型的最后一层中使用的标签数量，通常用于分类任务。

TFBertForSequenceClassification.from_pretrained(model_name, num_labels=num_labels)

模型 bert-base-uncased 的配置文件在 Huggingface model - bert-base-uncased - config.json 发布。

{
  "architectures": [
    "BertForMaskedLM"
  ],
  "attention_probs_dropout_prob": 0.1,
  "gradient_checkpointing": false,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "layer_norm_eps": 1e-12,
  "max_position_embeddings": 512,
  "model_type": "bert",
  "num_attention_heads": 12,
  "num_hidden_layers": 12,
  "pad_token_id": 0,
  "position_embedding_type": "absolute",
  "transformers_version": "4.6.0.dev0",
  "type_vocab_size": 2,
  "use_cache": true,
  "vocab_size": 30522
}

微调（迁移学习）

Huggngface 提供了几个示例，用于对您自己的自定义数据集进行微调。例如，利用 BERT 的 Sequence Classification 功能进行文本分类。

Fine-tuning with custom datasets

<块引用>

本教程将带您了解将 ? Transformers 模型与您自己的数据集结合使用的几个示例。

Fine-tuning a pretrained model

<块引用>

如何微调来自 Transformers 库的预训练模型。在 TensorFlow 中，可以使用 Keras 和 fit 方法直接训练模型。

但是，文档中的示例是概述，缺乏详细信息。

Fine-tuning with native PyTorch/TensorFlow

from transformers import TFDistilBertForSequenceClassification

model = TFDistilBertForSequenceClassification.from_pretrained('distilbert-base-uncased')

optimizer = tf.keras.optimizers.Adam(learning_rate=5e-5)
model.compile(optimizer=optimizer, loss=model.compute_loss) # can also use any keras loss fn
model.fit(train_dataset.shuffle(1000).batch(16), epochs=3, batch_size=16)

github 提供了完整的代码。

HuggingFace Text classification examples

<块引用>

此文件夹包含一些脚本，显示了使用 Hugs Transformers 库进行文本分类的示例。

run_text_classification.py 是 TensorFlow 文本分类微调的示例。

然而，这并不简单也不直接，因为它旨在作为通用和通用用途。因此没有一个很好的例子供人们开始，导致人们需要提出这样的问题。

分类层

您会看到迁移学习（微调）文章解释在预训练的基础模型之上添加分类层，答案也是如此。

output = tf.keras.layers.Dense(num_labels, activation='softmax')(output)

然而，文档中的拥抱脸示例并没有添加任何分类层。

from transformers import TFDistilBertForSequenceClassification

model = TFDistilBertForSequenceClassification.from_pretrained('distilbert-base-uncased')

optimizer = tf.keras.optimizers.Adam(learning_rate=5e-5)
model.compile(optimizer=optimizer, loss=model.compute_loss) # can also use any keras loss fn
model.fit(train_dataset.shuffle(1000).batch(16), epochs=3, batch_size=16)

这是因为 TFBertForSequenceClassification 已经添加了图层。

Hugging Face Transformers: Fine-tuning DistilBERT for Binary Classification Tasks

<块引用>

顶部没有任何特定头部的基础 DistilBERT 模型（相对于其他类，例如 TFDistilBertForSequenceClassification 确实添加了分类头部）。

如果您显示 Keras 模型摘要，例如 TFDistilBertForSequenceClassification，它会显示在基础 BERT 模型之上添加的 Dense 和 Dropout 层。

Model: "tf_distil_bert_for_sequence_classification_2"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
distilbert (TFDistilBertMain multiple                  66362880  
_________________________________________________________________
pre_classifier (Dense)       multiple                  590592    
_________________________________________________________________
classifier (Dense)           multiple                  1538      
_________________________________________________________________
dropout_59 (Dropout)         multiple                  0         
=================================================================
Total params: 66,955,010
Trainable params: 66,955,010
Non-trainable params: 0

冻结基础模型参数

有一些讨论，例如Fine Tune BERT Models 但显然 Huggingface 的方式不是冻结基本模型参数。如图所示，Keras 模型摘要 abobe Non-trainable params: 0。

冻结基础 distilbert 层。

for _layer in model:
    if _layer.name == 'distilbert':
        print(f"Freezing model layer {_layer.name}")
        _layer.trainable = False
    print(_layer.name)
    print(_layer.trainable)
---
Freezing model layer distilbert
distilbert
False      <----------------
pre_classifier
True
classifier
True
dropout_99
True

资源

其他需要研究的资源是 Kaggle。搜索关键词“huggingface”“BERT”，你会找到比赛发布的工作代码。

使用自定义X和Y数据训练TFBertForSequenceClassification

4 个答案:

微调方法

第一种方法

第二种方法

第二种方法的实现

第三种方法

基础

分词器

input_ids

attention_mask

特殊令牌

[CLS]

矢量大小

基础模型 - TFDistilBertModel

实施

Python 模块

配置

分词器

输入层

基础模型层

分类层

Softmax 层

最终自定义模型

数据分配

火车

保存模型

实验

配置文件

微调（迁移学习）

分类层

冻结基础模型参数

资源

使用自定义X和Y数据训练TFBertForSequenceClassification

4 个答案:

微调方法

第一种方法

第二种方法

第二种方法的实现

第三种方法

基础

分词器

input_ids

attention_mask

特殊令牌

[CLS]

矢量大小

基础模型 - TFDistilBertModel

实施

Python 模块

配置

分词器

输入层

基础模型层

分类层

Softmax 层

最终自定义模型

数据分配

火车

保存模型

实验

相关

配置文件

微调（迁移学习）

分类层

冻结基础模型参数

资源