Question

我正在尝试微调HuggingFace TFBertModel，以便能够将某些文本分类为单个标签。我已经启动并运行了模型，但是从一开始，准确性就非常低。我的期望是，由于它使用BERT预训练权重作为起点，因此准确性会很高。我希望就我要去的地方得到一些建议。

我正在使用here中的bbc文本数据集：

加载数据

df = pd.read_csv(open(<s3 url>),encoding='utf-8', error_bad_lines=False)
df = df.sample(frac=1)
df = df.dropna(how='any')

价值计数

sport            511
business         510
politics         417
tech             401
entertainment    386
Name: label, dtype: int64

预处理

def preprocess_text(sen):
# Convert html entities to normal
sentence = unescape(sen)

# Remove html tags
sentence = remove_tags(sentence)

# Remove newline chars
sentence = remove_newlinechars(sentence)

# Remove punctuations and numbers
sentence = re.sub('[^a-zA-Z]', ' ', sentence)

# Convert to lowercase
sentence = sentence.lower()

return sentence


def remove_newlinechars(text):
    return " ".join(text.splitlines()) 

def remove_tags(text):
    TAG_RE = re.compile(r'<[^>]+>')
    return TAG_RE.sub('', text)

df['text_prepd'] = df['text'].apply(preprocess_text)

拆分数据

train, val = train_test_split(df, test_size=0.30, shuffle=True, stratify=df['label'])

编码标签

from sklearn.preprocessing import LabelEncoder

label_encoder = LabelEncoder()
y_train = np.asarray(le.fit_transform(train['label']))
y_val = np.asarray(le.fit_transform(val['label']))

定义BERT输入功能

# Initialise Bert Tokenizer
bert_tokenizer_transformer = BertTokenizer.from_pretrained('bert-base-cased')

def create_input_array(df, tokenizer, args):
    sentences = df.text_prepd.values

    input_ids = []
    attention_masks = []
    token_type_ids = []

    for sent in tqdm(sentences):
        # `encode_plus` will:
        #   (1) Tokenize the sentence.
        #   (2) Prepend the `[CLS]` token to the start.
        #   (3) Append the `[SEP]` token to the end.
        #   (4) Map tokens to their IDs.
        #   (5) Pad or truncate the sentence to `max_length`
        #   (6) Create attention masks for [PAD] tokens.
        encoded_dict = tokenizer.encode_plus(
            sent,  # Sentence to encode.
            add_special_tokens=True,  # Add '[CLS]' and '[SEP]'
            max_length=args.max_seq_len,  # Pad & truncate all sentences.
                pad_to_max_length=True,
                return_attention_mask=True,  # Construct attn. masks.
                return_tensors='tf',  # Return tf tensors.
            )

        # Add the encoded sentence to the list.
        input_ids.append(encoded_dict['input_ids'])

        # And its attention mask (simply differentiates padding from non-padding).
        attention_masks.append(encoded_dict['attention_mask'])

        token_type_ids.append(encoded_dict['token_type_ids'])

    input_ids = tf.convert_to_tensor(input_ids)
    attention_masks = tf.convert_to_tensor(attention_masks)
    token_type_ids = tf.convert_to_tensor(token_type_ids)

    return input_ids, attention_masks, token_type_ids

将数据转换为Bert输入

train_inputs = [create_input_array(train[:], tokenizer=tokenizer, args=args)]
val_inputs = [create_input_array(val[:], tokenizer=tokenizer, args=args)]

对于train_inputs, y_train和val_inputs, y_val，我将应用下面的函数，该函数将整形并转换为numpy数组。然后，将从此函数返回的列表作为参数传递给keras fit方法。我意识到这先转换为tf.tensor然后转换为numpy有点矫I过正，但是我认为这没有影响。我本来是尝试使用tf.datasets的，但切换到numpy。

def convert_inputs_to_tf_dataset(inputs,y, args):
    # args.max_seq_len = 256
    ids = inputs[0][1]
    masks = inputs[0][1]
    token_types = inputs[0][2]

    ids = tf.reshape(ids, (-1, args.max_seq_len))
    print("Input ids shape: ", ids.shape)
    masks = tf.reshape(masks, (-1, args.max_seq_len))
    print("Input Masks shape: ", masks.shape)
    token_types = tf.reshape(token_types, (-1, args.max_seq_len))
    print("Token type ids shape: ", token_types.shape)

    ids=ids.numpy()
    masks = masks.numpy()
    token_types = token_types.numpy()

    return [ids, masks, token_types, y]

Keras模型

# args.max_seq_len = 256
# n_classes = 6
model = TFBertForSequenceClassification.from_pretrained('bert-base-uncased', trainable=True, num_labels=n_classes)

input_ids_layer = Input(shape=(args.max_seq_len, ), dtype=np.int32)
input_mask_layer = Input(shape=(args.max_seq_len, ), dtype=np.int32)
input_token_type_layer = Input(shape=(args.max_seq_len,), dtype=np.int32)

bert_layer = model([input_ids_layer, input_mask_layer, input_token_type_layer])[0]
flat_layer = Flatten()(bert_layer)
dropout= Dropout(0.3)(flat_layer)
dense_output = Dense(n_classes, activation='softmax')(dropout)

model_ = Model(inputs=[input_ids_layer, input_mask_layer, input_token_type_layer], outputs=dense_output)

编译并拟合

loss = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True)
metric = tf.keras.metrics.SparseCategoricalAccuracy('accuracy')
model.compile(optimizer='adam', loss=loss, metrics=[metric])
model.fit(inputs=..., outputs=..., validation_data=..., epochs=50, batch_size = 32, metrics=metric, verbose=1)


Epoch 32/50
1401/1401 [==============================] - 42s 30ms/sample - loss: 1.6103 - accuracy: 0.2327 - val_loss: 1.6042 -
 val_accuracy: 0.2308

当我使用BERT时，仅需要几个纪元，因此我期望在32个纪元后比23％高得多。

Answer 1

主要问题在以下一行：ids = inputs[0][1]。实际上，id是inputs[0]的第一个元素；因此应该是ids = inputs[0][0]。

但是还有另一个问题，可能会导致验证准确性不一致：您只能一次LabelEncoder来构造标签映射；因此您应该在验证标签上使用transform方法而不是fit_transform。

另一点是，您可能需要为优化器使用较低的学习率。 Adam优化器的默认学习率是1e-3，考虑到您正在微调预训练的模型，这可能会过高。尝试降低学习速度，例如1e-4或1e-5；例如tf.keras.optimizers.Adam(learning_rate=1e-4)。用于对预训练模型进行微调的高学习率可能会破坏学习的权重并破坏微调过程（由于生成的梯度值较大，尤其是在微调过程开始时）。

Answer 2

我不确定我是否理解您的所有步骤，尤其是使用令牌生成器时。我不知道问题可能在哪里，但是可以轻松解决。 Higgungface转换器为您提供了一些简单的解决方案来进行文本分类：

model = TFBertForSequenceClassification.from_pretrained('bert-base-cased')

还有一些功能可以将数据放置为模型期望的格式：

glue_convert_examples_to_features

您需要深入研究文档以查看可以设置的所有参数，例如类数，用于预处理的粘合任务的类型...

您可以在下面找到一些示例：https://pypi.org/project/transformers/

Tensorflow / Keras / BERT多类文本分类准确性

2 个答案: