我想使用BERT模型对文档进行分类。每个文档都包含多个序列,因此我首先尝试分别对每个序列进行分类,然后采用(CLS令牌的)经过微调的“合并”嵌入。接下来,将引用同一文档的所有“合并”嵌入列表放入列表中,并使用LSTM模型对其进行分类。
数据(train和val)包含三列:text,text_split和label。我创建了一个新的数据框,如下所示(类似于val,但未在此处显示)将文档转换为序列(text_split是包含序列的列表):
train_l = []
train_label_l = []
train_index_l = []
for idx, row in train.iterrows():
for l in row['text_split']:
train_l.append(l)
train_label_l.append(row['label'])
train_index_l.append(idx)
train_df = pd.DataFrame({'text': train_l, 'label': train_label_l})
我首先用以下代码和上面提到的4个numpy数组作为输入来训练了单个序列的分类模型:
def create_model(max_seq_len, bert_ckpt_file):
with tf.io.gfile.GFile(bert_config_file, 'r') as reader:
bc = StockBertConfig.from_json_string(reader.read())
bert_params = map_stock_config_to_params(bc)
bert_params.adapter_size = None
bert = BertModelLayer.from_params(bert_params, trainable = True, name = 'bert')
input_ids = tf.keras.layers.Input(shape = (max_seq_len, ), dtype = 'int32', name = 'input_ids')
bert_output = bert(input_ids)
cls_out = tf.keras.layers.Lambda(lambda seq: seq[:, 0, :], name = 'lambda')(bert_output)
dropout_out = tf.keras.layers.Dropout(0.2, name = 'dropout_one')(cls_out)
logits = tf.keras.layers.Dense(units = 10,
kernel_initializer = tf.keras.initializers.TruncatedNormal(stddev = 0.02),
bias_initializer = tf.zeros_initializer(),
activation = 'softmax',
name = 'output')(dropout_out)
model = tf.keras.Model(inputs = input_ids, outputs = logits, name = 'BERT_finetuning')
load_stock_weights(bert, bert_ckpt_file)
return model
model = create_model(200, bert_ckpt_file)
model.compile(optimizer = tf.keras.optimizers.Adam(learning_rate = 2e-5, beta_1 = 0.9, beta_2 = 0.999, epsilon = 1e-07, amsgrad = False),
loss = tf.keras.losses.SparseCategoricalCrossentropy(from_logits = False),
metrics = [tf.keras.metrics.SparseCategoricalAccuracy(name = 'accuracy')])
model.fit(x = train_x, y = train_y, validation_data = (val_x, val_y), batch_size = 16, shuffle = True, epochs = 3)
训练进行得很好,在验证集上的准确度大约为85%。到目前为止,没有问题。
接下来,我如下收集每个样本的微调“合并”嵌入(用于训练和val,用于val的代码都相似),如下所示(由于OOM错误,分批进行):
CLS_model = tf.keras.Model(inputs = model.input, outputs = model.layers[2].output)
CLS_layer_training_total = np.empty(shape = (0, 768))
for i in range(int(np.ceil(train_x.shape[0] / 100))):
CLS_layer_training = CLS_model.predict(train_x[i * 100: (i + 1) * 100])
CLS_layer_training_total = np.concatenate((CLS_layer_training_total, np.array(CLS_layer_training)), axis = 0)
然后,将与原始文档相对应的序列的嵌入内容放入列表,如下所示(再次,train和val相同,但此处仅显示train):
train_final_x = {}
for l, embedding in zip(train_index_l, keras_layer_training_total):
if l in train_final_x.keys():
train_final_x[l] = np.vstack([train_final_x[l], embedding])
else:
train_final_x[l] = [embedding]
train_data_final = []
train_label_final = []
for k in train_final_x.keys():
train_data_final.append(train_final_x[k])
train_label_final.append(train.loc[k, 'label'])
df_train = pd.DataFrame({'embedding': train_data_final, 'label': train_label_final})
然后我使用这些数据来训练最终模型:
inputs = tf.keras.layers.Input(shape = (None, 768, ), dtype = 'float32', name = 'input')
l_mask = tf.keras.layers.Masking(mask_value = -99., name = 'mask')(inputs)
encoded_text = tf.keras.layers.LSTM(100, name = 'LSTM')(l_mask)
out_dense = tf.keras.layers.Dense(30, activation = 'relu', name = 'dense')(encoded_text)
outputs = tf.keras.layers.Dense(10, activation = 'softmax', name = 'output')(out_dense)
model_final = tf.keras.Model(inputs, outputs, name = 'final_model')
model_final.compile(optimizer = 'adam', loss = 'sparse_categorical_crossentropy', metrics = ['acc'])
model_final.fit(train_generator(df_train), steps_per_epoch = batches_per_epoch, epochs = 10, validation_data = val_generator(df_val), validation_steps = batches_per_epoch_val)
具有以下生成器(类似于验证器):
num_sequences = len(df_train.embedding.to_list())
batch_size = 3
batches_per_epoch = num_sequences / batch_size
num_features = 768
def train_generator(df):
x_list = df.embedding.to_list()
y_list = df.label.to_list()
# generate batches
while True:
for b in range(batches_per_epoch):
longest_index = (b + 1) * batch_size - 1
timesteps = len(max(df.embedding.to_list()[:(b + 1) * batch_size][-batch_size:], key = len))
x_train = np.full((batch_size, timesteps, num_features), -99.)
y_train = np.zeros((batch_size, 1))
for i in range(batch_size):
li = b * batch_size + i
x_train[i, 0: len(x_list[li]), :] = x_list[li]
y_train[i] = y_list[li]
yield x_train, y_train
最终模型的准确度仅为10%,与随机分类一样,因为有10个类别。我在TensorFlow中具有相同的模型,最终模型的准确度为92%,所以我知道某处存在错误(微调模型在TensorFlow中的准确度也为85%)。有什么建议吗?
ps:我试图使代码尽可能地受限制,但我想以上所有这些都是理解发生了什么事情的必要条件。