组合特征和标签以正确生成用于model.fit的tf数据集(使用tf.data.Dataset.from_tensor_slices)

时间:2020-07-13 10:35:36

标签: python tensorflow dataset tensorflow2.0

我创建了一个使用形状输入的模型(无,512)。以下是我的模型摘要

enter image description here

训练功能的形状

train_ids.shape
(10, 512)

训练反应变量的形状

indus_cat_train.shape
(10, 49)

如果我使用的话,我的模型可以完美运行

history = model.fit(
    train_ids, indus_cat_train, epochs=2, validation_data=(
        valid_ids, indus_cat_valid))

但是,我的实际数据集非常大,一次全部馈入完整的数据集会消耗大量RAM,并关闭所有进程。

我想分批或一路喂所有数据。为了完成此任务,我尝试了tf.data.Dataset.from_tensor_slices函数

# training data
tf_train_data = tf.data.Dataset.from_tensor_slices((train_ids, indus_cat_train))

# validation data
tf_valid_data = tf.data.Dataset.from_tensor_slices((valid_ids, indus_cat_valid))

上面的代码运行良好,经过检查,它可以提供所需的形状

for elem in t:
    print(elem[0].shape) # for features
    print(elem[1].shape) # for response
打印输出
    (512,) # for features
    (49,)  # for response variable
# terminating all other output to save space

但是在调用tf_train_dataset上的model.fit时,模型给了我一个错误

bert_history = model.fit(
    tf_train_data, epochs=2, validation_data=tf_valid_data)

警告:tensorflow:为输入Tensor(“ input_ids_1:0”,shape =(None,512),dtype = int32)构造了形状为(None,512)的模型,但它被称为形状不兼容(512,1)的输入。

共享模型代码以进一步了解Prateek的要求
# training data

tf_train_data = tf.data.Dataset.from_tensor_slices((train_ids, indus_cat_train))

# validation data

tf_valid_data = tf.data.Dataset.from_tensor_slices((valid_ids, indus_cat_valid))





# model downloaded from bert

bert_model_name = "uncased_L-12_H-768_A-12"

bert_ckpt_dir = "bert_model"

bert_ckpt_file = os.path.join(bert_ckpt_dir, "bert_model.ckpt")

bert_config_file = os.path.join(bert_ckpt_dir, "bert_config.json")



# creating tokenizer

tokenizer = FullTokenizer(vocab_file=os.path.join(bert_ckpt_dir, "vocab.txt"))



# create function for model

def create_model(max_seq_len, bert_ckpt_file, n_classes):

    with tf.io.gfile.GFile(bert_config_file, "r") as reader:

        # get bert configurations

        bert_configurations = StockBertConfig.from_json_string(reader.read())

        bert_params = map_stock_config_to_params(bert_configurations)

        bert_params_adapter_size = None

        bert = BertModelLayer.from_params(bert_params, name="bert")



    input_ids = keras.layers.Input(shape=(max_seq_len,), dtype="int32",

                                   name="input_ids")



    bert_output = bert(input_ids)

    print("bert shape", bert_output.shape)



    cls_out = keras.layers.Lambda(lambda seq: seq[:, 0, :])(bert_output)

    cls_out = keras.layers.Dropout(0.5)(cls_out)

    logits = keras.layers.Dense(units=765, activation="tanh")(cls_out)

    logits = keras.layers.Dropout(0.5)(logits)

    logits = keras.layers.Dense(

        units=n_classes, activation="softmax")(logits)



    model = keras.Model(inputs=input_ids, outputs=logits)

    model.build(input_shape=(None, max_seq_len))

    load_stock_weights(bert, bert_ckpt_file)

    return model





n_cats = 49 #number of output categories

model = create_model(max_seq_len=512, bert_ckpt_file=bert_ckpt_file,

                     n_classes=n_cats)



model.summary()

optimizer = tf.keras.optimizers.Adam(    learning_rate=learning_rate, epsilon=1e-08)



loss = tf.keras.losses.CategoricalCrossentropy()metric = tf.keras.metrics.CategoricalCrossentropy(    name='categorical_crossentropy')model.compile(optimizer=optimizer, loss=loss, metrics=[metric])



bert_history = model.fit( tf_train_data, epochs=2, validation_data=tf_valid_data)

1 个答案:

答案 0 :(得分:1)

我已经使用dataset.batch解决了它。 tf.data.Dataset缺少批处理大小参数,因此未对提供的张量进行批处理,即我得到的形状是(512,1)而不是(512,)和(49,1)而不是(49,)

{{1}}