Question

我正在分布式群集上训练网络，我引用了[official API: tf.data Performance]

并对我的fun setDataIntoFirestore(data: HashMap<String,Any> , retValue : (Either<Failure,Boolean>) -> (Unit)){ db.collection("test") .add(data) .addOnSuccessListener { retValue.invoke(Either.Right(true)) } .addOnFailureListener { Log.d("FirestoreData", "Failure: " + it.message) //Here I want to add Either.Left(it) } }

进行编码

input_fn

我尝试使用def input_fn( file_pattern, feature_spec, label_name, batch_size=32, num_epochs=1, shuffle_buffer_size=2048, num_processor=8): """base input_fn.""" dataset = tf.data.Dataset.list_files(file_pattern) # shuffle default # num_parallel_calls must less than or equal to cycle_length. dataset = dataset.interleave(tf.data.TFRecordDataset, cycle_length=15, num_parallel_calls=15) dataset = dataset.map(lambda x: _parse_fn(x, feature_spec, label_name), num_parallel_calls=15) if shuffle_buffer_size > 1: dataset = dataset.shuffle(buffer_size=shuffle_buffer_size) if num_epochs > 1: dataset = dataset.repeat(num_epochs) dataset = dataset.batch(batch_size) dataset = dataset.prefetch(prefetch_buffer_size=15 * batch_size) return dataset，cycle_length，num_parallel_calls的不同参数，但是我的训练也没有使我的CPU或GPU或MEM或NET_BRANCH满负荷。

我无法弄清楚代码的瓶颈，感谢您的答复！

这是我的计算资源日志，

并且：我的数据存储为带有TFRecord的HDFS格式，并且分类器是自定义估算器！

使用tf.estimator和tf.data.dataset提高训练速度

0 个答案: