我正在运行具有20个PS和100个Worker的分布式tensorflow应用程序。
有一个名为worker0的首席工作人员负责一些额外的工作,例如变量初始化。
当我尝试通过sess.run(tf.variables_initializer([large_variable_list]))初始化一些非常大的变量时。 初始化很慢, 似乎变量是在worker0中创建的,然后发送给ps。
所以我的问题是使用分布式张量流进行变量初始化的确切机制是什么。 是在工作器中创建的,然后发送到PS的吗?
我如何跟踪变量初始化的过程?
log of worker0
2019-09-09 08:20:12.238889 Start to initialize all variables
2019-09-09 08:20:20.549082: W tensorflow/core/framework/allocator.cc:124] Allocation of 12271335136 exceeds 10% of system memory.
2019-09-09 08:20:36.105050: W tensorflow/core/framework/allocator.cc:124] Allocation of 12271335136 exceeds 10% of system memory.
2019-09-09 08:21:06.638078: W tensorflow/core/framework/allocator.cc:124] Allocation of 12271335136 exceeds 10% of system memory.
2019-09-09 08:21:27.582467: W tensorflow/core/framework/allocator.cc:124] Allocation of 12271335136 exceeds 10% of system memory.
2019-09-09 08:21:53.365509: W tensorflow/core/framework/allocator.cc:124] Allocation of 12271335136 exceeds 10% of system memory.
2019-09-09 08:42:32.114437 Initialize all variables success.
2019-09-09 08:42:33.471465 Start to initialize local variables
# code sippet
term_embeddings = tf.get_variable(
"term_embeddings", [383479222, 5], dtype=tf.float32,
initializer=variable_initializer, partitioner=tf.fixed_size_partitioner(ps_count))
global_var_init_op = tf.variables_initializer([v for v in tf.global_variables() if v not in reusable_barrier.get_variables()])
sess.run([global_var_init_op])