Question

所以我有一个带有64个内核的1个CPU。我已经从anaconda安装了tensorflow。我知道如果我有多个CPU，我可以通过指定CPUid来分配计算。如下（改编自here）：

with tf.device("/cpu:0"):
    a = tf.Variable(tf.ones(()))
    a = tf.square(a)
with tf.device("/cpu:1"):
    b = tf.Variable(tf.ones(()))
    b = tf.square(b)
with tf.device("/cpu:2"):
    loss = a+b
opt = tf.train.GradientDescentOptimizer(learning_rate=0.1)
train_op = opt.minimize(loss)

sess = tf.Session()
sess.run(tf.global_variables_initializer())
for i in range(10):
    loss0, _ = sess.run([loss, train_op])
    print("loss", loss0)

上面的示例代码假设有三个CPU。但我想知道我是否可以使用现有设施（1个CPU，64个核心）有效地进行某种高效的深度学习练习？有人可以帮助或指导我吗？

更新：

核心是 Intel Xeon Phi 处理器型号。
另请注意，我没有管理员权限，因此无法编译任何库。我通过Anaconda安装了每个python库。

我尝试理解某些东西。我在上面给出的代码中使用了时间轴概念（来自here），如下所示：

import tensorflow as tf
from tensorflow.python.client import timeline


with tf.device("/cpu:0"):
    a = tf.Variable(tf.ones(()))
    a = tf.square(a)
with tf.device("/cpu:0"):
    b = tf.Variable(tf.ones(()))
    b = tf.square(b)
with tf.device("/cpu:0"):
    loss = a+b
opt = tf.train.GradientDescentOptimizer(learning_rate=0.1)
train_op = opt.minimize(loss)

sess = tf.Session()
run_options = tf.RunOptions(trace_level=tf.RunOptions.FULL_TRACE)
run_metadata = tf.RunMetadata()
sess.run(tf.global_variables_initializer())
for i in range(10):
    run_options = tf.RunOptions(trace_level=tf.RunOptions.FULL_TRACE)
    run_metadata = tf.RunMetadata()
    loss0, _ = sess.run([loss, train_op], options=run_options,run_metadata=run_metadata)
    print("loss", loss0)

# Create the Timeline object, and write it to a json
tl = timeline.Timeline(run_metadata.step_stats)
ctf = tl.generate_chrome_trace_format()
with open('timeline_execution1.json', 'w') as f:
    f.write(ctf)

然后我生成了不同的json文件，以config=tf.ConfigProto(intra_op_parallelism_threads=#,inter_op_parallelism_threads=#)中的tf.Session()行查看chrome中的时间轴。然后我得到了不同的输出。但除了一点之外我什么都不懂。这个程序使用4个核心，我在tf.Session()内提供的任何选项。如下所示：

Answer 1

如果您有Intel CPU（可能是XeonPhi），使用MKL编译Tensorflow可能会加快速度。

你可以看到它是如何完成的here

如果我有一个64核的CPU，我怎样才能有效地使用张量流？

1 个答案: