分布式Tensorflow:内部错误-Blas GEMM启动失败

时间:2018-08-27 14:02:29

标签: python tensorflow distributed-computing

我正在尝试分布式Tensorflow,并从localhost(Windows 10,Python 3.6.6,Tensorflow 1.8.0)上的两个进程开始。每个过程都运行一个简单的神经网络(一个隐藏层)的副本,该副本针对UrbanSounds数据集的子集(每个模型有5268个样本,每个样本有193个特征)进行

以下写得很好的帖子:https://learningtensorflow.com/lesson11/ 我可以重复他们的基本示例,根据两个不同过程的结果计算均值。对于我的数据集,我对代码进行了如下修改,将总样本分为两半,并让两个不同的过程分别计算成本函数。 但是,在成功启动RPC服务器之后,两个过程最终都会出现以下错误:

  

InternalError(请参阅上面的回溯):Blas GEMM启动失败:   a.shape =(263,193),b.shape =(193,200),m = 263,n = 200,k = 193

     

[[节点:MatMul = MatMul [T = DT_FLOAT,transpose_a = false,   transpose_b = false,   _device =“ / job:local / replica:0 / task:0 / device:GPU:0”](_ recv_Placeholder_0_G7,   w1 / read)]]

在我看来,神经网络配置或为feed_dict准备数据集存在一些基本错误,但我看不到,因此需要另一双眼睛。 在此实验期间的另一个观察结果是,GPU多数情况下达到最大速度而代码中止。 请协助我解决在分发Tensorflow的代码或策略方面的任何错误吗?

谢谢。

### ERROR TRACE (removed duplicate rows ...) ####
train_data, train_labels (528, 193) (528, 10)
test_data, test_labels (22, 193) (22, 10)
2018-08-27 14:35:29.096572: I T:\src\github\tensorflow\tensorflow\core\platform\cpu_feature_guard.cc:140] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2
2018-08-27 14:35:29.330127: I T:\src\github\tensorflow\tensorflow\core\common_runtime\gpu\gpu_device.cc:1356] Found device 0 with properties:
name: GeForce GTX 1070 major: 6 minor: 1 memoryClockRate(GHz): 1.683
pciBusID: 0000:01:00.0
totalMemory: 8.00GiB freeMemory: 6.63GiB
...
2018-08-27 14:35:33.982347: E T:\src\github\tensorflow\tensorflow\stream_executor\cuda\cuda_blas.cc:462] failed to create cublas handle: CUBLAS_STATUS_ALLOC_FAILED
Traceback (most recent call last):
  File "C:\Users\shakeel\Anaconda3\envs\tensorflow-gpu\lib\site-packages\tensorflow\python\client\session.py", line 1322, in _do_call
2018-08-27 14:35:33.989312: W T:\src\github\tensorflow\tensorflow\stream_executor\stream.cc:2001] attempting to perform BLAS operation using StreamExecutor without BLAS support
    return fn(*args)
...
tensorflow.python.framework.errors_impl.InternalError: Blas GEMM launch failed : a.shape=(263, 193), b.shape=(193, 200), m=263, n=200, k=193
         [[Node: MatMul = MatMul[T=DT_FLOAT, transpose_a=false, transpose_b=false, _device="/job:local/replica:0/task:0/device:GPU:0"](_recv_Placeholder_0_G7, w1/read)]]

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
...
tensorflow.python.framework.errors_impl.InternalError: Blas GEMM launch failed : a.shape=(263, 193), b.shape=(193, 200), m=263, n=200, k=193
         [[Node: MatMul = MatMul[T=DT_FLOAT, transpose_a=false, transpose_b=false, _device="/job:local/replica:0/task:0/device:GPU:0"](_recv_Placeholder_0_G7, w1/read)]]

Caused by op 'MatMul', defined at:
  File "tf_dis_audio_test.py", line 78, in <module>
    z = tf.nn.tanh(tf.matmul(X, w1) + b1)
  File "C:\Users\shakeel\Anaconda3\envs\tensorflow-gpu\lib\site-packages\tensorflow\python\ops\math_ops.py", line 2122, in matmul
    a, b, transpose_a=transpose_a, transpose_b=transpose_b, name=name)
...
InternalError (see above for traceback): Blas GEMM launch failed : a.shape=(263, 193), b.shape=(193, 200), m=263, n=200, k=193
         [[Node: MatMul = MatMul[T=DT_FLOAT, transpose_a=false, transpose_b=false, _device="/job:local/replica:0/task:0/device:GPU:0"](_recv_Placeholder_0_G7, w1/read)]]
### CODE SAMPLE ###
# selected UrbanSounds dataset
print("train_data, train_labels", train_data.shape, train_labels.shape)
print("test_data, test_labels", test_data.shape, test_labels.shape)

# neural network configurations
cost = 0.0
n_tasks = 2
n_epochs = 10
n_classes = 10
n_features = 193
n_hidden_1 = 200
learning_rate = 0.1
sd = 1/np.sqrt(n_features)
cost_history = np.empty(shape=[1], dtype=float)

# task#0 is set as rpc host process
rpc_server = "grpc://localhost:2001"

# run two separate python shells, each with its task number (0,1), as:
#>python this_script.py  0
#>python this_script.py  1
task_number = int(sys.argv[1])

# cluster specs with two localhosts on different ports (2001, 2002)
cluster = tf.train.ClusterSpec({job_name:["localhost:2001", "localhost:2002"]})
server = tf.train.Server(cluster, job_name="local", task_index=task_number)
server.start()

graph = tf.Graph()
with graph.as_default():    
    X = tf.placeholder(tf.float32, [None, n_features])
    Y = tf.placeholder(tf.float32, [None, n_classes])

    w1 = tf.Variable(tf.random_normal([n_features, n_hidden_1], mean = 0, stddev=sd), name="w1")
    b1 = tf.Variable(tf.random_normal([n_hidden_1], mean=0, stddev=sd), name="b1")
    w2 = tf.Variable(tf.random_normal([n_hidden_1, n_classes], mean = 0, stddev=sd), name="w2")
    b2 = tf.Variable(tf.random_normal([n_classes], mean=0, stddev=sd), name="b2")
    
    z = tf.nn.tanh(tf.matmul(X, w1) + b1)
    _y = tf.nn.softmax(tf.matmul(z, w2) + b2)
    
    cost_function = tf.reduce_mean(tf.square(Y - _y))
    train_step = tf.train.GradientDescentOptimizer(learning_rate).minimize(cost_function)
    prediction = tf.equal(tf.argmax(Y, 1), tf.argmax(_y, 1))
    accuracy = tf.reduce_mean(tf.cast(prediction, tf.float32)) * 100.0
    print("#2: {}".format(datetime.utcnow().strftime(datetime_format)[:-3]))

# hack to fix the GPU out of memory issue
# but it does not make any good, GPU still shoots :(
gpuops = tf.GPUOptions(per_process_gpu_memory_fraction=0.333)
config = tf.ConfigProto(gpu_options=gpuops)

with tf.Session(rpc_server, graph=graph, config=config) as ss:
    # setting up the session with RPC host
    ss = tf.Session(rpc_server)
    ss.run(tf.global_variables_initializer())

    for epoch in range(n_epochs):
        batch_size = int(len(train_labels) / n_tasks)

	# run session for task#0
        if (task_number == 0):
            _, cost = ss.run([train_step, cost_function], feed_dict={X:train_data[:batch_size-1], Y:train_labels[:batch_size-1]})

	# run session for task#1
        elif (task_number == 1):
            _, cost = ss.run([train_step, cost_function], feed_dict={X:train_data[batch_size:-1], Y:train_labels[batch_size:-1]})

	# recording the running cost of both processes
        cost_history = np.append(cost_history, cost)
        print(" epoch {}: task {}: history {:.3f}".format(epoch, task_number, cost_history))

    print("Accuracy SGD ({}): {:.3f}".format(
        epoch, round(ss.run(accuracy, feed_dict={X: test_data, Y: test_labels}), 3)))

1 个答案:

答案 0 :(得分:0)

只需将给定的代码移至Ubuntu 16.04.4 LTS,即可为我解决上述问题。

我不确定,但这似乎与Windows 10上的GRPC + Fiewall有关。

如果有人在Windows上遇到BLASS错误并可以在Windows上解决它,那么请为我们其余人员发布解决方案。

干杯。