Tensorflow py_func引发未知错误并使IPython内核崩溃

时间:2018-08-16 20:09:25

标签: python tensorflow while-loop gpu

情况就是这样,我被迫使用py_func以便在tf.while_loop内调用scipy的多元正常CDF函数。似乎这样做的事实真是使IPython核心崩溃了,没有明显的错误:

2018 16:49:32.188716: I T:\src\github\tensorflow\tensorflow\core\platform\cpu_feature_guard.cc:141] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2
2018 16:49:32.452280: I T:\src\github\tensorflow\tensorflow\core\common_runtime\gpu\gpu_device.cc:1392] Found device 0 with properties: 
name: GeForce GTX 1050 major: 6 minor: 1 memoryClockRate(GHz): 1.493
pciBusID: 0000:01:00.0
totalMemory: 2.00GiB freeMemory: 1.61GiB
2018 16:49:32.454223: I T:\src\github\tensorflow\tensorflow\core\common_runtime\gpu\gpu_device.cc:1471] Adding visible gpu devices: 0
2018 16:49:33.216908: I T:\src\github\tensorflow\tensorflow\core\common_runtime\gpu\gpu_device.cc:952] Device interconnect StreamExecutor with strength 1 edge matrix:
2018 16:49:33.217288: I T:\src\github\tensorflow\tensorflow\core\common_runtime\gpu\gpu_device.cc:958] 0 
2018 16:49:33.217534: I T:\src\github\tensorflow\tensorflow\core\common_runtime\gpu\gpu_device.cc:971] 0: N
2018 16:49:33.217871: I T:\src\github\tensorflow\tensorflow\core\common_runtime\gpu\gpu_device.cc:1084] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 1365 MB memory) ‑> physical GPU (device: 0, name: GeForce GTX 1050, pci bus id: 0000:01:00.0, compute capability: 6.1)

冲突代码如下:

def mvn_cdf_tf(x, mean, cov):
    with tf.name_scope("CDF"):
        n_obs = tf.cast(mean.get_shape().as_list()[0], tf.int32)
        q = tf.constant(0, dtype = tf.int32, name ="q_iterator")
        _cdf = tf.Variable(tf.zeros([1]), name = "_cdf")
        _cov, uni_cdf = _cov_for_cdf_tf(x, mean, cov)

        def _loop_i(q, i, _final):
            aux = uni_cdf[q][i]+ tf.matmul(tf.matmul(tf.matrix_inverse(_cov[q,0:i, 0:i]), [_cov[q][i][0:i]], transpose_b = True),
            tf.ones([i,1])-uni_cdf[q][0:i], transpose_a = True)
            return [q, tf.add(i,1),tf.multiply(_final, tf.squeeze(aux))]

        def _loop_q(q, _cdf):
            I = tf.cast(mean.get_shape().as_list()[1], tf.int32)
            _final = tf.py_func(_bivariate_cdf,[x[q][0:2],mean[q][0:2], cov[q,0:2, 0:2]], tf.float32)
            i = tf.constant(2, dtype = tf.int32, name = "i_iterator")
            return [tf.add(q,1),tf.concat((_cdf,[tf.while_loop(lambda q, i, _final: tf.less(i, I), _loop_i, 
            [q,i, _final],
            parallel_iterations = 2000)[2]]), axis = 0)]

        return tf.while_loop(lambda q, _cdf: tf.less(q,n_obs), _loop_q, [q, _cdf],
         shape_invariants=[q.get_shape(), tf.TensorShape([None,])], parallel_iterations=2000)[1][1:]

特别是经过反复试验的_final = tf.py_func(_bivariate_cdf,[x[q][0:2],mean[q][0:2], cov[q,0:2, 0:2]], tf.float32)行,我发现如果使它等于一个常量(例如0.02)而不是一个py_func声明,那么一切都将正常运行。

我也确信_bivariate_cdf函数的定义是正确的,因为在图的另一部分以相同的方式调用了它,并且没有任何问题。

我认为GPU可能会用完RAM,但我希望在错误消息中看到这一点。

我正在使用带有2 GB vRAM的GTX 1050上通过槽点安装的tensorflow-gpu 1.9和CUDA 9.0。

如果您能发现错误,我将非常感谢。

0 个答案:

没有答案