我正在使用包含C ++代码的Tensorflow的Python版本。
我正在寻找一种方法,对于Tensorflow中的给定C ++函数,找到如何跟踪从Python函数开始的所有调用函数到进行该调用的c ++函数(Swig是透明的)。
我读到了混合语言调试,但问题是我不知道放置断点的C ++函数的被调用路径。如果我知道,问题从一开始就解决了,我不会问这个问题。
有办法吗?
答案 0 :(得分:1)
这是我调试/检查TensorFlow的一般策略。
首先,确保在启用调试的情况下编译它。这是我使用的脚本之一:
#!/bin/bash
export CC_OPT_FLAGS="-march=native"
export GCC_HOST_COMPILER_PATH=/usr/bin/gcc
export TF_NEED_GCP=0
export TF_NEED_GDR=0
export TF_NEED_S3=0
export TF_NEED_HDFS=0
export TF_NEED_MKL=0
export TF_NEED_MPI=0
export TF_NEED_OPENCL=0
export TF_NEED_CUDA=0
export TF_ENABLE_XLA=1
export TF_NEED_JEMALLOC=1
export TF_NEED_VERBS=0
TF_NEED_CUDA=0 bazel clean --expunge_async
PYTHON_LIB_PATH=${PYTHON_LIB_PATH} \
PYTHON_BIN_PATH=${PYTHON_BIN_PATH} ./configure
bazel build -c dbg --copt=-msse4.2 //tensorflow/tools/pip_package:build_pip_package && \
bazel-bin/tensorflow/tools/pip_package/build_pip_package /tmp/pkg
pip install /tmp/pkg/*.whl
现在制作一个小型Python程序来运行。确保启用设备放置日志记录。这提供了一些很好的信息。
import numpy as np
import tensorflow as tf
from tensorflow.core.protobuf import config_pb2
config = config_pb2.ConfigProto()
config.log_device_placement = True
ar = []
ar.append(np.array(
[1.0, 1.0, 1.0, 1.0, 1.0],
dtype='f'))
ar.append(np.array(
[1.0, 1.0, 1.0, 1.0, 1.0],
dtype='f'))
ar.append(np.array(
[1.0, 1.0, 1.0, 1.0, 1.0],
dtype='f'))
sess = tf.Session(graph=None, config=config)
with sess.graph.as_default(), sess.as_default():
with sess.graph.device('/device:CPU:0'):
c = tf.add_n(ar)
print(sess.run(c))
构建并安装TensorFlow时,请务必调高输出并运行Python脚本:
export TF_CPP_MIN_LOG_LEVEL=0
python my_tf_program.py
这应该会给你一些好的输出。类似的东西:
2017-10-20 16:47:47.194105: I tensorflow/core/platform/cpu_feature_guard.cc:137] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX
AVX2 FMA
2017-10-20 16:47:49.354363: I tensorflow/core/common_runtime/direct_session.cc:299] Device mapping:
/job:localhost/replica:0/task:0/device:XLA_CPU:0 -> device: XLA_CPU device
2017-10-20 16:47:49.358699: I tensorflow/core/common_runtime/placer.cc:874] AddN: (AddN)/job:localhost/replica:0/task:0/device:CPU:0
2017-10-20 16:47:49.358766: I tensorflow/core/common_runtime/placer.cc:874] Const_2: (Const)/job:localhost/replica:0/task:0/device:CPU:0
2017-10-20 16:47:49.358806: I tensorflow/core/common_runtime/placer.cc:874] Const_1: (Const)/job:localhost/replica:0/task:0/device:CPU:0
2017-10-20 16:47:49.358842: I tensorflow/core/common_runtime/placer.cc:874] Const: (Const)/job:localhost/replica:0/task:0/device:CPU:0
[ 3. 3. 3. 3. 3.]
Device mapping:
/job:localhost/replica:0/task:0/device:XLA_CPU:0 -> device: XLA_CPU device
AddN: (AddN): /job:localhost/replica:0/task:0/device:CPU:0
Const_2: (Const): /job:localhost/replica:0/task:0/device:CPU:0
Const_1: (Const): /job:localhost/replica:0/task:0/device:CPU:0
Const: (Const): /job:localhost/replica:0/task:0/device:CPU:0
你有提示!看起来正在调用AddN。让我们在代码中找到它:
[smckenney@xxx tensorflow (develop)]$ grep -nr "\"AddN\"" tensorflow/core/kernels/* | grep REGISTER
tensorflow/core/kernels/aggregate_ops.cc:237:REGISTER_KERNEL_BUILDER(Name("AddN")
tensorflow/core/kernels/aggregate_ops.cc:246:REGISTER_KERNEL_BUILDER(Name("AddN")
tensorflow/core/kernels/aggregate_ops.cc:262:REGISTER_KERNEL_BUILDER(Name("AddN")
在aggregate_ops.cc中查看后,将存在一个带有Compute()方法的AddN类。这是断点所在的地方:
41 template <typename Device, typename T>
42 class AddNOp : public OpKernel {
43 public:
44 explicit AddNOp(OpKernelConstruction* context) : OpKernel(context) {}
45
46 void Compute(OpKernelContext* ctx) override {
47 if (!ctx->ValidateInputsAreSameShape(this)) return;
48
49 const Tensor& input0 = ctx->input(0);
现在您拥有了所需的所有信息。在调试器中再次运行python脚本,并在aggregate_ops.cc第47行设置断点:
(tf-runner) spmckenney@host:/scratch/bitbucket$ lldb -- python my_tf_program.py
(lldb) target create "python"
Current executable set to 'python' (x86_64).
(lldb) settings set -- target.run-args "/scratch/bitbucket/tmp/so_answer.py"
(lldb) b aggregate_ops.cc:47
Breakpoint 1: no locations (pending).
WARNING: Unable to resolve breakpoint to any actual locations.
(lldb) run
当lldb加载符号时,您可能需要连续几次。在那之后,断点被击中,你得到你的回溯:
(lldb) bt
* thread #1: tid = 38544, 0x00007fff72f4875d _pywrap_tensorflow_internal.so`tensorflow::AddNOp<Eigen::ThreadPoolDevice, float>::Compute(this=0x00000000013a0150, ctx=0x0
0007fffffff9750) + 49 at aggregate_ops.cc:47, name = 'python', stop reason = breakpoint 1.6
* frame #0: 0x00007fff72f4875d _pywrap_tensorflow_internal.so`tensorflow::AddNOp<Eigen::ThreadPoolDevice, float>::Compute(this=0x00000000013a0150, ctx=0x00007fffffff9
750) + 49 at aggregate_ops.cc:47
frame #1: 0x00007fff734844d1 _pywrap_tensorflow_internal.so`tensorflow::grappler::ConstantFolding::EvaluateNode(this=0x0000000001419320, node=0x000000000144dad0, in
puts=0x00007fffffff9d10, output=0x00007fffffff9d60) const + 731 at constant_folding.cc:407
frame #2: 0x00007fff73484b9b _pywrap_tensorflow_internal.so`tensorflow::grappler::ConstantFolding::EvaluateOneFoldable(this=0x0000000001419320, node=0x000000000144d
ad0, outputs=0x00007fffffff9f90) + 1079 at constant_folding.cc:448
--- snip ---
frame #17: 0x00007fff6f7e1b6b _pywrap_tensorflow_internal.so`tensorflow::TF_Run_wrapper(session=0x000000000144cae0, run_options=0x0000000000000000, feed_dict=0x0000
7fffdf7afd88, output_names=0x00007fffffffd0b0, target_nodes=0x00007fffffffd100, out_status=0x0000000001840aa0, out_values=0x00007fffffffd150, run_outputs=0x000000000000
0000) + 97 at tf_session_helper.cc:149
frame #18: 0x00007fff6f76d84c _pywrap_tensorflow_internal.so`::_wrap_TF_Run((null)=0x00007fff7e1ed138, args=0x00007fffd5823458) + 2835 at pywrap_tensorflow_internal
.cc:15057
frame #19: 0x00000000004866fb python`PyEval_EvalFrameEx + 1099
frame #20: 0x000000000048f2df python`___lldb_unnamed_symbol1826$$python + 383
frame #21: 0x00000000004f14fa python`PyObject_Call + 58
frame #22: 0x0000000000488252 python`PyEval_EvalFrameEx + 8098
frame #23: 0x000000000048e45b python`PyEval_EvalCodeEx + 347
frame #24: 0x000000000048a673 python`PyEval_EvalFrameEx + 17347
frame #25: 0x000000000048e45b python`PyEval_EvalCodeEx + 347
frame #26: 0x000000000048a673 python`PyEval_EvalFrameEx + 17347
frame #27: 0x000000000048a19d python`PyEval_EvalFrameEx + 16109
frame #28: 0x000000000048e45b python`PyEval_EvalCodeEx + 347
frame #29: 0x000000000048a673 python`PyEval_EvalFrameEx + 17347
frame #30: 0x000000000048e45b python`PyEval_EvalCodeEx + 347
frame #31: 0x000000000048f15b python`PyEval_EvalCode + 59
frame #32: 0x0000000000559730 python`___lldb_unnamed_symbol2877$$python + 48
frame #33: 0x00000000004793c5 python`PyRun_FileExFlags + 167
frame #34: 0x00000000004797a2 python`PyRun_SimpleFileExFlags + 872
frame #35: 0x00000000005bfaa0 python`Py_Main + 1280
frame #36: 0x000000000047d9f4 python`main + 308
frame #37: 0x00007ffff7814f45 libc.so.6`__libc_start_main(main=(python`main), argc=2, argv=0x00007fffffffe138, init=<unavailable>, fini=<unavailable>, rtld_fini=<un
available>, stack_end=0x00007fffffffe128) + 245 at libc-start.c:287
frame #38: 0x000000000056d585 python`_start + 41
我对Python的回溯并不感兴趣,因为我正在努力,但我认为获取完整的Python跟踪并不是很难获得位置。
希望有所帮助。
答案 1 :(得分:0)
有可能通过Python端的TF_Run
调用来调用你的C函数。要获得Python的反向跟踪,请使用pdb:
在您的系统上修改session.py
,并在调用import pdb; pdb.set_trace()
之前添加TF_Run
。要查找session.py
,请转到TF发布的位置(python -c "import tensorflow as tf; import inspect; print(inspect.getsourcefile(tf).rsplit('/',1)[0])"
)并搜索session.py
文件。
这会让您进入pdb
提示符,然后您可以bt
进行回溯。
要获得回溯的C ++方面,请使用gdb。
gdb python
run myscript.py
使用gdb
命令在C文件和bt
内设置断点以获得回溯。如果使用tensorflow的调试版本,则可以获得更好的结果 - https://github.com/mind/wheels/releases/tag/tf1.3.1-cpu-debug