AWS tensorboard Segmentation故障(核心已转储)

时间:2019-06-26 09:37:08

标签: python amazon-web-services amazon-ec2 pytorch tensorboardx

我正在尝试使用tensorboardX调试在AWS的p2.xlarge实例中运行的pytorch NN。

我跟随this tutorial打开端口6006。

该模型正在运行,并且tensorboardX正在创建其writer文件。我在那里收到以下警告。我不确定它有多重要。

  

警告:root:tuple出现在不转发元组的op中   (位于/pytorch/torch/csrc/jit/passes/lower_tuples.cpp:117的VisitNode)   帧#0:std :: function :: operator()()const + 0x11   (0x7fbe3dd04441 in   /home/ubuntu/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/torch/lib/libc10.so)   框架#1:c10 :: Error :: Error(c10 :: SourceLocation,std :: string const&)+   0x2a(0x7fbe3dd03d7a in   /home/ubuntu/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/torch/lib/libc10.so)   帧#2:+ 0xaf61f5(0x7fbe3cdc41f5 in   /home/ubuntu/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/torch/lib/libtorch.so.1)   帧#3:+ 0xaf6464(0x7fbe3cdc4464 in   /home/ubuntu/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/torch/lib/libtorch.so.1)   框架4:   火炬:: jit :: LowerAllTuples(std :: shared_ptr&)+ 0x13   (0x7fbe3cdc44a3 in   /home/ubuntu/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/torch/lib/libtorch.so.1)   帧#5:+ 0x3f84b4(0x7fbe7d2cb4b4 in   /home/ubuntu/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/torch/lib/libtorch_python.so)   帧#6:+ 0x130cfc(0x7fbe7d003cfc in   /home/ubuntu/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/torch/lib/libtorch_python.so)    框架40:__ libc_start_main + 0xf0   (/lib/x86_64-linux-gnu/libc.so.6中的0x7fbe8d69c830)

     

警告:root:tuple出现在不转发元组的op中   (位于/pytorch/torch/csrc/jit/passes/lower_tuples.cpp:117的VisitNode)   帧#0:std :: function :: operator()()const + 0x11   (0x7fbe3dd04441 in   /home/ubuntu/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/torch/lib/libc10.so)   框架#1:c10 :: Error :: Error(c10 :: SourceLocation,std :: string const&)+   0x2a(0x7fbe3dd03d7a in   /home/ubuntu/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/torch/lib/libc10.so)   帧#2:+ 0xaf61f5(0x7fbe3cdc41f5 in   /home/ubuntu/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/torch/lib/libtorch.so.1)   帧#3:+ 0xaf6464(0x7fbe3cdc4464 in   /home/ubuntu/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/torch/lib/libtorch.so.1)   框架4:   火炬:: jit :: LowerAllTuples(std :: shared_ptr&)+ 0x13   (0x7fbe3cdc44a3 in   /home/ubuntu/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/torch/lib/libtorch.so.1)   帧#5:+ 0x3f84b4(0x7fbe7d2cb4b4 in   /home/ubuntu/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/torch/lib/libtorch_python.so)   帧#6:+ 0x130cfc(0x7fbe7d003cfc in   /home/ubuntu/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/torch/lib/libtorch_python.so)    框架40:__ libc_start_main + 0xf0   (/lib/x86_64-linux-gnu/libc.so.6中的0x7fbe8d69c830)

问题是我无法访问tensorboard浏览器用户界面。我采取以下步骤:

$ cd PATH_TO_FOLDER_CONTAINING_runs
$ source activate pytorch_p36
$ tensorboard --logdir=runs

我在哪里收到错误消息:

  

分段错误(核心已转储)

当我检查系统日志var/log/syslog时,我看到以下内容:

  

6月26日09:06:40 ip-172-xx-xx-xxx内核:[515315.598917] Tensorboard [1446]:segfault at 0 ip(null)sp 00007ffd64c5f178 python2.7中的错误14 [55d8673d1000 + 1000] < / p>

我的谷歌搜索技能还远远不够。在ASW实例中运行时,如何通过浏览器访问tensorboard?

请让我知道不清楚的地方或缺少的信息。

1 个答案:

答案 0 :(得分:0)

即使代码必须在pytorch_p36环境中运行,张量板实际上也必须在其他环境中运行。

终端中的命令顺序应为:

$ cd PATH_TO_FOLDER_CONTAINING_runs
$ source activate tensorflow_p27
$ tensorboard --logdir=runs

然后打开指定的端口。