我使用一个SGE集群,在头节点上运行IP控制器,在其他节点上运行约50个引擎(使用QSUB提交)。引擎能够连接并向控制器注册,没有任何问题。我还可以使用SSH连接到头节点并查看引擎ID并运行简单代码。例如,这非常有效:
%px %pylab inline
parallel_result = lbView.map_sync(lambda x: x*rand(), range(32))
但是,当我尝试运行以下行时,引擎崩溃了:
%px from sklearn.svm import LinearSVC
出现以下错误:
importing LinearSVC from sklearn.svm on engine(s)
[Engine Exception]
Traceback (most recent call last):
File "/usr/global/anaconda/lib/python2.7/site-packages/ipyparallel/client/client.py",
line 713, in _handle_stranded_msgs
raise error.EngineError("Engine %r died while running task %r"%(eid, msg_id))
EngineError: Engine 0 died while running task '48c99848-0784-4ea1-a8c9-900685e955a3
当我在集群头节点上的IPython实例上运行它时,完全相同的命令非常有效,甚至在另一台服务器(无SGE)上使用IPyparallel,其中有12个引擎在本地运行。
我已将日志记录级别设置为debug,这是引擎和控制器输出的内容:
Snippet IPENGINE OUTPUT:
2016-05-28 18:18:48.403 [IPEngineApp] apply_request: {'parent_header': {}, 'msg_type': u'apply_request', 'msg_id': u'4ca3bef9-5cbf-4b56-a232-b3f289dcf6a6', 'content': {}, 'header': {u'username': u'ABC', u'version': u'5.0', u'msg_type': u'apply_request', u'msg_id': u'4ca3bef9-5cbf-4b56-a232-b3f289dcf6a6', u'session': u'83df95f4-e961-4e8f-aa3c-2540719e08f4', u'date': datetime.datetime(2016, 5, 28, 18, 18, 48, 392750)}, 'buffers': [<memory at 0x2aaab7c348a0>, <memory at 0x2aaab7c34d60>, <memory at 0x2aaab7c34df8>, <memory at 0x2aaab7c34e90>, <memory at 0x2aaab7ba3218>], 'metadata': {}}
Snippet IPCONTROLLER OUTPUT:
2016-05-28 18:19:26.043 [IPControllerApp] registration::unregister_engine(8)
2016-05-28 18:19:26.043 [IPControllerApp] save engine state to /data1/home/kamesh/.ipython/profile_KK_Fiji_SGE/log/engines.json
2016-05-28 18:19:26.045 [IPControllerApp] heartbeat::handle_heart_failure('37d9bc53-66f8-4d14-9501-02c56a0ff1f0')
2016-05-28 18:19:26.045 [IPControllerApp] registration::unregister_engine(2)
2016-05-28 18:19:26.046 [IPControllerApp] save engine state to /data1/home/ABC/.ipython/profile_KK_Fiji_SGE/log/engines.json
2016-05-29 07:31:35.030 [IPControllerApp] heartbeat::missed ec0b5d83-b354-43c6-b7ec-909f6fd403fc : 1
2016-05-29 07:31:38.030 [IPControllerApp] heartbeat::missed ec0b5d83-b354-43c6-b7ec-909f6fd403fc : 2
2016-05-29 07:31:41.030 [IPControllerApp] heartbeat::missed ec0b5d83-b354-43c6-b7ec-909f6fd403fc : 3
2016-05-29 07:31:44.030 [IPControllerApp] heartbeat::missed ec0b5d83-b354-43c6-b7ec-909f6fd403fc : 4
2016-05-29 07:31:47.030 [IPControllerApp] heartbeat::missed ec0b5d83-b354-43c6-b7ec-909f6fd403fc : 5
2016-05-29 07:31:50.030 [IPControllerApp] heartbeat::missed ec0b5d83-b354-43c6-b7ec-909f6fd403fc : 6
2016-05-29 07:31:53.030 [IPControllerApp] heartbeat::missed ec0b5d83-b354-43c6-b7ec-909f6fd403fc : 7
2016-05-29 07:31:56.030 [IPControllerApp] heartbeat::missed ec0b5d83-b354-43c6-b7ec-909f6fd403fc : 8
2016-05-29 07:31:59.030 [IPControllerApp] heartbeat::missed ec0b5d83-b354-43c6-b7ec-909f6fd403fc : 9
2016-05-29 07:32:02.030 [IPControllerApp] heartbeat::missed ec0b5d83-b354-43c6-b7ec-909f6fd403fc : 10
2016-05-29 07:32:05.030 [IPControllerApp] heartbeat::missed ec0b5d83-b354-43c6-b7ec-909f6fd403fc : 11
2016-05-29 07:32:05.031 [IPControllerApp] heartbeat::handle_heart_failure('ec0b5d83-b354-43c6-b7ec-909f6fd403fc')
2016-05-29 07:32:05.031 [IPControllerApp] registration::unregister_engine(4)