不可用错误:在GC ML引擎上进行培训时出现操作系统错误

时间:2018-03-21 10:07:16

标签: tensorflow machine-learning object-detection google-cloud-ml

我一直在尝试在GC ML引擎平台上训练我的模型,但我得到了这个非描述性的错误

Traceback (most recent call last):
  [...]
  File "/root/.local/lib/python2.7/site-packages/object_detection/trainer.py", line 360, in train
    saver=saver)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/contrib/slim/python/slim/learning.py", line 746, in train
    master, start_standard_services=False, config=session_config) as sess:
  File "/usr/lib/python2.7/contextlib.py", line 17, in __enter__
    return self.gen.next()
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/supervisor.py", line 1000, in managed_session
    self.stop(close_summary_writer=close_summary_writer)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/supervisor.py", line 828, in stop
    ignore_live_threads=ignore_live_threads)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/coordinator.py", line 389, in join
    six.reraise(*self._exc_info_to_raise)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/supervisor.py", line 989, in managed_session
    start_standard_services=start_standard_services)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/supervisor.py", line 726, in prepare_or_wait_for_session
    init_feed_dict=self._init_feed_dict, init_fn=self._init_fn)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/session_manager.py", line 281, in prepare_session
    sess.run(init_op, feed_dict=init_feed_dict)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 905, in run
    run_metadata_ptr)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 1137, in _run
    feed_dict_tensor, options, run_metadata)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 1355, in _do_run
    options, run_metadata)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 1374, in _do_call
    raise type(e)(node_def, op, message)
UnavailableError: OS Error```

这是我的ML引擎配置文件 trainingInput: runtimeVersion: "1.6" scaleTier: CUSTOM masterType: standard_gpu workerCount: 5 workerType: standard_gpu parameterServerCount: 3 parameterServerType: standard_gpu

我一直在使用ssd_mobilenet_v1模型,我已经通过将batch_size减少到4以及queue_capacity 400来修改对象检测GitHub repo中的默认管道配置文件   min_after_dequeue train_input_reader 200 Realm realm = Realm.getInstance(PSApplicationClass.Config); final RealmLocation realmLocation1 = new RealmLocation(locEnd.getLatitude(), locEnd.getLongitude(), locEnd.getTime(), address, true); realmLocation1.setSpeed(locEnd.getSpeed()); realmLocation1.setAccuracy(locEnd.getAccuracy()); final int filterDistanceFinal = filterDistance; Log.i("", "GLOBAL intances before addLocation:" + Realm.getGlobalInstanceCount(PSApplicationClass.Config)); realm.executeTransaction(new Realm.Transaction() { @Override public void execute(Realm realm) { RealmLocation realmLocation = realm.copyToRealm(realmLocation1); Log.i("", "autopilot testcord onLocationChanged SAVED LOCATION:" + locEnd.getLatitude() + "," + locEnd.getLongitude()); Log.i("", "autopilot testcord TEST COORD HAS STARTED"); getRawLocations().add(realmLocation); if (hasStarted) { if (getDeparture_stop().getDeparture_time() == null) { Long time = System.currentTimeMillis() / 1000; getDeparture_stop().setDeparture_time(time.intValue()); } } activeTrip.getSteps().set(lastTripStepIndex, TripStep.this); filterIncomingLocations(context, filterDistanceFinal, realmLocation); Log.i("", "autopilot testcord LOCATIONTEST COORDINATES SAVED: " + locEnd.getLatitude() + ", " + locEnd.getLongitude() + "....speed: " + locEnd.getSpeed() + "..... accuracy: " + locEnd.getAccuracy() + "RAW SIZE: " + getRawLocations().size()); } }); realm.close(); 预感可能是记忆问题,但没有运气。

我在副本主机和所有工作人员上都有完全相同的堆栈跟踪。有没有人有类似的问题?

0 个答案:

没有答案