我一直在尝试在GC ML引擎平台上训练我的模型,但我得到了这个非描述性的错误
Traceback (most recent call last):
[...]
File "/root/.local/lib/python2.7/site-packages/object_detection/trainer.py", line 360, in train
saver=saver)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/contrib/slim/python/slim/learning.py", line 746, in train
master, start_standard_services=False, config=session_config) as sess:
File "/usr/lib/python2.7/contextlib.py", line 17, in __enter__
return self.gen.next()
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/supervisor.py", line 1000, in managed_session
self.stop(close_summary_writer=close_summary_writer)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/supervisor.py", line 828, in stop
ignore_live_threads=ignore_live_threads)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/coordinator.py", line 389, in join
six.reraise(*self._exc_info_to_raise)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/supervisor.py", line 989, in managed_session
start_standard_services=start_standard_services)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/supervisor.py", line 726, in prepare_or_wait_for_session
init_feed_dict=self._init_feed_dict, init_fn=self._init_fn)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/session_manager.py", line 281, in prepare_session
sess.run(init_op, feed_dict=init_feed_dict)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 905, in run
run_metadata_ptr)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 1137, in _run
feed_dict_tensor, options, run_metadata)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 1355, in _do_run
options, run_metadata)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 1374, in _do_call
raise type(e)(node_def, op, message)
UnavailableError: OS Error```
这是我的ML引擎配置文件
trainingInput:
runtimeVersion: "1.6"
scaleTier: CUSTOM
masterType: standard_gpu
workerCount: 5
workerType: standard_gpu
parameterServerCount: 3
parameterServerType: standard_gpu
我一直在使用ssd_mobilenet_v1模型,我已经通过将batch_size
减少到4以及queue_capacity
400来修改对象检测GitHub repo中的默认管道配置文件
min_after_dequeue
train_input_reader
200 Realm realm = Realm.getInstance(PSApplicationClass.Config);
final RealmLocation realmLocation1 = new RealmLocation(locEnd.getLatitude(), locEnd.getLongitude(), locEnd.getTime(), address, true);
realmLocation1.setSpeed(locEnd.getSpeed());
realmLocation1.setAccuracy(locEnd.getAccuracy());
final int filterDistanceFinal = filterDistance;
Log.i("", "GLOBAL intances before addLocation:" + Realm.getGlobalInstanceCount(PSApplicationClass.Config));
realm.executeTransaction(new Realm.Transaction() {
@Override
public void execute(Realm realm) {
RealmLocation realmLocation = realm.copyToRealm(realmLocation1);
Log.i("", "autopilot testcord onLocationChanged SAVED LOCATION:" + locEnd.getLatitude() + "," + locEnd.getLongitude());
Log.i("", "autopilot testcord TEST COORD HAS STARTED");
getRawLocations().add(realmLocation);
if (hasStarted) {
if (getDeparture_stop().getDeparture_time() == null) {
Long time = System.currentTimeMillis() / 1000;
getDeparture_stop().setDeparture_time(time.intValue());
}
}
activeTrip.getSteps().set(lastTripStepIndex, TripStep.this);
filterIncomingLocations(context, filterDistanceFinal, realmLocation);
Log.i("", "autopilot testcord LOCATIONTEST COORDINATES SAVED: " + locEnd.getLatitude() + ", " + locEnd.getLongitude() + "....speed: " + locEnd.getSpeed() + "..... accuracy: " + locEnd.getAccuracy() + "RAW SIZE: " + getRawLocations().size());
}
});
realm.close();
预感可能是记忆问题,但没有运气。
我在副本主机和所有工作人员上都有完全相同的堆栈跟踪。有没有人有类似的问题?