在Google Tpu上使用以下命令训练面具rcnn-套接字已关闭

时间:2019-05-23 11:02:55

标签: tensorflow tpu

我已经按照Google教程对Coco数据上的Mask rcnn进行了培训,没有问题(https://cloud.google.com/tpu/docs/tutorials/mask-rcnn

然后我再次按照教程步骤进行操作,但是这次是根据我自己的数据进行的。

我的数据集大约有3000个样本

This is how I start the train script:

python ~/tpu/models/official/mask_rcnn/mask_rcnn_main.py 
--use_tpu=True 
--tpu="tputputpu" 
--model_dir= "gs://my/path/mask-rcnn-model" 
--config="damconfig.yaml" 
--mode="train"

This is my config

num_classes: 9
backbone: 'resnet50'
use_bfloat16: True
train_batch_size: 16
eval_batch_size: 8
training_file_pattern: gs://my/path/TFRecords/train-*
validation_file_pattern: gs://my/path/TFRecords/val-*
val_json_file: gs://my/path/val_annotations.json
total_steps: 3000
num_steps_per_eval: 150
eval_samples: 311

I get the following error when I start training:

INFO:tensorflow:Enqueue next (2500) batch(es) of data to infeed.
INFO:tensorflow:Dequeue next (2500) batch(es) of data from outfeed.
INFO:tensorflow:Error recorded from infeed: assertion failed: [103]
[[{{node parser/Assert_2/Assert}}]]
[[node input_pipeline_task0/while/IteratorGetNext_4 (defined at /usr/local/lib/python2.7/dist-packages/tensorflow_estimator/python/estimator/estimator.py:1112) ]]
INFO:tensorflow:An error was raised. This may be due to a preemption in a connected worker or parameter server. The current session will be closed and a new session will be created. This error may also occur due to a gRPC failure caused by high memory or network bandwidth usage in the parameter servers. If this error occurs repeatedly, try increasing the number of parameter servers assigned to the job. Error: Socket closed

0 个答案:

没有答案