Question

有人可以发布ClusterSpec，以便对YouTube-8m Challenge code中定义的模型进行分布式培训吗？ The code尝试从TF_CONFIG环境变量加载集群规范。但是，我不确定TF_CONFIG的值应该是多少。我可以在一台机器上访问2个GPU，只想运行具有数据级并行性的模型。

Answer 1

If you want to run YouTube 8m challenge code in a distributed manner, you have to write a yaml file (There is an example yaml file provided by Google) and then you need to pass as parameter where is located this yaml file. TF_CONFIG makes reference to the configuration variables used to train the model.

For example, for running on google cloud the starting code in a distributed manner, I have used:

JOB_NAME=yt8m_train_$(date +%Y%m%d_%H%M%S); gcloud --verbosity=debug ml-engine jobs \
submit training $JOB_NAME \
--package-path=youtube-8m --module-name=youtube-8m.train \
--staging-bucket=$BUCKET_NAME --region=us-east1 \
--config=youtube-8m/cloudml-gpu-distributed.yaml \
-- --train_data_pattern='gs://youtube8m-ml-us-east1/1/frame_level/train/train*.tfrecord' \
--frame_features=True --model=LstmModel --feature_names="rgb,audio" \
--feature_sizes="1024, 128" --batch_size=128 \
--train_dir=$BUCKET_NAME/${JOB_TO_EVAL}

The parameter config is pointing to the yaml file cloudml-gpu-distributed.yaml with the following specification:

trainingInput:
  runtimeVersion: "1.0" 
  scaleTier: CUSTOM
  masterType: standard_gpu
  workerCount: 2
  workerType: standard_gpu 
  parameterServerCount: 2 
  parameterServerType: standard

如何编写Cluster Spec for Distributed YoutTube-8m挑战培训？

1 个答案: