使用我的自定义算法进行模型训练和运行后
estimator.fit({'training':'s3://abc/xxx/train.csv','validation':'s3://abc/xxx/val.csv'})
我收到以下消息:-
2020-05-24 21:08:52 Starting - Starting the training job...
2020-05-24 21:08:54 Starting - Launching requested ML instances...
2020-05-24 21:10:07 Starting - Preparing the instances for training............
2020-05-24 21:12:17 Downloading - Downloading input data
2020-05-24 21:12:17 Training - Downloading the training image.........
2020-05-24 21:14:16 Uploading - Uploading generated training model...
2020-05-24 21:14:22 Completed - Training job completed
..Training seconds: 143
Billable seconds: 43
Managed Spot Training savings: 69.9%
现在这真的很可疑,因为我正在为40个时期训练一个bert模型,而这在这么多的时间里是无法完成的。另外,我在cloudwatch中看不到任何日志。
这里到底发生了什么?非常感谢您的帮助!
即使当我运行estimator.fit()时,即未提供任何有关训练数据和验证数据的输入,它仍然表示训练已完成。我的容器根本没有被调用吗?
我的dockerfike:-
FROM python:3.6
RUN pip install numpy
RUN pip install pandas
RUN pip install torch
RUN pip install transformers
RUN pip install sklearn
#RUN pip install boto3
COPY crux /usr/local/lib/python3.6/site-packages/crux #my personal package
COPY prepare_data_and_train.sh /prepare_data_and_train.sh
COPY train.py /opt/ml/code/train.py
#WORKDIR /opt/ml/code
#RUN cd $WORKDIR
ENTRYPOINT ["/bin/bash", "/prepare_data_and_train.sh"]
prepare_data_and_train.sh:-
pip freeze
ls /opt/ml/input/data/training/
echo "train file"
ls /opt/ml/input/data/validation/
echo "validation file"
ls /opt/ml/input/data/emb/
echo "embeddings"
python /opt/ml/code/train.py
echo "file running"
我什至没有得到bash文件中提到的语句的输出