在Sagemaker上训练自定义算法

时间:2020-05-25 10:17:03

标签: python machine-learning data-science amazon-sagemaker

使用我的自定义算法进行模型训练和运行后

estimator.fit({'training':'s3://abc/xxx/train.csv','validation':'s3://abc/xxx/val.csv'})

我收到以下消息:-

2020-05-24 21:08:52 Starting - Starting the training job...
2020-05-24 21:08:54 Starting - Launching requested ML instances...
2020-05-24 21:10:07 Starting - Preparing the instances for training............
2020-05-24 21:12:17 Downloading - Downloading input data
2020-05-24 21:12:17 Training - Downloading the training image.........
2020-05-24 21:14:16 Uploading - Uploading generated training model...
2020-05-24 21:14:22 Completed - Training job completed
..Training seconds: 143
Billable seconds: 43
Managed Spot Training savings: 69.9%

现在这真的很可疑,因为我正在为40个时期训练一个bert模型,而这在这么多的时间里是无法完成的。另外,我在cloudwatch中看不到任何日志。

这里到底发生了什么?非常感谢您的帮助!

即使当我运行estimator.fit()时,即未提供任何有关训练数据和验证数据的输入,它仍然表示训练已完成。我的容器根本没有被调用吗?

我的dockerfike:-

FROM python:3.6

RUN pip install numpy
RUN pip install pandas
RUN pip install torch
RUN pip install transformers
RUN pip install sklearn
#RUN pip install boto3

COPY crux /usr/local/lib/python3.6/site-packages/crux #my personal package 
COPY prepare_data_and_train.sh /prepare_data_and_train.sh
COPY train.py /opt/ml/code/train.py
#WORKDIR /opt/ml/code


#RUN cd $WORKDIR
ENTRYPOINT ["/bin/bash", "/prepare_data_and_train.sh"]

prepare_data_and_train.sh:-

pip freeze

ls /opt/ml/input/data/training/
echo "train file"

ls /opt/ml/input/data/validation/
echo "validation file"

ls /opt/ml/input/data/emb/
echo "embeddings"

python /opt/ml/code/train.py
echo "file running"

我什至没有得到bash文件中提到的语句的输出

0 个答案:

没有答案