我已经创建了一个AWS Batch Compute环境,一个作业队列和一个作业定义来运行GPU工作负载。不幸的是,我的GPU docker容器(基于Nvidia / Cuda:9.0-cudnn7-runtime)失败,并出现以下错误:OSError:libcuda.so.1:无法打开共享库文件:没有这样的文件或目录。我尝试了很多解决方法,但最后又遇到另一个错误:CUDA错误:没有内核映像可用于在设备上执行。在这些问题上花费了3个多工作日之后,我现在陷入僵局。有人可以帮忙吗?
我的计算机环境:
MANAGED,EC2,[p2,p3]实例。
我的工作定义:
{
"jobDefinitionName": "B2JobB2LiveModelRowTomatoCount",
"jobDefinitionArn": "arn:aws:batch:us-west-2:019997017433:job-definition/B2JobB2LiveModelRowTomatoCount:10",
"revision": 10,
"status": "ACTIVE",
"type": "container",
"parameters": {},
"retryStrategy": {
"attempts": 3
},
"containerProperties": {
"image": "******.dkr.ecr.*******.amazonaws.com/b2-live-model-row-tomato-count:latest",
"vcpus": 1,
"memory": 4000,
"command": [],
"volumes": [
{
"host": {
"sourcePath": "/tmp"
},
"name": "tempfolder"
}
],
"environment": [
{
"name": "LOG_LEVEL",
"value": "INFO"
}
],
"mountPoints": [
{
"containerPath": "/tmp",
"readOnly": false,
"sourceVolume": "tempfolder"
}
],
"readonlyRootFilesystem": false,
"privileged": true,
"ulimits": [],
"resourceRequirements": [
{
"value": "1",
"type": "GPU"
}
]
}
}