python - 在 Kubernetes 上运行 PyTorch 映像时出现 PyTorch 错误 - Thinbug

在 Kubernetes 上运行 PyTorch 映像时出现 PyTorch 错误

时间：2021-06-23 08:06:55

标签： python pytorch airflow torchvision

我有一个 Docker 镜像，它使用 PyTorch 来执行对象检测。容器在 local 和 Google Colab 上运行良好，但是在 Kubernetes（通过 Airflow）上运行时，它会引发以下错误：

[2021-06-23 07:13:17,592] {pod_launcher.py:148} INFO - Traceback (most recent call last):
[2021-06-23 07:13:17,592] {pod_launcher.py:148} INFO -   File "/content/main.py", line 5, in <module>
[2021-06-23 07:13:17,594] {pod_launcher.py:148} INFO -     app()
[2021-06-23 07:13:17,594] {pod_launcher.py:148} INFO -   File "/usr/local/lib/python3.6/dist-packages/typer/main.py", line 214, in __call__
[2021-06-23 07:13:17,594] {pod_launcher.py:148} INFO -     return get_command(self)(*args, **kwargs)
[2021-06-23 07:13:17,594] {pod_launcher.py:148} INFO -   File "/usr/local/lib/python3.6/dist-packages/click/core.py", line 829, in __call__
[2021-06-23 07:13:17,594] {pod_launcher.py:148} INFO -     return self.main(*args, **kwargs)
[2021-06-23 07:13:17,594] {pod_launcher.py:148} INFO -   File "/usr/local/lib/python3.6/dist-packages/click/core.py", line 782, in main
[2021-06-23 07:13:17,595] {pod_launcher.py:148} INFO -     rv = self.invoke(ctx)
[2021-06-23 07:13:17,595] {pod_launcher.py:148} INFO -   File "/usr/local/lib/python3.6/dist-packages/click/core.py", line 1259, in invoke
[2021-06-23 07:13:17,595] {pod_launcher.py:148} INFO -     return _process_result(sub_ctx.command.invoke(sub_ctx))
[2021-06-23 07:13:17,595] {pod_launcher.py:148} INFO -   File "/usr/local/lib/python3.6/dist-packages/click/core.py", line 1066, in invoke
[2021-06-23 07:13:17,595] {pod_launcher.py:148} INFO -     return ctx.invoke(self.callback, **ctx.params)
[2021-06-23 07:13:17,595] {pod_launcher.py:148} INFO -   File "/usr/local/lib/python3.6/dist-packages/click/core.py", line 610, in invoke
[2021-06-23 07:13:17,595] {pod_launcher.py:148} INFO -     return callback(*args, **kwargs)
[2021-06-23 07:13:17,596] {pod_launcher.py:148} INFO -   File "/usr/local/lib/python3.6/dist-packages/typer/main.py", line 497, in wrapper
[2021-06-23 07:13:17,596] {pod_launcher.py:148} INFO -     return callback(**use_params)  # type: ignore
[2021-06-23 07:13:17,596] {pod_launcher.py:148} INFO -   File "/content/app/__init__.py", line 52, in detect_from_file
[2021-06-23 07:13:17,597] {pod_launcher.py:148} INFO -     coco_path=coco_path,
[2021-06-23 07:13:17,597] {pod_launcher.py:148} INFO -   File "/content/app/__init__.py", line 126, in _detect_from_file
[2021-06-23 07:13:17,597] {pod_launcher.py:148} INFO -     tables = infer_page(page_filename, model)
[2021-06-23 07:13:17,597] {pod_launcher.py:148} INFO -   File "/content/app/utils.py", line 9, in infer_page
[2021-06-23 07:13:17,597] {pod_launcher.py:148} INFO -     result = inference_detector(model, str(img))
[2021-06-23 07:13:17,597] {pod_launcher.py:148} INFO -   File "/content/mmdetection/mmdet/apis/inference.py", line 86, in inference_detector
[2021-06-23 07:13:17,598] {pod_launcher.py:148} INFO -     result = model(return_loss=False, rescale=True, **data)
[2021-06-23 07:13:17,598] {pod_launcher.py:148} INFO -   File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 532, in __call__
[2021-06-23 07:13:17,598] {pod_launcher.py:148} INFO -     result = self.forward(*input, **kwargs)
[2021-06-23 07:13:17,598] {pod_launcher.py:148} INFO -   File "/content/mmdetection/mmdet/core/fp16/decorators.py", line 49, in new_func
[2021-06-23 07:13:17,598] {pod_launcher.py:148} INFO -     return old_func(*args, **kwargs)
[2021-06-23 07:13:17,599] {pod_launcher.py:148} INFO -   File "/content/mmdetection/mmdet/models/detectors/base.py", line 149, in forward
[2021-06-23 07:13:17,599] {pod_launcher.py:148} INFO -     return self.forward_test(img, img_metas, **kwargs)
[2021-06-23 07:13:17,599] {pod_launcher.py:148} INFO -   File "/content/mmdetection/mmdet/models/detectors/base.py", line 130, in forward_test
[2021-06-23 07:13:17,599] {pod_launcher.py:148} INFO -     return self.simple_test(imgs[0], img_metas[0], **kwargs)
[2021-06-23 07:13:17,599] {pod_launcher.py:148} INFO -   File "/content/mmdetection/mmdet/models/detectors/cascade_rcnn.py", line 324, in simple_test
[2021-06-23 07:13:17,600] {pod_launcher.py:148} INFO -     self.test_cfg.rpn) if proposals is None else proposals
[2021-06-23 07:13:17,600] {pod_launcher.py:148} INFO -   File "/content/mmdetection/mmdet/models/detectors/test_mixins.py", line 34, in simple_test_rpn
[2021-06-23 07:13:17,600] {pod_launcher.py:148} INFO -     proposal_list = self.rpn_head.get_bboxes(*proposal_inputs)
[2021-06-23 07:13:17,600] {pod_launcher.py:148} INFO -   File "/content/mmdetection/mmdet/core/fp16/decorators.py", line 127, in new_func
[2021-06-23 07:13:17,600] {pod_launcher.py:148} INFO -     return old_func(*args, **kwargs)
[2021-06-23 07:13:17,600] {pod_launcher.py:148} INFO -   File "/content/mmdetection/mmdet/models/anchor_heads/anchor_head.py", line 276, in get_bboxes
[2021-06-23 07:13:17,600] {pod_launcher.py:148} INFO -     scale_factor, cfg, rescale)
[2021-06-23 07:13:17,601] {pod_launcher.py:148} INFO -   File "/content/mmdetection/mmdet/models/anchor_heads/rpn_head.py", line 92, in get_bboxes_single
[2021-06-23 07:13:17,601] {pod_launcher.py:148} INFO -     proposals, _ = nms(proposals, cfg.nms_thr)
[2021-06-23 07:13:17,601] {pod_launcher.py:148} INFO -   File "/content/mmdetection/mmdet/ops/nms/nms_wrapper.py", line 54, in nms
[2021-06-23 07:13:17,601] {pod_launcher.py:148} INFO -     inds = nms_cuda.nms(dets_th, iou_thr)
[2021-06-23 07:13:17,602] {pod_launcher.py:148} INFO - RuntimeError: CUDA error: no kernel image is available for execution on the device (launch_kernel at /pytorch/aten/src/ATen/native/cuda/Loops.cuh:103)
[2021-06-23 07:13:17,602] {pod_launcher.py:148} INFO - frame #0: c10::Error::Error(c10::SourceLocation, std::string const&) + 0x33 (0x7faf44434193 in /usr/local/lib/python3.6/dist-packages/torch/lib/libc10.so)
[2021-06-23 07:13:17,602] {pod_launcher.py:148} INFO - frame #1: void at::native::gpu_index_kernel<__nv_dl_wrapper_t<__nv_dl_tag<void (*)(at::TensorIterator&, c10::ArrayRef<long>, c10::ArrayRef<long>), &(void at::native::index_kernel_impl<at::native::OpaqueType<8> >(at::TensorIterator&, c10::ArrayRef<long>, c10::ArrayRef<long>)), 1u>> >(at::TensorIterator&, c10::ArrayRef<long>, c10::ArrayRef<long>, __nv_dl_wrapper_t<__nv_dl_tag<void (*)(at::TensorIterator&, c10::ArrayRef<long>, c10::ArrayRef<long>), &(void at::native::index_kernel_impl<at::native::OpaqueType<8> >(at::TensorIterator&, c10::ArrayRef<long>, c10::ArrayRef<long>)), 1u>> const&) + 0x7bb (0x7faefc45e87b in /usr/local/lib/python3.6/dist-packages/torch/lib/libtorch.so)
[2021-06-23 07:13:17,602] {pod_launcher.py:148} INFO - frame #2: <unknown function> + 0x580fc32 (0x7faefc458c32 in /usr/local/lib/python3.6/dist-packages/torch/lib/libtorch.so)
[2021-06-23 07:13:17,602] {pod_launcher.py:148} INFO - frame #3: <unknown function> + 0x580ff88 (0x7faefc458f88 in /usr/local/lib/python3.6/dist-packages/torch/lib/libtorch.so)
[2021-06-23 07:13:17,602] {pod_launcher.py:148} INFO - frame #4: <unknown function> + 0x1a7493b (0x7faef86bd93b in /usr/local/lib/python3.6/dist-packages/torch/lib/libtorch.so)
[2021-06-23 07:13:17,603] {pod_launcher.py:148} INFO - frame #5: at::native::index(at::Tensor const&, c10::ArrayRef<at::Tensor>) + 0x47e (0x7faef86b96fe in /usr/local/lib/python3.6/dist-packages/torch/lib/libtorch.so)
[2021-06-23 07:13:17,603] {pod_launcher.py:148} INFO - frame #6: <unknown function> + 0x1fe06aa (0x7faef8c296aa in /usr/local/lib/python3.6/dist-packages/torch/lib/libtorch.so)
[2021-06-23 07:13:17,603] {pod_launcher.py:148} INFO - frame #7: <unknown function> + 0x1fe5173 (0x7faef8c2e173 in /usr/local/lib/python3.6/dist-packages/torch/lib/libtorch.so)
[2021-06-23 07:13:17,603] {pod_launcher.py:148} INFO - frame #8: <unknown function> + 0x3bffe6a (0x7faefa848e6a in /usr/local/lib/python3.6/dist-packages/torch/lib/libtorch.so)
[2021-06-23 07:13:17,604] {pod_launcher.py:148} INFO - frame #9: <unknown function> + 0x1fe5173 (0x7faef8c2e173 in /usr/local/lib/python3.6/dist-packages/torch/lib/libtorch.so)
[2021-06-23 07:13:17,604] {pod_launcher.py:148} INFO - frame #10: at::Tensor c10::KernelFunction::callUnboxedOnly<at::Tensor, at::Tensor const&, c10::ArrayRef<at::Tensor> >(at::Tensor const&, c10::ArrayRef<at::Tensor>) const + 0xa3 (0x7faef3b7ec73 in /content/mmdetection/mmdet/ops/nms/nms_cuda.cpython-36m-x86_64-linux-gnu.so)
[2021-06-23 07:13:17,604] {pod_launcher.py:148} INFO - frame #11: c10::Dispatcher::doCallUnboxedOnly<at::Tensor, at::Tensor const&, c10::ArrayRef<at::Tensor> >(c10::DispatchTable const&, c10::LeftRight<ska::flat_hash_map<c10::TensorTypeId, c10::KernelFunction, std::hash<c10::TensorTypeId>, std::equal_to<c10::TensorTypeId>, std::allocator<std::pair<c10::TensorTypeId, c10::KernelFunction> > > > const&, at::Tensor const&, c10::ArrayRef<at::Tensor>) const::{lambda(ska::flat_hash_map<c10::TensorTypeId, c10::KernelFunction, std::hash<c10::TensorTypeId>, std::equal_to<c10::TensorTypeId>, std::allocator<std::pair<c10::TensorTypeId, c10::KernelFunction> > > const&)#1}::operator()(ska::flat_hash_map<c10::TensorTypeId, c10::KernelFunction, std::hash<c10::TensorTypeId>, std::equal_to<c10::TensorTypeId>, std::allocator<std::pair<c10::TensorTypeId, c10::KernelFunction> > > const&) const + 0xc9 (0x7faef3b7c331 in /content/mmdetection/mmdet/ops/nms/nms_cuda.cpython-36m-x86_64-linux-gnu.so)
[2021-06-23 07:13:17,604] {pod_launcher.py:148} INFO - frame #12: std::result_of<c10::Dispatcher::doCallUnboxedOnly<at::Tensor, at::Tensor const&, c10::ArrayRef<at::Tensor> >(c10::DispatchTable const&, c10::LeftRight<ska::flat_hash_map<c10::TensorTypeId, c10::KernelFunction, std::hash<c10::TensorTypeId>, std::equal_to<c10::TensorTypeId>, std::allocator<std::pair<c10::TensorTypeId, c10::KernelFunction> > > > const&, at::Tensor const&, c10::ArrayRef<at::Tensor>) const::{lambda(ska::flat_hash_map<c10::TensorTypeId, c10::KernelFunction, std::hash<c10::TensorTypeId>, std::equal_to<c10::TensorTypeId>, std::allocator<std::pair<c10::TensorTypeId, c10::KernelFunction> > > const&)#1} (ska::flat_hash_map<c10::TensorTypeId, c10::KernelFunction, std::hash<c10::TensorTypeId>, std::equal_to<c10::TensorTypeId>, std::allocator<std::pair<c10::TensorTypeId, c10::KernelFunction> > > const&)>::type c10::LeftRight<ska::flat_hash_map<c10::TensorTypeId, c10::KernelFunction, std::hash<c10::TensorTypeId>, std::equal_to<c10::TensorTypeId>, std::allocator<std::pair<c10::TensorTypeId, c10::KernelFunction> > > >::read<c10::Dispatcher::doCallUnboxedOnly<at::Tensor, at::Tensor const&, c10::ArrayRef<at::Tensor> >(c10::DispatchTable const&, c10::LeftRight<ska::flat_hash_map<c10::TensorTypeId, c10::KernelFunction, std::hash<c10::TensorTypeId>, std::equal_to<c10::TensorTypeId>, std::allocator<std::pair<c10::TensorTypeId, c10::KernelFunction> > > > const&, at::Tensor const&, c10::ArrayRef<at::Tensor>) const::{lambda(ska::flat_hash_map<c10::TensorTypeId, c10::KernelFunction, std::hash<c10::TensorTypeId>, std::equal_to<c10::TensorTypeId>, std::allocator<std::pair<c10::TensorTypeId, c10::KernelFunction> > > const&)#1}>(c10::Dispatcher::doCallUnboxedOnly<at::Tensor, at::Tensor const&, c10::ArrayRef<at::Tensor> >(c10::DispatchTable const&, c10::LeftRight<ska::flat_hash_map<c10::TensorTypeId, c10::KernelFunction, std::hash<c10::TensorTypeId>, std::equal_to<c10::TensorTypeId>, std::allocator<std::pair<c10::TensorTypeId, c10::KernelFunction> > > > const&, at::Tensor const&, c10::ArrayRef<at::Tensor>) const::{lambda(ska::flat_hash_map<c10::TensorTypeId, c10::KernelFunction, std::hash<c10::TensorTypeId>, std::equal_to<c10::TensorTypeId>, std::allocator<std::pair<c10::TensorTypeId, c10::KernelFunction> > > const&)#1}&&) const + 0x128 (0x7faef3b7eed2 in /content/mmdetection/mmdet/ops/nms/nms_cuda.cpython-36m-x86_64-linux-gnu.so)
[2021-06-23 07:13:17,604] {pod_launcher.py:148} INFO - frame #13: at::Tensor c10::Dispatcher::doCallUnboxedOnly<at::Tensor, at::Tensor const&, c10::ArrayRef<at::Tensor> >(c10::DispatchTable const&, c10::LeftRight<ska::flat_hash_map<c10::TensorTypeId, c10::KernelFunction, std::hash<c10::TensorTypeId>, std::equal_to<c10::TensorTypeId>, std::allocator<std::pair<c10::TensorTypeId, c10::KernelFunction> > > > const&, at::Tensor const&, c10::ArrayRef<at::Tensor>) const + 0x6a (0x7faef3b7c3ba in /content/mmdetection/mmdet/ops/nms/nms_cuda.cpython-36m-x86_64-linux-gnu.so)
[2021-06-23 07:13:17,604] {pod_launcher.py:148} INFO - frame #14: c10::Dispatcher::callUnboxedOnly<at::Tensor, at::Tensor const&, c10::ArrayRef<at::Tensor> >(c10::OperatorHandle const&, at::Tensor const&, c10::ArrayRef<at::Tensor>) const::{lambda(c10::DispatchTable const&)#1}::operator()(c10::DispatchTable const&) const + 0x80 (0x7faef3b7936a in /content/mmdetection/mmdet/ops/nms/nms_cuda.cpython-36m-x86_64-linux-gnu.so)
[2021-06-23 07:13:17,604] {pod_launcher.py:148} INFO - frame #15: std::result_of<c10::Dispatcher::callUnboxedOnly<at::Tensor, at::Tensor const&, c10::ArrayRef<at::Tensor> >(c10::OperatorHandle const&, at::Tensor const&, c10::ArrayRef<at::Tensor>) const::{lambda(c10::DispatchTable const&)#1} (c10::DispatchTable const&)>::type c10::LeftRight<c10::DispatchTable>::read<c10::Dispatcher::callUnboxedOnly<at::Tensor, at::Tensor const&, c10::ArrayRef<at::Tensor> >(c10::OperatorHandle const&, at::Tensor const&, c10::ArrayRef<at::Tensor>) const::{lambda(c10::DispatchTable const&)#1}>(c10::Dispatcher::callUnboxedOnly<at::Tensor, at::Tensor const&, c10::ArrayRef<at::Tensor> >(c10::OperatorHandle const&, at::Tensor const&, c10::ArrayRef<at::Tensor>) const::{lambda(c10::DispatchTable const&)#1}&&) const + 0x128 (0x7faef3b7f056 in /content/mmdetection/mmdet/ops/nms/nms_cuda.cpython-36m-x86_64-linux-gnu.so)
[2021-06-23 07:13:17,604] {pod_launcher.py:148} INFO - frame #16: c10::guts::infer_function_traits<c10::Dispatcher::callUnboxedOnly<at::Tensor, at::Tensor const&, c10::ArrayRef<at::Tensor> >(c10::OperatorHandle const&, at::Tensor const&, c10::ArrayRef<at::Tensor>) const::{lambda(c10::DispatchTable const&)#1}>::type::return_type c10::impl::OperatorEntry::readDispatchTable<c10::Dispatcher::callUnboxedOnly<at::Tensor, at::Tensor const&, c10::ArrayRef<at::Tensor> >(c10::OperatorHandle const&, at::Tensor const&, c10::ArrayRef<at::Tensor>) const::{lambda(c10::DispatchTable const&)#1}>(c10::Dispatcher::callUnboxedOnly<at::Tensor, at::Tensor const&, c10::ArrayRef<at::Tensor> >(c10::OperatorHandle const&, at::Tensor const&, c10::ArrayRef<at::Tensor>) const::{lambda(c10::DispatchTable const&)#1}&&) const + 0x4a (0x7faef3b7c42c in /content/mmdetection/mmdet/ops/nms/nms_cuda.cpython-36m-x86_64-linux-gnu.so)
[2021-06-23 07:13:17,604] {pod_launcher.py:148} INFO - frame #17: at::Tensor c10::Dispatcher::callUnboxedOnly<at::Tensor, at::Tensor const&, c10::ArrayRef<at::Tensor> >(c10::OperatorHandle const&, at::Tensor const&, c10::ArrayRef<at::Tensor>) const + 0x7c (0x7faef3b7941a in /content/mmdetection/mmdet/ops/nms/nms_cuda.cpython-36m-x86_64-linux-gnu.so)
[2021-06-23 07:13:17,605] {pod_launcher.py:148} INFO - frame #18: at::Tensor::index(c10::ArrayRef<at::Tensor>) const + 0x16f (0x7faef3b74dad in /content/mmdetection/mmdet/ops/nms/nms_cuda.cpython-36m-x86_64-linux-gnu.so)
[2021-06-23 07:13:17,605] {pod_launcher.py:148} INFO - frame #19: nms_cuda(at::Tensor, float) + 0x84f (0x7faef3b71a0c in /content/mmdetection/mmdet/ops/nms/nms_cuda.cpython-36m-x86_64-linux-gnu.so)
[2021-06-23 07:13:17,605] {pod_launcher.py:148} INFO - frame #20: nms(at::Tensor const&, float) + 0xee (0x7faef3b6087e in /content/mmdetection/mmdet/ops/nms/nms_cuda.cpython-36m-x86_64-linux-gnu.so)
[2021-06-23 07:13:17,605] {pod_launcher.py:148} INFO - frame #21: <unknown function> + 0x335ab (0x7faef3b6f5ab in /content/mmdetection/mmdet/ops/nms/nms_cuda.cpython-36m-x86_64-linux-gnu.so)
[2021-06-23 07:13:17,605] {pod_launcher.py:148} INFO - frame #22: <unknown function> + 0x302b0 (0x7faef3b6c2b0 in /content/mmdetection/mmdet/ops/nms/nms_cuda.cpython-36m-x86_64-linux-gnu.so)

以下是命令 nvidia-smi 的输出以及 mmdetection 的另一个实用程序在 Kubernetes 和 Google Colab 上运行（代码运行良好）

在 Kubernetes 上运行容器的输出

nvidia-smi

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 450.51.06    Driver Version: 450.51.06    CUDA Version: 11.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Tesla P100-PCIE...  Off  | 00000001:00:00.0 Off |                    0 |
| N/A   23C    P0    26W / 250W |      0MiB / 16280MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

python3 /content/mmdetection/mmdet/utils/collect_env.py

sys.platform: linux
Python: 3.6.9 (default, Jan 26 2021, 15:33:00) [GCC 8.4.0]
CUDA available: True
CUDA_HOME: /usr/local/cuda
NVCC: Cuda compilation tools, release 10.0, V10.0.130
GPU 0: Tesla P100-PCIE-16GB
GCC: gcc (Ubuntu 7.5.0-3ubuntu1~18.04) 7.5.0
PyTorch: 1.4.0+cu100
PyTorch compiling details: PyTorch built with:
  - GCC 7.3
  - Intel(R) Math Kernel Library Version 2019.0.4 Product Build 20190411 for Intel(R) 64 architecture applications
  - Intel(R) MKL-DNN v0.21.1 (Git Hash 7d2fd500bc78936d1d648ca713b901012f470dbc)
  - OpenMP 201511 (a.k.a. OpenMP 4.5)
  - NNPACK is enabled
  - CUDA Runtime 10.0
  - NVCC architecture flags: -gencode;arch=compute_37,code=sm_37;-gencode;arch=compute_50,code=sm_50;-gencode;arch=compute_60,code=sm_60;-gencode;arch=compute_61,code=sm_61;-gencode;arch=compute_70,code=sm_70;-gencode;arch=compute_75,code=sm_75;-gencode;arch=compute_37,code=compute_37
  - CuDNN 7.6.3
  - Magma 2.5.1
  - Build settings: BLAS=MKL, BUILD_NAMEDTENSOR=OFF, BUILD_TYPE=Release, CXX_FLAGS= -Wno-deprecated -fvisibility-inlines-hidden -fopenmp -DUSE_FBGEMM -DUSE_QNNPACK -DUSE_PYTORCH_QNNPACK -O2 -fPIC -Wno-narrowing -Wall -Wextra -Wno-missing-field-initializers -Wno-type-limits -Wno-array-bounds -Wno-unknown-pragmas -Wno-sign-compare -Wno-unused-parameter -Wno-unused-variable -Wno-unused-function -Wno-unused-result -Wno-strict-overflow -Wno-strict-aliasing -Wno-error=deprecated-declarations -Wno-stringop-overflow -Wno-error=pedantic -Wno-error=redundant-decls -Wno-error=old-style-cast -fdiagnostics-color=always -faligned-new -Wno-unused-but-set-variable -Wno-maybe-uninitialized -fno-math-errno -fno-trapping-math -Wno-stringop-overflow, DISABLE_NUMA=1, PERF_WITH_AVX=1, PERF_WITH_AVX2=1, PERF_WITH_AVX512=1, USE_CUDA=ON, USE_EXCEPTION_PTR=1, USE_GFLAGS=OFF, USE_GLOG=OFF, USE_MKL=ON, USE_MKLDNN=ON, USE_MPI=OFF, USE_NCCL=ON, USE_NNPACK=ON, USE_OPENMP=ON, USE_STATIC_DISPATCH=OFF,

TorchVision: 0.5.0+cu100
OpenCV: 4.5.2
MMCV: 0.4.3
MMDetection: 1.2.0+unknown
MMDetection Compiler: GCC 7.5
MMDetection CUDA Compiler: 10.0

在 Google Colab 上运行代码的输出

nvidia-smi

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 465.27       Driver Version: 460.32.03    CUDA Version: 11.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Tesla V100-SXM2...  Off  | 00000000:00:04.0 Off |                    0 |
| N/A   37C    P0    25W / 300W |      0MiB / 16160MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

python3 /content/mmdetection/mmdet/utils/collect_env.py

sys.platform: linux
Python: 3.7.10 (default, May  3 2021, 02:48:31) [GCC 7.5.0]
CUDA available: True
CUDA_HOME: /usr/local/cuda
NVCC: Build cuda_11.0_bu.TC445_37.28845127_0
GPU 0: Tesla V100-SXM2-16GB
GCC: gcc (Ubuntu 7.5.0-3ubuntu1~18.04) 7.5.0
PyTorch: 1.4.0+cu100
PyTorch compiling details: PyTorch built with:
  - GCC 7.3
  - Intel(R) Math Kernel Library Version 2019.0.4 Product Build 20190411 for Intel(R) 64 architecture applications
  - Intel(R) MKL-DNN v0.21.1 (Git Hash 7d2fd500bc78936d1d648ca713b901012f470dbc)
  - OpenMP 201511 (a.k.a. OpenMP 4.5)
  - NNPACK is enabled
  - CUDA Runtime 10.0
  - NVCC architecture flags: -gencode;arch=compute_37,code=sm_37;-gencode;arch=compute_50,code=sm_50;-gencode;arch=compute_60,code=sm_60;-gencode;arch=compute_61,code=sm_61;-gencode;arch=compute_70,code=sm_70;-gencode;arch=compute_75,code=sm_75;-gencode;arch=compute_37,code=compute_37
  - CuDNN 7.6.3
  - Magma 2.5.1
  - Build settings: BLAS=MKL, BUILD_NAMEDTENSOR=OFF, BUILD_TYPE=Release, CXX_FLAGS= -Wno-deprecated -fvisibility-inlines-hidden -fopenmp -DUSE_FBGEMM -DUSE_QNNPACK -DUSE_PYTORCH_QNNPACK -O2 -fPIC -Wno-narrowing -Wall -Wextra -Wno-missing-field-initializers -Wno-type-limits -Wno-array-bounds -Wno-unknown-pragmas -Wno-sign-compare -Wno-unused-parameter -Wno-unused-variable -Wno-unused-function -Wno-unused-result -Wno-strict-overflow -Wno-strict-aliasing -Wno-error=deprecated-declarations -Wno-stringop-overflow -Wno-error=pedantic -Wno-error=redundant-decls -Wno-error=old-style-cast -fdiagnostics-color=always -faligned-new -Wno-unused-but-set-variable -Wno-maybe-uninitialized -fno-math-errno -fno-trapping-math -Wno-stringop-overflow, DISABLE_NUMA=1, PERF_WITH_AVX=1, PERF_WITH_AVX2=1, PERF_WITH_AVX512=1, USE_CUDA=ON, USE_EXCEPTION_PTR=1, USE_GFLAGS=OFF, USE_GLOG=OFF, USE_MKL=ON, USE_MKLDNN=ON, USE_MPI=OFF, USE_NCCL=ON, USE_NNPACK=ON, USE_OPENMP=ON, USE_STATIC_DISPATCH=OFF, 

TorchVision: 0.5.0+cu100
OpenCV: 4.5.2
MMCV: 0.4.3
MMDetection: 1.2.0+0f33c08
MMDetection Compiler: GCC 7.5
MMDetection CUDA Compiler: 11.0

注意：此处发布的 Google Colab 输出在 Tesla V100 GPU 上运行，但有时我会分配到 Tesla P100（与 Kubernetes 上使用的 GPU 相同）并且代码运行流畅在这两种情况下（在 Google Colab 上），但是在 Kubernetes 上运行时会引发错误。

感谢任何帮助

0 个答案:

没有答案