在具有4个NVIDIA GPU的节点上,我在设备0上启用了ECC内存保护(所有其他都禁用了ECC)。由于我在设备0上启用了ECC,因此当我尝试在此设备0(驱动程序API)上创建上下文时,我的应用程序(CUDA,仅使用一个设备)会挂起。我不知道为什么它会挂起。如果我使用不同的设备将CUDA_VISIBLE_DEVICE设置为相应的另一台设备,它可以正常工作。它必须与启用ECC有关。有什么想法吗?
这里是nvidia-smi
的输出:
(为什么它报告99%的GPU利用率不稳定,没有在那里运行?)
+------------------------------------------------------+
| NVIDIA-SMI 4.304.54 Driver Version: 304.54 |
|-------------------------------+----------------------+----------------------+
| GPU Name | Bus-Id Disp. | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 Tesla K20m | 0000:02:00.0 Off | 1 |
| N/A 29C P0 49W / 225W | 0% 12MB / 4799MB | 99% Default |
+-------------------------------+----------------------+----------------------+
| 1 Tesla K20m | 0000:03:00.0 Off | 0 |
| N/A 22C P8 15W / 225W | 0% 12MB / 4799MB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 2 Tesla K20m | 0000:83:00.0 Off | 0 |
| N/A 22C P8 24W / 225W | 0% 11MB / 4799MB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 3 Tesla K20m | 0000:84:00.0 Off | 0 |
| N/A 23C P8 25W / 225W | 0% 11MB / 4799MB | 0% Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Compute processes: GPU Memory |
| GPU PID Process name Usage |
|=============================================================================|
| No running compute processes found |
+-----------------------------------------------------------------------------+
编辑:nvidia-smi -a
报告在所有设备上启用了ECC。奇怪!
==============NVSMI LOG==============
Timestamp : Fri Apr 26 10:18:14 2013
Driver Version : 304.54
Attached GPUs : 4
GPU 0000:02:00.0
Product Name : Tesla K20m
Display Mode : Disabled
Persistence Mode : Enabled
Driver Model
Current : N/A
Pending : N/A
Serial Number : 0324512044699
VBIOS Version : 80.10.11.00.0B
Inforom Version
Image Version : 2081.0208.01.07
OEM Object : 1.1
ECC Object : 3.0
Power Management Object : N/A
GPU Operation Mode
Current : Compute
Pending : Compute
PCI
Bus : 0x02
Device : 0x00
Domain : 0x0000
Device Id : 0x102810DE
Bus Id : 0000:02:00.0
Sub System Id : 0x101510DE
GPU Link Info
PCIe Generation
Max : 2
Current : 2
Link Width
Max : 16x
Current : 16x
Fan Speed : N/A
Performance State : P0
Clocks Throttle Reasons
Idle : Not Active
User Defined Clocks : Not Active
SW Power Cap : Not Active
HW Slowdown : Not Active
Unknown : Not Active
Memory Usage
Total : 4799 MB
Used : 12 MB
Free : 4787 MB
Compute Mode : Default
Utilization
Gpu : 99 %
Memory : 0 %
Ecc Mode
Current : Enabled
Pending : Enabled
ECC Errors
Volatile
Single Bit
Device Memory : 0
Register File : 0
L1 Cache : 0
L2 Cache : 0
Texture Memory : 0
Total : 0
Double Bit
Device Memory : 1
Register File : 0
L1 Cache : 0
L2 Cache : 0
Texture Memory : 0
Total : 1
Aggregate
Single Bit
Device Memory : 1
Register File : 0
L1 Cache : 0
L2 Cache : 0
Texture Memory : 0
Total : 1
Double Bit
Device Memory : 0
Register File : 0
L1 Cache : 0
L2 Cache : 0
Texture Memory : 0
Total : 0
Temperature
Gpu : 29 C
Power Readings
Power Management : Supported
Power Draw : 49.51 W
Power Limit : 225.00 W
Default Power Limit : 225.00 W
Min Power Limit : 150.00 W
Max Power Limit : 225.00 W
Clocks
Graphics : 758 MHz
SM : 758 MHz
Memory : 2600 MHz
Applications Clocks
Graphics : 705 MHz
Memory : 2600 MHz
Max Clocks
Graphics : 758 MHz
SM : 758 MHz
Memory : 2600 MHz
Compute Processes : None
GPU 0000:03:00.0
Product Name : Tesla K20m
Display Mode : Disabled
Persistence Mode : Enabled
Driver Model
Current : N/A
Pending : N/A
Serial Number : 0324512044821
VBIOS Version : 80.10.11.00.0B
Inforom Version
Image Version : 2081.0208.01.07
OEM Object : 1.1
ECC Object : 3.0
Power Management Object : N/A
GPU Operation Mode
Current : Compute
Pending : Compute
PCI
Bus : 0x03
Device : 0x00
Domain : 0x0000
Device Id : 0x102810DE
Bus Id : 0000:03:00.0
Sub System Id : 0x101510DE
GPU Link Info
PCIe Generation
Max : 2
Current : 1
Link Width
Max : 16x
Current : 16x
Fan Speed : N/A
Performance State : P8
Clocks Throttle Reasons
Idle : Active
User Defined Clocks : Not Active
SW Power Cap : Not Active
HW Slowdown : Not Active
Unknown : Not Active
Memory Usage
Total : 4799 MB
Used : 12 MB
Free : 4787 MB
Compute Mode : Default
Utilization
Gpu : 0 %
Memory : 0 %
Ecc Mode
Current : Enabled
Pending : Enabled
ECC Errors
Volatile
Single Bit
Device Memory : 0
Register File : 0
L1 Cache : 0
L2 Cache : 0
Texture Memory : 0
Total : 0
Double Bit
Device Memory : 0
Register File : 0
L1 Cache : 0
L2 Cache : 0
Texture Memory : 0
Total : 0
Aggregate
Single Bit
Device Memory : 0
Register File : 0
L1 Cache : 0
L2 Cache : 0
Texture Memory : 0
Total : 0
Double Bit
Device Memory : 0
Register File : 0
L1 Cache : 0
L2 Cache : 0
Texture Memory : 0
Total : 0
Temperature
Gpu : 19 C
Power Readings
Power Management : Supported
Power Draw : 15.22 W
Power Limit : 225.00 W
Default Power Limit : 225.00 W
Min Power Limit : 150.00 W
Max Power Limit : 225.00 W
Clocks
Graphics : 324 MHz
SM : 324 MHz
Memory : 324 MHz
Applications Clocks
Graphics : 705 MHz
Memory : 2600 MHz
Max Clocks
Graphics : 758 MHz
SM : 758 MHz
Memory : 2600 MHz
Compute Processes : None
GPU 0000:83:00.0
Product Name : Tesla K20m
Display Mode : Disabled
Persistence Mode : Enabled
Driver Model
Current : N/A
Pending : N/A
Serial Number : 0324512044783
VBIOS Version : 80.10.11.00.0B
Inforom Version
Image Version : 2081.0208.01.07
OEM Object : 1.1
ECC Object : 3.0
Power Management Object : N/A
GPU Operation Mode
Current : Compute
Pending : Compute
PCI
Bus : 0x83
Device : 0x00
Domain : 0x0000
Device Id : 0x102810DE
Bus Id : 0000:83:00.0
Sub System Id : 0x101510DE
GPU Link Info
PCIe Generation
Max : 2
Current : 1
Link Width
Max : 16x
Current : 16x
Fan Speed : N/A
Performance State : P8
Clocks Throttle Reasons
Idle : Active
User Defined Clocks : Not Active
SW Power Cap : Not Active
HW Slowdown : Not Active
Unknown : Not Active
Memory Usage
Total : 4799 MB
Used : 11 MB
Free : 4788 MB
Compute Mode : Default
Utilization
Gpu : 0 %
Memory : 0 %
Ecc Mode
Current : Enabled
Pending : Enabled
ECC Errors
Volatile
Single Bit
Device Memory : 0
Register File : 0
L1 Cache : 0
L2 Cache : 0
Texture Memory : 0
Total : 0
Double Bit
Device Memory : 0
Register File : 0
L1 Cache : 0
L2 Cache : 0
Texture Memory : 0
Total : 0
Aggregate
Single Bit
Device Memory : 0
Register File : 0
L1 Cache : 0
L2 Cache : 0
Texture Memory : 0
Total : 0
Double Bit
Device Memory : 0
Register File : 0
L1 Cache : 0
L2 Cache : 0
Texture Memory : 0
Total : 0
Temperature
Gpu : 22 C
Power Readings
Power Management : Supported
Power Draw : 24.74 W
Power Limit : 225.00 W
Default Power Limit : 225.00 W
Min Power Limit : 150.00 W
Max Power Limit : 225.00 W
Clocks
Graphics : 324 MHz
SM : 324 MHz
Memory : 324 MHz
Applications Clocks
Graphics : 705 MHz
Memory : 2600 MHz
Max Clocks
Graphics : 758 MHz
SM : 758 MHz
Memory : 2600 MHz
Compute Processes : None
GPU 0000:84:00.0
Product Name : Tesla K20m
Display Mode : Disabled
Persistence Mode : Enabled
Driver Model
Current : N/A
Pending : N/A
Serial Number : 0324512044628
VBIOS Version : 80.10.11.00.0B
Inforom Version
Image Version : 2081.0208.01.07
OEM Object : 1.1
ECC Object : 3.0
Power Management Object : N/A
GPU Operation Mode
Current : Compute
Pending : Compute
PCI
Bus : 0x84
Device : 0x00
Domain : 0x0000
Device Id : 0x102810DE
Bus Id : 0000:84:00.0
Sub System Id : 0x101510DE
GPU Link Info
PCIe Generation
Max : 2
Current : 1
Link Width
Max : 16x
Current : 16x
Fan Speed : N/A
Performance State : P8
Clocks Throttle Reasons
Idle : Active
User Defined Clocks : Not Active
SW Power Cap : Not Active
HW Slowdown : Not Active
Unknown : Not Active
Memory Usage
Total : 4799 MB
Used : 11 MB
Free : 4788 MB
Compute Mode : Default
Utilization
Gpu : 0 %
Memory : 0 %
Ecc Mode
Current : Enabled
Pending : Enabled
ECC Errors
Volatile
Single Bit
Device Memory : 0
Register File : 0
L1 Cache : 0
L2 Cache : 0
Texture Memory : 0
Total : 0
Double Bit
Device Memory : 0
Register File : 0
L1 Cache : 0
L2 Cache : 0
Texture Memory : 0
Total : 0
Aggregate
Single Bit
Device Memory : 0
Register File : 0
L1 Cache : 0
L2 Cache : 0
Texture Memory : 0
Total : 0
Double Bit
Device Memory : 0
Register File : 0
L1 Cache : 0
L2 Cache : 0
Texture Memory : 0
Total : 0
Temperature
Gpu : 23 C
Power Readings
Power Management : Supported
Power Draw : 25.47 W
Power Limit : 225.00 W
Default Power Limit : 225.00 W
Min Power Limit : 150.00 W
Max Power Limit : 225.00 W
Clocks
Graphics : 324 MHz
SM : 324 MHz
Memory : 324 MHz
Applications Clocks
Graphics : 705 MHz
Memory : 2600 MHz
Max Clocks
Graphics : 758 MHz
SM : 758 MHz
Memory : 2600 MHz
Compute Processes : None
答案 0 :(得分:4)
nvidia-smi输出显示设备上无法纠正的ECC错误。您可以使用nvidia-smi --reset-ecc-errors=0 -g 0
重置错误,然后重试。重置中的0
表示仅重置易失性计数器,聚合计数器仍将指示过去发生过错误。
如果您发现设备存在更多错误,则值得进一步调查原因。
请注意,在摘要视图中,您正在查看的ECC字段实际上是“Volatile Uncorr.ECC”,即它是错误计数而不是ECC启用/禁用标志。如果ECC被禁用,则会显示“N / A”。