无法在启用了ECC的NVIDIA设备上创建上下文

时间:2013-04-26 14:04:08

标签: cuda nvidia

在具有4个NVIDIA GPU的节点上,我在设备0上启用了ECC内存保护(所有其他都禁用了ECC)。由于我在设备0上启用了ECC,因此当我尝试在此设备0(驱动程序API)上创建上下文时,我的应用程序(CUDA,仅使用一个设备)会挂起。我不知道为什么它会挂起。如果我使用不同的设备将CUDA_VISIBLE_DEVICE设置为相应的另一台设备,它可以正常工作。它必须与启用ECC有关。有什么想法吗? 这里是nvidia-smi的输出: (为什么它报告99%的GPU利用率不稳定,没有在那里运行?)

+------------------------------------------------------+                       
| NVIDIA-SMI 4.304.54   Driver Version: 304.54         |                       
|-------------------------------+----------------------+----------------------+
| GPU  Name                     | Bus-Id        Disp.  | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap| Memory-Usage         | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Tesla K20m               | 0000:02:00.0     Off |                    1 |
| N/A   29C    P0    49W / 225W |   0%   12MB / 4799MB |     99%      Default |
+-------------------------------+----------------------+----------------------+
|   1  Tesla K20m               | 0000:03:00.0     Off |                    0 |
| N/A   22C    P8    15W / 225W |   0%   12MB / 4799MB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   2  Tesla K20m               | 0000:83:00.0     Off |                    0 |
| N/A   22C    P8    24W / 225W |   0%   11MB / 4799MB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   3  Tesla K20m               | 0000:84:00.0     Off |                    0 |
| N/A   23C    P8    25W / 225W |   0%   11MB / 4799MB |      0%      Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Compute processes:                                               GPU Memory |
|  GPU       PID  Process name                                     Usage      |
|=============================================================================|
|  No running compute processes found                                         |
+-----------------------------------------------------------------------------+

编辑:nvidia-smi -a报告在所有设备上启用了ECC。奇怪!

==============NVSMI LOG==============

Timestamp                       : Fri Apr 26 10:18:14 2013
Driver Version                  : 304.54

Attached GPUs                   : 4
GPU 0000:02:00.0
    Product Name                : Tesla K20m
    Display Mode                : Disabled
    Persistence Mode            : Enabled
    Driver Model
        Current                 : N/A
        Pending                 : N/A
    Serial Number               : 0324512044699
    VBIOS Version               : 80.10.11.00.0B
    Inforom Version
        Image Version           : 2081.0208.01.07
        OEM Object              : 1.1
        ECC Object              : 3.0
        Power Management Object : N/A
    GPU Operation Mode
        Current                 : Compute
        Pending                 : Compute
    PCI
        Bus                     : 0x02
        Device                  : 0x00
        Domain                  : 0x0000
        Device Id               : 0x102810DE
        Bus Id                  : 0000:02:00.0
        Sub System Id           : 0x101510DE
        GPU Link Info
            PCIe Generation
                Max             : 2
                Current         : 2
            Link Width
                Max             : 16x
                Current         : 16x
    Fan Speed                   : N/A
    Performance State           : P0
    Clocks Throttle Reasons
        Idle                    : Not Active
        User Defined Clocks     : Not Active
        SW Power Cap            : Not Active
        HW Slowdown             : Not Active
        Unknown                 : Not Active
    Memory Usage
        Total                   : 4799 MB
        Used                    : 12 MB
        Free                    : 4787 MB
    Compute Mode                : Default
    Utilization
        Gpu                     : 99 %
        Memory                  : 0 %
    Ecc Mode
        Current                 : Enabled
        Pending                 : Enabled
    ECC Errors
        Volatile
            Single Bit            
                Device Memory   : 0
                Register File   : 0
                L1 Cache        : 0
                L2 Cache        : 0
                Texture Memory  : 0
                Total           : 0
            Double Bit            
                Device Memory   : 1
                Register File   : 0
                L1 Cache        : 0
                L2 Cache        : 0
                Texture Memory  : 0
                Total           : 1
        Aggregate
            Single Bit            
                Device Memory   : 1
                Register File   : 0
                L1 Cache        : 0
                L2 Cache        : 0
                Texture Memory  : 0
                Total           : 1
            Double Bit            
                Device Memory   : 0
                Register File   : 0
                L1 Cache        : 0
                L2 Cache        : 0
                Texture Memory  : 0
                Total           : 0
    Temperature
        Gpu                     : 29 C
    Power Readings
        Power Management        : Supported
        Power Draw              : 49.51 W
        Power Limit             : 225.00 W
        Default Power Limit     : 225.00 W
        Min Power Limit         : 150.00 W
        Max Power Limit         : 225.00 W
    Clocks
        Graphics                : 758 MHz
        SM                      : 758 MHz
        Memory                  : 2600 MHz
    Applications Clocks
        Graphics                : 705 MHz
        Memory                  : 2600 MHz
    Max Clocks
        Graphics                : 758 MHz
        SM                      : 758 MHz
        Memory                  : 2600 MHz
    Compute Processes           : None

GPU 0000:03:00.0
    Product Name                : Tesla K20m
    Display Mode                : Disabled
    Persistence Mode            : Enabled
    Driver Model
        Current                 : N/A
        Pending                 : N/A
    Serial Number               : 0324512044821
    VBIOS Version               : 80.10.11.00.0B
    Inforom Version
        Image Version           : 2081.0208.01.07
        OEM Object              : 1.1
        ECC Object              : 3.0
        Power Management Object : N/A
    GPU Operation Mode
        Current                 : Compute
        Pending                 : Compute
    PCI
        Bus                     : 0x03
        Device                  : 0x00
        Domain                  : 0x0000
        Device Id               : 0x102810DE
        Bus Id                  : 0000:03:00.0
        Sub System Id           : 0x101510DE
        GPU Link Info
            PCIe Generation
                Max             : 2
                Current         : 1
            Link Width
                Max             : 16x
                Current         : 16x
    Fan Speed                   : N/A
    Performance State           : P8
    Clocks Throttle Reasons
        Idle                    : Active
        User Defined Clocks     : Not Active
        SW Power Cap            : Not Active
        HW Slowdown             : Not Active
        Unknown                 : Not Active
    Memory Usage
        Total                   : 4799 MB
        Used                    : 12 MB
        Free                    : 4787 MB
    Compute Mode                : Default
    Utilization
        Gpu                     : 0 %
        Memory                  : 0 %
    Ecc Mode
        Current                 : Enabled
        Pending                 : Enabled
    ECC Errors
        Volatile
            Single Bit            
                Device Memory   : 0
                Register File   : 0
                L1 Cache        : 0
                L2 Cache        : 0
                Texture Memory  : 0
                Total           : 0
            Double Bit            
                Device Memory   : 0
                Register File   : 0
                L1 Cache        : 0
                L2 Cache        : 0
                Texture Memory  : 0
                Total           : 0
        Aggregate
            Single Bit            
                Device Memory   : 0
                Register File   : 0
                L1 Cache        : 0
                L2 Cache        : 0
                Texture Memory  : 0
                Total           : 0
            Double Bit            
                Device Memory   : 0
                Register File   : 0
                L1 Cache        : 0
                L2 Cache        : 0
                Texture Memory  : 0
                Total           : 0
    Temperature
        Gpu                     : 19 C
    Power Readings
        Power Management        : Supported
        Power Draw              : 15.22 W
        Power Limit             : 225.00 W
        Default Power Limit     : 225.00 W
        Min Power Limit         : 150.00 W
        Max Power Limit         : 225.00 W
    Clocks
        Graphics                : 324 MHz
        SM                      : 324 MHz
        Memory                  : 324 MHz
    Applications Clocks
        Graphics                : 705 MHz
        Memory                  : 2600 MHz
    Max Clocks
        Graphics                : 758 MHz
        SM                      : 758 MHz
        Memory                  : 2600 MHz
    Compute Processes           : None

GPU 0000:83:00.0
    Product Name                : Tesla K20m
    Display Mode                : Disabled
    Persistence Mode            : Enabled
    Driver Model
        Current                 : N/A
        Pending                 : N/A
    Serial Number               : 0324512044783
    VBIOS Version               : 80.10.11.00.0B
    Inforom Version
        Image Version           : 2081.0208.01.07
        OEM Object              : 1.1
        ECC Object              : 3.0
        Power Management Object : N/A
    GPU Operation Mode
        Current                 : Compute
        Pending                 : Compute
    PCI
        Bus                     : 0x83
        Device                  : 0x00
        Domain                  : 0x0000
        Device Id               : 0x102810DE
        Bus Id                  : 0000:83:00.0
        Sub System Id           : 0x101510DE
        GPU Link Info
            PCIe Generation
                Max             : 2
                Current         : 1
            Link Width
                Max             : 16x
                Current         : 16x
    Fan Speed                   : N/A
    Performance State           : P8
    Clocks Throttle Reasons
        Idle                    : Active
        User Defined Clocks     : Not Active
        SW Power Cap            : Not Active
        HW Slowdown             : Not Active
        Unknown                 : Not Active
    Memory Usage
        Total                   : 4799 MB
        Used                    : 11 MB
        Free                    : 4788 MB
    Compute Mode                : Default
    Utilization
        Gpu                     : 0 %
        Memory                  : 0 %
    Ecc Mode
        Current                 : Enabled
        Pending                 : Enabled
    ECC Errors
        Volatile
            Single Bit            
                Device Memory   : 0
                Register File   : 0
                L1 Cache        : 0
                L2 Cache        : 0
                Texture Memory  : 0
                Total           : 0
            Double Bit            
                Device Memory   : 0
                Register File   : 0
                L1 Cache        : 0
                L2 Cache        : 0
                Texture Memory  : 0
                Total           : 0
        Aggregate
            Single Bit            
                Device Memory   : 0
                Register File   : 0
                L1 Cache        : 0
                L2 Cache        : 0
                Texture Memory  : 0
                Total           : 0
            Double Bit            
                Device Memory   : 0
                Register File   : 0
                L1 Cache        : 0
                L2 Cache        : 0
                Texture Memory  : 0
                Total           : 0
    Temperature
        Gpu                     : 22 C
    Power Readings
        Power Management        : Supported
        Power Draw              : 24.74 W
        Power Limit             : 225.00 W
        Default Power Limit     : 225.00 W
        Min Power Limit         : 150.00 W
        Max Power Limit         : 225.00 W
    Clocks
        Graphics                : 324 MHz
        SM                      : 324 MHz
        Memory                  : 324 MHz
    Applications Clocks
        Graphics                : 705 MHz
        Memory                  : 2600 MHz
    Max Clocks
        Graphics                : 758 MHz
        SM                      : 758 MHz
        Memory                  : 2600 MHz
    Compute Processes           : None

GPU 0000:84:00.0
    Product Name                : Tesla K20m
    Display Mode                : Disabled
    Persistence Mode            : Enabled
    Driver Model
        Current                 : N/A
        Pending                 : N/A
    Serial Number               : 0324512044628
    VBIOS Version               : 80.10.11.00.0B
    Inforom Version
        Image Version           : 2081.0208.01.07
        OEM Object              : 1.1
        ECC Object              : 3.0
        Power Management Object : N/A
    GPU Operation Mode
        Current                 : Compute
        Pending                 : Compute
    PCI
        Bus                     : 0x84
        Device                  : 0x00
        Domain                  : 0x0000
        Device Id               : 0x102810DE
        Bus Id                  : 0000:84:00.0
        Sub System Id           : 0x101510DE
        GPU Link Info
            PCIe Generation
                Max             : 2
                Current         : 1
            Link Width
                Max             : 16x
                Current         : 16x
    Fan Speed                   : N/A
    Performance State           : P8
    Clocks Throttle Reasons
        Idle                    : Active
        User Defined Clocks     : Not Active
        SW Power Cap            : Not Active
        HW Slowdown             : Not Active
        Unknown                 : Not Active
    Memory Usage
        Total                   : 4799 MB
        Used                    : 11 MB
        Free                    : 4788 MB
    Compute Mode                : Default
    Utilization
        Gpu                     : 0 %
        Memory                  : 0 %
    Ecc Mode
        Current                 : Enabled
        Pending                 : Enabled
    ECC Errors
        Volatile
            Single Bit            
                Device Memory   : 0
                Register File   : 0
                L1 Cache        : 0
                L2 Cache        : 0
                Texture Memory  : 0
                Total           : 0
            Double Bit            
                Device Memory   : 0
                Register File   : 0
                L1 Cache        : 0
                L2 Cache        : 0
                Texture Memory  : 0
                Total           : 0
        Aggregate
            Single Bit            
                Device Memory   : 0
                Register File   : 0
                L1 Cache        : 0
                L2 Cache        : 0
                Texture Memory  : 0
                Total           : 0
            Double Bit            
                Device Memory   : 0
                Register File   : 0
                L1 Cache        : 0
                L2 Cache        : 0
                Texture Memory  : 0
                Total           : 0
    Temperature
        Gpu                     : 23 C
    Power Readings
        Power Management        : Supported
        Power Draw              : 25.47 W
        Power Limit             : 225.00 W
        Default Power Limit     : 225.00 W
        Min Power Limit         : 150.00 W
        Max Power Limit         : 225.00 W
    Clocks
        Graphics                : 324 MHz
        SM                      : 324 MHz
        Memory                  : 324 MHz
    Applications Clocks
        Graphics                : 705 MHz
        Memory                  : 2600 MHz
    Max Clocks
        Graphics                : 758 MHz
        SM                      : 758 MHz
        Memory                  : 2600 MHz
    Compute Processes           : None

1 个答案:

答案 0 :(得分:4)

nvidia-smi输出显示设备上无法纠正的ECC错误。您可以使用nvidia-smi --reset-ecc-errors=0 -g 0重置错误,然后重试。重置中的0表示仅重置易失性计数器,聚合计数器仍将指示过去发生过错误。

如果您发现设备存在更多错误,则值得进一步调查原因。

请注意,在摘要视图中,您正在查看的ECC字段实际上是“Volatile Uncorr.ECC”,即它是错误计数而不是ECC启用/禁用标志。如果ECC被禁用,则会显示“N / A”。