MDADM突袭不断失败-配置还是硬件?

时间:2019-02-26 14:33:23

标签: raid mdadm

使用MDADM设置Raid完全没有运气。在这一点上,我怀疑这是我的硬件。初始设置后不久,以及在成功同步过程中和成功同步之后,驱动器都将标记为“失败”并从阵列中删除。我尝试使用RAW驱动器方法和分区方法。使用分区方法,我尝试了全部容量以及较小的分区大小(分区开始时为-100MB)。从许多发现中得出的结论是,在原始未分区驱动器上添加分区大小小于实际驱动器容量的分区驱动器是建立mdadm raid的推荐方法。这样可以简化管理,即更换故障驱动器等。

我的硬件从Dell PowerEdge R410服务器开始。我有一个esata适配器(非高端)连接到具有4个4TB WD Red NAS驱动器的5托架Sans Digital TowerRaid TR5M-(B)。我想将数据存储区与物理服务器分开。我还没有尝试将磁盘移至Dell服务器,因为我不希望RAID阵列上的操作系统。我想我可以尝试从外部驱动器启动,但这太不合常规了,我真的不想往那个方向走。

我碰到过一两个涉及“时间”问题的帖子,想知道这是否真的是我的问题的根源。但是他们谈到失败时的“同步过程”。就我而言,我已经看到袭击成功同步100%,然后才看到袭击崩溃。我可以发布一连串的mdadm检查和详细信息。

因此,在再次构建阵列以发布设置,状态详细信息等之前,我想我会问社区您的想法。但是,这是在所有驱动器发生故障/卸下之前的样子。

services-admin@mydomain:(172.20.0.9)~/DockerServices$ sudo mdadm --detail /dev/md0

       Version : 1.2
 Creation Time : Mon Feb 25 14:42:27 2019
    Raid Level : raid6
    Array Size : 7813566464 (7451.60 GiB 8001.09 GB)
 Used Dev Size : 3906783232 (3725.80 GiB 4000.55 GB)
  Raid Devices : 4
 Total Devices : 4
   Persistence : Superblock is persistent

 Intent Bitmap : Internal

   Update Time : Mon Feb 25 16:01:57 2019
         State : clean, FAILED 
Active Devices : 0
Failed Devices : 4
 Spare Devices : 0

        Layout : left-symmetric
    Chunk Size : 512K

Consistency Policy : bitmap

Number   Major   Minor   RaidDevice State
   -       0        0        0      removed
   -       0        0        1      removed
   -       0        0        2      removed
   -       0        0        3      removed

   0       8        1        -      faulty   /dev/sda1
   1       8       17        -      faulty   /dev/sdb1
   2       8       33        -      faulty   /dev/sdc1
   3       8       49        -      faulty   /dev/sdd1

1 个答案:

答案 0 :(得分:0)

我相信我找到了问题的原因。查看单个驱动器Smartctl会发现存在接口CRC错误。来自驱动器之一(线路100、117和134)的样本显示了接口CRC错误。每个驱动器显示类似的错误。我怀疑所有四个驱动器的接口都有问题。特别是在这么短的时间里。因此,它看起来像是不良的esata电缆,服务器pci卡,TowerRaid接口或上面的许多设备。我将从电缆开始,然后从那里开始。

sudo smartctl --all /dev/sdb | cat -n $1
 1      smartctl 6.6 2016-05-31 r4324 [x86_64-linux-4.15.0-45-generic] (local build)
 2        Copyright (C) 2002-16, Bruce Allen, Christian Franke, www.smartmontools.org
 3  
 4        === START OF INFORMATION SECTION ===
 5        Device Model:     WDC WD4002FFWX-68TZ4N0
 6        Serial Number:    K4JHGWXB
 7        LU WWN Device Id: 5 000cca 25de33882
 8        Firmware Version: 83.H0A83
 9        User Capacity:    4,000,787,030,016 bytes [4.00 TB]
10        Sector Sizes:     512 bytes logical, 4096 bytes physical
11        Rotation Rate:    7200 rpm
12        Form Factor:      3.5 inches
13        Device is:        Not in smartctl database [for details use: -P showall]
14        ATA Version is:   ACS-2, ATA8-ACS T13/1699-D revision 4
15        SATA Version is:  SATA 3.1, 6.0 Gb/s (current: 6.0 Gb/s)
16        Local Time is:    Tue Feb 26 12:41:00 2019 MST
17        SMART support is: Available - device has SMART capability.
18        SMART support is: Enabled
19  
20        === START OF READ SMART DATA SECTION ===
21        SMART Status not supported: Incomplete response, ATA output registers missing
22        SMART overall-health self-assessment test result: PASSED
23        Warning: This result is based on an Attribute check.
24  
25        General SMART Values:
26        Offline data collection status:  (0x80)   Offline data collection activity
27                          was never started.
28                          Auto Offline Data Collection: Enabled.
29        Self-test execution status:      (   0)   The previous self-test routine completed
30                          without error or no self-test has ever
31                          been run.
32        Total time to complete Offline
33        data collection:      (  113) seconds.
34        Offline data collection
35        capabilities:              (0x5b) SMART execute Offline immediate.
36                          Auto Offline data collection on/off support.
37                          Suspend Offline collection upon new
38                          command.
39                          Offline surface scan supported.
40                          Self-test supported.
41                          No Conveyance Self-test supported.
42                          Selective Self-test supported.
43        SMART capabilities:            (0x0003)   Saves SMART data before entering
44                          power-saving mode.
45                          Supports SMART auto save timer.
46        Error logging capability:        (0x01)   Error logging supported.
47                          General Purpose Logging supported.
48        Short self-test routine
49        recommended polling time:      (   2) minutes.
50        Extended self-test routine
51        recommended polling time:      ( 571) minutes.
52        SCT capabilities:            (0x003d) SCT Status supported.
53        SCT Error Recovery Control supported.
54        SCT Feature Control supported.
55        SCT Data Table supported.
56  
57        SMART Attributes Data Structure revision number: 16
58        Vendor Specific SMART Attributes with Thresholds:
59        ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
60          1 Raw_Read_Error_Rate     0x000b   100   100   016    Pre-fail  Always       -       0
61          2 Throughput_Performance  0x0005   137   137   054    Pre-fail  Offline      -       104
62          3 Spin_Up_Time            0x0007   142   142   024    Pre-fail  Always       -       369 (Average 381)
63          4 Start_Stop_Count        0x0012   100   100   000    Old_age   Always       -       23
64          5 Reallocated_Sector_Ct   0x0033   100   100   005    Pre-fail  Always       -       0
65          7 Seek_Error_Rate         0x000b   100   100   067    Pre-fail  Always       -       0
66          8 Seek_Time_Performance   0x0005   128   128   020    Pre-fail  Offline      -       18
67          9 Power_On_Hours          0x0012   100   100   000    Old_age   Always       -       820
68         10 Spin_Retry_Count        0x0013   100   100   060    Pre-fail  Always       -       0
69         12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       21
70        192 Power-Off_Retract_Count 0x0032   100   100   000    Old_age   Always       -       55
71        193 Load_Cycle_Count        0x0012   100   100   000    Old_age   Always       -       55
72        194 Temperature_Celsius     0x0002   171   171   000    Old_age   Always       -       35 (Min/Max 19/42)
73        196 Reallocated_Event_Count 0x0032   100   100   000    Old_age   Always       -       0
74        197 Current_Pending_Sector  0x0022   100   100   000    Old_age   Always       -       0
75        198 Offline_Uncorrectable   0x0008   100   100   000    Old_age   Offline      -       0
76        199 UDMA_CRC_Error_Count    0x000a   200   200   000    Old_age   Always       -       3
77  
78        SMART Error Log Version: 1
79        ATA Error Count: 3
80          CR = Command Register [HEX]
81          FR = Features Register [HEX]
82          SC = Sector Count Register [HEX]
83          SN = Sector Number Register [HEX]
84          CL = Cylinder Low Register [HEX]
85          CH = Cylinder High Register [HEX]
86          DH = Device/Head Register [HEX]
87          DC = Device Command Register [HEX]
88          ER = Error register [HEX]
89          ST = Status register [HEX]
90        Powered_Up_Time is measured from power on, and printed as
91        DDd+hh:mm:SS.sss where DD=days, hh=hours, mm=minutes,
92        SS=sec, and sss=millisec. It "wraps" after 49.710 days.
93  
94        Error 3 occurred at disk power-on lifetime: 715 hours (29 days + 19 hours)
95          When the command that caused the error occurred, the device was active or idle.
96  
97          After command completion occurred, registers were:
98          ER ST SC SN CL CH DH

99          -- -- -- -- -- -- --
100         84 43 00 00 00 00 00  Error: ICRC, ABRT at LBA = 0x00000000 = 0
101 
102         Commands leading to the command that caused the error were:
103         CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
104         -- -- -- -- -- -- -- --  ----------------  --------------------
105         61 40 d8 c0 33 f8 40 08  10d+11:01:08.855  WRITE FPDMA QUEUED
106         61 40 f0 80 2e f8 40 08  10d+11:01:08.847  WRITE FPDMA QUEUED
107         61 40 e8 40 29 f8 40 08  10d+11:01:08.844  WRITE FPDMA QUEUED
108         61 40 e0 00 24 f8 40 08  10d+11:01:08.841  WRITE FPDMA QUEUED
109         61 a8 d8 18 20 f8 40 08  10d+11:01:08.840  WRITE FPDMA QUEUED
110 
111       Error 2 occurred at disk power-on lifetime: 0 hours (0 days + 0 hours)
112         When the command that caused the error occurred, the device was active or idle.
113 
114         After command completion occurred, registers were:
115         ER ST SC SN CL CH DH
116         -- -- -- -- -- -- --
117         84 43 00 00 00 00 00  Error: ICRC, ABRT at LBA = 0x00000000 = 0
118 
119         Commands leading to the command that caused the error were:
120         CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
121         -- -- -- -- -- -- -- --  ----------------  --------------------
122         60 00 c8 00 02 00 40 08      00:00:16.009  READ FPDMA QUEUED
123         47 00 01 12 00 00 a0 08      00:00:15.990  READ LOG DMA EXT
124         47 00 01 00 00 00 a0 08      00:00:15.989  READ LOG DMA EXT
125         ef 10 02 00 00 00 a0 08      00:00:15.987  SET FEATURES [Enable SATA feature]
126         27 00 00 00 00 00 e0 08      00:00:15.987  READ NATIVE MAX ADDRESS EXT [OBS-ACS-3]
127 
128       Error 1 occurred at disk power-on lifetime: 0 hours (0 days + 0 hours)
129         When the command that caused the error occurred, the device was active or idle.
130 
131         After command completion occurred, registers were:
132         ER ST SC SN CL CH DH
133         -- -- -- -- -- -- --
134         84 43 00 00 00 00 00  Error: ICRC, ABRT at LBA = 0x00000000 = 0
135 
136         Commands leading to the command that caused the error were:
137         CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
138         -- -- -- -- -- -- -- --  ----------------  --------------------
139         60 00 b8 00 02 00 40 08      00:00:15.373  READ FPDMA QUEUED
140         60 80 b0 80 00 00 40 08      00:00:15.370  READ FPDMA QUEUED
141         60 38 a8 40 00 00 40 08      00:00:15.370  READ FPDMA QUEUED
142         60 08 a0 10 00 00 40 08      00:00:15.370  READ FPDMA QUEUED
143         60 18 98 20 00 00 40 08      00:00:15.370  READ FPDMA QUEUED
144 
145       SMART Self-test log structure revision number 1
146       No self-tests have been logged.  [To run self-tests, use: smartctl -t]
147 
148       SMART Selective self-test log data structure revision number 1
149        SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
150           1        0        0  Not_testing
151           2        0        0  Not_testing
152           3        0        0  Not_testing
153           4        0        0  Not_testing
154           5        0        0  Not_testing
155       Selective self-test flags (0x0):
156         After scanning selected spans, do NOT read-scan remainder of disk.
157       If Selective self-test is pending on power-up, resume after 0 minute delay.