drbd同步因ProtocolError而失败

时间:2016-08-10 18:26:45

标签: synchronization drbd

我目前有一对drbd服务器决定停止同步,我似乎无法做任何事情让他们再次同步。同步过程通过两台服务器之间的专用交叉电缆(1gbps铜线)进行。

以下是我在日志中看到的r01:

Aug  9 16:09:44 r02 kernel: [12739.178449] block drbd0: receiver (re)started
Aug  9 16:09:44 r02 kernel: [12739.178454] block drbd0: conn( Unconnected -> WFConnection ) 
Aug  9 16:09:44 r02 kernel: [12739.912037] block drbd0: Handshake successful: Agreed network protocol version 91
Aug  9 16:09:44 r02 kernel: [12739.912048] block drbd0: conn( WFConnection -> WFReportParams ) 
Aug  9 16:09:44 r02 kernel: [12739.912074] block drbd0: Starting asender thread (from drbd0_receiver [3740])
Aug  9 16:09:44 r02 kernel: [12739.936681] block drbd0: data-integrity-alg: <not-used>
Aug  9 16:09:44 r02 kernel: [12739.936691] block drbd0: Considerable difference in lower level device sizes: 256503768s vs. 1344982880s
Aug  9 16:09:44 r02 kernel: [12739.942918] block drbd0: drbd_sync_handshake:
Aug  9 16:09:44 r02 kernel: [12739.942923] block drbd0: self E17D2EE7BC2C235E:0000000000000000:0000000000000000:0000000000000000 bits:32062701 flags:0
Aug  9 16:09:44 r02 kernel: [12739.942928] block drbd0: peer E21F17F92705CD4F:E17D2EE7BC2C235F:1074ED292C876258:548AFBCD7D5C2C3B bits:32062701 flags:0
Aug  9 16:09:44 r02 kernel: [12739.942933] block drbd0: uuid_compare()=-1 by rule 50
Aug  9 16:09:44 r02 kernel: [12739.942935] block drbd0: Becoming sync target due to disk states.
Aug  9 16:09:44 r02 kernel: [12739.942946] block drbd0: peer( Unknown -> Primary ) conn( WFReportParams -> WFBitMapT ) pdsk( DUnknown -> UpToDate ) 
Aug  9 16:09:44 r02 kernel: [12740.099597] block drbd0: conn( WFBitMapT -> WFSyncUUID ) 
Aug  9 16:09:44 r02 kernel: [12740.104324] block drbd0: updated sync uuid BF8D25FBE26085B0:0000000000000000:0000000000000000:0000000000000000
Aug  9 16:09:44 r02 kernel: [12740.104423] block drbd0: helper command: /sbin/drbdadm before-resync-target minor-0
Aug  9 16:09:44 r02 kernel: [12740.106582] block drbd0: helper command: /sbin/drbdadm before-resync-target minor-0 exit code 0 (0x0)
Aug  9 16:09:44 r02 kernel: [12740.106591] block drbd0: conn( WFSyncUUID -> SyncTarget ) 
Aug  9 16:09:44 r02 kernel: [12740.106599] block drbd0: Began resync as SyncTarget (will sync 128250804 KB [32062701 bits set]).
Aug  9 16:09:44 r02 kernel: [12740.140796] block drbd0: meta connection shut down by peer.
Aug  9 16:09:44 r02 kernel: [12740.141304] block drbd0: sock was shut down by peer
Aug  9 16:09:44 r02 kernel: [12740.141309] block drbd0: peer( Primary -> Unknown ) conn( SyncTarget -> BrokenPipe ) pdsk( UpToDate -> DUnknown ) 
Aug  9 16:09:44 r02 kernel: [12740.141316] block drbd0: short read expecting header on sock: r=0
Aug  9 16:09:44 r02 kernel: [12740.142235] block drbd0: asender terminated
Aug  9 16:09:44 r02 kernel: [12740.142238] block drbd0: Terminating drbd0_asender
Aug  9 16:09:44 r02 kernel: [12740.151561] block drbd0: bitmap WRITE of 979 pages took 2 jiffies
Aug  9 16:09:44 r02 kernel: [12740.151567] block drbd0: 122 GB (32062701 bits) marked out-of-sync by on disk bit-map.
Aug  9 16:09:44 r02 kernel: [12740.151580] block drbd0: Connection closed
Aug  9 16:09:44 r02 kernel: [12740.151586] block drbd0: conn( BrokenPipe -> Unconnected ) 
Aug  9 16:09:44 r02 kernel: [12740.151592] block drbd0: receiver terminated

对于r01:

Aug  9 16:09:44 r01 kernel: [3438273.766768] block drbd0: receiver (re)started
Aug  9 16:09:44 r01 kernel: [3438273.771898] block drbd0: conn( Unconnected -> WFConnection ) 
Aug  9 16:09:44 r01 kernel: [3438274.474411] block drbd0: Handshake successful: Agreed network protocol version 91
Aug  9 16:09:44 r01 kernel: [3438274.483299] block drbd0: conn( WFConnection -> WFReportParams ) 
Aug  9 16:09:44 r01 kernel: [3438274.490420] block drbd0: Starting asender thread (from drbd0_receiver [6366])
Aug  9 16:09:44 r01 kernel: [3438274.498900] block drbd0: data-integrity-alg: <not-used>
Aug  9 16:09:44 r01 kernel: [3438274.505166] block drbd0: Considerable difference in lower level device sizes: 1344982880s vs. 256503768s
Aug  9 16:09:44 r01 kernel: [3438274.516226] block drbd0: max_segment_size ( = BIO size ) = 65536
Aug  9 16:09:44 r01 kernel: [3438274.523385] block drbd0: drbd_sync_handshake:
Aug  9 16:09:44 r01 kernel: [3438274.528677] block drbd0: self E21F17F92705CD4F:E17D2EE7BC2C235F:1074ED292C876258:548AFBCD7D5C2C3B bits:32062701 flags:0
Aug  9 16:09:44 r01 kernel: [3438274.541195] block drbd0: peer E17D2EE7BC2C235E:0000000000000000:0000000000000000:0000000000000000 bits:32062701 flags:0
Aug  9 16:09:44 r01 kernel: [3438274.553710] block drbd0: uuid_compare()=1 by rule 70
Aug  9 16:09:44 r01 kernel: [3438274.559677] block drbd0: Becoming sync source due to disk states.
Aug  9 16:09:44 r01 kernel: [3438274.566897] block drbd0: peer( Unknown -> Secondary ) conn( WFReportParams -> WFBitMapS ) 
Aug  9 16:09:44 r01 kernel: [3438274.666397] block drbd0: conn( WFBitMapS -> SyncSource ) 
Aug  9 16:09:44 r01 kernel: [3438274.672845] block drbd0: Began resync as SyncSource (will sync 128250804 KB [32062701 bits set]).
Aug  9 16:09:44 r01 kernel: [3438274.683196] block drbd0: /build/buildd-linux-2.6_2.6.32-48squeeze3-amd64-mcoLgp/linux-2.6-2.6.32/debian/build/source_amd64_none/drivers/block/drbd/drbd_receiver.c:1932: sector: 0s, size: 65536
Aug  9 16:09:45 r01 kernel: [3438274.702834] block drbd0: error receiving RSDataRequest, l: 24!
Aug  9 16:09:45 r01 kernel: [3438274.702837] block drbd0: peer( Secondary -> Unknown ) conn( SyncSource -> ProtocolError ) 
Aug  9 16:09:45 r01 kernel: [3438274.703005] block drbd0: asender terminated
Aug  9 16:09:45 r01 kernel: [3438274.703009] block drbd0: Terminating drbd0_asender
Aug  9 16:09:45 r01 kernel: [3438274.711319] block drbd0: Connection closed
Aug  9 16:09:45 r01 kernel: [3438274.711323] block drbd0: conn( ProtocolError -> Unconnected ) 
Aug  9 16:09:45 r01 kernel: [3438274.711329] block drbd0: receiver terminated

这只是一遍又一遍地重复。

两台服务器上的配置相同:

r01:~$ rsync --dry-run --verbose --checksum --itemize-changes 10.0.255.254:/etc/drbd.conf /etc/

sent 11 bytes  received 51 bytes  124.00 bytes/sec
total size is 615  speedup is 9.92 (DRY RUN)

这就是配置的样子:

r01:~$ cat /etc/drbd.conf
global {
   usage-count no;
}

resource drbd0 {
  protocol C;
  handlers { pri-on-incon-degr "echo '!DRBD! pri on incon-degr' | wall ; exit 1"; }
  startup {
    degr-wfc-timeout 60;    # 1 minute.
    wfc-timeout 55;
  }

  disk {
    on-io-error   detach;
  }

  syncer {
    rate 100M;
    al-extents 257;
  }

  on r01.c07.mtsvc.net {
    device     /dev/drbd0;
    disk       /dev/cciss/c0d0p3;
    address    10.0.255.253:7788;
    meta-disk  internal;
  }

  on r02.c07.mtsvc.net {
    device     /dev/drbd0;
    disk       /dev/cciss/c0d0p6;
    address    10.0.255.254:7788;
    meta-disk  internal;
  }
}

以下是双方的网络配置:

r01:~$ sudo ifconfig -a | grep -B 2 -A 8 10.0.255

eth2      Link encap:Ethernet  HWaddr 00:26:55:d6:f8:fc  
          inet addr:10.0.255.253  Bcast:10.0.255.255  Mask:255.255.255.0
          inet6 addr: fe80::226:55ff:fed6:f8fc/64 Scope:Link
          UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
          RX packets:4062510240 errors:0 dropped:0 overruns:0 frame:0
          TX packets:5692251259 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1000 
          RX bytes:5512604514975 (5.0 TiB)  TX bytes:5820995499388 (5.2 TiB)
          Interrupt:24 Memory:fbe80000-fbea0000 

r01:~$ sudo ifconfig -a | grep -B 2 -A 8 10.0.255

eth2      Link encap:Ethernet  HWaddr 00:1b:78:5c:a8:fd  
          inet addr:10.0.255.254  Bcast:10.0.255.255  Mask:255.255.255.252
          inet6 addr: fe80::21b:78ff:fe5c:a8fd/64 Scope:Link
          UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
          RX packets:321977747 errors:0 dropped:0 overruns:0 frame:0
          TX packets:264683964 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1000 
          RX bytes:332813827055 (309.9 GiB)  TX bytes:328142295363 (305.6 GiB)
          Interrupt:17 Memory:fdfa0000-fdfc0000 

最初,r01和r02都在运行Debian Squeeze(drbd 8.3.7)。然后我用Debian Wheezy(drbd 8.3.13)重建了r02。事情顺利进行了几天,然后在重启drbd之后,这个问题就开始了。我有几个其他drbd集群,我一直在以同样的方式升级。其中一些完全升级到Wheezy,其他仍然是一半挤压,一半Wheezy并且很好。

到目前为止,我已尝试解决此问题。

  • 擦除r02上的drbd卷并尝试重新同步
  • 擦除,重新安装并重新配置r02。
  • 用不同的硬件替换r02,并从头开始重建。
  • 更换交叉电缆(两次)

在接下来的几天里,我将用100%不同的硬件替换r01。但即使这样有效,我仍然处于亏损状态。我真的想了解导致这个问题的原因以及解决问题的正确方法。

1 个答案:

答案 0 :(得分:0)

DRBD在8.3.7和8.3.13之间发生了很多变化;包括对resyncs工作方式的重大更改:https://blogs.linbit.com/p/128/drbd-sync-rate-controller/

您可以尝试从资源配置中删除任何不需要的设置(因此,同步器{}部分)并调整DRBD:# drbdadm adjust all

如果仍然无法连接,您可能需要升级旧节点才能进行同步:http://www.drbd.org/download/drbd/8.3/drbd-8.3.13.tar.gz