我目前有一对drbd服务器决定停止同步,我似乎无法做任何事情让他们再次同步。同步过程通过两台服务器之间的专用交叉电缆(1gbps铜线)进行。
以下是我在日志中看到的r01:
Aug 9 16:09:44 r02 kernel: [12739.178449] block drbd0: receiver (re)started
Aug 9 16:09:44 r02 kernel: [12739.178454] block drbd0: conn( Unconnected -> WFConnection )
Aug 9 16:09:44 r02 kernel: [12739.912037] block drbd0: Handshake successful: Agreed network protocol version 91
Aug 9 16:09:44 r02 kernel: [12739.912048] block drbd0: conn( WFConnection -> WFReportParams )
Aug 9 16:09:44 r02 kernel: [12739.912074] block drbd0: Starting asender thread (from drbd0_receiver [3740])
Aug 9 16:09:44 r02 kernel: [12739.936681] block drbd0: data-integrity-alg: <not-used>
Aug 9 16:09:44 r02 kernel: [12739.936691] block drbd0: Considerable difference in lower level device sizes: 256503768s vs. 1344982880s
Aug 9 16:09:44 r02 kernel: [12739.942918] block drbd0: drbd_sync_handshake:
Aug 9 16:09:44 r02 kernel: [12739.942923] block drbd0: self E17D2EE7BC2C235E:0000000000000000:0000000000000000:0000000000000000 bits:32062701 flags:0
Aug 9 16:09:44 r02 kernel: [12739.942928] block drbd0: peer E21F17F92705CD4F:E17D2EE7BC2C235F:1074ED292C876258:548AFBCD7D5C2C3B bits:32062701 flags:0
Aug 9 16:09:44 r02 kernel: [12739.942933] block drbd0: uuid_compare()=-1 by rule 50
Aug 9 16:09:44 r02 kernel: [12739.942935] block drbd0: Becoming sync target due to disk states.
Aug 9 16:09:44 r02 kernel: [12739.942946] block drbd0: peer( Unknown -> Primary ) conn( WFReportParams -> WFBitMapT ) pdsk( DUnknown -> UpToDate )
Aug 9 16:09:44 r02 kernel: [12740.099597] block drbd0: conn( WFBitMapT -> WFSyncUUID )
Aug 9 16:09:44 r02 kernel: [12740.104324] block drbd0: updated sync uuid BF8D25FBE26085B0:0000000000000000:0000000000000000:0000000000000000
Aug 9 16:09:44 r02 kernel: [12740.104423] block drbd0: helper command: /sbin/drbdadm before-resync-target minor-0
Aug 9 16:09:44 r02 kernel: [12740.106582] block drbd0: helper command: /sbin/drbdadm before-resync-target minor-0 exit code 0 (0x0)
Aug 9 16:09:44 r02 kernel: [12740.106591] block drbd0: conn( WFSyncUUID -> SyncTarget )
Aug 9 16:09:44 r02 kernel: [12740.106599] block drbd0: Began resync as SyncTarget (will sync 128250804 KB [32062701 bits set]).
Aug 9 16:09:44 r02 kernel: [12740.140796] block drbd0: meta connection shut down by peer.
Aug 9 16:09:44 r02 kernel: [12740.141304] block drbd0: sock was shut down by peer
Aug 9 16:09:44 r02 kernel: [12740.141309] block drbd0: peer( Primary -> Unknown ) conn( SyncTarget -> BrokenPipe ) pdsk( UpToDate -> DUnknown )
Aug 9 16:09:44 r02 kernel: [12740.141316] block drbd0: short read expecting header on sock: r=0
Aug 9 16:09:44 r02 kernel: [12740.142235] block drbd0: asender terminated
Aug 9 16:09:44 r02 kernel: [12740.142238] block drbd0: Terminating drbd0_asender
Aug 9 16:09:44 r02 kernel: [12740.151561] block drbd0: bitmap WRITE of 979 pages took 2 jiffies
Aug 9 16:09:44 r02 kernel: [12740.151567] block drbd0: 122 GB (32062701 bits) marked out-of-sync by on disk bit-map.
Aug 9 16:09:44 r02 kernel: [12740.151580] block drbd0: Connection closed
Aug 9 16:09:44 r02 kernel: [12740.151586] block drbd0: conn( BrokenPipe -> Unconnected )
Aug 9 16:09:44 r02 kernel: [12740.151592] block drbd0: receiver terminated
对于r01:
Aug 9 16:09:44 r01 kernel: [3438273.766768] block drbd0: receiver (re)started
Aug 9 16:09:44 r01 kernel: [3438273.771898] block drbd0: conn( Unconnected -> WFConnection )
Aug 9 16:09:44 r01 kernel: [3438274.474411] block drbd0: Handshake successful: Agreed network protocol version 91
Aug 9 16:09:44 r01 kernel: [3438274.483299] block drbd0: conn( WFConnection -> WFReportParams )
Aug 9 16:09:44 r01 kernel: [3438274.490420] block drbd0: Starting asender thread (from drbd0_receiver [6366])
Aug 9 16:09:44 r01 kernel: [3438274.498900] block drbd0: data-integrity-alg: <not-used>
Aug 9 16:09:44 r01 kernel: [3438274.505166] block drbd0: Considerable difference in lower level device sizes: 1344982880s vs. 256503768s
Aug 9 16:09:44 r01 kernel: [3438274.516226] block drbd0: max_segment_size ( = BIO size ) = 65536
Aug 9 16:09:44 r01 kernel: [3438274.523385] block drbd0: drbd_sync_handshake:
Aug 9 16:09:44 r01 kernel: [3438274.528677] block drbd0: self E21F17F92705CD4F:E17D2EE7BC2C235F:1074ED292C876258:548AFBCD7D5C2C3B bits:32062701 flags:0
Aug 9 16:09:44 r01 kernel: [3438274.541195] block drbd0: peer E17D2EE7BC2C235E:0000000000000000:0000000000000000:0000000000000000 bits:32062701 flags:0
Aug 9 16:09:44 r01 kernel: [3438274.553710] block drbd0: uuid_compare()=1 by rule 70
Aug 9 16:09:44 r01 kernel: [3438274.559677] block drbd0: Becoming sync source due to disk states.
Aug 9 16:09:44 r01 kernel: [3438274.566897] block drbd0: peer( Unknown -> Secondary ) conn( WFReportParams -> WFBitMapS )
Aug 9 16:09:44 r01 kernel: [3438274.666397] block drbd0: conn( WFBitMapS -> SyncSource )
Aug 9 16:09:44 r01 kernel: [3438274.672845] block drbd0: Began resync as SyncSource (will sync 128250804 KB [32062701 bits set]).
Aug 9 16:09:44 r01 kernel: [3438274.683196] block drbd0: /build/buildd-linux-2.6_2.6.32-48squeeze3-amd64-mcoLgp/linux-2.6-2.6.32/debian/build/source_amd64_none/drivers/block/drbd/drbd_receiver.c:1932: sector: 0s, size: 65536
Aug 9 16:09:45 r01 kernel: [3438274.702834] block drbd0: error receiving RSDataRequest, l: 24!
Aug 9 16:09:45 r01 kernel: [3438274.702837] block drbd0: peer( Secondary -> Unknown ) conn( SyncSource -> ProtocolError )
Aug 9 16:09:45 r01 kernel: [3438274.703005] block drbd0: asender terminated
Aug 9 16:09:45 r01 kernel: [3438274.703009] block drbd0: Terminating drbd0_asender
Aug 9 16:09:45 r01 kernel: [3438274.711319] block drbd0: Connection closed
Aug 9 16:09:45 r01 kernel: [3438274.711323] block drbd0: conn( ProtocolError -> Unconnected )
Aug 9 16:09:45 r01 kernel: [3438274.711329] block drbd0: receiver terminated
这只是一遍又一遍地重复。
两台服务器上的配置相同:
r01:~$ rsync --dry-run --verbose --checksum --itemize-changes 10.0.255.254:/etc/drbd.conf /etc/
sent 11 bytes received 51 bytes 124.00 bytes/sec
total size is 615 speedup is 9.92 (DRY RUN)
这就是配置的样子:
r01:~$ cat /etc/drbd.conf
global {
usage-count no;
}
resource drbd0 {
protocol C;
handlers { pri-on-incon-degr "echo '!DRBD! pri on incon-degr' | wall ; exit 1"; }
startup {
degr-wfc-timeout 60; # 1 minute.
wfc-timeout 55;
}
disk {
on-io-error detach;
}
syncer {
rate 100M;
al-extents 257;
}
on r01.c07.mtsvc.net {
device /dev/drbd0;
disk /dev/cciss/c0d0p3;
address 10.0.255.253:7788;
meta-disk internal;
}
on r02.c07.mtsvc.net {
device /dev/drbd0;
disk /dev/cciss/c0d0p6;
address 10.0.255.254:7788;
meta-disk internal;
}
}
以下是双方的网络配置:
r01:~$ sudo ifconfig -a | grep -B 2 -A 8 10.0.255
eth2 Link encap:Ethernet HWaddr 00:26:55:d6:f8:fc
inet addr:10.0.255.253 Bcast:10.0.255.255 Mask:255.255.255.0
inet6 addr: fe80::226:55ff:fed6:f8fc/64 Scope:Link
UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
RX packets:4062510240 errors:0 dropped:0 overruns:0 frame:0
TX packets:5692251259 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:1000
RX bytes:5512604514975 (5.0 TiB) TX bytes:5820995499388 (5.2 TiB)
Interrupt:24 Memory:fbe80000-fbea0000
r01:~$ sudo ifconfig -a | grep -B 2 -A 8 10.0.255
eth2 Link encap:Ethernet HWaddr 00:1b:78:5c:a8:fd
inet addr:10.0.255.254 Bcast:10.0.255.255 Mask:255.255.255.252
inet6 addr: fe80::21b:78ff:fe5c:a8fd/64 Scope:Link
UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
RX packets:321977747 errors:0 dropped:0 overruns:0 frame:0
TX packets:264683964 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:1000
RX bytes:332813827055 (309.9 GiB) TX bytes:328142295363 (305.6 GiB)
Interrupt:17 Memory:fdfa0000-fdfc0000
最初,r01和r02都在运行Debian Squeeze(drbd 8.3.7)。然后我用Debian Wheezy(drbd 8.3.13)重建了r02。事情顺利进行了几天,然后在重启drbd之后,这个问题就开始了。我有几个其他drbd集群,我一直在以同样的方式升级。其中一些完全升级到Wheezy,其他仍然是一半挤压,一半Wheezy并且很好。
到目前为止,我已尝试解决此问题。
在接下来的几天里,我将用100%不同的硬件替换r01。但即使这样有效,我仍然处于亏损状态。我真的想了解导致这个问题的原因以及解决问题的正确方法。
答案 0 :(得分:0)
DRBD在8.3.7和8.3.13之间发生了很多变化;包括对resyncs工作方式的重大更改:https://blogs.linbit.com/p/128/drbd-sync-rate-controller/
您可以尝试从资源配置中删除任何不需要的设置(因此,同步器{}部分)并调整DRBD:# drbdadm adjust all
如果仍然无法连接,您可能需要升级旧节点才能进行同步:http://www.drbd.org/download/drbd/8.3/drbd-8.3.13.tar.gz