作为背景,自80年代以来,我一直在调整数据库平台。所以,我过去曾处理过很多异步I / O问题。这个是新的,很奇怪。
首先,我在RHEL 7.1 64位(3.10.0-229)上使用带有ASM的Oracle 12c。我一直在使用两个EMC CX4-960阵列,共有72个SSD。我总共做了~105K读/秒,65K写/秒。 (是的,这是一个非常强大的存储后端!)磁盘写入延迟是2-3ms。当oracle dbwriters刷新缓冲区(通常是大批量和异步)时,以下strace片段显示io_submit()和io_getevents()在几毫秒内完成,然后完成所有写入需要几毫秒,我们移动到下一批。 (我删除了io_submit()行中提交的块的详细信息:
294692 12:46:10.173955 io_submit(140662136606720, 301, ) = 301 <0.002482>
294692 12:46:10.178452 io_getevents(140662136606720, 38, 128, , {600, 0}) = 60 <0.000026>
294692 12:46:10.178766 times(NULL) = 439014359 <0.000016>
294692 12:46:10.178845 io_getevents(140662136606720, 128, 128, , {0, 0}) = 85 <0.000109>
294692 12:46:10.179352 io_getevents(140662136606720, 128, 128, , {0, 0}) = 62 <0.000118>
294692 12:46:10.180207 io_getevents(140662136606720, 94, 128, , {0, 0}) = 76 <0.000115>
294692 12:46:10.180743 io_getevents(140662136606720, 18, 128, , {0, 0}) = 16 <0.000122>
294692 12:46:10.181994 io_getevents(140662136606720, 2, 128, , {0, 0}) = 2 <0.000032>
294692 12:46:10.182393 times(NULL) = 439014359 <0.000016>
294692 12:46:10.182462 semtimedop(4718593, , 1, {3, 0}) = -1 EAGAIN (Resource temporarily unavailable) <2.999632>
294692 12:46:13.182193 times(NULL) = 439014659 <0.000015>
294692 12:46:13.188183 io_submit(140662136606720, 319, ) = 319 <0.002741>
294692 12:46:13.193078 io_getevents(140662136606720, 40, 128, , {600, 0}) = 128 <0.000021>
294692 12:46:13.193583 times(NULL) = 439014660 <0.000018>
294692 12:46:13.193663 io_getevents(140662136606720, 128, 128, , {0, 0}) = 119 <0.000116>
294692 12:46:13.194364 io_getevents(140662136606720, 72, 128, , {0, 0}) = 59 <0.000123>
294692 12:46:13.195876 io_getevents(140662136606720, 13, 128, , {0, 0}) = 13 <0.000021>
294692 12:46:13.196650 times(NULL) = 439014661 <0.000017>
294692 12:46:13.196725 semtimedop(4718593, , 1, {2, 990000000}) = -1 EAGAIN (Resource temporarily unavailable) <2.989363>
294692 12:46:16.186196 times(NULL) = 439014960 <0.000015>
294692 12:46:16.194006 io_submit(140662136606720, 276, ) = 276 <0.002434>
294692 12:46:16.198285 io_getevents(140662136606720, 36, 128, , {600, 0}) = 42 <0.000017>
294692 12:46:16.198518 times(NULL) = 439014961 <0.000014>
294692 12:46:16.198572 io_getevents(140662136606720, 128, 128, , {0, 0}) = 48 <0.000092>
294692 12:46:16.198893 io_getevents(140662136606720, 128, 128, , {0, 0}) = 37 <0.000070>
到目前为止,这么好。然后我切换到我正在测试的两个Tegile t3600阵列。这些家伙甚至更快,并且可以在更低的延迟时间给我更多的IOPS。问题是我很快就遇到了Oracle&#34;免费缓冲等待&#34;在50%甚至更高。 dbwriters无法跟上,迫使前台写入和各种坏事。令人惊讶的是,dbwriters无法使用如此快速的存储来刷新足够的缓冲区。但是strace显示了原因。请注意,iostat显示平均磁盘写入延迟大约为0.7毫秒。
19131 18:35:06.903628 io_submit(140538814074880, 517, ) = 517 <0.505505>
19131 18:35:07.414281 io_getevents(140538814074880, 40, 128, , {600, 0}) = 128 <0.000014>
19131 18:35:07.415091 io_getevents(140538814074880, 128, 128, , {0, 0}) = 128 <0.000012>
19131 18:35:07.416139 io_getevents(140538814074880, 128, 128, , {0, 0}) = 128 <0.000010>
19131 18:35:07.417134 semctl(753668, 33, SETVAL, 0x1) = 0 <0.000017>
19131 18:35:07.417553 semctl(688130, 103, SETVAL, 0x1) = 0 <0.000014>
19131 18:35:07.417640 semctl(655361, 130, SETVAL, 0x1) = 0 <0.000013>
19131 18:35:07.419923 io_submit(140538814074880, 248, ) = 248 <0.250174>
19131 18:35:07.673864 io_getevents(140538814074880, 22, 128, , {600, 0}) = 128 <0.000019>
19131 18:35:07.674735 io_getevents(140538814074880, 128, 128, , {0, 0}) = 128 <0.000010>
19131 18:35:07.676021 io_getevents(140538814074880, 128, 128, , {0, 0}) = 128 <0.000020>
19131 18:35:07.676660 semctl(753668, 5, SETVAL, 0x1) = 0 <0.000021>
19131 18:35:07.680954 io_submit(140538814074880, 507, ) = 507 <0.503491>
19131 18:35:08.190096 io_getevents(140538814074880, 38, 128, , {600, 0}) = 128 <0.000010>
19131 18:35:08.190617 io_getevents(140538814074880, 128, 128, , {0, 0}) = 128 <0.000008>
19131 18:35:08.193571 io_getevents(140538814074880, 128, 128, , {0, 0}) = 128 <0.000025>
19131 18:35:08.196128 semctl(720899, 38, SETVAL, 0x1) = 0 <0.000026>
因此,出于某种原因,具有517个块的io_submit()需要505ms才能返回。为什么呢?
为什么会发生这种情况的任何想法?似乎该数组以某种方式告诉操作系统以串行方式发出写入。 FWIW,我甚至在阵列控制器中启用了写回写缓存。所以它似乎是操作系统本身的东西
答案 0 :(得分:1)
问题是,当Linux扫描LUN时,LUN会通过“启用写入缓存”来通告自己。这告诉Linux,由于Oracle使用O_SYNC(或O_DSYNC?)打开LUN,因此必须使用强制单元访问以避免数据丢失以防高速缓存中的数据丢失。这基于许多假设 - 缓存在RAM中,是易失性的等等 - 但我们只是接受它。在性能方面,FUA是个坏消息。它还会失败并行发出异步I / O.
事实证明,该阵列有一个设置,告诉它是否要将回写缓存通告给Linux服务器。它不会改变阵列的运行方式;它只是改变它对主机的看法。通过将阵列上的WBC设置更改为“已禁用”,Linux主机在扫描LUN时会打印“禁用写入缓存”行,现在异步写入的行为正常。