Ignite持续使用时无法消耗WAL日志并释放OS缓冲区
我有一台ignite服务器,具有128G内存,并启用了持久性以确保数据安全。
根据我从正式文件中获得的信息,我的理解是: 启用Persitent后,Ignite将首先将数据更改保存到OS缓冲区(我检查 作为Linux命令free -mh中的buff / cache), 然后写入WAL日志,并定期通过检查点过程来分析WAL 记录并释放解析的WAL日志磁盘空间,并释放使用的OS缓冲区,如果我错了,请纠正我。
但是在测试中,当Ignite开始处理流量时,我发现操作系统缓冲区迅速增加 并检查WAL日志目录,按顺序生成了很多wal日志, 几乎与buff / cache的大小相同。
[root@Redis1 apache-ignite]# free -mh
total used free shared buff/cache available
Mem: 125G 14G 109G 995M 1.7G 109G
Swap: 127G 0B 127G
127G
只有几分钟,空闲列迅速减少,而buff / cache增加则迅速
[root@Redis1 apache-ignite]# free -mh
total used free shared buff/cache available
Mem: 125G 15G 85G 995M 25G 108G
Swap: 127G 0B 127G
,WAL日志的大小和段数也不断增加,仍然与buff / cache的大小几乎相同。
我已经检查了点火日志,检查点过程每3分钟审核一次:
[05:30:05,818][INFO][db-checkpoint-thread-#107][GridCacheDatabaseSharedManager] Checkpoint started [checkpointId=9428aebc-f2b0-4d33-bed6-fb9a1ad49848, startPtr=FileWALPointer [idx=341, fileOff=50223036, len=420491], checkpointLockWait=0ms, checkpointLockHoldTime=860ms, walCpRecordFsyncDuration=245ms, pages=89627, reason='timeout']
[05:30:22,429][INFO][db-checkpoint-thread-#107][GridCacheDatabaseSharedManager] Checkpoint finished [cpId=9428aebc-f2b0-4d33-bed6-fb9a1ad49848, pages=89627, markPos=FileWALPointer [idx=341, fileOff=50223036, len=420491], walSegmentsCleared=0, markDuration=1288ms, pagesWrite=844ms, fsync=15767ms, total=17899ms]
但是对于“ free -mh”命令的输出,“ free”列无法释放,即使通信量停止,它也会随着通信量的增加而增加 不会减少,如果我一直发送流量,可用内存会一直减少,最后可用内存会减少到大约百兆字节,
[root@Redis1 apache-ignite]# free -mh
total used free shared buff/cache available
Mem: 125G 16G 370M 971M 108G 107G
Swap: 127G 0B 127G
这种情况发生时(内存耗尽吗?),我所有基于ignite的服务都停止了,不再处理我的新请求,对于ignite,它挂起了。
我还注意到了带有reason ='timeout'的Checkpoint日志,我不知道这个问题是否能正确解析WAL日志和免费的OS缓存缓冲区? 无论如何,是否可以让检查点正常工作来释放内存?
我的问题是我该如何做些事情来防止点燃可用内存,并保持我的服务持续开启, 我发现如果我关闭了持久性功能,非常快地点燃手柄,并且在相同流量下缓存的使用率低于1G,但是启用了持久性标志后, 操作系统缓存内存迅速增加版本,耗尽了所有可用的内存,然后点燃无法从此状态恢复并挂起。
我尝试了很多参数,使用WALMODE,LOG_ONLY或Background,在JVM中设置-DIGNITE_WAL_MMAP = false,设置checkpointPageBufferSize,但是没有 其中的一些可以保存我的ignite服务,但仍会占用操作系统缓存并耗尽它。
https://apacheignite.readme.io/docs/write-ahead-log https://apacheignite.readme.io/docs/durable-memory-tuning#section-checkpointing-buffer-size
<property name="dataStorageConfiguration">
<bean class="org.apache.ignite.configuration.DataStorageConfiguration">
<property name="defaultDataRegionConfiguration">
<bean class="org.apache.ignite.configuration.DataRegionConfiguration">
<!-- 10 GB initial size. -->
<property name="initialSize" value="#{10L * 1024 * 1024 * 1024}"/>
<!-- 50 GB maximum size. -->
<property name="maxSize" value="#{50L * 1024 * 1024 * 1024}"/>
<property name="persistenceEnabled" value="true"/>
<property name="checkpointPageBufferSize" value="#{1024L * 1024 * 1024}"/>
</bean>
</property>
<property name="writeThrottlingEnabled" value="true"/>
<property name="walMode" value="LOG_ONLY"/>
<property name="walPath" value="/wal/ebc"/>
<property name="walArchivePath" value="/wal/ebc"/>
</bean>
</property>
以下是我的缓存配置:
public void createLvOneTxCache() {
CacheConfiguration<String, OrderInfo> cacheCfg =
new CacheConfiguration<>("LvOneTxCache");
cacheCfg.setCacheMode(CacheMode.REPLICATED);
//cacheCfg.setStoreKeepBinary(true);
cacheCfg.setAtomicityMode(ATOMIC);
ebcLvOneTxCache = ignite.getOrCreateCache(cacheCfg);
}
我尝试修改参数,但是操作系统缓存仍在增加:
<!-- Enabling Apache Ignite native persistence. -->
<property name="dataStorageConfiguration">
<bean class="org.apache.ignite.configuration.DataStorageConfiguration">
<property name="defaultDataRegionConfiguration">
<bean class="org.apache.ignite.configuration.DataRegionConfiguration">
<!-- 10 GB initial size. -->
<property name="initialSize" value="#{4L * 1024 * 1024 * 1024}"/>
<!-- 50 GB maximum size. -->
<property name="maxSize" value="#{4L * 1024 * 1024 * 1024}"/>
<property name="persistenceEnabled" value="true"/>
<property name="checkpointPageBufferSize" value="#{4L * 1024 * 1024 * 1024}"/>
</bean>
</property>
<property name="checkpointFrequency" value="6000"/>
<property name="checkpointThreads" value="32"/>
<property name="writeThrottlingEnabled" value="true"/>
<property name="walMode" value="LOG_ONLY"/>
<property name="walPath" value="/wal/ebc"/>
<property name="walArchivePath" value="/wal/ebc"/>
</bean>
</property>
并且ignit日志显示审计迅速,但是缓存也没有释放。
[07:51:20,165][INFO][db-checkpoint-thread-#108][GridCacheDatabaseSharedManager] Checkpoint started [checkpointId=fd0c7e68-564a-4b40-9516-bb2a451869e7, startPtr=FileWALPointer [idx=23, fileOff=47849256, len=420491], checkpointLockWait=0ms, checkpointLockHoldTime=77ms, walCpRecordFsyncDuration=233ms, pages=7744, reason='timeout']
[07:51:20,219][INFO][sys-stripe-0-#1][PageMemoryImpl] Throttling is applied to page modifications [percentOfPartTime=0.36, markDirty=16378 pages/sec, checkpointWrite=3322 pages/sec, estIdealMarkDirty=673642 pages/sec, curDirty=0.00, maxDirty=0.40, avgParkTime=21501 ns, pages: (total=7744, evicted=0, written=7744, synced=229, cpBufUsed=0, cpBufTotal=1036430)]
[07:51:22,303][INFO][db-checkpoint-thread-#108][GridCacheDatabaseSharedManager] Checkpoint finished [cpId=fd0c7e68-564a-4b40-9516-bb2a451869e7, pages=7744, markPos=FileWALPointer [idx=23, fileOff=47849256, len=420491], walSegmentsCleared=0, markDuration=317ms, pagesWrite=24ms, fsync=2114ms, total=2456ms]
[07:51:26,117][INFO][db-checkpoint-thread-#108][GridCacheDatabaseSharedManager] Checkpoint started [checkpointId=d64991bc-3d2f-4f2c-8175-d7e92f46f0bf, startPtr=FileWALPointer [idx=25, fileOff=35951286, len=420491], checkpointLockWait=0ms, checkpointLockHoldTime=49ms, walCpRecordFsyncDuration=200ms, pages=7605, reason='timeout']
[07:51:28,612][INFO][db-checkpoint-thread-#108][GridCacheDatabaseSharedManager] Checkpoint finished [cpId=d64991bc-3d2f-4f2c-8175-d7e92f46f0bf, pages=7605, markPos=FileWALPointer [idx=25, fileOff=35951286, len=420491], walSegmentsCleared=0, markDuration=266ms, pagesWrite=23ms, fsync=2472ms, total=2761ms]
[07:51:32,118][INFO][db-checkpoint-thread-#108][GridCacheDatabaseSharedManager] Checkpoint started [checkpointId=07246861-57ae-4ef5-8419-cb7710d2f72d, startPtr=FileWALPointer [idx=27, fileOff=38042090, len=420491], checkpointLockWait=6ms, checkpointLockHoldTime=60ms, walCpRecordFsyncDuration=185ms, pages=7186, reason='timeout']
[07:51:32,121][INFO][service-#232][PageMemoryImpl] Throttling is applied to page modifications [percentOfPartTime=0.24, markDirty=10738 pages/sec, checkpointWrite=2757 pages/sec, estIdealMarkDirty=310976 pages/sec, curDirty=0.00, maxDirty=0.07, avgParkTime=358945 ns, pages: (total=7186, evicted=0, written=896, synced=0, cpBufUsed=565, cpBufTotal=1036430)]
[07:51:34,534][INFO][db-checkpoint-thread-#108][GridCacheDatabaseSharedManager] Checkpoint finished [cpId=07246861-57ae-4ef5-8419-cb7710d2f72d, pages=7186, markPos=FileWALPointer [idx=27, fileOff=38042090, len=420491], walSegmentsCleared=0, markDuration=257ms, pagesWrite=29ms, fsync=2387ms, total=2679ms]
[07:51:38,169][INFO][db-checkpoint-thread-#108][GridCacheDatabaseSharedManager] Checkpoint started [checkpointId=44e6870a-e370-4bd3-8ad9-8252abb0acd3, startPtr=FileWALPointer [idx=29, fileOff=44462293, len=420491], checkpointLockWait=0ms, checkpointLockHoldTime=76ms, walCpRecordFsyncDuration=210ms, pages=7529, reason='timeout']
[07:51:40,668][INFO][db-checkpoint-thread-#108][GridCacheDatabaseSharedManager] Checkpoint finished [cpId=44e6870a-e370-4bd3-8ad9-8252abb0acd3, pages=7529, markPos=FileWALPointer [idx=29, fileOff=44462293, len=420491], walSegmentsCleared=0, markDuration=303ms, pagesWrite=24ms, fsync=2475ms, total=2802ms]
[root@Redis1 node00-296a5110-74ad-45e0-bf9c-5c075a4f5fdf]# free -mh
total used free shared buff/cache available
Mem: 125G 14G 107G 995M 3.5G 109G
Swap: 127G 0B 127G
[root@Redis1 node00-296a5110-74ad-45e0-bf9c-5c075a4f5fdf]# free -mh
total used free shared buff/cache available
Mem: 125G 14G 107G 995M 3.5G 109G
Swap: 127G 0B 127G
[root@Redis1 node00-296a5110-74ad-45e0-bf9c-5c075a4f5fdf]# free -mh
total used free shared buff/cache available
Mem: 125G 14G 107G 995M 3.5G 109G
Swap: 127G 0B 127G
[root@Redis1 node00-296a5110-74ad-45e0-bf9c-5c075a4f5fdf]# free -mh
total used free shared buff/cache available
Mem: 125G 14G 105G 995M 5.6G 109G
Swap: 127G 0B 127G
当我停止更新缓存的流量时,发现操作系统缓存恢复正常,但是非常缓慢,需要很长时间才能发布, 快速检查点频率6s。如何能够迅速处理?
[root@Redis1 node00-296a5110-74ad-45e0-bf9c-5c075a4f5fdf]# free -mh
total used free shared buff/cache available
Mem: 125G 14G 104G 995M 6.5G 109G
Swap: 127G 0B 127G
[root@Redis1 node00-296a5110-74ad-45e0-bf9c-5c075a4f5fdf]# free -mh
total used free shared buff/cache available
Mem: 125G 14G 104G 995M 6.3G 109G
Swap: 127G 0B 127G
[root@Redis1 node00-296a5110-74ad-45e0-bf9c-5c075a4f5fdf]# free -mh
total used free shared buff/cache available
Mem: 125G 14G 104G 995M 6.3G 109G
Swap: 127G 0B 127G
[root@Redis1 node00-296a5110-74ad-45e0-bf9c-5c075a4f5fdf]# free -mh
total used free shared buff/cache available
Mem: 125G 14G 106G 995M 4.6G 109G
Swap: 127G 0B 127G
[root@Redis1 node00-296a5110-74ad-45e0-bf9c-5c075a4f5fdf]# free -mh
total used free shared buff/cache available
Mem: 125G 14G 106G 995M 4.4G 109G
答案 0 :(得分:1)
OS缓存磁盘数据完全没问题,在这里linux ate my ram进行了很好的解释。如果您的内核支持,您总是可以设置可用内存量,这样可以减少Ignite分配new memory blocks
时的停顿。