如何放弃陷入“不完整”的Ceph PG?

时间:2016-09-02 22:28:15

标签: ceph

我们一直在努力在丢失大量OSD后恢复我们的Ceph集群。我们现在所有PG都处于活动状态,除了80个PG处于“不完整”状态。这些PG正在引用我们在2周前因腐败而删除的OSD.8。

我们想放弃“不完整”的PG,因为它们不可恢复。我们尝试了以下方法:

  1. 根据文档,我们确保相应池上的min_size是 设置为1.这不清楚条件。
  2. Ceph不会让我们 问题“ceph osd lost N”因为OSD.8已被删除 集群。
  3. 我们也试过“ceph pg force_create_pg X” 前列腺素。 80 PGs转移到“创造”几分钟,然后全部 回到“不完整”。
  4. 我们如何放弃这些PG以允许恢复继续?是否有某种方法可以强制将个别PG标记为“丢失”?

    要删除OSD,我们使用网站上的程序:

    http://docs.ceph.com/docs/jewel/rados/operations/add-or-rm-osds/#removing-osds-manual

    基本上:

    ceph osd crush remove 8
    ceph auth del osd.8
    ceph osd rm 8
    

    以下一些杂项数据:

    djakubiec@dev:~$ ceph osd lost 8 --yes-i-really-mean-it
    osd.8 is not down or doesn't exist
    
    
    djakubiec@dev:~$ ceph osd tree
    ID WEIGHT   TYPE NAME       UP/DOWN REWEIGHT PRIMARY-AFFINITY
    -1 58.19960 root default
    -2  7.27489     host node24
     1  7.27489         osd.1        up  1.00000          1.00000
    -3  7.27489     host node25
     2  7.27489         osd.2        up  1.00000          1.00000
    -4  7.27489     host node26
     3  7.27489         osd.3        up  1.00000          1.00000
    -5  7.27489     host node27
     4  7.27489         osd.4        up  1.00000          1.00000
    -6  7.27489     host node28
     5  7.27489         osd.5        up  1.00000          1.00000
    -7  7.27489     host node29
     6  7.27489         osd.6        up  1.00000          1.00000
    -8  7.27539     host node30
     9  7.27539         osd.9        up  1.00000          1.00000
    -9  7.27489     host node31
     7  7.27489         osd.7        up  1.00000          1.00000
    

    但是,即使OSD 8不再存在,我仍然会在各种ceph转储和查询中看到很多对OSD 8的引用。

    有趣的是,我们仍然在CRUSH地图中看到奇怪的条目(我应该对这些做些什么吗?):

    # devices
    device 0 device0
    device 1 osd.1
    device 2 osd.2
    device 3 osd.3
    device 4 osd.4
    device 5 osd.5
    device 6 osd.6
    device 7 osd.7
    device 8 device8
    device 9 osd.9
    

    对于它的价值,这里是ceph -s:

    cluster 10d47013-8c2a-40c1-9b4a-214770414234
     health HEALTH_ERR
            212 pgs are stuck inactive for more than 300 seconds
            93 pgs backfill_wait
            1 pgs backfilling
            101 pgs degraded
            63 pgs down
            80 pgs incomplete
            89 pgs inconsistent
            4 pgs recovery_wait
            1 pgs repair
            132 pgs stale
            80 pgs stuck inactive
            132 pgs stuck stale
            103 pgs stuck unclean
            97 pgs undersized
            2 requests are blocked > 32 sec
            recovery 4394354/46343776 objects degraded (9.482%)
            recovery 4025310/46343776 objects misplaced (8.686%)
            2157 scrub errors
            mds cluster is degraded
     monmap e1: 3 mons at {core=10.0.1.249:6789/0,db=10.0.1.251:6789/0,dev=10.0.1.250:6789/0}
            election epoch 266, quorum 0,1,2 core,dev,db
      fsmap e3627: 1/1/1 up {0=core=up:replay}
     osdmap e4293: 8 osds: 8 up, 8 in; 144 remapped pgs
            flags sortbitwise
      pgmap v1866639: 744 pgs, 10 pools, 7668 GB data, 20673 kobjects
            8339 GB used, 51257 GB / 59596 GB avail
            4394354/46343776 objects degraded (9.482%)
            4025310/46343776 objects misplaced (8.686%)
                 362 active+clean
                 112 stale+active+clean
                  89 active+undersized+degraded+remapped+wait_backfill
                  66 active+clean+inconsistent
                  63 down+incomplete
                  19 stale+active+clean+inconsistent
                  17 incomplete
                   5 active+undersized+degraded+remapped
                   4 active+recovery_wait+degraded
                   2 active+undersized+degraded+remapped+inconsistent+wait_backfill
                   1 stale+active+clean+scrubbing+deep+inconsistent+repair
                   1 active+remapped+inconsistent+wait_backfill
                   1 active+clean+scrubbing+deep
                   1 active+remapped+wait_backfill
                   1 active+undersized+degraded+remapped+backfilling
    

0 个答案:

没有答案