Distributed storage CEPH cluster crush advanced use

Posted by mechamecha on Mon, 22 Nov 2021 09:52:52 +0100

Nine ceph cluster crush advanced use

9.1 ceph cluster operation diagram

  • Five operation diagrams maintained by mon server in ceph cluster:
    1. Monitor map # monitor running diagram;
    2. OSD map #OSD operation diagram;
    3. PG map #PG operation diagram;
    4. Crush map (Controllers replication under scalable hashing) # controllable, replicable and scalable consistency hash algorithm, crush run chart. When a new storage pool is created, a new PG combination list will be created based on OSD map to store data and dynamically update the running status;
    5. MDS map #cephfs metadata operation diagram;

9.2 crush algorithm

  1. Unifom
  2. List
  3. Tree
  4. Straw
  5. Straw2 # is used by default

9.3 PG and OSD mapping adjustment

By default, the crush algorithm allocates OSD to the PG in the created pool by itself, but the tendency of the crush algorithm to allocate data can be manually set based on the weight. For example, the weight of 1T disk is 1 and 2T disk is 2. It is recommended to use devices of the same size.

9.3.1 view current status

  • Weight: indicates the relative capacity of the device. For example, 1TB corresponds to 1.00, so the weight of 500G OSD should be 0.5. Weight is the number of PG allocated based on disk space. Let the crush algorithm allocate more PG to OSD with large disk space and less PG to OSD with small disk space.
  • Reweight: the purpose of the parameter is to rebalance the PG randomly allocated by the cross algorithm of ceph. The default allocation is probability equalization. Even if the OSD s are the same, the disk space will produce some uneven PG distribution. At this time, you can adjust the reweight parameter to make the ceph cluster immediately rebalance the PG of the current disk to achieve a directory with balanced data distribution, Reweight means that PG has been allocated, and the distribution of PG should be rebalanced in ceph cluster. The value range is 0-1.
Click to view the code
ceph@ceph-deploy:~/ceph-cluster$ ceph osd df
ID  CLASS  WEIGHT   REWEIGHT  SIZE     RAW USE  DATA     OMAP     META     AVAIL    %USE  VAR   PGS  STATUS
 0    hdd  0.01949   1.00000   20 GiB  318 MiB   27 MiB    4 KiB  291 MiB   20 GiB  1.56  0.72  114      up
 1    hdd  0.01949   1.00000   20 GiB  388 MiB   28 MiB    6 KiB  360 MiB   20 GiB  1.89  0.87  115      up
 2    hdd  0.01949   1.00000   20 GiB  451 MiB   33 MiB   25 KiB  418 MiB   20 GiB  2.20  1.02  124      up
 3    hdd  0.01949   1.00000   20 GiB  434 MiB   31 MiB   25 KiB  403 MiB   20 GiB  2.12  0.98  128      up
 4    hdd  0.01949   1.00000   20 GiB  377 MiB   34 MiB    8 KiB  342 MiB   20 GiB  1.84  0.85  116      up
 5    hdd  0.01949   1.00000   20 GiB  545 MiB   23 MiB    2 KiB  522 MiB   19 GiB  2.66  1.23  109      up
 6    hdd  0.01949   1.00000   20 GiB  433 MiB   18 MiB    9 KiB  415 MiB   20 GiB  2.11  0.98  124      up
 7    hdd  0.01949   1.00000   20 GiB  548 MiB   45 MiB   24 KiB  503 MiB   19 GiB  2.68  1.24  120      up
 8    hdd  0.01949   1.00000   20 GiB  495 MiB   26 MiB    5 KiB  469 MiB   20 GiB  2.42  1.12  109      up
                       TOTAL  180 GiB  3.9 GiB  264 MiB  113 KiB  3.6 GiB  176 GiB  2.16                   
MIN/MAX VAR: 0.72/1.24  STDDEV: 0.35

9.3.2 modify the weight value

ceph@ceph-deploy:~/ceph-cluster$ ceph osd crush reweight osd.7 0.07 
reweighted item id 7 name 'osd.7' to 0.07 in crush map

9.3.3 verify and modify the weight value

Click to view the code
ceph@ceph-deploy:~/ceph-cluster$ ceph osd df
ID  CLASS  WEIGHT   REWEIGHT  SIZE     RAW USE  DATA     OMAP     META     AVAIL    %USE  VAR   PGS  STATUS
 0    hdd  0.01949   1.00000   20 GiB  338 MiB   32 MiB    4 KiB  306 MiB   20 GiB  1.65  0.74  109      up
 1    hdd  0.01949   1.00000   20 GiB  397 MiB   25 MiB    6 KiB  372 MiB   20 GiB  1.94  0.87  117      up
 2    hdd  0.01949   1.00000   20 GiB  446 MiB   34 MiB   25 KiB  412 MiB   20 GiB  2.18  0.98  127      up
 3    hdd  0.01949   1.00000   20 GiB  447 MiB   32 MiB   25 KiB  414 MiB   20 GiB  2.18  0.98  129      up
 4    hdd  0.01949   1.00000   20 GiB  378 MiB   29 MiB    8 KiB  350 MiB   20 GiB  1.85  0.83  112      up
 5    hdd  0.01949   1.00000   20 GiB  569 MiB   31 MiB    2 KiB  538 MiB   19 GiB  2.78  1.25  112      up
 6    hdd  0.01949   1.00000   20 GiB  439 MiB   16 MiB    9 KiB  423 MiB   20 GiB  2.14  0.96   65      up
 7    hdd  0.06999   1.00000   20 GiB  598 MiB   60 MiB   24 KiB  538 MiB   19 GiB  2.92  1.31  228      up
 8    hdd  0.01949   1.00000   20 GiB  493 MiB   16 MiB    5 KiB  477 MiB   20 GiB  2.41  1.08   60      up
                       TOTAL  180 GiB  4.0 GiB  274 MiB  113 KiB  3.7 GiB  176 GiB  2.23                   
MIN/MAX VAR: 0.74/1.31  STDDEV: 0.39

9.3.4 modifying the reweight value

ceph@ceph-deploy:~/ceph-cluster$ ceph osd reweight 6 0.6
reweighted osd.6 to 0.6 (9999)

9.3.5 verifying and modifying the reweight value

Click to view the code
ceph@ceph-deploy:~/ceph-cluster$ ceph osd df
ID  CLASS  WEIGHT   REWEIGHT  SIZE     RAW USE  DATA     OMAP     META     AVAIL    %USE  VAR   PGS  STATUS
 0    hdd  0.01949   1.00000   20 GiB  339 MiB   32 MiB    4 KiB  307 MiB   20 GiB  1.65  0.74  109      up
 1    hdd  0.01949   1.00000   20 GiB  397 MiB   25 MiB    6 KiB  372 MiB   20 GiB  1.94  0.87  117      up
 2    hdd  0.01949   1.00000   20 GiB  451 MiB   34 MiB   25 KiB  417 MiB   20 GiB  2.20  0.98  127      up
 3    hdd  0.01949   1.00000   20 GiB  451 MiB   32 MiB   25 KiB  419 MiB   20 GiB  2.20  0.98  129      up
 4    hdd  0.01949   1.00000   20 GiB  383 MiB   29 MiB    8 KiB  354 MiB   20 GiB  1.87  0.83  112      up
 5    hdd  0.01949   1.00000   20 GiB  569 MiB   31 MiB    2 KiB  539 MiB   19 GiB  2.78  1.24  112      up
 6    hdd  0.01949   0.59999   20 GiB  443 MiB   16 MiB    9 KiB  427 MiB   20 GiB  2.16  0.97   38      up
 7    hdd  0.06999   1.00000   20 GiB  604 MiB   60 MiB   24 KiB  544 MiB   19 GiB  2.95  1.32  247      up
 8    hdd  0.01949   1.00000   20 GiB  493 MiB   16 MiB    5 KiB  477 MiB   20 GiB  2.41  1.07   64      up
                       TOTAL  180 GiB  4.0 GiB  274 MiB  113 KiB  3.8 GiB  176 GiB  2.24                   
MIN/MAX VAR: 0.74/1.32  STDDEV: 0.40

9.4 crush diagram management

The exported crush run chart is in binary format and cannot be opened directly through a text editor. It can only be opened and edited through vim and other text editors after being converted to text format by using the crush tool tool.

9.4.1 export crush diagram

root@ceph-deploy:~# mkdir -pv /data/ceph
mkdir: created directory '/data/ceph'
root@ceph-deploy:~# ceph osd getcrushmap -o /data/ceph/crushmap
73

9.4.2 convert train diagram to text

Click to view the code
root@ceph-deploy:~# apt -y install ceph-base
root@ceph-deploy:~# crushtool -d /data/ceph/crushmap > /data/ceph/crushmap.txt
root@ceph-deploy:~# file /data/ceph/crushmap.txt 
/data/ceph/crushmap.txt: ASCII text

9.4.3 example of crush operation diagram

root@ceph-deploy:~# cat /data/ceph/crushmap.txt 

tunable choose_local_tries 0
tunable choose_local_fallback_tries 0
tunable choose_total_tries 50
tunable chooseleaf_descend_once 1
tunable chooseleaf_vary_r 1
tunable chooseleaf_stable 1
tunable straw_calc_version 1
tunable allowed_bucket_algs 54

   # Current device list
device 0 osd.0 class hdd 
device 1 osd.1 class hdd
device 2 osd.2 class hdd
device 3 osd.3 class hdd
device 4 osd.4 class hdd
device 5 osd.5 class hdd
device 6 osd.6 class hdd
device 7 osd.7 class hdd
device 8 osd.8 class hdd

 # Types # currently supported bucket types
type 0 osd  # OSD daemon, corresponding to a disk device
type 1 host # a host
Chassis of type 2 chassis # blade server
type 3 rack # contains several server cabinets / racks
type 4 row # a row of cabinets containing several cabinets
Type 5 power socket for PDU # cabinet
type 6 pod # several small rooms in a machine room
type 7 room # contains rooms with several cabinets. A data center consists of many such rooms
type 8 datacenter # a data center or IDS
type 9 zone  # Available area
type 10 region # a region, such as AWS
Type 11 the top of the root #ucket hierarchy, followed by

 # buckets
host ceph-node-01 {
        id -3           # do not change unnecessarily  # The OSD ID generated by ceph needs to be changed
        id -4 class hdd         # do not change unnecessarily 
        # weight 0.058
        alg straw2  # crush algorithm, manage OSD roles
        hash 0  # rjenkins1  # Which hash algorithm to use, 0 means to select rjinkins1
        item osd.0 weight 0.019  # Osd.0 weight proportion, crush will automatically calculate according to the disk space, and the weight of different disk space is different
        item osd.1 weight 0.019
        item osd.2 weight 0.019
}
host ceph-node-02 {
        id -5           # do not change unnecessarily
        id -6 class hdd         # do not change unnecessarily
        # weight 0.058
        alg straw2
        hash 0  # rjenkins1
        item osd.3 weight 0.019
        item osd.4 weight 0.019
        item osd.5 weight 0.019
}
host ceph-node-03 {
        id -7           # do not change unnecessarily
        id -8 class hdd         # do not change unnecessarily
        # weight 0.109
        alg straw2
        hash 0  # rjenkins1
        item osd.6 weight 0.019
        item osd.7 weight 0.070
        item osd.8 weight 0.019
}
root default {
        id -1           # do not change unnecessarily
        id -2 class hdd         # do not change unnecessarily
        # weight 0.226
        alg straw2
        hash 0  # rjenkins1
        item ceph-node-01 weight 0.058
        item ceph-node-02 weight 0.058
        item ceph-node-03 weight 0.109
}

 # rules
rule replicated_rule {  # Default configuration for replica pool
        id 0
        type replicated
        min_size 1
        max_size 10 # the default maximum copy is 10
        step take default # assigns OSD based on the host defined by default
        step chooseleaf firstn 0 type host # select the host, and the fault domain type is host
        The step emit # pop-up configuration is returned to the client
}

Rule erasure code {# erasure code pool default configuration
        id 1
        type erasure
        min_size 3
        max_size 4
        step set_chooseleaf_tries 5
        step set_choose_tries 100
        step take default
        setp chooseleaf indep 0 type host
        step emit
}
# end crush map

9.4.4 edit crush diagram

Change max_size 10 to max_size 8

9.4.5 converting text to crush binary format

root@ceph-deploy:~# crushtool -c /data/ceph/crushmap.txt -o /data/ceph/newcrushmap

9.4.6 import a new crush diagram

root@ceph-deploy:~# ceph osd setcrushmap -i /data/ceph/newcrushmap 

9.4.7 verify whether the crush operation diagram is effective

Click to view the code
root@ceph-deploy:~# ceph osd crush rule dump
[
    {
        "rule_id": 0,
        "rule_name": "replicated_rule",
        "ruleset": 0,
        "type": 1,
        "min_size": 1,
        "max_size": 8,
        "steps": [
            {
                "op": "take",
                "item": -1,
                "item_name": "default"
            },
            {
                "op": "chooseleaf_firstn",
                "num": 0,
                "type": "host"
            },
            {
                "op": "emit"
            }
        ]
    }
]

9.5 crush data classification management

When ceph crush algorithm allocates PG, it can allocate PG to OSD s of different hosts to achieve high availability based on hosts. This is also the default mechanism, but it can not guarantee the high availability of data in different cabinets or machine rooms at the tail of different pg. if it is necessary to achieve high availability of data based on cabinets or higher-level IDC, it can not realize the high availability of data in SSD and B projects The data is on the mechanical disk. If you want to realize this function, you need to export the crush diagram and edit it manually, and then import and overwrite the original crush diagram.

9.5.1 export crush diagram

root@ceph-deploy:~# mkdir -pv /data/ceph
mkdir: created directory '/data/ceph'
root@ceph-deploy:~# ceph osd getcrushmap -o /data/ceph/crushmap
73

9.5.2 convert train diagram to text

root@ceph-deploy:~# apt -y install ceph-base
root@ceph-deploy:~# crushtool -d /data/ceph/crushmap > /data/ceph/crushmap.txt
root@ceph-deploy:~# file /data/ceph/crushmap.txt 
/data/ceph/crushmap.txt: ASCII text

9.5.3 add custom configuration

be careful:

  1. Host name cannot be duplicate
  2. buckets must be defined before rules

# ssd node
host ceph-sshnode-01 {
        id -103         # do not change unnecessarily
        id -104 class hdd               # do not change unnecessarily
        # weight 0.098
        alg straw2
        hash 0  # rjenkins1
        item osd.0 weight 0.019
}
host ceph-sshnode-02 {
        id -105         # do not change unnecessarily
        id -106 class hdd               # do not change unnecessarily
        # weight 0.098
        alg straw2
        hash 0  # rjenkins1
        item osd.5 weight 0.019
}
host ceph-sshnode-03 {
        id -107         # do not change unnecessarily
        id -108 class hdd               # do not change unnecessarily
        # weight 0.098
        alg straw2
        hash 0  # rjenkins1
        item osd.8 weight 0.019
}

# bucket
root ssd {
        id -127         # do not change unnecessarily
        id -11 class hdd                # do not change unnecessarily
        # weight 1.952
        alg straw
        hash 0  # rjenkins1
        item ceph-sshnode-01 weight 0.088
        item ceph-sshnode-02 weight 0.088
        item ceph-sshnode-03 weight 0.088
}

#ssd rules
rule ssd_rule {
        id 20
        type replicated
        min_size 1
        max_size 5
        step take ssd
        step chooseleaf firstn 0 type host
        step emit
}

9.5.4 convert to crush binary format

root@ceph-deploy:~# crushtool -c /data/ceph/crushmap.txt -o /data/ceph/newcrushmap-01

9.5.5 import new crush diagram

root@ceph-deploy:~# ceph osd setcrushmap -i /data/ceph/newcrushmap-01 
76

9.5.6 verify whether the crush operation diagram is effective

Click to view the code
root@ceph-deploy:~# ceph osd crush rule dump
[
    {
        "rule_id": 0,
        "rule_name": "replicated_rule",
        "ruleset": 0,
        "type": 1,
        "min_size": 1,
        "max_size": 8,
        "steps": [
            {
                "op": "take",
                "item": -1,
                "item_name": "default"
            },
            {
                "op": "chooseleaf_firstn",
                "num": 0,
                "type": "host"
            },
            {
                "op": "emit"
            }
        ]
    },
    {
        "rule_id": 20,
        "rule_name": "ssd_rule",
        "ruleset": 20,
        "type": 1,
        "min_size": 1,
        "max_size": 5,
        "steps": [
            {
                "op": "take",
                "item": -127,
                "item_name": "ssd"
            },
            {
                "op": "chooseleaf_firstn",
                "num": 0,
                "type": "host"
            },
            {
                "op": "emit"
            }
        ]
    }
]

9.5.7 test create storage pool

root@ceph-deploy:~# ceph osd pool create ssdpool 32 32 ssd_rule
pool 'ssdpool' created

9.5.8 verify pgp status

Click to view the code
root@ceph-deploy:~# ceph pg ls-by-pool  ssdpool | awk '{print $1,$2,$15}' 
PG OBJECTS ACTING
28.0 0 [8,0,5]p8
28.1 0 [5,8,0]p5
28.2 0 [8,0,5]p8
28.3 0 [8,5,0]p8
28.4 0 [0,5,8]p0
28.5 0 [5,8,0]p5
28.6 0 [5,8,0]p5
28.7 0 [8,0,5]p8
28.8 0 [0,5,8]p0
28.9 0 [8,5,0]p8
28.a 0 [5,0,8]p5
28.b 0 [0,5,8]p0
28.c 0 [8,5,0]p8
28.d 0 [8,5,0]p8
28.e 0 [0,5,8]p0
28.f 0 [5,0,8]p5
28.10 0 [5,0,8]p5
28.11 0 [0,5,8]p0
28.12 0 [5,0,8]p5
28.13 0 [0,8,5]p0
28.14 0 [0,5,8]p0
28.15 0 [0,8,5]p0
28.16 0 [8,0,5]p8
28.17 0 [5,0,8]p5
28.18 0 [5,8,0]p5
28.19 0 [5,0,8]p5
28.1a 0 [5,8,0]p5
28.1b 0 [5,0,8]p5
28.1c 0 [8,5,0]p8
28.1d 0 [5,0,8]p5
28.1e 0 [0,8,5]p0
28.1f 0 [5,0,8]p5
  • NOTE: afterwards

It can be seen from the above that the PG s of the newly created ssdpool are distributed on osd.0, osd.5 and osd.8, which comply with the added rules.

9.6 correspondence between node and OSD

Click to view the code
root@ceph-deploy:~# ceph osd tree
ID    CLASS  WEIGHT   TYPE NAME                 STATUS  REWEIGHT  PRI-AFF
-127         0.26399  root ssd                                           
-103         0.08800      host ceph-sshnode-01                           
   0    hdd  0.01900          osd.0                 up   1.00000  1.00000
-105         0.08800      host ceph-sshnode-02                           
   5    hdd  0.01900          osd.5                 up   1.00000  1.00000
-107         0.08800      host ceph-sshnode-03                           
   8    hdd  0.01900          osd.8                 up   1.00000  1.00000
  -1         0.22499  root default                                       
  -3         0.05800      host ceph-node-01                              
   0    hdd  0.01900          osd.0                 up   1.00000  1.00000
   1    hdd  0.01900          osd.1                 up   1.00000  1.00000
   2    hdd  0.01900          osd.2                 up   1.00000  1.00000
  -5         0.05800      host ceph-node-02                              
   3    hdd  0.01900          osd.3                 up   1.00000  1.00000
   4    hdd  0.01900          osd.4                 up   1.00000  1.00000
   5    hdd  0.01900          osd.5                 up   1.00000  1.00000
  -7         0.10899      host ceph-node-03                              
   6    hdd  0.01900          osd.6                 up   0.59999  1.00000
   7    hdd  0.06999          osd.7                 up   1.00000  1.00000
   8    hdd  0.01900          osd.8                 up   1.00000  1.00000