Nine ceph cluster crush advanced use
9.1 ceph cluster operation diagram
- Five operation diagrams maintained by mon server in ceph cluster:
- Monitor map # monitor running diagram;
- OSD map #OSD operation diagram;
- PG map #PG operation diagram;
- Crush map (Controllers replication under scalable hashing) # controllable, replicable and scalable consistency hash algorithm, crush run chart. When a new storage pool is created, a new PG combination list will be created based on OSD map to store data and dynamically update the running status;
- MDS map #cephfs metadata operation diagram;
9.2 crush algorithm
- Unifom
- List
- Tree
- Straw
- Straw2 # is used by default
9.3 PG and OSD mapping adjustment
By default, the crush algorithm allocates OSD to the PG in the created pool by itself, but the tendency of the crush algorithm to allocate data can be manually set based on the weight. For example, the weight of 1T disk is 1 and 2T disk is 2. It is recommended to use devices of the same size.
9.3.1 view current status
- Weight: indicates the relative capacity of the device. For example, 1TB corresponds to 1.00, so the weight of 500G OSD should be 0.5. Weight is the number of PG allocated based on disk space. Let the crush algorithm allocate more PG to OSD with large disk space and less PG to OSD with small disk space.
- Reweight: the purpose of the parameter is to rebalance the PG randomly allocated by the cross algorithm of ceph. The default allocation is probability equalization. Even if the OSD s are the same, the disk space will produce some uneven PG distribution. At this time, you can adjust the reweight parameter to make the ceph cluster immediately rebalance the PG of the current disk to achieve a directory with balanced data distribution, Reweight means that PG has been allocated, and the distribution of PG should be rebalanced in ceph cluster. The value range is 0-1.
ceph@ceph-deploy:~/ceph-cluster$ ceph osd df ID CLASS WEIGHT REWEIGHT SIZE RAW USE DATA OMAP META AVAIL %USE VAR PGS STATUS 0 hdd 0.01949 1.00000 20 GiB 318 MiB 27 MiB 4 KiB 291 MiB 20 GiB 1.56 0.72 114 up 1 hdd 0.01949 1.00000 20 GiB 388 MiB 28 MiB 6 KiB 360 MiB 20 GiB 1.89 0.87 115 up 2 hdd 0.01949 1.00000 20 GiB 451 MiB 33 MiB 25 KiB 418 MiB 20 GiB 2.20 1.02 124 up 3 hdd 0.01949 1.00000 20 GiB 434 MiB 31 MiB 25 KiB 403 MiB 20 GiB 2.12 0.98 128 up 4 hdd 0.01949 1.00000 20 GiB 377 MiB 34 MiB 8 KiB 342 MiB 20 GiB 1.84 0.85 116 up 5 hdd 0.01949 1.00000 20 GiB 545 MiB 23 MiB 2 KiB 522 MiB 19 GiB 2.66 1.23 109 up 6 hdd 0.01949 1.00000 20 GiB 433 MiB 18 MiB 9 KiB 415 MiB 20 GiB 2.11 0.98 124 up 7 hdd 0.01949 1.00000 20 GiB 548 MiB 45 MiB 24 KiB 503 MiB 19 GiB 2.68 1.24 120 up 8 hdd 0.01949 1.00000 20 GiB 495 MiB 26 MiB 5 KiB 469 MiB 20 GiB 2.42 1.12 109 up TOTAL 180 GiB 3.9 GiB 264 MiB 113 KiB 3.6 GiB 176 GiB 2.16 MIN/MAX VAR: 0.72/1.24 STDDEV: 0.35
9.3.2 modify the weight value
ceph@ceph-deploy:~/ceph-cluster$ ceph osd crush reweight osd.7 0.07 reweighted item id 7 name 'osd.7' to 0.07 in crush map
9.3.3 verify and modify the weight value
Click to view the codeceph@ceph-deploy:~/ceph-cluster$ ceph osd df ID CLASS WEIGHT REWEIGHT SIZE RAW USE DATA OMAP META AVAIL %USE VAR PGS STATUS 0 hdd 0.01949 1.00000 20 GiB 338 MiB 32 MiB 4 KiB 306 MiB 20 GiB 1.65 0.74 109 up 1 hdd 0.01949 1.00000 20 GiB 397 MiB 25 MiB 6 KiB 372 MiB 20 GiB 1.94 0.87 117 up 2 hdd 0.01949 1.00000 20 GiB 446 MiB 34 MiB 25 KiB 412 MiB 20 GiB 2.18 0.98 127 up 3 hdd 0.01949 1.00000 20 GiB 447 MiB 32 MiB 25 KiB 414 MiB 20 GiB 2.18 0.98 129 up 4 hdd 0.01949 1.00000 20 GiB 378 MiB 29 MiB 8 KiB 350 MiB 20 GiB 1.85 0.83 112 up 5 hdd 0.01949 1.00000 20 GiB 569 MiB 31 MiB 2 KiB 538 MiB 19 GiB 2.78 1.25 112 up 6 hdd 0.01949 1.00000 20 GiB 439 MiB 16 MiB 9 KiB 423 MiB 20 GiB 2.14 0.96 65 up 7 hdd 0.06999 1.00000 20 GiB 598 MiB 60 MiB 24 KiB 538 MiB 19 GiB 2.92 1.31 228 up 8 hdd 0.01949 1.00000 20 GiB 493 MiB 16 MiB 5 KiB 477 MiB 20 GiB 2.41 1.08 60 up TOTAL 180 GiB 4.0 GiB 274 MiB 113 KiB 3.7 GiB 176 GiB 2.23 MIN/MAX VAR: 0.74/1.31 STDDEV: 0.39
9.3.4 modifying the reweight value
ceph@ceph-deploy:~/ceph-cluster$ ceph osd reweight 6 0.6 reweighted osd.6 to 0.6 (9999)
9.3.5 verifying and modifying the reweight value
Click to view the codeceph@ceph-deploy:~/ceph-cluster$ ceph osd df ID CLASS WEIGHT REWEIGHT SIZE RAW USE DATA OMAP META AVAIL %USE VAR PGS STATUS 0 hdd 0.01949 1.00000 20 GiB 339 MiB 32 MiB 4 KiB 307 MiB 20 GiB 1.65 0.74 109 up 1 hdd 0.01949 1.00000 20 GiB 397 MiB 25 MiB 6 KiB 372 MiB 20 GiB 1.94 0.87 117 up 2 hdd 0.01949 1.00000 20 GiB 451 MiB 34 MiB 25 KiB 417 MiB 20 GiB 2.20 0.98 127 up 3 hdd 0.01949 1.00000 20 GiB 451 MiB 32 MiB 25 KiB 419 MiB 20 GiB 2.20 0.98 129 up 4 hdd 0.01949 1.00000 20 GiB 383 MiB 29 MiB 8 KiB 354 MiB 20 GiB 1.87 0.83 112 up 5 hdd 0.01949 1.00000 20 GiB 569 MiB 31 MiB 2 KiB 539 MiB 19 GiB 2.78 1.24 112 up 6 hdd 0.01949 0.59999 20 GiB 443 MiB 16 MiB 9 KiB 427 MiB 20 GiB 2.16 0.97 38 up 7 hdd 0.06999 1.00000 20 GiB 604 MiB 60 MiB 24 KiB 544 MiB 19 GiB 2.95 1.32 247 up 8 hdd 0.01949 1.00000 20 GiB 493 MiB 16 MiB 5 KiB 477 MiB 20 GiB 2.41 1.07 64 up TOTAL 180 GiB 4.0 GiB 274 MiB 113 KiB 3.8 GiB 176 GiB 2.24 MIN/MAX VAR: 0.74/1.32 STDDEV: 0.40
9.4 crush diagram management
The exported crush run chart is in binary format and cannot be opened directly through a text editor. It can only be opened and edited through vim and other text editors after being converted to text format by using the crush tool tool.
9.4.1 export crush diagram
root@ceph-deploy:~# mkdir -pv /data/ceph mkdir: created directory '/data/ceph' root@ceph-deploy:~# ceph osd getcrushmap -o /data/ceph/crushmap 73
9.4.2 convert train diagram to text
Click to view the coderoot@ceph-deploy:~# apt -y install ceph-base root@ceph-deploy:~# crushtool -d /data/ceph/crushmap > /data/ceph/crushmap.txt root@ceph-deploy:~# file /data/ceph/crushmap.txt /data/ceph/crushmap.txt: ASCII text
9.4.3 example of crush operation diagram
root@ceph-deploy:~# cat /data/ceph/crushmap.txt
tunable choose_local_tries 0
tunable choose_local_fallback_tries 0
tunable choose_total_tries 50
tunable chooseleaf_descend_once 1
tunable chooseleaf_vary_r 1
tunable chooseleaf_stable 1
tunable straw_calc_version 1
tunable allowed_bucket_algs 54
# Current device list
device 0 osd.0 class hdd
device 1 osd.1 class hdd
device 2 osd.2 class hdd
device 3 osd.3 class hdd
device 4 osd.4 class hdd
device 5 osd.5 class hdd
device 6 osd.6 class hdd
device 7 osd.7 class hdd
device 8 osd.8 class hdd
# Types # currently supported bucket types
type 0 osd # OSD daemon, corresponding to a disk device
type 1 host # a host
Chassis of type 2 chassis # blade server
type 3 rack # contains several server cabinets / racks
type 4 row # a row of cabinets containing several cabinets
Type 5 power socket for PDU # cabinet
type 6 pod # several small rooms in a machine room
type 7 room # contains rooms with several cabinets. A data center consists of many such rooms
type 8 datacenter # a data center or IDS
type 9 zone # Available area
type 10 region # a region, such as AWS
Type 11 the top of the root #ucket hierarchy, followed by
# buckets
host ceph-node-01 {
id -3 # do not change unnecessarily # The OSD ID generated by ceph needs to be changed
id -4 class hdd # do not change unnecessarily
# weight 0.058
alg straw2 # crush algorithm, manage OSD roles
hash 0 # rjenkins1 # Which hash algorithm to use, 0 means to select rjinkins1
item osd.0 weight 0.019 # Osd.0 weight proportion, crush will automatically calculate according to the disk space, and the weight of different disk space is different
item osd.1 weight 0.019
item osd.2 weight 0.019
}
host ceph-node-02 {
id -5 # do not change unnecessarily
id -6 class hdd # do not change unnecessarily
# weight 0.058
alg straw2
hash 0 # rjenkins1
item osd.3 weight 0.019
item osd.4 weight 0.019
item osd.5 weight 0.019
}
host ceph-node-03 {
id -7 # do not change unnecessarily
id -8 class hdd # do not change unnecessarily
# weight 0.109
alg straw2
hash 0 # rjenkins1
item osd.6 weight 0.019
item osd.7 weight 0.070
item osd.8 weight 0.019
}
root default {
id -1 # do not change unnecessarily
id -2 class hdd # do not change unnecessarily
# weight 0.226
alg straw2
hash 0 # rjenkins1
item ceph-node-01 weight 0.058
item ceph-node-02 weight 0.058
item ceph-node-03 weight 0.109
}
# rules
rule replicated_rule { # Default configuration for replica pool
id 0
type replicated
min_size 1
max_size 10 # the default maximum copy is 10
step take default # assigns OSD based on the host defined by default
step chooseleaf firstn 0 type host # select the host, and the fault domain type is host
The step emit # pop-up configuration is returned to the client
}
Rule erasure code {# erasure code pool default configuration
id 1
type erasure
min_size 3
max_size 4
step set_chooseleaf_tries 5
step set_choose_tries 100
step take default
setp chooseleaf indep 0 type host
step emit
}
# end crush map
9.4.4 edit crush diagram
Change max_size 10 to max_size 8
9.4.5 converting text to crush binary format
root@ceph-deploy:~# crushtool -c /data/ceph/crushmap.txt -o /data/ceph/newcrushmap
9.4.6 import a new crush diagram
root@ceph-deploy:~# ceph osd setcrushmap -i /data/ceph/newcrushmap
9.4.7 verify whether the crush operation diagram is effective
Click to view the coderoot@ceph-deploy:~# ceph osd crush rule dump [ { "rule_id": 0, "rule_name": "replicated_rule", "ruleset": 0, "type": 1, "min_size": 1, "max_size": 8, "steps": [ { "op": "take", "item": -1, "item_name": "default" }, { "op": "chooseleaf_firstn", "num": 0, "type": "host" }, { "op": "emit" } ] } ]
9.5 crush data classification management
When ceph crush algorithm allocates PG, it can allocate PG to OSD s of different hosts to achieve high availability based on hosts. This is also the default mechanism, but it can not guarantee the high availability of data in different cabinets or machine rooms at the tail of different pg. if it is necessary to achieve high availability of data based on cabinets or higher-level IDC, it can not realize the high availability of data in SSD and B projects The data is on the mechanical disk. If you want to realize this function, you need to export the crush diagram and edit it manually, and then import and overwrite the original crush diagram.
9.5.1 export crush diagram
root@ceph-deploy:~# mkdir -pv /data/ceph mkdir: created directory '/data/ceph' root@ceph-deploy:~# ceph osd getcrushmap -o /data/ceph/crushmap 73
9.5.2 convert train diagram to text
root@ceph-deploy:~# apt -y install ceph-base root@ceph-deploy:~# crushtool -d /data/ceph/crushmap > /data/ceph/crushmap.txt root@ceph-deploy:~# file /data/ceph/crushmap.txt /data/ceph/crushmap.txt: ASCII text
9.5.3 add custom configuration
be careful:
- Host name cannot be duplicate
- buckets must be defined before rules
# ssd node
host ceph-sshnode-01 {
id -103 # do not change unnecessarily
id -104 class hdd # do not change unnecessarily
# weight 0.098
alg straw2
hash 0 # rjenkins1
item osd.0 weight 0.019
}
host ceph-sshnode-02 {
id -105 # do not change unnecessarily
id -106 class hdd # do not change unnecessarily
# weight 0.098
alg straw2
hash 0 # rjenkins1
item osd.5 weight 0.019
}
host ceph-sshnode-03 {
id -107 # do not change unnecessarily
id -108 class hdd # do not change unnecessarily
# weight 0.098
alg straw2
hash 0 # rjenkins1
item osd.8 weight 0.019
}
# bucket
root ssd {
id -127 # do not change unnecessarily
id -11 class hdd # do not change unnecessarily
# weight 1.952
alg straw
hash 0 # rjenkins1
item ceph-sshnode-01 weight 0.088
item ceph-sshnode-02 weight 0.088
item ceph-sshnode-03 weight 0.088
}
#ssd rules
rule ssd_rule {
id 20
type replicated
min_size 1
max_size 5
step take ssd
step chooseleaf firstn 0 type host
step emit
}
9.5.4 convert to crush binary format
root@ceph-deploy:~# crushtool -c /data/ceph/crushmap.txt -o /data/ceph/newcrushmap-01
9.5.5 import new crush diagram
root@ceph-deploy:~# ceph osd setcrushmap -i /data/ceph/newcrushmap-01 76
9.5.6 verify whether the crush operation diagram is effective
Click to view the coderoot@ceph-deploy:~# ceph osd crush rule dump [ { "rule_id": 0, "rule_name": "replicated_rule", "ruleset": 0, "type": 1, "min_size": 1, "max_size": 8, "steps": [ { "op": "take", "item": -1, "item_name": "default" }, { "op": "chooseleaf_firstn", "num": 0, "type": "host" }, { "op": "emit" } ] }, { "rule_id": 20, "rule_name": "ssd_rule", "ruleset": 20, "type": 1, "min_size": 1, "max_size": 5, "steps": [ { "op": "take", "item": -127, "item_name": "ssd" }, { "op": "chooseleaf_firstn", "num": 0, "type": "host" }, { "op": "emit" } ] } ]
9.5.7 test create storage pool
root@ceph-deploy:~# ceph osd pool create ssdpool 32 32 ssd_rule pool 'ssdpool' created
9.5.8 verify pgp status
Click to view the coderoot@ceph-deploy:~# ceph pg ls-by-pool ssdpool | awk '{print $1,$2,$15}' PG OBJECTS ACTING 28.0 0 [8,0,5]p8 28.1 0 [5,8,0]p5 28.2 0 [8,0,5]p8 28.3 0 [8,5,0]p8 28.4 0 [0,5,8]p0 28.5 0 [5,8,0]p5 28.6 0 [5,8,0]p5 28.7 0 [8,0,5]p8 28.8 0 [0,5,8]p0 28.9 0 [8,5,0]p8 28.a 0 [5,0,8]p5 28.b 0 [0,5,8]p0 28.c 0 [8,5,0]p8 28.d 0 [8,5,0]p8 28.e 0 [0,5,8]p0 28.f 0 [5,0,8]p5 28.10 0 [5,0,8]p5 28.11 0 [0,5,8]p0 28.12 0 [5,0,8]p5 28.13 0 [0,8,5]p0 28.14 0 [0,5,8]p0 28.15 0 [0,8,5]p0 28.16 0 [8,0,5]p8 28.17 0 [5,0,8]p5 28.18 0 [5,8,0]p5 28.19 0 [5,0,8]p5 28.1a 0 [5,8,0]p5 28.1b 0 [5,0,8]p5 28.1c 0 [8,5,0]p8 28.1d 0 [5,0,8]p5 28.1e 0 [0,8,5]p0 28.1f 0 [5,0,8]p5
- NOTE: afterwards
It can be seen from the above that the PG s of the newly created ssdpool are distributed on osd.0, osd.5 and osd.8, which comply with the added rules.
9.6 correspondence between node and OSD
Click to view the coderoot@ceph-deploy:~# ceph osd tree ID CLASS WEIGHT TYPE NAME STATUS REWEIGHT PRI-AFF -127 0.26399 root ssd -103 0.08800 host ceph-sshnode-01 0 hdd 0.01900 osd.0 up 1.00000 1.00000 -105 0.08800 host ceph-sshnode-02 5 hdd 0.01900 osd.5 up 1.00000 1.00000 -107 0.08800 host ceph-sshnode-03 8 hdd 0.01900 osd.8 up 1.00000 1.00000 -1 0.22499 root default -3 0.05800 host ceph-node-01 0 hdd 0.01900 osd.0 up 1.00000 1.00000 1 hdd 0.01900 osd.1 up 1.00000 1.00000 2 hdd 0.01900 osd.2 up 1.00000 1.00000 -5 0.05800 host ceph-node-02 3 hdd 0.01900 osd.3 up 1.00000 1.00000 4 hdd 0.01900 osd.4 up 1.00000 1.00000 5 hdd 0.01900 osd.5 up 1.00000 1.00000 -7 0.10899 host ceph-node-03 6 hdd 0.01900 osd.6 up 0.59999 1.00000 7 hdd 0.06999 osd.7 up 1.00000 1.00000 8 hdd 0.01900 osd.8 up 1.00000 1.00000