The full name of CRUSH is Controlled Replication Under Scalable Hashing. It is a distributed selection algorithm for ceph data storage and the core of ceph storage engine.
When the client of ceph reads and writes data to the cluster, it dynamically calculates the storage location of the data. In this way, ceph does not need to maintain something called metadata, which improves performance.
ceph distributed storage has key 3Rs: replication, recovery and Rebalancing. In case of component failure, ceph waits for 300 seconds by default, then marks OSD as down and out, and initializes the recovery operation. This wait time can be found in the mon of the cluster configuration file_ osd_ down_ out_ Set the interval parameter.
When a new host or disk is added to the cluster, rush starts rebalancing, which migrates data from the existing host or disk to the new host or disk. Rebalancing will make full use of all disks to improve cluster performance. If the ceph cluster is heavily used, the recommended practice is to set the weight of 0 for the newly added disks and gradually increase the weight to make the data migration occur slowly so as not to affect the performance. This is recommended for all distributed storage during capacity expansion.
In practice, you may often need to adjust the layout of the cluster. The default cross layout is very simple. When you execute the ceph osd tree command, you will see that only two bucket types, host and OSD, are under root. The default layout is disadvantageous to partition fault tolerance. There are no concepts such as rack, row and room. Next, we add a bucket type: rack. All hosts should be located under the rack.
(1) Execute ceph osd tree to get the current cluster layout:
[root@node3 ~]# ceph osd tree ID CLASS WEIGHT TYPE NAME STATUS REWEIGHT PRI-AFF -1 0.05878 root default -3 0.01959 host node1 0 hdd 0.00980 osd.0 up 1.00000 1.00000 3 hdd 0.00980 osd.3 up 1.00000 1.00000 -5 0.01959 host node2 1 hdd 0.00980 osd.1 up 1.00000 1.00000 4 hdd 0.00980 osd.4 up 1.00000 1.00000 -7 0.01959 host node3 2 hdd 0.00980 osd.2 up 1.00000 1.00000 5 hdd 0.00980 osd.5 up 1.00000 1.00000
(2) Add rack:
[root@node3 ~]# ceph osd crush add-bucket rack03 rack added bucket rack03 type rack to crush map [root@node3 ~]# ceph osd crush add-bucket rack01 rack added bucket rack01 type rack to crush map [root@node3 ~]# ceph osd crush add-bucket rack02 rack added bucket rack02 type rack to crush map
(3) Move the host under the rack:
[root@node3 ~]# ceph osd crush move node1 rack=rack01 moved item id -3 name 'node1' to location {rack=rack01} in crush map [root@node3 ~]# ceph osd crush move node2 rack=rack02 moved item id -5 name 'node2' to location {rack=rack02} in crush map [root@node3 ~]# ceph osd crush move node3 rack=rack03 moved item id -7 name 'node3' to location {rack=rack03} in crush map
(4) Move the rack to the default root:
[root@node3 ~]# ceph osd crush move rack01 root=default moved item id -9 name 'rack01' to location {root=default} in crush map [root@node3 ~]# ceph osd crush move rack02 root=default moved item id -10 name 'rack02' to location {root=default} in crush map [root@node3 ~]# ceph osd crush move rack03 root=default moved item id -11 name 'rack03' to location {root=default} in crush map
(5) Run the ceph osd tree command again:
[root@node3 ~]# ceph osd tree ID CLASS WEIGHT TYPE NAME STATUS REWEIGHT PRI-AFF -1 0.05878 root default -9 0.01959 rack rack01 -3 0.01959 host node1 0 hdd 0.00980 osd.0 up 1.00000 1.00000 3 hdd 0.00980 osd.3 up 1.00000 1.00000 -10 0.01959 rack rack02 -5 0.01959 host node2 1 hdd 0.00980 osd.1 up 1.00000 1.00000 4 hdd 0.00980 osd.4 up 1.00000 1.00000 -11 0.01959 rack rack03 -7 0.01959 host node3 2 hdd 0.00980 osd.2 up 1.00000 1.00000 5 hdd 0.00980 osd.5 up 1.00000 1.00000
You will see that a new layout has been generated, and all host s are located under a specific rack. Click this operation to complete the adjustment of the CRUSH layout.
For a known object, you can find its storage structure according to the CRUSH algorithm. For example, there is a file test in the data pool txt:
[root@node3 ~]# echo "this is test! ">>test.txt [root@node3 ~]# rados -p data ls [root@node3 ~]# rados -p data put test.txt test.txt [root@node3 ~]# rados -p data ls test.txt
Display its storage structure:
[root@node3 ~]# ceph osd map data test.txt osdmap e42 pool 'data' (1) object 'test.txt' -> pg 1.8b0b6108 (1.8) -> up ([3,4,2], p3) acting ([3,4,2], p3)
crushmap is related to the storage architecture of ceph, which may need to be adjusted frequently in practice. As follows, dump it first, and then decompile it into plaintext for viewing.
[root@node3 ~]# ceph osd getcrushmap -o crushmap 22 [root@node3 ~]# crushtool -d crushmap -o crushmap [root@node3 ~]# cat crushmap # begin crush map tunable choose_local_tries 0 tunable choose_local_fallback_tries 0 tunable choose_total_tries 50 tunable chooseleaf_descend_once 1 tunable chooseleaf_vary_r 1 tunable chooseleaf_stable 1 tunable straw_calc_version 1 tunable allowed_bucket_algs 54 # devices device 0 osd.0 class hdd device 1 osd.1 class hdd device 2 osd.2 class hdd device 3 osd.3 class hdd device 4 osd.4 class hdd device 5 osd.5 class hdd # types type 0 osd type 1 host type 2 chassis type 3 rack type 4 row type 5 pdu type 6 pod type 7 room type 8 datacenter type 9 region type 10 root # buckets host node1 { id -3 # do not change unnecessarily id -4 class hdd # do not change unnecessarily # weight 0.020 alg straw2 hash 0 # rjenkins1 item osd.0 weight 0.010 item osd.3 weight 0.010 } rack rack01 { id -9 # do not change unnecessarily id -14 class hdd # do not change unnecessarily # weight 0.020 alg straw2 hash 0 # rjenkins1 item node1 weight 0.020 } host node2 { id -5 # do not change unnecessarily id -6 class hdd # do not change unnecessarily # weight 0.020 alg straw2 hash 0 # rjenkins1 item osd.1 weight 0.010 item osd.4 weight 0.010 } rack rack02 { id -10 # do not change unnecessarily id -13 class hdd # do not change unnecessarily # weight 0.020 alg straw2 hash 0 # rjenkins1 item node2 weight 0.020 } host node3 { id -7 # do not change unnecessarily id -8 class hdd # do not change unnecessarily # weight 0.020 alg straw2 hash 0 # rjenkins1 item osd.2 weight 0.010 item osd.5 weight 0.010 } rack rack03 { id -11 # do not change unnecessarily id -12 class hdd # do not change unnecessarily # weight 0.020 alg straw2 hash 0 # rjenkins1 item node3 weight 0.020 } root default { id -1 # do not change unnecessarily id -2 class hdd # do not change unnecessarily # weight 0.059 alg straw2 hash 0 # rjenkins1 item rack01 weight 0.020 item rack02 weight 0.020 item rack03 weight 0.020 } # rules rule replicated_rule { id 0 type replicated min_size 1 max_size 10 step take default step chooseleaf firstn 0 type host step emit } # end crush map
This document includes several sections, which roughly explain:
crushmap Equipment: see the above documents#After device. Here is a list of OSDs for ceph. This list will be automatically updated whether OSD is added or deleted. Usually you don't need to change here, ceph will be maintained automatically.
crushmap bucket Type: see above documents#After types. Define bucket types, including root, datacenter, room, row, rack, host, osd, etc. The default bucket type is enough for most ceph clusters, but you can add your own.
crushmap bucket Definition: see the above documents#Contents after buckets. Here you define the hierarchical architecture of buckets, and you can also define the algorithm types used by buckets.
crushmap Rules: see above documents#Rules. It defines which corresponding bucket should be selected for the data stored in the pool. For larger clusters, there are multiple pools, and each pool has its own selection rules.
In the actual scenario of crushmap application, we can define a pool named SSD, which uses SSD disk to improve performance. Then define a pool named SATA, which uses SATA disks to achieve better economy. Suppose there are three ceph storage nodes, and each node has an independent osd service.
First, modify the root default in the crushmap file to:
root default { id -1 # do not change unnecessarily id -2 class hdd # do not change unnecessarily # weight 0.059 alg straw2 hash 0 # rjenkins1 item rack01 weight 0.020 }
Mainly modify its item and delete the contents of item rack02 weight 0.020 and item rack03 weight 0.020
And add the following contents:
root ssd { id -15 alg straw hash 0 item rack02 weight 0.020 } root sata { id -16 alg straw hash 0 item rack03 weight 0.020 } # rules rule replicated_rule { id 0 type replicated min_size 1 max_size 10 step take default step chooseleaf firstn 0 type host step emit } rule ssd-pool { ruleset 1 type replicated min_size 1 max_size 10 step take ssd step chooseleaf firstn 0 type osd step emit } rule sata-pool { ruleset 2 type replicated min_size 1 max_size 10 step take sata step chooseleaf firstn 0 type osd step emit
In ruleset 2, step take sata means that the bucket of sata is preferred
In ruleset 1, step take ssd means that the bucket of ssd is preferred
It should be noted that bucket id should not be repeated
Compile the file and upload it to the cluster:
[root@node3 ~]# crushtool -c crushmap -o crushmap.new [root@node3 ~]# ceph osd setcrushmap -i crushmap.new 23
View the cluster layout again:
[root@node3 ~]# ceph osd tree ID CLASS WEIGHT TYPE NAME STATUS REWEIGHT PRI-AFF -16 0.01999 root sata -11 0.01999 rack rack03 -7 0.01999 host node3 2 hdd 0.00999 osd.2 up 1.00000 1.00000 5 hdd 0.00999 osd.5 up 1.00000 1.00000 -15 0.01999 root ssd -10 0.01999 rack rack02 -5 0.01999 host node2 1 hdd 0.00999 osd.1 up 1.00000 1.00000 4 hdd 0.00999 osd.4 up 1.00000 1.00000 -1 0.01999 root default -9 0.01999 rack rack01 -3 0.01999 host node1 0 hdd 0.00999 osd.0 up 1.00000 1.00000 3 hdd 0.00999 osd.3 up 1.00000 1.00000
Next, observe whether ceph -s is healthy and OK. If the health is OK, add 2 pool s:
[root@node3 ~]# ceph osd pool create sata 64 64 pool 'sata' created [root@node3 ~]# ceph osd pool create ssd 64 64 pool 'ssd' created
Allocate crush rules to the above two newly created pool s:
[root@node3 ~]# ceph osd pool set sata crush_rule sata-pool set pool 2 crush_rule to sata-pool [root@node3 ~]# ceph osd pool set ssd crush_rule ssd-pool set pool 3 crush_rule to ssd-pool
Check whether the rule is effective:
[root@node3 ~]# ceph osd dump |egrep -i "ssd|sata" pool 2 'sata' replicated size 3 min_size 2 crush_rule 2 object_hash rjenkins pg_num 64 pgp_num 64 last_change 55 flags hashpspool stripe_width 0 pool 3 'ssd' replicated size 3 min_size 2 crush_rule 1 object_hash rjenkins pg_num 64 pgp_num 64 last_change 60 flags hashpspool stripe_width 0
Now write to the target of sata pool and store it on SATA devices first. The target written to the ssd pool will be preferentially stored on the SSD device.
Test with the rados command:
[root@node3 ~]# touch file.ssd [root@node3 ~]# touch file.sata [root@node3 ~]# rados -p ssd put filename file.ssd [root@node3 ~]# rados -p sata put filename file.sata
Finally, use the ceph osd map command to check their storage locations:
[root@node3 ~]# ceph osd map ssd file.ssd osdmap e69 pool 'ssd' (3) object 'file.ssd' -> pg 3.46b33220 (3.20) -> up ([4,1], p4) acting ([4,1,0], p4) [root@node3 ~]# ceph osd map sata file.sata osdmap e69 pool 'sata' (2) object 'file.sata' -> pg 2.df856dd1 (2.11) -> up ([5,2], p5) acting ([5,2,0], p5)
You can see that objects of the corresponding type are preferentially stored on devices of the corresponding type
Reference documents:
CRUSH of ceph learning
Understand OpenStack + Ceph (7): basic operation and common troubleshooting methods of Ceph
Ceph: mix SATA and SSD within the same box
crush class of new features of CEPH lunar
Transferred from: https://www.cnblogs.com/sisimi/p/7799980.html