Modified crushmap experiment

Posted by gr00 on Wed, 05 Jan 2022 12:22:29 +0100

The full name of CRUSH is Controlled Replication Under Scalable Hashing. It is a distributed selection algorithm for ceph data storage and the core of ceph storage engine.

When the client of ceph reads and writes data to the cluster, it dynamically calculates the storage location of the data. In this way, ceph does not need to maintain something called metadata, which improves performance.

ceph distributed storage has key 3Rs: replication, recovery and Rebalancing. In case of component failure, ceph waits for 300 seconds by default, then marks OSD as down and out, and initializes the recovery operation. This wait time can be found in the mon of the cluster configuration file_ osd_ down_ out_ Set the interval parameter.

When a new host or disk is added to the cluster, rush starts rebalancing, which migrates data from the existing host or disk to the new host or disk. Rebalancing will make full use of all disks to improve cluster performance. If the ceph cluster is heavily used, the recommended practice is to set the weight of 0 for the newly added disks and gradually increase the weight to make the data migration occur slowly so as not to affect the performance. This is recommended for all distributed storage during capacity expansion.

In practice, you may often need to adjust the layout of the cluster. The default cross layout is very simple. When you execute the ceph osd tree command, you will see that only two bucket types, host and OSD, are under root. The default layout is disadvantageous to partition fault tolerance. There are no concepts such as rack, row and room. Next, we add a bucket type: rack. All hosts should be located under the rack.

(1) Execute ceph osd tree to get the current cluster layout:

[root@node3 ~]# ceph osd tree
ID  CLASS WEIGHT  TYPE NAME      STATUS REWEIGHT PRI-AFF 
 -1       0.05878 root default                           
 -3       0.01959     host node1                         
  0   hdd 0.00980         osd.0      up  1.00000 1.00000 
  3   hdd 0.00980         osd.3      up  1.00000 1.00000 
 -5       0.01959     host node2                         
  1   hdd 0.00980         osd.1      up  1.00000 1.00000 
  4   hdd 0.00980         osd.4      up  1.00000 1.00000 
 -7       0.01959     host node3                         
  2   hdd 0.00980         osd.2      up  1.00000 1.00000 
  5   hdd 0.00980         osd.5      up  1.00000 1.00000

(2) Add rack:

[root@node3 ~]# ceph osd crush add-bucket rack03 rack
added bucket rack03 type rack to crush map
[root@node3 ~]# ceph osd crush add-bucket rack01 rack
added bucket rack01 type rack to crush map
[root@node3 ~]# ceph osd crush add-bucket rack02 rack
added bucket rack02 type rack to crush map

(3) Move the host under the rack:

[root@node3 ~]# ceph osd crush move node1 rack=rack01
moved item id -3 name 'node1' to location {rack=rack01} in crush map
[root@node3 ~]# ceph osd crush move node2 rack=rack02
moved item id -5 name 'node2' to location {rack=rack02} in crush map
[root@node3 ~]# ceph osd crush move node3 rack=rack03
moved item id -7 name 'node3' to location {rack=rack03} in crush map

(4) Move the rack to the default root:

[root@node3 ~]# ceph osd crush move rack01 root=default
moved item id -9 name 'rack01' to location {root=default} in crush map
[root@node3 ~]# ceph osd crush move rack02 root=default
moved item id -10 name 'rack02' to location {root=default} in crush map
[root@node3 ~]# ceph osd crush move rack03 root=default
moved item id -11 name 'rack03' to location {root=default} in crush map

(5) Run the ceph osd tree command again:

[root@node3 ~]# ceph osd tree
ID  CLASS WEIGHT  TYPE NAME          STATUS REWEIGHT PRI-AFF 
 -1       0.05878 root default                               
 -9       0.01959     rack rack01                            
 -3       0.01959         host node1                         
  0   hdd 0.00980             osd.0      up  1.00000 1.00000 
  3   hdd 0.00980             osd.3      up  1.00000 1.00000 
-10       0.01959     rack rack02                            
 -5       0.01959         host node2                         
  1   hdd 0.00980             osd.1      up  1.00000 1.00000 
  4   hdd 0.00980             osd.4      up  1.00000 1.00000 
-11       0.01959     rack rack03                            
 -7       0.01959         host node3                         
  2   hdd 0.00980             osd.2      up  1.00000 1.00000 
  5   hdd 0.00980             osd.5      up  1.00000 1.00000

You will see that a new layout has been generated, and all host s are located under a specific rack. Click this operation to complete the adjustment of the CRUSH layout.

For a known object, you can find its storage structure according to the CRUSH algorithm. For example, there is a file test in the data pool txt:

[root@node3 ~]# echo "this is test! ">>test.txt
[root@node3 ~]# rados -p data ls
[root@node3 ~]# rados -p data put test.txt test.txt 
[root@node3 ~]# rados -p data ls
test.txt

Display its storage structure:

[root@node3 ~]# ceph osd map data test.txt 
osdmap e42 pool 'data' (1) object 'test.txt' -> pg 1.8b0b6108 (1.8) -> up ([3,4,2], p3) acting ([3,4,2], p3)

crushmap is related to the storage architecture of ceph, which may need to be adjusted frequently in practice. As follows, dump it first, and then decompile it into plaintext for viewing.

[root@node3 ~]# ceph osd getcrushmap -o crushmap
22
[root@node3 ~]# crushtool -d crushmap -o crushmap
[root@node3 ~]# cat crushmap 
# begin crush map
tunable choose_local_tries 0
tunable choose_local_fallback_tries 0
tunable choose_total_tries 50
tunable chooseleaf_descend_once 1
tunable chooseleaf_vary_r 1
tunable chooseleaf_stable 1
tunable straw_calc_version 1
tunable allowed_bucket_algs 54

# devices
device 0 osd.0 class hdd
device 1 osd.1 class hdd
device 2 osd.2 class hdd
device 3 osd.3 class hdd
device 4 osd.4 class hdd
device 5 osd.5 class hdd

# types
type 0 osd
type 1 host
type 2 chassis
type 3 rack
type 4 row
type 5 pdu
type 6 pod
type 7 room
type 8 datacenter
type 9 region
type 10 root

# buckets
host node1 {
        id -3       # do not change unnecessarily
        id -4 class hdd         # do not change unnecessarily
        # weight 0.020
        alg straw2
        hash 0  # rjenkins1
        item osd.0 weight 0.010
        item osd.3 weight 0.010
}
rack rack01 {
        id -9       # do not change unnecessarily
        id -14 class hdd                # do not change unnecessarily
        # weight 0.020
        alg straw2
        hash 0  # rjenkins1
        item node1 weight 0.020
}
host node2 {
        id -5       # do not change unnecessarily
        id -6 class hdd         # do not change unnecessarily
        # weight 0.020
        alg straw2
        hash 0  # rjenkins1
        item osd.1 weight 0.010
        item osd.4 weight 0.010
}
rack rack02 {
        id -10      # do not change unnecessarily
        id -13 class hdd                # do not change unnecessarily
        # weight 0.020
        alg straw2
        hash 0  # rjenkins1
        item node2 weight 0.020
}
host node3 {
        id -7       # do not change unnecessarily
        id -8 class hdd         # do not change unnecessarily
        # weight 0.020
        alg straw2
        hash 0  # rjenkins1
        item osd.2 weight 0.010
        item osd.5 weight 0.010
}
rack rack03 {
        id -11      # do not change unnecessarily
        id -12 class hdd                # do not change unnecessarily
        # weight 0.020
        alg straw2
        hash 0  # rjenkins1
        item node3 weight 0.020
}
root default {
        id -1       # do not change unnecessarily
        id -2 class hdd         # do not change unnecessarily
        # weight 0.059
        alg straw2
        hash 0  # rjenkins1
        item rack01 weight 0.020
        item rack02 weight 0.020
        item rack03 weight 0.020
}

# rules
rule replicated_rule {
        id 0
        type replicated
        min_size 1
        max_size 10
        step take default
        step chooseleaf firstn 0 type host
        step emit
}

# end crush map

This document includes several sections, which roughly explain:

    crushmap Equipment: see the above documents#After device. Here is a list of OSDs for ceph. This list will be automatically updated whether OSD is added or deleted. Usually you don't need to change here, ceph will be maintained automatically.

    crushmap bucket Type: see above documents#After types. Define bucket types, including root, datacenter, room, row, rack, host, osd, etc. The default bucket type is enough for most ceph clusters, but you can add your own.

    crushmap bucket Definition: see the above documents#Contents after buckets. Here you define the hierarchical architecture of buckets, and you can also define the algorithm types used by buckets.

    crushmap Rules: see above documents#Rules. It defines which corresponding bucket should be selected for the data stored in the pool. For larger clusters, there are multiple pools, and each pool has its own selection rules.

In the actual scenario of crushmap application, we can define a pool named SSD, which uses SSD disk to improve performance. Then define a pool named SATA, which uses SATA disks to achieve better economy. Suppose there are three ceph storage nodes, and each node has an independent osd service.

First, modify the root default in the crushmap file to:

root default {
        id -1           # do not change unnecessarily
        id -2 class hdd         # do not change unnecessarily
        # weight 0.059
        alg straw2
        hash 0  # rjenkins1
        item rack01 weight 0.020
}

Mainly modify its item and delete the contents of item rack02 weight 0.020 and item rack03 weight 0.020

And add the following contents:

root ssd {
        id -15
        alg straw
        hash 0
        item rack02 weight 0.020

} 
root sata {
        id -16
        alg straw
        hash 0
        item rack03 weight 0.020

}

# rules
rule replicated_rule {
        id 0
        type replicated
        min_size 1
        max_size 10
        step take default
        step chooseleaf firstn 0 type host
        step emit
}
rule ssd-pool {
        ruleset 1
        type replicated
        min_size 1
        max_size 10
        step take ssd
        step chooseleaf firstn 0 type osd
        step emit
}

rule sata-pool {
        ruleset 2
        type replicated
        min_size 1
        max_size 10
        step take sata
        step chooseleaf firstn 0 type osd
        step emit

In ruleset 2, step take sata means that the bucket of sata is preferred
In ruleset 1, step take ssd means that the bucket of ssd is preferred
It should be noted that bucket id should not be repeated

Compile the file and upload it to the cluster:

[root@node3 ~]# crushtool -c crushmap -o crushmap.new
[root@node3 ~]# ceph osd setcrushmap -i crushmap.new
23

View the cluster layout again:

[root@node3 ~]# ceph osd tree
ID  CLASS WEIGHT  TYPE NAME          STATUS REWEIGHT PRI-AFF 
-16       0.01999 root sata                                  
-11       0.01999     rack rack03                            
 -7       0.01999         host node3                         
  2   hdd 0.00999             osd.2      up  1.00000 1.00000 
  5   hdd 0.00999             osd.5      up  1.00000 1.00000 
-15       0.01999 root ssd                                   
-10       0.01999     rack rack02                            
 -5       0.01999         host node2                         
  1   hdd 0.00999             osd.1      up  1.00000 1.00000 
  4   hdd 0.00999             osd.4      up  1.00000 1.00000 
 -1       0.01999 root default                               
 -9       0.01999     rack rack01                            
 -3       0.01999         host node1                         
  0   hdd 0.00999             osd.0      up  1.00000 1.00000 
  3   hdd 0.00999             osd.3      up  1.00000 1.00000

Next, observe whether ceph -s is healthy and OK. If the health is OK, add 2 pool s:

[root@node3 ~]# ceph osd pool create sata 64 64
pool 'sata' created
[root@node3 ~]# ceph osd pool create ssd 64 64
pool 'ssd' created

Allocate crush rules to the above two newly created pool s:

[root@node3 ~]# ceph osd pool set sata crush_rule sata-pool
set pool 2 crush_rule to sata-pool
[root@node3 ~]# ceph osd pool set ssd crush_rule ssd-pool
set pool 3 crush_rule to ssd-pool

Check whether the rule is effective:

[root@node3 ~]# ceph osd dump |egrep -i "ssd|sata"
pool 2 'sata' replicated size 3 min_size 2 crush_rule 2 object_hash rjenkins pg_num 64 pgp_num 64 last_change 55 flags hashpspool stripe_width 0
pool 3 'ssd' replicated size 3 min_size 2 crush_rule 1 object_hash rjenkins pg_num 64 pgp_num 64 last_change 60 flags hashpspool stripe_width 0

Now write to the target of sata pool and store it on SATA devices first. The target written to the ssd pool will be preferentially stored on the SSD device.

Test with the rados command:

[root@node3 ~]# touch file.ssd
[root@node3 ~]# touch file.sata
[root@node3 ~]# rados -p ssd put filename file.ssd
[root@node3 ~]# rados -p sata put filename file.sata

Finally, use the ceph osd map command to check their storage locations:

[root@node3 ~]# ceph osd map ssd file.ssd 
osdmap e69 pool 'ssd' (3) object 'file.ssd' -> pg 3.46b33220 (3.20) -> up ([4,1], p4) acting ([4,1,0], p4)
[root@node3 ~]# ceph osd map sata file.sata
osdmap e69 pool 'sata' (2) object 'file.sata' -> pg 2.df856dd1 (2.11) -> up ([5,2], p5) acting ([5,2,0], p5)

You can see that objects of the corresponding type are preferentially stored on devices of the corresponding type

Reference documents:
CRUSH of ceph learning
Understand OpenStack + Ceph (7): basic operation and common troubleshooting methods of Ceph
Ceph: mix SATA and SSD within the same box
crush class of new features of CEPH lunar

Transferred from: https://www.cnblogs.com/sisimi/p/7799980.html

Topics: Ceph

Programmer Think

Modified crushmap experiment

Hot Topics