OpenStack uses Ceph storage. What does Ceph do?

Posted by kroll on Mon, 03 Jan 2022 08:28:11 +0100

1 background knowledge

1.1 Ceph introduction

Ceph is a very popular open source distributed storage system. It has the advantages of high scalability, high performance and high reliability. It also provides block storage service (rbd), object storage service (rgw) and file system storage service (cephfs). At present, it is also the mainstream back-end storage of OpenStack. It is close to OpenStack and provides unified shared storage services for OpenStack. Using Ceph as OpenStack back-end storage has the following advantages:

    All computing nodes share storage. During migration, you do not need to copy the root disk. Even if the computing node hangs, you can immediately start the virtual machine on another computing node( evacuate). 
　　utilize COW(Copy On Write)Feature, when creating a virtual machine, it only needs to be based on the image clone You don't need to download the entire image clone The operation is basically 0 overhead, which enables the creation of virtual machines at the second level.
　　Ceph RBD support thin provisioning，That is, allocate space on demand, which is a little similar Linux File system sparse Sparse file. Create a 20 GB The virtual hard disk does not occupy physical storage space at first,
　　　　Storage space is allocated on demand only when data is written.

For more information about Ceph, please refer to Official documents Here, we only focus on RBD. The core object of RBD management is block device, which is usually called volume, but it is customarily called image in Ceph (pay attention to the difference between it and OpenStack image). Ceph also has a concept of pool, which is similar to namespace. Different pools can define different replica numbers, pg numbers, placement strategies, etc. Pool must be specified for each image. The naming convention of image is pool_name/image_name@snapshot For example, openstack/test-volume@test-snap, which represents the snapshot of test volumeimage in openstackpool. Therefore, the following two commands have the same effect:

rbd snap create --pool openstack --image test-image --snap test-snap
rbd snap create openstack/test-image@test-snap

Create a 1G image on openstack pool. The command is:

rbd -p openstack create --size 1024 int32bit-test-1

Image supports the function of snapshot. Creating a snapshot saves the state of the current image, which is equivalent to the git commit operation. Users can roll back the image to any snapshot point (git reset) at any time. The commands to create a snapshot are as follows:

rbd -p openstack snap create int32bit-test-1@snap-1

To view the rbd list:

$ rbd -p openstack ls -l | grep int32bit-test
int32bit-test-1        1024M 2
int32bit-test-1@snap-1 1024M 2

Based on snapshots, you can create a new image called a clone. Instead of copying the original image immediately, the clone uses the COW policy, that is, copy on write. Only when you need to write an object, you can copy the object from the parent to the local. Therefore, the clone operation is basically completed in seconds, It should also be noted that all images created based on the same snapshot share the image data before the snapshot. Therefore, we must protect the snapshot before clone, and the protected snapshot cannot be deleted. The clone operation is similar to the git branch operation. An image command of the clone is as follows:

rbd -p openstack snap protect int32bit-test-1@snap-1
rbd -p openstack clone int32bit-test-1@snap-1 int32bit-test-2

We can view the sub images (children) of an image and the image clone (parent) on which an image is based:

$ rbd -p openstack children int32bit-test-1@snap-1
openstack/int32bit-test-2
$ rbd -p openstack info int32bit-test-2 | grep parent
parent: openstack/int32bit-test-1@snap-1

From the above, we can find that int32bit-test-2 is the children of int32bit-test-1, and int32bit-test-1 is the parent of int32bit-test-2.

Constantly creating snapshots and clone image s will form a long image chain. When the chain is long, it will not only affect the read-write performance, but also lead to very troublesome management. Fortunately, Ceph supports merging all images on the chain into an independent image. This operation is called flatten, which is similar to the git merge operation. Flatten needs to copy all data that does not exist at the top level layer by layer, so it is usually very time-consuming.

$ rbd -p openstack flatten int32bit-test-2
Image flatten: 31% complete...

At this point, let's view the parent children relationship again:

rbd -p openstack children int32bit-test-1@snap-1

At this time, int32bit-test-1 has no children, and int32bit-test-2 is completely independent.

Of course, Ceph also supports full copy, called copy:

rbd -p openstack cp int32bit-test-1 int32bit-test-3

Copy will completely copy an image, so it will be very time-consuming, but note that copy will not copy the original snapshot information.

Ceph supports exporting an RBD image:

rbd -p openstack export int32bit-test-1 int32bit-1.raw

Export will export the whole image. Ceph also supports export diff, that is, specify to export from a snapshot point:

rbd -p openstack export-diff \
int32bit-test-1 --from-snap snap-1 \
--snap snap-2 int32bit-test-1-diff.raw

The data from snapshot point snap-1 to snapshot point snap-2 is exported above.

Of course, the opposite operations are import and import diff. export/import supports full backup of images, while export diff / import diff enables differential backup of images.

Rbd image dynamically allocates storage space. You can view the physical storage space actually occupied by the image through the du command:

$ rbd du int32bit-test-1
NAME            PROVISIONED   USED
int32bit-test-1       1024M 12288k

The allocated size of the above image is 1024M, and the actual occupied space is 12288KB.

To delete an image, you must first delete all its snapshots and ensure that there are no dependent children:

rbd -p openstack snap unprotect int32bit-test-1@snap-1
rbd -p openstack snap rm int32bit-test-1@snap-1
rbd -p openstack rm int32bit-test-1

1.2 introduction to openstack

OpenStack is an open source implementation of IaaS layer cloud computing platform. For more information about OpenStack, please visit my personal blog. Here we only focus on exploring what Ceph has done step by step based on source code analysis when OpenStack is connected to Ceph storage system. This article will not introduce the whole workflow of OpenStack in detail, but only the implementation related to Ceph. If you don't know the source code architecture of OpenStack, you can refer to the article I wrote before How to read OpenStack source code.

After reading this article, you can understand the following questions:

Why does the uploaded image have to be converted to raw format?
How to efficiently upload a large image file?
Why can we create virtual machines in seconds?
Why does it take several minutes to create a virtual machine snapshot, while creating a volume snapshot can be completed in seconds?
Why can't I delete a mirror when a virtual machine exists?
Why must the backup be restored to an empty volume instead of overwriting the existing volume?
Create a volume from the image. Can I delete the image?

Note that this article is based on the premise of using Ceph storage, that is, grace, Nova and Cinder all use Ceph. In other cases, the conclusion may not be valid.

In addition, this article will first post the source code, which is very long and boring. You can quickly jump to Summary part View the Ceph work corresponding to each operation of OpenStack.

2 Glance

2.1 grace introduction

The core entity of grace management is Image, which is one of the core components of OpenStack and provides Image service for OpenStack as Service), mainly responsible for the life cycle management, retrieval, download and other functions of OpenStack image and image metadata. Grace supports saving images to a variety of storage systems. The back-end storage system is called store, and the address accessing the image is called location. Location can be an http address or an rbd protocol address. As long as the driver of the store is implemented, it can be used as the storage backend of grace. The main interfaces of the driver are as follows:

    get: Get mirrored location. 
　　get_size: Gets the size of the image.
　　get_schemes: Gets the of access to the mirror URL prefix(Agreement part)，such as rbd,swift+https,http Wait.
　　add: Upload the image to the back-end storage.
　　delete: Delete the mirror.
　　set_acls: Set read and write access permissions for back-end storage.

In order to facilitate maintenance, the grace store has been separated from the grace code as an independent library and is managed by the project glance_store maintain. The list of store s currently supported by the community is as follows:

    filesystem: Save to local file system, save by default/var/lib/glance/images To the directory.
　　cinder: Save to Cinder Yes.
　　rbd: Save to Ceph Yes.
　　sheepdog: Save to sheepdog Yes.
　　swift: Save to Swift Object is stored.
　　vmware datastore: Save to Vmware datastore Yes.
　　http: All of the above store Will save the mirrored data, except http store It is special. It does not save any data of the image, so it is not implemented add Method, which only saves the image URL Address,
　　　　When starting the virtual machine, the compute node from the specified http Download the image from the address.

This article focuses on rbd store, whose source code is here , the driver code of this store is mainly from domestic Fei Long Wang Responsible for maintenance, other store implementation details can refer to the source code glance store drivers.

2.2 image upload

It can be seen from the previous introduction that image upload is mainly realized by the add method of the store:

@capabilities.check
def add(self, image_id, image_file, image_size, context=None,
        verifier=None):
    checksum = hashlib.md5()
    image_name = str(image_id)
    with self.get_connection(conffile=self.conf_file,
                             rados_id=self.user) as conn:
        fsid = None
        if hasattr(conn, 'get_fsid'):
            fsid = conn.get_fsid()
        with conn.open_ioctx(self.pool) as ioctx:
            order = int(math.log(self.WRITE_CHUNKSIZE, 2))
            try:
                loc = self._create_image(fsid, conn, ioctx, image_name,
                                         image_size, order)
            except rbd.ImageExists:
                msg = _('RBD image %s already exists') % image_id
                raise exceptions.Duplicate(message=msg)
                ...

Note image_file is not a file, but an instance of LimitingReader, which saves all the data of the image and reads the image content through the read(bytes) method.

From the above source code, glance first gets the connection session of ceph, then calls it. create_ The image method creates an rbd image with the same size as the image:

def _create_image(self, fsid, conn, ioctx, image_name,
                  size, order, context=None):
    librbd = rbd.RBD()
    features = conn.conf_get('rbd_default_features')
    librbd.create(ioctx, image_name, size, order, old_format=False,
                  features=int(features))
    return StoreLocation({
        'fsid': fsid,
        'pool': self.pool,
        'image': image_name,
        'snapshot': DEFAULT_SNAPNAME,
    }, self.conf)

Therefore, the above steps are roughly expressed by rbd command:

rbd -p ${rbd_store_pool} create\ 
--size ${image_size} ${image_id}

After creating rbd image in ceph, next:

with rbd.Image(ioctx, image_name) as image:
    bytes_written = 0
    offset = 0
    chunks = utils.chunkreadable(image_file,
                                 self.WRITE_CHUNKSIZE)
    for chunk in chunks:
        offset += image.write(chunk, offset)
        checksum.update(chunk)

Visible radiance blocks from image_ The read data from file is written to the rbd image just created and the checksum is calculated. The block size is determined by rbd_store_chunk_size configuration, 8MB by default.

Let's move on to the final step:

if loc.snapshot:
    image.create_snap(loc.snapshot)
    image.protect_snap(loc.snapshot)

As you can see from the code, the last step is to create an image snapshot (the snapshot name is snap) and protect it.

Assuming that the uploaded image is circles, the image size is 39MB, the image uuid is d1a06da9-8ccd-4d3e-9b63-6dcd3ead29e6, and the configuration is saved in the openstack pool of ceph, the operation flow of ceph is roughly as follows:

rbd -p openstack create \
--size 39 d1a06da9-8ccd-4d3e-9b63-6dcd3ead29e6
rbd -p openstack snap create \
d1a06da9-8ccd-4d3e-9b63-6dcd3ead29e6@snap
rbd -p openstack snap protect \
d1a06da9-8ccd-4d3e-9b63-6dcd3ead29e6@snap

We can verify through the rbd command:

$ rbd ls -l | grep d1a06da9-8ccd-4d3e-9b63-6dcd3ead29e6
d1a06da9-8ccd-4d3e-9b63-6dcd3ead29e6      40162k  2
d1a06da9-8ccd-4d3e-9b63-6dcd3ead29e6@snap 40162k  2 yes

enlightenment

We introduced the process of uploading images to Ceph and omitted the process of uploading images to Glance, but there is no doubt that images must be uploaded to Glance through the Glance API. When the image is very large, it is very time-consuming and occupies the bandwidth of the API management network due to the HTTP protocol through the grace API. We can greatly improve the efficiency of uploading images by importing images directly through rbd import.

First, create an empty image using grace and note its uuid:

glance image-create \
d1a06da9-8ccd-4d3e-9b63-6dcd3ead29e6 | awk '/\sid\s/{print $4}'}

Assuming that the uuid is d1a06da9-8ccd-4d3e-9b63-6dcd3ead29e6, use the rbd command to directly import the image and create a snapshot:

rbd -p openstack import cirros.raw \
--image=d1a06da9-8ccd-4d3e-9b63-6dcd3ead29e6
rbd -p openstack snap create \
d1a06da9-8ccd-4d3e-9b63-6dcd3ead29e6@snap
rbd -p openstack snap protect \
d1a06da9-8ccd-4d3e-9b63-6dcd3ead29e6@snap

Set glance image location url:

FS_ID=`ceph -s | grep cluster | awk '{print $2}'`
glance location-add \
--url rbd://${FS_ID}/openstack/d1a06da9-8ccd-4d3e-9b63-6dcd3ead29e6/snap \
d1a06da9-8ccd-4d3e-9b63-6dcd3ead29e6

Set other properties of the grace mirror:

glance image-update --name="cirros" \
--disk-format=raw \
--container-format=bare d1a06da9-8ccd-4d3e-9b63-6dcd3ead29e6

2.3 image deletion

Deleting an image is the reverse process, that is, first execute unprotext - > snap RM - > RM, as follows:

try:
    self._unprotect_snapshot(image, snapshot_name)
    image.remove_snap(snapshot_name)
except rbd.ImageBusy as exc:
    raise exceptions.InUseByStore()
rbd.RBD().remove(ioctx, image_name)

To delete an image, you must ensure that the current rbd image does not have child images, otherwise the deletion will fail.

3 Nova

3.1 Nova introduction

The core entity managed by Nova is server, which provides computing services for OpenStack. It is the most core component of OpenStack. Note that the server in Nova does not just refer to virtual machines. It can be an abstraction of any computing resources. In addition to virtual machines, it may also be baremetal bare metal machines, containers, etc.

However, we assume here that:

    server Is a virtual machine.
　　image type by rbd. 
　　compute driver by libvirt.

Before starting the virtual machine, you need to prepare the root disk (root) disk), Nova is called image. Like Glance, Nova's image also supports storage to local disk, Ceph and Cinder(boot from) volume). It should be noted that where the image is saved is through image Determined by type, the data stored on the local disk can be raw, qcow2, ploop, etc. if image If the type is RBD, the image is stored in Ceph. Different image type s are responsible for different image backend. The backend of RBD is implemented by the RBD class module in nova/virt/libvirt/imageackend.

3.2 creating virtual machines

The process of creating a virtual machine is no longer analyzed in detail. If you don't know, you can check my previous blog. We go directly to study how Nova's libvirt driver prepares the root disk image for the virtual machine. The code is located in Nova / virt / libvirt / driver Py's spawn method, where the creation of an image calls_ create_image method.

def spawn(self, context, instance, image_meta, injected_files,
          admin_password, network_info=None, block_device_info=None):
    ...
    self._create_image(context, instance, disk_info['mapping'],
                       injection_info=injection_info,
                       block_device_info=block_device_info)
    ...

_ create_ The code of image method is as follows:

def _create_image(self, context, instance,
                  disk_mapping, injection_info=None, suffix='',
                  disk_images=None, block_device_info=None,
                  fallback_from_host=None,
                  ignore_bdi_for_swap=False):
    booted_from_volume = self._is_booted_from_volume(block_device_info)
    ...
    # ensure directories exist and are writable
    fileutils.ensure_tree(libvirt_utils.get_instance_path(instance))
    ...
    self._create_and_inject_local_root(context, instance,
                                       booted_from_volume, suffix,
                                       disk_images, injection_info,
                                       fallback_from_host)
    ...

This method first creates the data directory /var/lib/nova/instances/${uuid}/ of the virtual machine locally, and then calls the create_ and_ inject_ local_ The root method creates the root disk.

def _create_and_inject_local_root(self, context, instance,
                                  booted_from_volume, suffix, disk_images,
                                  injection_info, fallback_from_host):
    ...
    if not booted_from_volume:
        root_fname = imagecache.get_cache_fname(disk_images['image_id'])
        size = instance.flavor.root_gb * units.Gi
        backend = self.image_backend.by_name(instance, 'disk' + suffix,
                                             CONF.libvirt.images_type)
        if backend.SUPPORTS_CLONE:
            def clone_fallback_to_fetch(*args, **kwargs):
                try:
                    backend.clone(context, disk_images['image_id'])
                except exception.ImageUnacceptable:
                    libvirt_utils.fetch_image(*args, **kwargs)
            fetch_func = clone_fallback_to_fetch
        else:
            fetch_func = libvirt_utils.fetch_image
        self._try_fetch_image_cache(backend, fetch_func, context,
                                    root_fname, disk_images['image_id'],
                                    instance, size, fallback_from_host)
        ...

Where image_ backend. by_ The name () method returns the image backend instance by the image type name, here is Rbd. It can be seen from the code that if the backend supports the clone operation (SUPPORTS_CLONE), the clone() method of the backend will be called; otherwise, it will be fetched_ The image() method downloads the image. Obviously, Ceph rbd supports clone. Let's look at the clone() method of Rbd. The code is located in Nova / virt / libvirt / imagebackend Py module:

def clone(self, context, image_id_or_uri):
    ...
    for location in locations:
        if self.driver.is_cloneable(location, image_meta):
            LOG.debug('Selected location: %(loc)s', {'loc': location})
            return self.driver.clone(location, self.rbd_name)
    ...

This method traverses all locations of grace image, and then passes driver is_ The clonable () method determines whether clone is supported. If clone is supported, call driver Clone() method. Where driver is Nova's storage driver, and the code is in nova/virt/libvirt/storage, where rbd driver is in RBD_ utils. Under the PY module, let's first check is_ Clonable() method:

 def is_cloneable(self, image_location, image_meta):
        url = image_location['url']
        try:
            fsid, pool, image, snapshot = self.parse_url(url)
        except exception.ImageUnacceptable as e:
            return False
        if self.get_fsid() != fsid:
            return False
        if image_meta.get('disk_format') != 'raw':
            return False
        # check that we can read the image
        try:
            return self.exists(image, pool=pool, snapshot=snapshot)
        except rbd.Error as e:
            LOG.debug('Unable to open image %(loc)s: %(err)s',
                      dict(loc=url, err=e))
            return False

It can be seen that clone is not supported in the following cases:

    Glance Medium rbd image location wrongful, rbd location Must contain fsid,pool,image id，snapshot 4 Fields, fields through/divide.
　　Glance and Nova Docking is different Ceph Cluster.
　　Glance Mirror non raw Format.
　　Glance of rbd image The name does not exist snap Snapshot of.

Pay special attention to Article 3. If the image is in a non raw format, Nova does not support clone operation when creating a virtual machine, so you must download the image from grace. This is why when grace uses Ceph storage, the image must be converted to raw format.

Finally, let's look at the clone method:

def clone(self, image_location, dest_name, dest_pool=None):
    _fsid, pool, image, snapshot = self.parse_url(
            image_location['url'])
    with RADOSClient(self, str(pool)) as src_client:
        with RADOSClient(self, dest_pool) as dest_client:
            try:
                RbdProxy().clone(src_client.ioctx,
                                 image,
                                 snapshot,
                                 dest_client.ioctx,
                                 str(dest_name),
                                 features=src_client.features)
            except rbd.PermissionError:
                raise exception.Forbidden(_('no write permission on '
                                            'storage pool %s') % dest_pool)

This method only calls the clone method of ceph. Some people may wonder why two ioctx are needed because they all use the same Ceph cluster? This is because grace and Nova may not use the same Ceph pool, and one pool corresponds to one ioctx.

The above operations are roughly equivalent to the following rbd commands:

rbd clone \
${glance_pool}/${image uuid}@snap \
${nova_pool}/${virtual machine uuid}.disk

Assuming that the pool used by Nova and Glance is openstack, the uuid of Glance image is d1a06da9-8ccd-4d3e-9b63-6dcd3ead29e6, and the uuid of Nova virtual machine is cbf44290-f142-41f8-86e1-d63c902b38ed, the corresponding rbd command is roughly:

rbd clone \
openstack/d1a06da9-8ccd-4d3e-9b63-6dcd3ead29e6@snap \
openstack/cbf44290-f142-41f8-86e1-d63c902b38ed_disk

We further verify that:

int32bit $ rbd -p openstack ls | grep cbf44290-f142-41f8-86e1-d63c902b38ed
cbf44290-f142-41f8-86e1-d63c902b38ed_disk
int32bit $ rbd -p openstack info cbf44290-f142-41f8-86e1-d63c902b38ed_disk
rbd image 'cbf44290-f142-41f8-86e1-d63c902b38ed_disk':
    size 2048 MB in 256 objects
    order 23 (8192 kB objects)
    block_name_prefix: rbd_data.9f756763845e
    format: 2
    features: layering, exclusive-lock, object-map, fast-diff, deep-flatten
    flags:
    create_timestamp: Wed Nov 22 05:11:17 2017
    parent: openstack/d1a06da9-8ccd-4d3e-9b63-6dcd3ead29e6@snap
    overlap: 40162 kB

As can be seen from the output, Nova did create a file named cbf44290-f142-41f8-86e1-d63c902b38ed_disk rbd image, and its parent is openstack/d1a06da9-8ccd-4d3e-9b63-6dcd3ead29e6@snap .

enlightenment

    When creating a virtual machine, there is no need to copy the image or download the image, but a simple clone Therefore, the creation of virtual machines can be basically completed in seconds.
　　If there are virtual machine dependencies in the image, you cannot delete the image. In other words, all virtual machines created based on the image must be deleted before deleting the image.

3.3 create virtual machine snapshot

First of all, I think Nova has confused create image and create snapshot. I understand the difference between them:

    create image: Upload the root disk of the virtual machine to Glance Yes.
　　create snapshot: according to image Snapshot the virtual machine in the format, qcow2 and rbd All formats obviously support snapshots. Snapshot should not be saved to Glance In, by Nova perhaps Cinder(boot from Cinder)Administration.

In fact, the subcommand of Nova to create a snapshot is image create, and the API method is also called_ action_create_image(), then the method called snapshot(). In fact, for most image type s, if you do not boot from volume, you are actually creating an image, that is, uploading the image to grace instead of a real snapshot.

Of course, it's just the difference in naming. There is no difference between create image and create snapshot.

The snapshot of the virtual machine is implemented by the snapshot() method of libvirtdriver, and the code is located in Nova / virt / libvirt / driver Py, the core code is as follows:

def snapshot(self, context, instance, image_id, update_task_state):
    ...
    root_disk = self.image_backend.by_libvirt_path(
        instance, disk_path, image_type=source_type)
    try:
        update_task_state(task_state=task_states.IMAGE_UPLOADING,
                          expected_state=task_states.IMAGE_PENDING_UPLOAD)
        metadata['location'] = root_disk.direct_snapshot(
            context, snapshot_name, image_format, image_id,
            instance.image_ref)
        self._snapshot_domain(context, live_snapshot, virt_dom, state,
                              instance)
        self._image_api.update(context, image_id, metadata,
                               purge_props=False)
    except (NotImplementedError, exception.ImageUnacceptable) as e:
        ...

Nova starts with disk_path gets the corresponding image backend. Here, imagebackend is returned Rbd, and then invoked the direct_ of backend. Snapshot() method, which is as follows:

def direct_snapshot(self, context, snapshot_name, image_format,
                    image_id, base_image_id):
    fsid = self.driver.get_fsid()
    parent_pool = self._get_parent_pool(context, base_image_id, fsid)

    self.driver.create_snap(self.rbd_name, snapshot_name, protect=True)
    location = {'url': 'rbd://%(fsid)s/%(pool)s/%(image)s/%(snap)s' %
                       dict(fsid=fsid,
                            pool=self.pool,
                            image=self.rbd_name,
                            snap=snapshot_name)}
    try:
        self.driver.clone(location, image_id, dest_pool=parent_pool)
        self.driver.flatten(image_id, pool=parent_pool)
    finally:
        self.cleanup_direct_snapshot(location)
    self.driver.create_snap(image_id, 'snap', pool=parent_pool,
                            protect=True)
    return ('rbd://%(fsid)s/%(pool)s/%(image)s/snap' %
            dict(fsid=fsid, pool=parent_pool, image=image_id))

From the code analysis, it can be roughly divided into the following steps:

obtain Ceph Clustered fsid. 
Corresponding to the root disk of the virtual machine rbd image Create a temporary snapshot with a random name uuid. Protect snapshots created( protect). 
Snapshot based clone A new rbd image，Name is snapshot uuid. 
yes clone of image implement flatten Operation.
Delete the temporary snapshot created.
yes clone of rbd image Create a snapshot named snap，And execute protect.

The corresponding rbd command. Assuming that the uuid of the virtual machine is cbf44290-f142-41f8-86e1-d63c902b38ed and the uuid of the snapshot is db2b6552-394a-42d2-9de8-2295fe2b3180, the corresponding rbd command is:

# Snapshot the disk and clone it into Glance's storage pool
rbd -p openstack snap create \
cbf44290-f142-41f8-86e1-d63c902b38ed_disk@3437a9bbba5842629cc76e78aa613c70
rbd -p openstack snap protect \
cbf44290-f142-41f8-86e1-d63c902b38ed_disk@3437a9bbba5842629cc76e78aa613c70
rbd -p openstack clone \
cbf44290-f142-41f8-86e1-d63c902b38ed_disk@3437a9bbba5842629cc76e78aa613c70 \
db2b6552-394a-42d2-9de8-2295fe2b3180
# Flatten the image, which detaches it from the source snapshot
rbd -p openstack flatten \
db2b6552-394a-42d2-9de8-2295fe2b3180
# all done with the source snapshot, clean it up
rbd -p openstack snap unprotect \
cbf44290-f142-41f8-86e1-d63c902b38ed_disk@3437a9bbba5842629cc76e78aa613c70
rbd -p openstack snap rm \
cbf44290-f142-41f8-86e1-d63c902b38ed_disk@3437a9bbba5842629cc76e78aa613c70
# Makes a protected snapshot called 'snap' on uploaded images 
# and hands it out
rbd -p openstack snap create \
db2b6552-394a-42d2-9de8-2295fe2b3180@snap
rbd -p openstack snap protect \
db2b6552-394a-42d2-9de8-2295fe2b3180@snap

Where 3437a9bbba5842629cc76e78aa613c70 is the name of the generated temporary snapshot, which is a randomly generated uuid.

enlightenment

Other storage backend mainly takes time in the image upload process, while when using Ceph storage, it mainly takes time in the flatten process of rbd. Therefore, it usually takes several minutes to create virtual machine snapshots. Some people may wonder, why do you have to perform the flatten operation? Isn't it finished by directly clone? The community did so for a reason:

If not flatten Operation, the virtual machine snapshot depends on the virtual machine. In other words, the virtual machine cannot be deleted as long as there is a snapshot, which is obviously unreasonable.
The previous problem continues to be extended. Suppose that the virtual machine is created based on the snapshot, and the virtual machine creates the snapshot again. This is repeated, and the whole rbd image Your dependencies can be very complex and can't be managed at all.
When rbd image When the chain is longer and longer, the corresponding IO Reading and writing performance will be worse and worse

3.4 deleting virtual machines

libvirt driver the code to delete the virtual machine is located in Nova / virt / libvirt / driver destroy method of Py:

def destroy(self, context, instance, network_info, block_device_info=None,
                destroy_disks=True):
    self._destroy(instance)
    self.cleanup(context, instance, network_info, block_device_info,
                 destroy_disks)

Pay attention to the front_ The destroy method is actually the virtual machine shutdown operation, that is, Nova will first shut down the virtual machine and then delete it. This is followed by a call to the cleanup() method, which performs resource cleanup. Here, we only focus on the process of cleaning disks:

if destroy_disks:
    # NOTE(haomai): destroy volumes if needed
    if CONF.libvirt.images_type == 'lvm':
        self._cleanup_lvm(instance, block_device_info)
    if CONF.libvirt.images_type == 'rbd':
        self._cleanup_rbd(instance)
...

Since our image type is rbd, the_ cleanup_rbd() method:

def _cleanup_rbd(self, instance):
    if instance.task_state == task_states.RESIZE_REVERTING:
        filter_fn = lambda disk: (disk.startswith(instance.uuid) and
                                  disk.endswith('disk.local'))
    else:
        filter_fn = lambda disk: disk.startswith(instance.uuid)
    LibvirtDriver._get_rbd_driver().cleanup_volumes(filter_fn)

If we only consider the normal delete operation and ignore the resize recall operation, then the filter_fn is lambda disk: disk Startswitch (instance. uuid), that is, all disks (RBD images) starting with virtual machine uuid. Note that instead of calling the Rbd driver of imagebackend, the storage driver is called directly. The code is located in nova/virt/libvirt/storage/rbd_utils.py:

def cleanup_volumes(self, filter_fn):
    with RADOSClient(self, self.pool) as client:
        volumes = RbdProxy().list(client.ioctx)
        for volume in filter(filter_fn, volumes):
            self._destroy_volume(client, volume)

This method first obtains the list of all RBD images, and then uses the filter_ The FN method filters images starting with virtual machine uuid, and calls_ destroy_volume method:

def _destroy_volume(self, client, volume, pool=None):
    """Destroy an RBD volume, retrying as needed.
    """
    def _cleanup_vol(ioctx, volume, retryctx):
        try:
            RbdProxy().remove(ioctx, volume)
            raise loopingcall.LoopingCallDone(retvalue=False)
        except rbd.ImageHasSnapshots:
            self.remove_snap(volume, libvirt_utils.RESIZE_SNAPSHOT_NAME,
                             ignore_errors=True)
        except (rbd.ImageBusy, rbd.ImageHasSnapshots):
            LOG.warning('rbd remove %(volume)s in pool %(pool)s failed',
                        {'volume': volume, 'pool': self.pool})
        retryctx['retries'] -= 1
        if retryctx['retries'] <= 0:
            raise loopingcall.LoopingCallDone()

    # NOTE(danms): We let it go for ten seconds
    retryctx = {'retries': 10}
    timer = loopingcall.FixedIntervalLoopingCall(
        _cleanup_vol, client.ioctx, volume, retryctx)
    timed_out = timer.start(interval=1).wait()
    if timed_out:
        # NOTE(danms): Run this again to propagate the error, but
        # if it succeeds, don't raise the loopingcall exception
        try:
            _cleanup_vol(client.ioctx, volume, retryctx)
        except loopingcall.LoopingCallDone:
            pass

This method can be tried up to 10 + 1 times_ cleanup_ The vol() method deletes the rbd image. If there is a snapshot, the snapshot will be deleted first.

Assuming that the uuid of the virtual machine is cbf44290-f142-41f8-86e1-d63c902b38ed, the corresponding rbd command is roughly:

for image in $(rbd -p openstack ls | grep '^cbf44290-f142-41f8-86e1-d63c902b38ed');
do 
    rbd -p openstack rm "$image";
done

4 Cinder

4.1 Cinder introduction

Cinder is a block storage service of OpenStack, similar to EBS of AWS. The managed entity is volume. Cinder volume is not implemented The provide function is responsible for managing the volume of various storage systems, such as Ceph, fujitsu, netapp, etc. it supports the creation, snapshot, backup and other functions of volume. The docked storage system is called backend. As long as you implement Cinder / volume / driver The interface defined by the VolumeDriver class in. Py allows Cinder to dock with the storage system.

Cinder not only supports the management of local volume, but also backs up the local volume to the remote storage system, such as another Ceph cluster or Swift object storage system. This paper will only consider backing up from the source Ceph cluster to the remote Ceph cluster.

4.2 create volume

The creation of volume is completed by the cinder volume service, and the entry is cinder / volume / manager Py's create_volume() method,

def create_volume(self, context, volume, request_spec=None,
                  filter_properties=None, allow_reschedule=True):
    ...              
    try:
        # NOTE(flaper87): Driver initialization is
        # verified by the task itself.
        flow_engine = create_volume.get_flow(
            context_elevated,
            self,
            self.db,
            self.driver,
            self.scheduler_rpcapi,
            self.host,
            volume,
            allow_reschedule,
            context,
            request_spec,
            filter_properties,
            image_volume_cache=self.image_volume_cache,
        )
    except Exception:
        msg = _("Create manager volume flow failed.")
        LOG.exception(msg, resource={'type': 'volume', 'id': volume.id})
        raise exception.CinderException(msg)
...

Cinder's process of creating volume uses taskflow framework The specific implementation of taskflow is located in cinder/volume/flows/manager/create_volume.py, we focus on its execute() method:

def execute(self, context, volume, volume_spec):
    ...
    if create_type == 'raw':
        model_update = self._create_raw_volume(volume, **volume_spec)
    elif create_type == 'snap':
        model_update = self._create_from_snapshot(context, volume,
                                                  **volume_spec)
    elif create_type == 'source_vol':
        model_update = self._create_from_source_volume(
            context, volume, **volume_spec)
    elif create_type == 'image':
        model_update = self._create_from_image(context,
                                               volume,
                                               **volume_spec)
    else:
        raise exception.VolumeTypeNotFound(volume_type_id=create_type)
    ...

From the code, we can see that volume creation can be divided into four types:

raw: Create a blank volume.
create from snapshot: Snapshot based creation volume. 
create from volume: This is equivalent to copying an existing volume. 
create from image: be based on Glance image Create a volume.

raw

Creating a blank volume is the easiest way. The code is as follows:

def _create_raw_volume(self, volume, **kwargs):
    ret = self.driver.create_volume(volume)
    ...

Directly call the create of the driver_ Volume () method, where driver is RBDDriver, and the code is located in cinder / volume / Drivers / RBD py:

def create_volume(self, volume):
    with RADOSClient(self) as client:
        self.RBDProxy().create(client.ioctx,
                               vol_name,
                               size,
                               order,
                               old_format=False,
                               features=client.features)

        try:
            volume_update = self._enable_replication_if_needed(volume)
        except Exception:
            self.RBDProxy().remove(client.ioctx, vol_name)
            err_msg = (_('Failed to enable image replication'))
            raise exception.ReplicationError(reason=err_msg,
                                             volume_id=volume.id)

Where the unit of size is MB, Vol_ The name is volume-${volume_uuid}.

Assuming that the uuid of the volume is bf2d1c54-6c98-4a78-9c20-3e8ea033c3db, the Ceph pool is openstack, and the created volume size is 1GB, the corresponding rbd command is equivalent to:

rbd -p openstack create \
--new-format --size 1024 \
volume-bf2d1c54-6c98-4a78-9c20-3e8ea033c3db

We can verify through the rbd command:

int32bit $ rbd -p openstack ls | grep bf2d1c54-6c98-4a78-9c20-3e8ea033c3db
volume-bf2d1c54-6c98-4a78-9c20-3e8ea033c3db

create from snapshot

Creating a volume from a snapshot is also a method of directly calling the driver, as follows:

def _create_from_snapshot(self, context, volume, snapshot_id,
                          **kwargs):
    snapshot = objects.Snapshot.get_by_id(context, snapshot_id)
    model_update = self.driver.create_volume_from_snapshot(volume,
                                                           snapshot)

Let's look at the create of RBDDriver_ volume_ from_ Snapshot() method:

def create_volume_from_snapshot(self, volume, snapshot):
    """Creates a volume from a snapshot."""
    volume_update = self._clone(volume, self.configuration.rbd_pool,
                                snapshot.volume_name, snapshot.name)
    if self.configuration.rbd_flatten_volume_from_snapshot:
        self._flatten(self.configuration.rbd_pool, volume.name)
    if int(volume.size):
        self._resize(volume)
    return volume_update

From the code, there are three steps to create a snapshot from a snapshot:

from rbd Execute in snapshot clone Operation.
If rbd_flatten_volume_from_snapshot Configure as True，Then execute flatten Operation.
If specified in the creation size，Then execute resize Operation.

Assume that the uuid of the newly created volume is e6bc8618-879b-4655-aac0-05e5a1ce0e06, the uuid of the snapshot is snapshot-e4e534fc-420b-45c6-8e9f-b23dcfcb7f86, the source volume uuid of the snapshot is bf2d1c54-6c98-4a78-9c20-3e8ea033c3db, the specified size is 2, and the rbd_ flatten_ volume_ from_ If the snapshot is False (the default), the corresponding rbd command is:

rbd clone \
openstack/volume-bf2d1c54-6c98-4a78-9c20-3e8ea033c3db@snapshot-e4e534fc-420b-45c6-8e9f-b23dcfcb7f86 \
openstack/volume-e6bc8618-879b-4655-aac0-05e5a1ce0e06
rbd resize --size 2048 \
openstack/volume-e6bc8618-879b-4655-aac0-05e5a1ce0e06

From the source code analysis, when Cinder creates a volume from a snapshot, the user can configure whether to execute the flatten operation:

If executed flatten Action, create from snapshot volume It may take several minutes, and you can delete the snapshot at any time after it is created.
If not flatten Operation, you need to pay attention to deleting all created based on the snapshot volume You cannot delete the snapshot or the source of the snapshot before volume.

The second point may be more complicated. For example, if a volume is created based on a snapshot, then a snapshot is created based on the volume. If a volume is created based on the snapshot, the user cannot delete the source volume or the snapshot.

create from volume

To create a volume from volume, you need to specify the source volume ID (source_volume ID):

def _create_from_source_volume(self, context, volume, source_volid,
                               **kwargs):
    srcvol_ref = objects.Volume.get_by_id(context, source_volid)
    model_update = self.driver.create_cloned_volume(volume, srcvol_ref)

Let's look directly at the driver's create_cloned_volume() method, which has a very important configuration item rbd_max_clone_depth, the longest depth allowed by rbd image clone, if rbd_max_clone_depth < = 0 means clone is not allowed:

# Do full copy if requested
if self.configuration.rbd_max_clone_depth <= 0:
    with RBDVolumeProxy(self, src_name, read_only=True) as vol:
        vol.copy(vol.ioctx, dest_name)
        self._extend_if_required(volume, src_vref)
    return

This is equivalent to the copy command of rbd.

If RBD_ max_ clone_ depth > 0:

# Otherwise do COW clone.
with RADOSClient(self) as client:
    src_volume = self.rbd.Image(client.ioctx, src_name)
    LOG.debug("creating snapshot='%s'", clone_snap)
    try:
        # Create new snapshot of source volume
        src_volume.create_snap(clone_snap)
        src_volume.protect_snap(clone_snap)
        # Now clone source volume snapshot
        LOG.debug("cloning '%(src_vol)s@%(src_snap)s' to "
                  "'%(dest)s'",
                  {'src_vol': src_name, 'src_snap': clone_snap,
                   'dest': dest_name})
        self.RBDProxy().clone(client.ioctx, src_name, clone_snap,
                              client.ioctx, dest_name,
                              features=client.features)

This process is very similar to creating a virtual machine snapshot. Both create a snapshot based on the source image first, and then perform a clone operation based on the snapshot. The difference is whether to perform a flatten operation. When creating a virtual machine snapshot, the flatten operation must be performed, and this operation depends on the clone depth:

depth = self._get_clone_depth(client, src_name)
if depth >= self.configuration.rbd_max_clone_depth:
        dest_volume = self.rbd.Image(client.ioctx, dest_name)
        try:
            dest_volume.flatten()
        except Exception as e:
            ...

        try:
            src_volume.unprotect_snap(clone_snap)
            src_volume.remove_snap(clone_snap)
        except Exception as e:
            ...

If the current depth exceeds the maximum allowable depth rbd_max_clone_depth performs the flatten operation and deletes the created snapshot.

Assuming that the created volume uuid is 3b8b15a4-3020-41a0-80be-afaa35ed5eef and the source volume uuid is bf2d1c54-6c98-4a78-9c20-3e8ea033c3db, the corresponding rbd command is:

VOLID=3b8b15a4-3020-41a0-80be-afaa35ed5eef
SOURCE_VOLID=bf2d1c54-6c98-4a78-9c20-3e8ea033c3db
CINDER_POOL=openstack
# Do full copy if rbd_max_clone_depth <= 0.
if [[ "$rbd_max_clone_depth" -le 0 ]]; then
    rbd copy ${CINDER_POOL}/volume-${SOURCE_VOLID} openstack/volume-${VOLID}
    exit 0
fi
# Otherwise do COW clone.
# Create new snapshot of source volume
rbd snap create \
${CINDER_POOL}/volume-${SOURCE_VOLID}@volume-${VOLID}.clone_snap
rbd snap protect \
${CINDER_POOL}/volume-${SOURCE_VOLID}@volume-${VOLID}.clone_snap
# Now clone source volume snapshot
rbd clone \
${CINDER_POOL}/volume-${SOURCE_VOLID}@volume-${VOLID}.clone_snap \
${CINDER_POOL}/volume-${VOLID}
# If dest volume is a clone and rbd_max_clone_depth reached,
# flatten the dest after cloning.
depth=$(get_clone_depth ${CINDER_POOL}/volume-${VOLID})
if [[ "$depth" -ge "$rbd_max_clone_depth" ]]; then
    # Flatten destination volume 
    rbd flatten ${CINDER_POOL}/volume-${VOLID}
    # remove temporary snap
    rbd snap unprotect \
    ${CINDER_POOL}/volume-${SOURCE_VOLID}@volume-${VOLID}.clone_snap
    rbd snap rm \
    ${CINDER_POOL}/volume-${SOURCE_VOLID}@volume-${VOLID}.clone_snap
fi

When rbd_ max_ clone_ Depth > 0 and depth < rbd_ max_ clone_ When depth, verify through the rbd command:

int32bit $ rbd info volume-3b8b15a4-3020-41a0-80be-afaa35ed5eef
rbd image 'volume-3b8b15a4-3020-41a0-80be-afaa35ed5eef':
        size 1024 MB in 256 objects
        order 22 (4096 kB objects)
        block_name_prefix: rbd_data.ae2e437c177a
        format: 2
        features: layering, exclusive-lock, object-map, fast-diff, deep-flatten
        flags:
        create_timestamp: Wed Nov 22 12:32:09 2017
        parent: openstack/volume-bf2d1c54-6c98-4a78-9c20-3e8ea033c3db@volume-3b8b15a4-3020-41a0-80be-afaa35ed5eef.clone_snap
        overlap: 1024 MB

It can be seen that the parent of volume-3b8b15a4-3020-41a0-80be-afaa35ed5eef is:

volume-bf2d1c54-6c98-4a78-9c20-3e8ea033c3db@volume-3b8b15a4-3020-41a0-80be-afaa35ed5eef.clone_snap.

create from image

Create volume from the image. Assuming that both grace and Cinder use the same Ceph cluster, Cinder can clone directly from grace without downloading the image:

def _create_from_image(self, context, volume,
                       image_location, image_id, image_meta,
                       image_service, **kwargs):
    ...
    model_update, cloned = self.driver.clone_image(
        context,
        volume,
        image_location,
        image_meta,
        image_service)
   ...

Let's look at the clone of the driver_ Image() method:

def clone_image(self, context, volume,
                image_location, image_meta,
                image_service):
    # iterate all locations to look for a cloneable one.
    for url_location in url_locations:
        if url_location and self._is_cloneable(
                url_location, image_meta):
            _prefix, pool, image, snapshot = \
                self._parse_location(url_location)
            volume_update = self._clone(volume, pool, image, snapshot)
            volume_update['provider_location'] = None
            self._resize(volume)
            return volume_update, True
    return ({}, False)

rbd directly clone s, which is basically the same as creating virtual machines. If a new size is specified when creating the volume, rbd resize is called to perform the capacity expansion operation.

Assuming that the newly created volume uuid is 87ee1ec6-3fe4-413b-a4c0-8ec7756bf1b4 and the grace image UUID is db2b6552-394a-42d2-9de8-2295fe2b3180, the rbd command is:

rbd clone \
openstack/db2b6552-394a-42d2-9de8-2295fe2b3180@snap \
openstack/volume-87ee1ec6-3fe4-413b-a4c0-8ec7756bf1b4

if [[ -n "$size" ]]; then
    rbd resize --size $size \
    openstack/volume-87ee1ec6-3fe4-413b-a4c0-8ec7756bf1b4
fi

Verify the following with the rbd command:

int32bit $ rbd info openstack/volume-87ee1ec6-3fe4-413b-a4c0-8ec7756bf1b4
rbd image 'volume-87ee1ec6-3fe4-413b-a4c0-8ec7756bf1b4':
        size 3072 MB in 768 objects
        order 22 (4096 kB objects)
        block_name_prefix: rbd_data.affc488ac1a
        format: 2
        features: layering, exclusive-lock, object-map, fast-diff, deep-flatten
        flags:
        create_timestamp: Wed Nov 22 13:07:50 2017
        parent: openstack/db2b6552-394a-42d2-9de8-2295fe2b3180@snap
        overlap: 2048 MB

It can be seen that the parent of the newly created rbd image is openstack/db2b6552-394a-42d2-9de8-2295fe2b3180@snap .

Note: in fact, I personally think this method needs to perform the flatten operation. Otherwise, when there is volume, grace cannot delete the image, which is equivalent to that the grace service depends on the Cinder service state, which is a little unreasonable.

4.3 creating snapshots

Create snapshot entry as cinder / volume / manager Py's create_snapshot() method, which does not use taskflow framework, but directly calls driver create_ The snapshot () method is as follows:

...
try:
    utils.require_driver_initialized(self.driver)
    snapshot.context = context
    model_update = self.driver.create_snapshot(snapshot)
    ...
except Exception:
    ...

Create of RBDDriver_ The snapshot () method is very simple:

def create_snapshot(self, snapshot):
    """Creates an rbd snapshot."""
    with RBDVolumeProxy(self, snapshot.volume_name) as volume:
        snap = utils.convert_str(snapshot.name)
        volume.create_snap(snap)
        volume.protect_snap(snap)

Therefore, the volume snapshot is actually the corresponding Ceph rbd image snapshot. Assuming that the snapshot uuid is e4e534fc-420b-45c6-8e9f-b23dcfcb7f86 and the volume uuid is bf2d1c54-6c98-4a78-9c20-3e8ea033c3db, the corresponding rbd commands are roughly as follows:

rbd -p openstack snap create \
volume-bf2d1c54-6c98-4a78-9c20-3e8ea033c3db@snapshot-e4e534fc-420b-45c6-8e9f-b23dcfcb7f86
rbd -p openstack snap protect \
volume-bf2d1c54-6c98-4a78-9c20-3e8ea033c3db@snapshot-e4e534fc-420b-45c6-8e9f-b23dcfcb7f86

From here, we can see the difference between virtual machine snapshot and volume snapshot. Virtual machine snapshot needs to clone and then flash from rbd image snapshot on root disk, while volume snapshot only needs to create rbd image snapshot. Therefore, virtual machine snapshot usually takes several minutes, while volume snapshot can be completed in seconds.

4.4 create volume backup

Before understanding volume backup, you need to clarify the difference between snapshot and backup. We can use git analogy. Snapshot is similar to git commit operation, which only indicates that the data has been committed. It is mainly used for backtracking and rollback. When the cluster collapses, resulting in data loss, it is usually impossible to fully recover the data from the snapshot. Backup is similar to git push, which pushes data security to the remote storage system. It is mainly used to ensure data security. Even if local data is lost, it can be recovered from backup. Cinder's disk backup also supports a variety of storage back ends. Here, we only consider the case where both volume and backup driver are Ceph. For other details, please refer to Principle and practice of Cinder data volume backup . In production, volume and backup must use different Ceph clusters to ensure that when the volume ceph cluster hangs, data can be quickly recovered from another cluster. This article is just to test the function, so the same Ceph cluster is used. It is distinguished by pool. Volume uses openstackpool and backup uses cinder_backuppool.

In addition, Cinder supports incremental backup. You can specify the -- incremental parameter to decide whether to use full backup or incremental backup. However, for the Ceph backend, Cinder always tries to perform incremental backup first. Only when the incremental backup fails will it fall back to the full backup, regardless of whether the user specifies the -- incremental parameter. Nevertheless, we still divide backup into full backup and incremental backup. Note that only the first backup can be full backup, and the rest are incremental backup.

Full backup (first backup)

Let's directly look at CephBackupDriver's backup() method. The code is located in cinder / backup / Drivers / CEPH py.

if self._file_is_rbd(volume_file):
    # If volume an RBD, attempt incremental backup.
    LOG.debug("Volume file is RBD: attempting incremental backup.")
    try:
        updates = self._backup_rbd(backup, volume_file,
                                   volume.name, length)
    except exception.BackupRBDOperationFailed:
        LOG.debug("Forcing full backup of volume %s.", volume.id)
        do_full_backup = True

Here, we mainly judge whether the source volume is rbd, that is, whether the Ceph backend is used. Incremental backup can be performed only when the volume also uses the Ceph storage backend.

Let's see_ backup_rbd() method:

from_snap = self._get_most_recent_snap(source_rbd_image)
base_name = self._get_backup_base_name(volume_id, diff_format=True)
image_created = False
with rbd_driver.RADOSClient(self, backup.container) as client:
    if base_name not in self.rbd.RBD().list(ioctx=client.ioctx):
        ...
        # Create new base image
        self._create_base_image(base_name, length, client)
        image_created = True
    else:
        ...

from_snap is the snapshot point of the last backup. Since this is the first backup, it is from_snap is None, base_ The format of name is volume -% s.backup Base, what does this base do? Let's check_ create_ base_ The image () method knows:

def _create_base_image(self, name, size, rados_client):
    old_format, features = self._get_rbd_support()
    self.rbd.RBD().create(ioctx=rados_client.ioctx,
                          name=name,
                          size=size,
                          old_format=old_format,
                          features=features,
                          stripe_unit=self.rbd_stripe_unit,
                          stripe_count=self.rbd_stripe_count)

It can be seen that the base is actually an empty volume with the same size as the previous volume.

That is, if it is the first backup, the Ceph cluster in the backup will first create an empty volume with the same size as volume.

Let's continue to look at the source code:

def _backup_rbd(self, backup, volume_file, volume_name, length):
    ...
    new_snap = self._get_new_snap_name(backup.id)
    LOG.debug("Creating backup snapshot='%s'", new_snap)
    source_rbd_image.create_snap(new_snap)

    try:
        self._rbd_diff_transfer(volume_name, rbd_pool, base_name,
                                backup.container,
                                src_user=rbd_user,
                                src_conf=rbd_conf,
                                dest_user=self._ceph_backup_user,
                                dest_conf=self._ceph_backup_conf,
                                src_snap=new_snap,
                                from_snap=from_snap)
                            
def _get_new_snap_name(self, backup_id):
    return utils.convert_str("backup.%s.snap.%s"
                             % (backup_id, time.time()))

First, a new snapshot is created in the source volume. The snapshot is called backup$ {backup_id}. snap.$ {timestamp}, and then called rbd_. diff_ Transfer() method:

def _rbd_diff_transfer(self, src_name, src_pool, dest_name, dest_pool,
                       src_user, src_conf, dest_user, dest_conf,
                       src_snap=None, from_snap=None):
    src_ceph_args = self._ceph_args(src_user, src_conf, pool=src_pool)
    dest_ceph_args = self._ceph_args(dest_user, dest_conf, pool=dest_pool)

    cmd1 = ['rbd', 'export-diff'] + src_ceph_args
    if from_snap is not None:
        cmd1.extend(['--from-snap', from_snap])
    if src_snap:
        path = utils.convert_str("%s/%s@%s"
                                 % (src_pool, src_name, src_snap))
    else:
        path = utils.convert_str("%s/%s" % (src_pool, src_name))
    cmd1.extend([path, '-'])

    cmd2 = ['rbd', 'import-diff'] + dest_ceph_args
    rbd_path = utils.convert_str("%s/%s" % (dest_pool, dest_name))
    cmd2.extend(['-', rbd_path])

    ret, stderr = self._piped_execute(cmd1, cmd2)
    if ret:
        msg = (_("RBD diff op failed - (ret=%(ret)s stderr=%(stderr)s)") %
               {'ret': ret, 'stderr': stderr})
        LOG.info(msg)
        raise exception.BackupRBDOperationFailed(msg)

Method calls the rbd command. First, export the difference file of the source rbd image through the export diff subcommand, and then import it into the backup image through import diff.

Assuming that the uuid of the source volume is 075c06ed-37e2-407d-b998-e270c4edc53c, the size is 1GB, and the backup uuid is db563496-0c15-4349-95f3-fc5194bfb11a, the corresponding rbd commands are roughly as follows:

VOLUME_ID=075c06ed-37e2-407d-b998-e270c4edc53c
BACKUP_ID=db563496-0c15-4349-95f3-fc5194bfb11a
rbd -p cinder_backup create \
--size 1024 \
volume-${VOLUME_ID}.backup.base
new_snap=volume-${VOLUME_ID}@backup.${BACKUP_ID}.snap.1511344566.67
rbd -p openstack snap create ${new_snap}
rbd export-diff --pool openstack ${new_snap} - \
| rbd import-diff --pool cinder_backup - volume-${VOLUME_ID}.backup.base

We can verify the following with the rbd command:

# volume ceph cluster
int32bit $ rbd -p openstack snap ls volume-075c06ed-37e2-407d-b998-e270c4edc53c
SNAPID NAME                                                              SIZE TIMESTAMP
    52 backup.db563496-0c15-4349-95f3-fc5194bfb11a.snap.1511344566.67 1024 MB Wed Nov 22 17:56:15 2017
# backup ceph cluster
int32bit $ rbd -p cinder_backup ls -l
NAME                                                                                                                   SIZE PARENT FMT PROT LOCK
volume-075c06ed-37e2-407d-b998-e270c4edc53c.backup.base                                                                1024M 2
volume-075c06ed-37e2-407d-b998-e270c4edc53c.backup.base@backup.db563496-0c15-4349-95f3-fc5194bfb11a.snap.1511344566.67 1024M  2

From the output, the source volume creates a snapshot with ID 52, and creates an empty volume volume volume-075c06ed-37e2-407d-b998-e270c4edc53c in the Ceph cluster of backup backup. Base and contains a snapshot backup xxx. snap. 1511344566.67, the snapshot was created through import diff.

Incremental backup

The previous process is the same as full backup. Let's skip to_ backup_rbd() method:

from_snap = self._get_most_recent_snap(source_rbd_image)
with rbd_driver.RADOSClient(self, backup.container) as client:
    if base_name not in self.rbd.RBD().list(ioctx=client.ioctx):
        ...
    else:
        if not self._snap_exists(base_name, from_snap, client):
            errmsg = (_("Snapshot='%(snap)s' does not exist in base "
                        "image='%(base)s' - aborting incremental "
                        "backup") %
                      {'snap': from_snap, 'base': base_name})
            LOG.info(errmsg)
            raise exception.BackupRBDOperationFailed(errmsg)

First, obtain the latest snapshot of rbd image corresponding to the source volume as the most parent, and then judge whether the same snapshot exists in the base of Ceph cluster of backup (according to the previous full backup, there must be the same snapshot as the source volume).

Let's continue to look at the following parts:

new_snap = self._get_new_snap_name(backup.id)
source_rbd_image.create_snap(new_snap)

try:
    before = time.time()
    self._rbd_diff_transfer(volume_name, rbd_pool, base_name,
                            backup.container,
                            src_user=rbd_user,
                            src_conf=rbd_conf,
                            dest_user=self._ceph_backup_user,
                            dest_conf=self._ceph_backup_conf,
                            src_snap=new_snap,
                            from_snap=from_snap)
    if from_snap:
        source_rbd_image.remove_snap(from_snap)

This is basically the same as full backup. The only difference is from_snap is not None, and from will be deleted later_ snap. _ rbd_ diff_ The transfer method can turn to the previous code.

Assuming that the source volume uuid is 075c06ed-37e2-407d-b998-e270c4edc53c, the backup uuid is e3db9e85-d352-47e2-bced-5bad68da853b, and the parent backup uuid is db563496-0c15-4349-95f3-fc5194bfb11a, the corresponding rbd commands are roughly as follows:

VOLUME_ID=075c06ed-37e2-407d-b998-e270c4edc53c
BACKUP_ID=e3db9e85-d352-47e2-bced-5bad68da853b
PARENT_ID=db563496-0c15-4349-95f3-fc5194bfb11a
rbd -p openstack snap create \
volume-${VOLUME_ID}@backup.${BACKUP_ID}.snap.1511348180.27 
rbd export-diff  --pool openstack \
--from-snap backup.${PARENT_ID}.snap.1511344566.67 \
openstack/volume-${VOLUME_ID}@backup.${BACKUP_ID}.snap.1511348180.27 - \
| rbd import-diff --pool cinder_backup - \
cinder_backup/volume-${VOLUME_ID}.backup.base
rbd -p openstack snap rm \
volume-${VOLUME_ID}.backup.base@backup.${PARENT_ID}.snap.1511344566.67

We verify the following with the rbd command:

int32bit $ rbd -p openstack snap ls volume-075c06ed-37e2-407d-b998-e270c4edc53c
SNAPID NAME                                                              SIZE TIMESTAMP
    53 backup.e3db9e85-d352-47e2-bced-5bad68da853b.snap.1511348180.27 1024 MB Wed Nov 22 18:56:20 2017
int32bit $ rbd -p cinder_backup ls -l
NAME                                                                                                                    SIZE PARENT FMT PROT LOCK
volume-075c06ed-37e2-407d-b998-e270c4edc53c.backup.base                                                                1024M          2
volume-075c06ed-37e2-407d-b998-e270c4edc53c.backup.base@backup.db563496-0c15-4349-95f3-fc5194bfb11a.snap.1511344566.67 1024M          2
volume-075c06ed-37e2-407d-b998-e270c4edc53c.backup.base@backup.e3db9e85-d352-47e2-bced-5bad68da853b.snap.1511348180.27 1024M          2

Consistent with the results of our analysis, the snapshot of the source volume will delete the old one and only keep the latest one, and backup will keep all snapshots.

4.5 backup and recovery

Backup recovery is the reverse process of backup, that is, restoring data from remote storage to local storage. The source code of backup and recovery is located in cinder / backup / Drivers / CEPH Py's restore() method, which directly calls_ restore_volume() method, so let's look directly_ restore_volume() method:

def _restore_volume(self, backup, volume, volume_file):
    length = int(volume.size) * units.Gi

    base_name = self._get_backup_base_name(backup.volume_id,
                                           diff_format=True)
    with rbd_driver.RADOSClient(self, backup.container) as client:
        diff_allowed, restore_point = \
            self._diff_restore_allowed(base_name, backup, volume,
                                       volume_file, client)

Among them_ diff_restore_allowed() is a very important method to judge whether recovery through direct import is supported. Let's see the implementation of this method:

def _diff_restore_allowed(self, base_name, backup, volume, volume_file,
                          rados_client):
    rbd_exists, base_name = self._rbd_image_exists(base_name,
                                                   backup.volume_id,
                                                   rados_client)
    if not rbd_exists:
        return False, None
    restore_point = self._get_restore_point(base_name, backup.id)
    if restore_point:
        if self._file_is_rbd(volume_file):
            if volume.id == backup.volume_id:
                return False, restore_point
            if self._rbd_has_extents(volume_file.rbd_image):
                return False, restore_point
            return True, restore_point

From this method, we can see that all the following conditions need to be met to support data recovery by differential import:

backup Cluster correspondence volume of rbd base image Must exist.
The recovery point must exist, i.e backup base image The corresponding snapshot must exist.
Recovery target volume Must be RBD，Namely volume The storage backend must also be Ceph. 
Recovery target volume It must be an empty volume and does not support overwriting existing content image. 
Recovery target volume uuid and backup Source of volume uuid It cannot be the same, that is, it cannot overwrite the original volume.

In other words, although Cinder supports restoring data to existing volumes (including source volumes), if Ceph is used, the backend does not support incremental recovery, resulting in very low efficiency.

Therefore, if Ceph storage backend is used, it is recommended in the official document to restore the backup to an empty volume (no volume is specified), and it is not recommended to restore to an existing volume.

Note that Cinder supports restoring to a new volume or the original volume the backup was taken from. 
For the latter case, a full copy is enforced since this was deemed the safest action to take. 
It is therefore recommended to always restore to a new volume (default).

Here, suppose we restore to the empty volume, and the command is as follows:

cinder backup-restore \
--name int32bit-restore-1 \
e3db9e85-d352-47e2-bced-5bad68da853b

Note that we did not specify the -- volume parameter. At this time, perform incremental recovery. The code is as follows:

def _diff_restore_rbd(self, backup, restore_file, restore_name,
                      restore_point, restore_length):
    rbd_user = restore_file.rbd_user
    rbd_pool = restore_file.rbd_pool
    rbd_conf = restore_file.rbd_conf
    base_name = self._get_backup_base_name(backup.volume_id,
                                           diff_format=True)
    before = time.time()
    try:
        self._rbd_diff_transfer(base_name, backup.container,
                                restore_name, rbd_pool,
                                src_user=self._ceph_backup_user,
                                src_conf=self._ceph_backup_conf,
                                dest_user=rbd_user, dest_conf=rbd_conf,
                                src_snap=restore_point)
    except exception.BackupRBDOperationFailed:
        raise
    self._check_restore_vol_size(backup, restore_name, restore_length,
                                 rbd_pool)

It can be seen that incremental recovery is very simple and only calls the previously described_ rbd_ diff_ The transfer () method exports diff the snapshot of the base image corresponding to the backup Ceph cluster to the Ceph cluster of volume, and adjusts the size.

Assuming that the backup uuid is e3db9e85-d352-47e2-bced-5bad68da853b, the source volume uuid is 075c06ed-37e2-407d-b998-e270c4edc53c, and the target volume uuid is f65cf534-5266-44bb-ad57-ddba21d9e5f9, the corresponding rbd command is:

BACKUP_ID=e3db9e85-d352-47e2-bced-5bad68da853b
SOURCE_VOLUME_ID=075c06ed-37e2-407d-b998-e270c4edc53c
DEST_VOLUME_ID=f65cf534-5266-44bb-ad57-ddba21d9e5f9
rbd export-diff --pool cinder_backup \
cinder_backup/volume-${SOURCE_VOLUME_ID}.backup.base@backup.${BACKUP_ID}.snap.1511348180.27 - \
| rbd import-diff --pool openstack - \
openstack/volume-${DEST_VOLUME_ID}
rbd -p openstack resize \
--size ${new_size} volume-${DEST_VOLUME_ID}

If one of the above five conditions is not met, Cinder will perform a full backup, which means writing data piece by piece:

def _transfer_data(self, src, src_name, dest, dest_name, length):
    chunks = int(length / self.chunk_size)
    for chunk in range(0, chunks):
        before = time.time()
        data = src.read(self.chunk_size)
        dest.write(data)
        dest.flush()
        delta = (time.time() - before)
        rate = (self.chunk_size / delta) / 1024
        # yield to any other pending backups
        eventlet.sleep(0)
    rem = int(length % self.chunk_size)
    if rem:
        dest.write(data)
        dest.flush()
        # yield to any other pending backups
        eventlet.sleep(0)

In this case, it is inefficient and time-consuming, and is not recommended.

5 Summary

5.1 Glance

1. Upload image

rbd -p ${GLANCE_POOL} create --size ${SIZE} ${IMAGE_ID}
rbd -p ${GLANCE_POOL} snap create ${IMAGE_ID}@snap
rbd -p ${GLANCE_POOL} snap protect ${IMAGE_ID}@snap

2. Delete image

rbd -p ${GLANCE_POOL} snap unprotect ${IMAGE_ID}@snap
rbd -p ${GLANCE_POOL} snap rm ${IMAGE_ID}@snap
rbd -p ${GLANCE_POOL} rm ${IMAGE_ID}

5.2 Nova

1 create virtual machine

rbd clone \
${GLANCE_POOL}/${IMAGE_ID}@snap \
${NOVA_POOL}/${SERVER_ID}_disk

2 create virtual machine snapshot

# Snapshot the disk and clone 
# it into Glance's storage pool
rbd -p ${NOVA_POOL} snap create \
${SERVER_ID}_disk@${RANDOM_UUID}
rbd -p ${NOVA_POOL} snap protect \
${SERVER_ID}_disk@${RANDOM_UUID}
rbd clone \
${NOVA_POOL}/${SERVER_ID}_disk@${RANDOM_UUID} \
${GLANCE_POOL}/${IMAGE_ID} 
# Flatten the image, which detaches it from the 
# source snapshot
rbd -p ${GLANCE_POOL} flatten ${IMAGE_ID} 
# all done with the source snapshot, clean it up
rbd -p ${NOVA_POOL} snap unprotect \
${SERVER_ID}_disk@${RANDOM_UUID}
rbd -p ${NOVA_POOL} snap rm \
${SERVER_ID}_disk@${RANDOM_UUID} 
# Makes a protected snapshot called 'snap' on 
# uploaded images and hands it out
rbd -p ${GLANCE_POOL} snap create ${IMAGE_ID}@snap
rbd -p ${GLANCE_POOL} snap protect ${IMAGE_ID}@snap

3 delete virtual machine

for image in $(rbd -p ${NOVA_POOL} ls | grep "^${SERVER_ID}");
do 
    rbd -p ${NOVA_POOL} rm "$image"; 
done

5.3 Cinder

1 create volume

(1) Create a blank volume

rbd -p ${CINDER_POOL} create \
--new-format --size ${SIZE} \
volume-${VOLUME_ID}

(2) Create from snapshot

rbd clone \
${CINDER_POOL}/volume-${SOURCE_VOLUME_ID}@snapshot-${SNAPSHOT_ID} \
${CINDER_POOL}/volume-${VOLUME_ID}
rbd resize --size ${SIZE} \
openstack/volume-${VOLUME_ID}

(3) Create from volume

# Do full copy if rbd_max_clone_depth <= 0.
if [[ "$rbd_max_clone_depth" -le 0 ]]; then
    rbd copy \
    ${CINDER_POOL}/volume-${SOURCE_VOLUME_ID} \
    ${CINDER_POOL}/volume-${VOLUME_ID}
    exit 0
fi
# Otherwise do COW clone.
# Create new snapshot of source volume
rbd snap create \
${CINDER_POOL}/volume-${SOURCE_VOLUME_ID}@volume-${VOLUME_ID}.clone_snap
rbd snap protect \
${CINDER_POOL}/volume-${SOURCE_VOLUME_ID}@volume-${VOLUME_ID}.clone_snap
# Now clone source volume snapshot
rbd clone \
${CINDER_POOL}/volume-${SOURCE_VOLUME_ID}@volume-${VOLUME_ID}.clone_snap \
${CINDER_POOL}/volume-${VOLUME_ID}
# If dest volume is a clone and rbd_max_clone_depth reached,
# flatten the dest after cloning.
depth=$(get_clone_depth ${CINDER_POOL}/volume-${VOLUME_ID})
if [[ "$depth" -ge "$rbd_max_clone_depth" ]]; then
    # Flatten destination volume 
    rbd flatten ${CINDER_POOL}/volume-${VOLUME_ID}
    # remove temporary snap
    rbd snap unprotect \
    ${CINDER_POOL}/volume-${SOURCE_VOLUME_ID}@volume-${VOLUME_ID}.clone_snap
    rbd snap rm \
    ${CINDER_POOL}/volume-${SOURCE_VOLUME_ID}@volume-${VOLUME_ID}.clone_snap
fi

(4) Create from mirror

rbd clone \
${GLANCE_POOL}/${IMAGE_ID}@snap \
${CINDER_POOL}/volume-${VOLUME_ID}
if [[ -n "${SIZE}" ]]; then
    rbd resize --size ${SIZE} ${CINDER_POOL}/volume-${VOLUME_ID}
fi

2 create a snapshot

rbd -p ${CINDER_POOL} snap create \
volume-${VOLUME_ID}@snapshot-${SNAPSHOT_ID}
rbd -p ${CINDER_POOL} snap protect \
volume-${VOLUME_ID}@snapshot-${SNAPSHOT_ID}

3 create backup

(1) First backup

rbd -p ${BACKUP_POOL} create \
--size ${VOLUME_SIZE} \
volume-${VOLUME_ID}.backup.base
NEW_SNAP=volume-${VOLUME_ID}@backup.${BACKUP_ID}.snap.${TIMESTAMP}
rbd -p ${CINDER_POOL} snap create ${NEW_SNAP}
rbd export-diff ${CINDER_POOL}/volume-${VOLUME_ID}${NEW_SNAP} - \
| rbd import-diff --pool ${BACKUP_POOL} - \
volume-${VOLUME_ID}.backup.base

(2) Incremental backup

rbd -p ${CINDER_POOL} snap create \
volume-${VOLUME_ID}@backup.${BACKUP_ID}.snap.${TIMESTAMP} 
rbd export-diff  --pool ${CINDER_POOL} \
--from-snap backup.${PARENT_ID}.snap.${LAST_TIMESTAMP} \
${CINDER_POOL}/volume-${VOLUME_ID}@backup.${BACKUP_ID}.snap.${TIMESTRAMP} - \
| rbd import-diff --pool ${BACKUP_POOL} - \
${BACKUP_POOL}/volume-${VOLUME_ID}.backup.base
rbd -p ${CINDER_POOL} snap rm \
volume-${VOLUME_ID}.backup.base@backup.${PARENT_ID}.snap.${LAST_TIMESTAMP}

4 backup and recovery

rbd export-diff --pool ${BACKUP_POOL} \
volume-${SOURCE_VOLUME_ID}.backup.base@backup.${BACKUP_ID}.snap.${TIMESTRAMP} - \
| rbd import-diff --pool ${CINDER_POOL} - \
volume-${DEST_VOLUME_ID}
rbd -p ${CINDER_POOL} resize \
--size ${new_size} volume-${DEST_VOLUME_ID}

reference resources: https://zhuanlan.zhihu.com/p/31581145

$ rbd -p openstack flatten int32bit-test-2
Image flatten: 31% complete...

Programmer Think

OpenStack uses Ceph storage. What does Ceph do?

1 background knowledge

1.1 Ceph introduction

1.2 introduction to openstack

2 Glance

2.1 grace introduction

2.2 image upload

enlightenment

2.3 image deletion

3 Nova

3.1 Nova introduction

3.2 creating virtual machines

enlightenment

3.3 create virtual machine snapshot

enlightenment

3.4 deleting virtual machines

4 Cinder

4.1 Cinder introduction

4.2 create volume

raw

create from snapshot

create from volume

create from image

4.3 creating snapshots

4.4 create volume backup

Full backup (first backup)

Incremental backup

4.5 backup and recovery

5 Summary

5.1 Glance

5.2 Nova

5.3 Cinder

4 backup and recovery

Hot Topics