1 background knowledge
1.1 Ceph introduction
Ceph is a very popular open source distributed storage system. It has the advantages of high scalability, high performance and high reliability. It also provides block storage service (rbd), object storage service (rgw) and file system storage service (cephfs). At present, it is also the mainstream back-end storage of OpenStack. It is close to OpenStack and provides unified shared storage services for OpenStack. Using Ceph as OpenStack back-end storage has the following advantages:
All computing nodes share storage. During migration, you do not need to copy the root disk. Even if the computing node hangs, you can immediately start the virtual machine on another computing node( evacuate).
utilize COW(Copy On Write)Feature, when creating a virtual machine, it only needs to be based on the image clone You don't need to download the entire image clone The operation is basically 0 overhead, which enables the creation of virtual machines at the second level.
Ceph RBD support thin provisioning,That is, allocate space on demand, which is a little similar Linux File system sparse Sparse file. Create a 20 GB The virtual hard disk does not occupy physical storage space at first,
Storage space is allocated on demand only when data is written.
For more information about Ceph, please refer to Official documents Here, we only focus on RBD. The core object of RBD management is block device, which is usually called volume, but it is customarily called image in Ceph (pay attention to the difference between it and OpenStack image). Ceph also has a concept of pool, which is similar to namespace. Different pools can define different replica numbers, pg numbers, placement strategies, etc. Pool must be specified for each image. The naming convention of image is pool_name/image_name@snapshot For example, openstack/test-volume@test-snap, which represents the snapshot of test volumeimage in openstackpool. Therefore, the following two commands have the same effect:
rbd snap create --pool openstack --image test-image --snap test-snap
rbd snap create openstack/test-image@test-snap
Create a 1G image on openstack pool. The command is:
rbd -p openstack create --size 1024 int32bit-test-1
Image supports the function of snapshot. Creating a snapshot saves the state of the current image, which is equivalent to the git commit operation. Users can roll back the image to any snapshot point (git reset) at any time. The commands to create a snapshot are as follows:
rbd -p openstack snap create int32bit-test-1@snap-1
To view the rbd list:
$ rbd -p openstack ls -l | grep int32bit-test int32bit-test-1 1024M 2 int32bit-test-1@snap-1 1024M 2
Based on snapshots, you can create a new image called a clone. Instead of copying the original image immediately, the clone uses the COW policy, that is, copy on write. Only when you need to write an object, you can copy the object from the parent to the local. Therefore, the clone operation is basically completed in seconds, It should also be noted that all images created based on the same snapshot share the image data before the snapshot. Therefore, we must protect the snapshot before clone, and the protected snapshot cannot be deleted. The clone operation is similar to the git branch operation. An image command of the clone is as follows:
rbd -p openstack snap protect int32bit-test-1@snap-1 rbd -p openstack clone int32bit-test-1@snap-1 int32bit-test-2
We can view the sub images (children) of an image and the image clone (parent) on which an image is based:
$ rbd -p openstack children int32bit-test-1@snap-1 openstack/int32bit-test-2 $ rbd -p openstack info int32bit-test-2 | grep parent parent: openstack/int32bit-test-1@snap-1
From the above, we can find that int32bit-test-2 is the children of int32bit-test-1, and int32bit-test-1 is the parent of int32bit-test-2.
Constantly creating snapshots and clone image s will form a long image chain. When the chain is long, it will not only affect the read-write performance, but also lead to very troublesome management. Fortunately, Ceph supports merging all images on the chain into an independent image. This operation is called flatten, which is similar to the git merge operation. Flatten needs to copy all data that does not exist at the top level layer by layer, so it is usually very time-consuming.
$ rbd -p openstack flatten int32bit-test-2 Image flatten: 31% complete...
At this point, let's view the parent children relationship again:
rbd -p openstack children int32bit-test-1@snap-1
At this time, int32bit-test-1 has no children, and int32bit-test-2 is completely independent.
Of course, Ceph also supports full copy, called copy:
rbd -p openstack cp int32bit-test-1 int32bit-test-3
Copy will completely copy an image, so it will be very time-consuming, but note that copy will not copy the original snapshot information.
Ceph supports exporting an RBD image:
rbd -p openstack export int32bit-test-1 int32bit-1.raw
Export will export the whole image. Ceph also supports export diff, that is, specify to export from a snapshot point:
rbd -p openstack export-diff \ int32bit-test-1 --from-snap snap-1 \ --snap snap-2 int32bit-test-1-diff.raw
The data from snapshot point snap-1 to snapshot point snap-2 is exported above.
Of course, the opposite operations are import and import diff. export/import supports full backup of images, while export diff / import diff enables differential backup of images.
Rbd image dynamically allocates storage space. You can view the physical storage space actually occupied by the image through the du command:
$ rbd du int32bit-test-1 NAME PROVISIONED USED int32bit-test-1 1024M 12288k
The allocated size of the above image is 1024M, and the actual occupied space is 12288KB.
To delete an image, you must first delete all its snapshots and ensure that there are no dependent children:
rbd -p openstack snap unprotect int32bit-test-1@snap-1 rbd -p openstack snap rm int32bit-test-1@snap-1 rbd -p openstack rm int32bit-test-1
1.2 introduction to openstack
OpenStack is an open source implementation of IaaS layer cloud computing platform. For more information about OpenStack, please visit my personal blog. Here we only focus on exploring what Ceph has done step by step based on source code analysis when OpenStack is connected to Ceph storage system. This article will not introduce the whole workflow of OpenStack in detail, but only the implementation related to Ceph. If you don't know the source code architecture of OpenStack, you can refer to the article I wrote before How to read OpenStack source code.
After reading this article, you can understand the following questions:
- Why does the uploaded image have to be converted to raw format?
- How to efficiently upload a large image file?
- Why can we create virtual machines in seconds?
- Why does it take several minutes to create a virtual machine snapshot, while creating a volume snapshot can be completed in seconds?
- Why can't I delete a mirror when a virtual machine exists?
- Why must the backup be restored to an empty volume instead of overwriting the existing volume?
- Create a volume from the image. Can I delete the image?
Note that this article is based on the premise of using Ceph storage, that is, grace, Nova and Cinder all use Ceph. In other cases, the conclusion may not be valid.
In addition, this article will first post the source code, which is very long and boring. You can quickly jump to Summary part View the Ceph work corresponding to each operation of OpenStack.
2 Glance
2.1 grace introduction
The core entity of grace management is Image, which is one of the core components of OpenStack and provides Image service for OpenStack as Service), mainly responsible for the life cycle management, retrieval, download and other functions of OpenStack image and image metadata. Grace supports saving images to a variety of storage systems. The back-end storage system is called store, and the address accessing the image is called location. Location can be an http address or an rbd protocol address. As long as the driver of the store is implemented, it can be used as the storage backend of grace. The main interfaces of the driver are as follows:
get: Get mirrored location.
get_size: Gets the size of the image.
get_schemes: Gets the of access to the mirror URL prefix(Agreement part),such as rbd,swift+https,http Wait.
add: Upload the image to the back-end storage.
delete: Delete the mirror.
set_acls: Set read and write access permissions for back-end storage.
In order to facilitate maintenance, the grace store has been separated from the grace code as an independent library and is managed by the project glance_store maintain. The list of store s currently supported by the community is as follows:
filesystem: Save to local file system, save by default/var/lib/glance/images To the directory.
cinder: Save to Cinder Yes.
rbd: Save to Ceph Yes.
sheepdog: Save to sheepdog Yes.
swift: Save to Swift Object is stored.
vmware datastore: Save to Vmware datastore Yes.
http: All of the above store Will save the mirrored data, except http store It is special. It does not save any data of the image, so it is not implemented add Method, which only saves the image URL Address,
When starting the virtual machine, the compute node from the specified http Download the image from the address.
This article focuses on rbd store, whose source code is here , the driver code of this store is mainly from domestic Fei Long Wang Responsible for maintenance, other store implementation details can refer to the source code glance store drivers.
2.2 image upload
It can be seen from the previous introduction that image upload is mainly realized by the add method of the store:
@capabilities.check def add(self, image_id, image_file, image_size, context=None, verifier=None): checksum = hashlib.md5() image_name = str(image_id) with self.get_connection(conffile=self.conf_file, rados_id=self.user) as conn: fsid = None if hasattr(conn, 'get_fsid'): fsid = conn.get_fsid() with conn.open_ioctx(self.pool) as ioctx: order = int(math.log(self.WRITE_CHUNKSIZE, 2)) try: loc = self._create_image(fsid, conn, ioctx, image_name, image_size, order) except rbd.ImageExists: msg = _('RBD image %s already exists') % image_id raise exceptions.Duplicate(message=msg) ...
Note image_file is not a file, but an instance of LimitingReader, which saves all the data of the image and reads the image content through the read(bytes) method.
From the above source code, glance first gets the connection session of ceph, then calls it. create_ The image method creates an rbd image with the same size as the image:
def _create_image(self, fsid, conn, ioctx, image_name, size, order, context=None): librbd = rbd.RBD() features = conn.conf_get('rbd_default_features') librbd.create(ioctx, image_name, size, order, old_format=False, features=int(features)) return StoreLocation({ 'fsid': fsid, 'pool': self.pool, 'image': image_name, 'snapshot': DEFAULT_SNAPNAME, }, self.conf)
Therefore, the above steps are roughly expressed by rbd command:
rbd -p ${rbd_store_pool} create\
--size ${image_size} ${image_id}
After creating rbd image in ceph, next:
with rbd.Image(ioctx, image_name) as image: bytes_written = 0 offset = 0 chunks = utils.chunkreadable(image_file, self.WRITE_CHUNKSIZE) for chunk in chunks: offset += image.write(chunk, offset) checksum.update(chunk)
Visible radiance blocks from image_ The read data from file is written to the rbd image just created and the checksum is calculated. The block size is determined by rbd_store_chunk_size configuration, 8MB by default.
Let's move on to the final step:
if loc.snapshot: image.create_snap(loc.snapshot) image.protect_snap(loc.snapshot)
As you can see from the code, the last step is to create an image snapshot (the snapshot name is snap) and protect it.
Assuming that the uploaded image is circles, the image size is 39MB, the image uuid is d1a06da9-8ccd-4d3e-9b63-6dcd3ead29e6, and the configuration is saved in the openstack pool of ceph, the operation flow of ceph is roughly as follows:
rbd -p openstack create \ --size 39 d1a06da9-8ccd-4d3e-9b63-6dcd3ead29e6 rbd -p openstack snap create \ d1a06da9-8ccd-4d3e-9b63-6dcd3ead29e6@snap rbd -p openstack snap protect \ d1a06da9-8ccd-4d3e-9b63-6dcd3ead29e6@snap
We can verify through the rbd command:
$ rbd ls -l | grep d1a06da9-8ccd-4d3e-9b63-6dcd3ead29e6 d1a06da9-8ccd-4d3e-9b63-6dcd3ead29e6 40162k 2 d1a06da9-8ccd-4d3e-9b63-6dcd3ead29e6@snap 40162k 2 yes
enlightenment
We introduced the process of uploading images to Ceph and omitted the process of uploading images to Glance, but there is no doubt that images must be uploaded to Glance through the Glance API. When the image is very large, it is very time-consuming and occupies the bandwidth of the API management network due to the HTTP protocol through the grace API. We can greatly improve the efficiency of uploading images by importing images directly through rbd import.
First, create an empty image using grace and note its uuid:
glance image-create \ d1a06da9-8ccd-4d3e-9b63-6dcd3ead29e6 | awk '/\sid\s/{print $4}'}
Assuming that the uuid is d1a06da9-8ccd-4d3e-9b63-6dcd3ead29e6, use the rbd command to directly import the image and create a snapshot:
rbd -p openstack import cirros.raw \ --image=d1a06da9-8ccd-4d3e-9b63-6dcd3ead29e6 rbd -p openstack snap create \ d1a06da9-8ccd-4d3e-9b63-6dcd3ead29e6@snap rbd -p openstack snap protect \ d1a06da9-8ccd-4d3e-9b63-6dcd3ead29e6@snap
Set glance image location url:
FS_ID=`ceph -s | grep cluster | awk '{print $2}'` glance location-add \ --url rbd://${FS_ID}/openstack/d1a06da9-8ccd-4d3e-9b63-6dcd3ead29e6/snap \ d1a06da9-8ccd-4d3e-9b63-6dcd3ead29e6
Set other properties of the grace mirror:
glance image-update --name="cirros" \ --disk-format=raw \ --container-format=bare d1a06da9-8ccd-4d3e-9b63-6dcd3ead29e6
2.3 image deletion
Deleting an image is the reverse process, that is, first execute unprotext - > snap RM - > RM, as follows:
try: self._unprotect_snapshot(image, snapshot_name) image.remove_snap(snapshot_name) except rbd.ImageBusy as exc: raise exceptions.InUseByStore() rbd.RBD().remove(ioctx, image_name)
To delete an image, you must ensure that the current rbd image does not have child images, otherwise the deletion will fail.
3 Nova
3.1 Nova introduction
The core entity managed by Nova is server, which provides computing services for OpenStack. It is the most core component of OpenStack. Note that the server in Nova does not just refer to virtual machines. It can be an abstraction of any computing resources. In addition to virtual machines, it may also be baremetal bare metal machines, containers, etc.
However, we assume here that:
server Is a virtual machine.
image type by rbd.
compute driver by libvirt.
Before starting the virtual machine, you need to prepare the root disk (root) disk), Nova is called image. Like Glance, Nova's image also supports storage to local disk, Ceph and Cinder(boot from) volume). It should be noted that where the image is saved is through image Determined by type, the data stored on the local disk can be raw, qcow2, ploop, etc. if image If the type is RBD, the image is stored in Ceph. Different image type s are responsible for different image backend. The backend of RBD is implemented by the RBD class module in nova/virt/libvirt/imageackend.
3.2 creating virtual machines
The process of creating a virtual machine is no longer analyzed in detail. If you don't know, you can check my previous blog. We go directly to study how Nova's libvirt driver prepares the root disk image for the virtual machine. The code is located in Nova / virt / libvirt / driver Py's spawn method, where the creation of an image calls_ create_image method.
def spawn(self, context, instance, image_meta, injected_files, admin_password, network_info=None, block_device_info=None): ... self._create_image(context, instance, disk_info['mapping'], injection_info=injection_info, block_device_info=block_device_info) ...
_ create_ The code of image method is as follows:
def _create_image(self, context, instance, disk_mapping, injection_info=None, suffix='', disk_images=None, block_device_info=None, fallback_from_host=None, ignore_bdi_for_swap=False): booted_from_volume = self._is_booted_from_volume(block_device_info) ... # ensure directories exist and are writable fileutils.ensure_tree(libvirt_utils.get_instance_path(instance)) ... self._create_and_inject_local_root(context, instance, booted_from_volume, suffix, disk_images, injection_info, fallback_from_host) ...
This method first creates the data directory /var/lib/nova/instances/${uuid}/ of the virtual machine locally, and then calls the create_ and_ inject_ local_ The root method creates the root disk.
def _create_and_inject_local_root(self, context, instance, booted_from_volume, suffix, disk_images, injection_info, fallback_from_host): ... if not booted_from_volume: root_fname = imagecache.get_cache_fname(disk_images['image_id']) size = instance.flavor.root_gb * units.Gi backend = self.image_backend.by_name(instance, 'disk' + suffix, CONF.libvirt.images_type) if backend.SUPPORTS_CLONE: def clone_fallback_to_fetch(*args, **kwargs): try: backend.clone(context, disk_images['image_id']) except exception.ImageUnacceptable: libvirt_utils.fetch_image(*args, **kwargs) fetch_func = clone_fallback_to_fetch else: fetch_func = libvirt_utils.fetch_image self._try_fetch_image_cache(backend, fetch_func, context, root_fname, disk_images['image_id'], instance, size, fallback_from_host) ...
Where image_ backend. by_ The name () method returns the image backend instance by the image type name, here is Rbd. It can be seen from the code that if the backend supports the clone operation (SUPPORTS_CLONE), the clone() method of the backend will be called; otherwise, it will be fetched_ The image() method downloads the image. Obviously, Ceph rbd supports clone. Let's look at the clone() method of Rbd. The code is located in Nova / virt / libvirt / imagebackend Py module:
def clone(self, context, image_id_or_uri): ... for location in locations: if self.driver.is_cloneable(location, image_meta): LOG.debug('Selected location: %(loc)s', {'loc': location}) return self.driver.clone(location, self.rbd_name) ...
This method traverses all locations of grace image, and then passes driver is_ The clonable () method determines whether clone is supported. If clone is supported, call driver Clone() method. Where driver is Nova's storage driver, and the code is in nova/virt/libvirt/storage, where rbd driver is in RBD_ utils. Under the PY module, let's first check is_ Clonable() method:
def is_cloneable(self, image_location, image_meta): url = image_location['url'] try: fsid, pool, image, snapshot = self.parse_url(url) except exception.ImageUnacceptable as e: return False if self.get_fsid() != fsid: return False if image_meta.get('disk_format') != 'raw': return False # check that we can read the image try: return self.exists(image, pool=pool, snapshot=snapshot) except rbd.Error as e: LOG.debug('Unable to open image %(loc)s: %(err)s', dict(loc=url, err=e)) return False
It can be seen that clone is not supported in the following cases:
Glance Medium rbd image location wrongful, rbd location Must contain fsid,pool,image id,snapshot 4 Fields, fields through/divide.
Glance and Nova Docking is different Ceph Cluster.
Glance Mirror non raw Format.
Glance of rbd image The name does not exist snap Snapshot of.
Pay special attention to Article 3. If the image is in a non raw format, Nova does not support clone operation when creating a virtual machine, so you must download the image from grace. This is why when grace uses Ceph storage, the image must be converted to raw format.
Finally, let's look at the clone method:
def clone(self, image_location, dest_name, dest_pool=None): _fsid, pool, image, snapshot = self.parse_url( image_location['url']) with RADOSClient(self, str(pool)) as src_client: with RADOSClient(self, dest_pool) as dest_client: try: RbdProxy().clone(src_client.ioctx, image, snapshot, dest_client.ioctx, str(dest_name), features=src_client.features) except rbd.PermissionError: raise exception.Forbidden(_('no write permission on ' 'storage pool %s') % dest_pool)
This method only calls the clone method of ceph. Some people may wonder why two ioctx are needed because they all use the same Ceph cluster? This is because grace and Nova may not use the same Ceph pool, and one pool corresponds to one ioctx.
The above operations are roughly equivalent to the following rbd commands:
rbd clone \ ${glance_pool}/${image uuid}@snap \ ${nova_pool}/${virtual machine uuid}.disk
Assuming that the pool used by Nova and Glance is openstack, the uuid of Glance image is d1a06da9-8ccd-4d3e-9b63-6dcd3ead29e6, and the uuid of Nova virtual machine is cbf44290-f142-41f8-86e1-d63c902b38ed, the corresponding rbd command is roughly:
rbd clone \ openstack/d1a06da9-8ccd-4d3e-9b63-6dcd3ead29e6@snap \ openstack/cbf44290-f142-41f8-86e1-d63c902b38ed_disk
We further verify that:
int32bit $ rbd -p openstack ls | grep cbf44290-f142-41f8-86e1-d63c902b38ed cbf44290-f142-41f8-86e1-d63c902b38ed_disk int32bit $ rbd -p openstack info cbf44290-f142-41f8-86e1-d63c902b38ed_disk rbd image 'cbf44290-f142-41f8-86e1-d63c902b38ed_disk': size 2048 MB in 256 objects order 23 (8192 kB objects) block_name_prefix: rbd_data.9f756763845e format: 2 features: layering, exclusive-lock, object-map, fast-diff, deep-flatten flags: create_timestamp: Wed Nov 22 05:11:17 2017 parent: openstack/d1a06da9-8ccd-4d3e-9b63-6dcd3ead29e6@snap overlap: 40162 kB
As can be seen from the output, Nova did create a file named cbf44290-f142-41f8-86e1-d63c902b38ed_disk rbd image, and its parent is openstack/d1a06da9-8ccd-4d3e-9b63-6dcd3ead29e6@snap .
enlightenment
When creating a virtual machine, there is no need to copy the image or download the image, but a simple clone Therefore, the creation of virtual machines can be basically completed in seconds.
If there are virtual machine dependencies in the image, you cannot delete the image. In other words, all virtual machines created based on the image must be deleted before deleting the image.
3.3 create virtual machine snapshot
First of all, I think Nova has confused create image and create snapshot. I understand the difference between them:
create image: Upload the root disk of the virtual machine to Glance Yes.
create snapshot: according to image Snapshot the virtual machine in the format, qcow2 and rbd All formats obviously support snapshots. Snapshot should not be saved to Glance In, by Nova perhaps Cinder(boot from Cinder)Administration.
In fact, the subcommand of Nova to create a snapshot is image create, and the API method is also called_ action_create_image(), then the method called snapshot(). In fact, for most image type s, if you do not boot from volume, you are actually creating an image, that is, uploading the image to grace instead of a real snapshot.
Of course, it's just the difference in naming. There is no difference between create image and create snapshot.
The snapshot of the virtual machine is implemented by the snapshot() method of libvirtdriver, and the code is located in Nova / virt / libvirt / driver Py, the core code is as follows:
def snapshot(self, context, instance, image_id, update_task_state): ... root_disk = self.image_backend.by_libvirt_path( instance, disk_path, image_type=source_type) try: update_task_state(task_state=task_states.IMAGE_UPLOADING, expected_state=task_states.IMAGE_PENDING_UPLOAD) metadata['location'] = root_disk.direct_snapshot( context, snapshot_name, image_format, image_id, instance.image_ref) self._snapshot_domain(context, live_snapshot, virt_dom, state, instance) self._image_api.update(context, image_id, metadata, purge_props=False) except (NotImplementedError, exception.ImageUnacceptable) as e: ...
Nova starts with disk_path gets the corresponding image backend. Here, imagebackend is returned Rbd, and then invoked the direct_ of backend. Snapshot() method, which is as follows:
def direct_snapshot(self, context, snapshot_name, image_format, image_id, base_image_id): fsid = self.driver.get_fsid() parent_pool = self._get_parent_pool(context, base_image_id, fsid) self.driver.create_snap(self.rbd_name, snapshot_name, protect=True) location = {'url': 'rbd://%(fsid)s/%(pool)s/%(image)s/%(snap)s' % dict(fsid=fsid, pool=self.pool, image=self.rbd_name, snap=snapshot_name)} try: self.driver.clone(location, image_id, dest_pool=parent_pool) self.driver.flatten(image_id, pool=parent_pool) finally: self.cleanup_direct_snapshot(location) self.driver.create_snap(image_id, 'snap', pool=parent_pool, protect=True) return ('rbd://%(fsid)s/%(pool)s/%(image)s/snap' % dict(fsid=fsid, pool=parent_pool, image=image_id))
From the code analysis, it can be roughly divided into the following steps:
obtain Ceph Clustered fsid.
Corresponding to the root disk of the virtual machine rbd image Create a temporary snapshot with a random name uuid. Protect snapshots created( protect).
Snapshot based clone A new rbd image,Name is snapshot uuid.
yes clone of image implement flatten Operation.
Delete the temporary snapshot created.
yes clone of rbd image Create a snapshot named snap,And execute protect.
The corresponding rbd command. Assuming that the uuid of the virtual machine is cbf44290-f142-41f8-86e1-d63c902b38ed and the uuid of the snapshot is db2b6552-394a-42d2-9de8-2295fe2b3180, the corresponding rbd command is:
# Snapshot the disk and clone it into Glance's storage pool rbd -p openstack snap create \ cbf44290-f142-41f8-86e1-d63c902b38ed_disk@3437a9bbba5842629cc76e78aa613c70 rbd -p openstack snap protect \ cbf44290-f142-41f8-86e1-d63c902b38ed_disk@3437a9bbba5842629cc76e78aa613c70 rbd -p openstack clone \ cbf44290-f142-41f8-86e1-d63c902b38ed_disk@3437a9bbba5842629cc76e78aa613c70 \ db2b6552-394a-42d2-9de8-2295fe2b3180 # Flatten the image, which detaches it from the source snapshot rbd -p openstack flatten \ db2b6552-394a-42d2-9de8-2295fe2b3180 # all done with the source snapshot, clean it up rbd -p openstack snap unprotect \ cbf44290-f142-41f8-86e1-d63c902b38ed_disk@3437a9bbba5842629cc76e78aa613c70 rbd -p openstack snap rm \ cbf44290-f142-41f8-86e1-d63c902b38ed_disk@3437a9bbba5842629cc76e78aa613c70 # Makes a protected snapshot called 'snap' on uploaded images # and hands it out rbd -p openstack snap create \ db2b6552-394a-42d2-9de8-2295fe2b3180@snap rbd -p openstack snap protect \ db2b6552-394a-42d2-9de8-2295fe2b3180@snap
Where 3437a9bbba5842629cc76e78aa613c70 is the name of the generated temporary snapshot, which is a randomly generated uuid.
enlightenment
Other storage backend mainly takes time in the image upload process, while when using Ceph storage, it mainly takes time in the flatten process of rbd. Therefore, it usually takes several minutes to create virtual machine snapshots. Some people may wonder, why do you have to perform the flatten operation? Isn't it finished by directly clone? The community did so for a reason:
If not flatten Operation, the virtual machine snapshot depends on the virtual machine. In other words, the virtual machine cannot be deleted as long as there is a snapshot, which is obviously unreasonable.
The previous problem continues to be extended. Suppose that the virtual machine is created based on the snapshot, and the virtual machine creates the snapshot again. This is repeated, and the whole rbd image Your dependencies can be very complex and can't be managed at all.
When rbd image When the chain is longer and longer, the corresponding IO Reading and writing performance will be worse and worse
3.4 deleting virtual machines
libvirt driver the code to delete the virtual machine is located in Nova / virt / libvirt / driver destroy method of Py:
def destroy(self, context, instance, network_info, block_device_info=None, destroy_disks=True): self._destroy(instance) self.cleanup(context, instance, network_info, block_device_info, destroy_disks)
Pay attention to the front_ The destroy method is actually the virtual machine shutdown operation, that is, Nova will first shut down the virtual machine and then delete it. This is followed by a call to the cleanup() method, which performs resource cleanup. Here, we only focus on the process of cleaning disks:
if destroy_disks: # NOTE(haomai): destroy volumes if needed if CONF.libvirt.images_type == 'lvm': self._cleanup_lvm(instance, block_device_info) if CONF.libvirt.images_type == 'rbd': self._cleanup_rbd(instance) ...
Since our image type is rbd, the_ cleanup_rbd() method:
def _cleanup_rbd(self, instance): if instance.task_state == task_states.RESIZE_REVERTING: filter_fn = lambda disk: (disk.startswith(instance.uuid) and disk.endswith('disk.local')) else: filter_fn = lambda disk: disk.startswith(instance.uuid) LibvirtDriver._get_rbd_driver().cleanup_volumes(filter_fn)
If we only consider the normal delete operation and ignore the resize recall operation, then the filter_fn is lambda disk: disk Startswitch (instance. uuid), that is, all disks (RBD images) starting with virtual machine uuid. Note that instead of calling the Rbd driver of imagebackend, the storage driver is called directly. The code is located in nova/virt/libvirt/storage/rbd_utils.py:
def cleanup_volumes(self, filter_fn): with RADOSClient(self, self.pool) as client: volumes = RbdProxy().list(client.ioctx) for volume in filter(filter_fn, volumes): self._destroy_volume(client, volume)
This method first obtains the list of all RBD images, and then uses the filter_ The FN method filters images starting with virtual machine uuid, and calls_ destroy_volume method:
def _destroy_volume(self, client, volume, pool=None): """Destroy an RBD volume, retrying as needed. """ def _cleanup_vol(ioctx, volume, retryctx): try: RbdProxy().remove(ioctx, volume) raise loopingcall.LoopingCallDone(retvalue=False) except rbd.ImageHasSnapshots: self.remove_snap(volume, libvirt_utils.RESIZE_SNAPSHOT_NAME, ignore_errors=True) except (rbd.ImageBusy, rbd.ImageHasSnapshots): LOG.warning('rbd remove %(volume)s in pool %(pool)s failed', {'volume': volume, 'pool': self.pool}) retryctx['retries'] -= 1 if retryctx['retries'] <= 0: raise loopingcall.LoopingCallDone() # NOTE(danms): We let it go for ten seconds retryctx = {'retries': 10} timer = loopingcall.FixedIntervalLoopingCall( _cleanup_vol, client.ioctx, volume, retryctx) timed_out = timer.start(interval=1).wait() if timed_out: # NOTE(danms): Run this again to propagate the error, but # if it succeeds, don't raise the loopingcall exception try: _cleanup_vol(client.ioctx, volume, retryctx) except loopingcall.LoopingCallDone: pass
This method can be tried up to 10 + 1 times_ cleanup_ The vol() method deletes the rbd image. If there is a snapshot, the snapshot will be deleted first.
Assuming that the uuid of the virtual machine is cbf44290-f142-41f8-86e1-d63c902b38ed, the corresponding rbd command is roughly:
for image in $(rbd -p openstack ls | grep '^cbf44290-f142-41f8-86e1-d63c902b38ed'); do rbd -p openstack rm "$image"; done
4 Cinder
4.1 Cinder introduction
Cinder is a block storage service of OpenStack, similar to EBS of AWS. The managed entity is volume. Cinder volume is not implemented The provide function is responsible for managing the volume of various storage systems, such as Ceph, fujitsu, netapp, etc. it supports the creation, snapshot, backup and other functions of volume. The docked storage system is called backend. As long as you implement Cinder / volume / driver The interface defined by the VolumeDriver class in. Py allows Cinder to dock with the storage system.
Cinder not only supports the management of local volume, but also backs up the local volume to the remote storage system, such as another Ceph cluster or Swift object storage system. This paper will only consider backing up from the source Ceph cluster to the remote Ceph cluster.
4.2 create volume
The creation of volume is completed by the cinder volume service, and the entry is cinder / volume / manager Py's create_volume() method,
def create_volume(self, context, volume, request_spec=None, filter_properties=None, allow_reschedule=True): ... try: # NOTE(flaper87): Driver initialization is # verified by the task itself. flow_engine = create_volume.get_flow( context_elevated, self, self.db, self.driver, self.scheduler_rpcapi, self.host, volume, allow_reschedule, context, request_spec, filter_properties, image_volume_cache=self.image_volume_cache, ) except Exception: msg = _("Create manager volume flow failed.") LOG.exception(msg, resource={'type': 'volume', 'id': volume.id}) raise exception.CinderException(msg) ...
Cinder's process of creating volume uses taskflow framework The specific implementation of taskflow is located in cinder/volume/flows/manager/create_volume.py, we focus on its execute() method:
def execute(self, context, volume, volume_spec): ... if create_type == 'raw': model_update = self._create_raw_volume(volume, **volume_spec) elif create_type == 'snap': model_update = self._create_from_snapshot(context, volume, **volume_spec) elif create_type == 'source_vol': model_update = self._create_from_source_volume( context, volume, **volume_spec) elif create_type == 'image': model_update = self._create_from_image(context, volume, **volume_spec) else: raise exception.VolumeTypeNotFound(volume_type_id=create_type) ...
From the code, we can see that volume creation can be divided into four types:
raw: Create a blank volume.
create from snapshot: Snapshot based creation volume.
create from volume: This is equivalent to copying an existing volume.
create from image: be based on Glance image Create a volume.
raw
Creating a blank volume is the easiest way. The code is as follows:
def _create_raw_volume(self, volume, **kwargs): ret = self.driver.create_volume(volume) ...
Directly call the create of the driver_ Volume () method, where driver is RBDDriver, and the code is located in cinder / volume / Drivers / RBD py:
def create_volume(self, volume): with RADOSClient(self) as client: self.RBDProxy().create(client.ioctx, vol_name, size, order, old_format=False, features=client.features) try: volume_update = self._enable_replication_if_needed(volume) except Exception: self.RBDProxy().remove(client.ioctx, vol_name) err_msg = (_('Failed to enable image replication')) raise exception.ReplicationError(reason=err_msg, volume_id=volume.id)
Where the unit of size is MB, Vol_ The name is volume-${volume_uuid}.
Assuming that the uuid of the volume is bf2d1c54-6c98-4a78-9c20-3e8ea033c3db, the Ceph pool is openstack, and the created volume size is 1GB, the corresponding rbd command is equivalent to:
rbd -p openstack create \ --new-format --size 1024 \ volume-bf2d1c54-6c98-4a78-9c20-3e8ea033c3db
We can verify through the rbd command:
int32bit $ rbd -p openstack ls | grep bf2d1c54-6c98-4a78-9c20-3e8ea033c3db
volume-bf2d1c54-6c98-4a78-9c20-3e8ea033c3db
create from snapshot
Creating a volume from a snapshot is also a method of directly calling the driver, as follows:
def _create_from_snapshot(self, context, volume, snapshot_id, **kwargs): snapshot = objects.Snapshot.get_by_id(context, snapshot_id) model_update = self.driver.create_volume_from_snapshot(volume, snapshot)
Let's look at the create of RBDDriver_ volume_ from_ Snapshot() method:
def create_volume_from_snapshot(self, volume, snapshot): """Creates a volume from a snapshot.""" volume_update = self._clone(volume, self.configuration.rbd_pool, snapshot.volume_name, snapshot.name) if self.configuration.rbd_flatten_volume_from_snapshot: self._flatten(self.configuration.rbd_pool, volume.name) if int(volume.size): self._resize(volume) return volume_update
From the code, there are three steps to create a snapshot from a snapshot:
from rbd Execute in snapshot clone Operation.
If rbd_flatten_volume_from_snapshot Configure as True,Then execute flatten Operation.
If specified in the creation size,Then execute resize Operation.
Assume that the uuid of the newly created volume is e6bc8618-879b-4655-aac0-05e5a1ce0e06, the uuid of the snapshot is snapshot-e4e534fc-420b-45c6-8e9f-b23dcfcb7f86, the source volume uuid of the snapshot is bf2d1c54-6c98-4a78-9c20-3e8ea033c3db, the specified size is 2, and the rbd_ flatten_ volume_ from_ If the snapshot is False (the default), the corresponding rbd command is:
rbd clone \ openstack/volume-bf2d1c54-6c98-4a78-9c20-3e8ea033c3db@snapshot-e4e534fc-420b-45c6-8e9f-b23dcfcb7f86 \ openstack/volume-e6bc8618-879b-4655-aac0-05e5a1ce0e06 rbd resize --size 2048 \ openstack/volume-e6bc8618-879b-4655-aac0-05e5a1ce0e06
From the source code analysis, when Cinder creates a volume from a snapshot, the user can configure whether to execute the flatten operation:
If executed flatten Action, create from snapshot volume It may take several minutes, and you can delete the snapshot at any time after it is created.
If not flatten Operation, you need to pay attention to deleting all created based on the snapshot volume You cannot delete the snapshot or the source of the snapshot before volume.
The second point may be more complicated. For example, if a volume is created based on a snapshot, then a snapshot is created based on the volume. If a volume is created based on the snapshot, the user cannot delete the source volume or the snapshot.
create from volume
To create a volume from volume, you need to specify the source volume ID (source_volume ID):
def _create_from_source_volume(self, context, volume, source_volid, **kwargs): srcvol_ref = objects.Volume.get_by_id(context, source_volid) model_update = self.driver.create_cloned_volume(volume, srcvol_ref)
Let's look directly at the driver's create_cloned_volume() method, which has a very important configuration item rbd_max_clone_depth, the longest depth allowed by rbd image clone, if rbd_max_clone_depth < = 0 means clone is not allowed:
# Do full copy if requested if self.configuration.rbd_max_clone_depth <= 0: with RBDVolumeProxy(self, src_name, read_only=True) as vol: vol.copy(vol.ioctx, dest_name) self._extend_if_required(volume, src_vref) return
This is equivalent to the copy command of rbd.
If RBD_ max_ clone_ depth > 0:
# Otherwise do COW clone. with RADOSClient(self) as client: src_volume = self.rbd.Image(client.ioctx, src_name) LOG.debug("creating snapshot='%s'", clone_snap) try: # Create new snapshot of source volume src_volume.create_snap(clone_snap) src_volume.protect_snap(clone_snap) # Now clone source volume snapshot LOG.debug("cloning '%(src_vol)s@%(src_snap)s' to " "'%(dest)s'", {'src_vol': src_name, 'src_snap': clone_snap, 'dest': dest_name}) self.RBDProxy().clone(client.ioctx, src_name, clone_snap, client.ioctx, dest_name, features=client.features)
This process is very similar to creating a virtual machine snapshot. Both create a snapshot based on the source image first, and then perform a clone operation based on the snapshot. The difference is whether to perform a flatten operation. When creating a virtual machine snapshot, the flatten operation must be performed, and this operation depends on the clone depth:
depth = self._get_clone_depth(client, src_name) if depth >= self.configuration.rbd_max_clone_depth: dest_volume = self.rbd.Image(client.ioctx, dest_name) try: dest_volume.flatten() except Exception as e: ... try: src_volume.unprotect_snap(clone_snap) src_volume.remove_snap(clone_snap) except Exception as e: ...
If the current depth exceeds the maximum allowable depth rbd_max_clone_depth performs the flatten operation and deletes the created snapshot.
Assuming that the created volume uuid is 3b8b15a4-3020-41a0-80be-afaa35ed5eef and the source volume uuid is bf2d1c54-6c98-4a78-9c20-3e8ea033c3db, the corresponding rbd command is:
VOLID=3b8b15a4-3020-41a0-80be-afaa35ed5eef SOURCE_VOLID=bf2d1c54-6c98-4a78-9c20-3e8ea033c3db CINDER_POOL=openstack # Do full copy if rbd_max_clone_depth <= 0. if [[ "$rbd_max_clone_depth" -le 0 ]]; then rbd copy ${CINDER_POOL}/volume-${SOURCE_VOLID} openstack/volume-${VOLID} exit 0 fi # Otherwise do COW clone. # Create new snapshot of source volume rbd snap create \ ${CINDER_POOL}/volume-${SOURCE_VOLID}@volume-${VOLID}.clone_snap rbd snap protect \ ${CINDER_POOL}/volume-${SOURCE_VOLID}@volume-${VOLID}.clone_snap # Now clone source volume snapshot rbd clone \ ${CINDER_POOL}/volume-${SOURCE_VOLID}@volume-${VOLID}.clone_snap \ ${CINDER_POOL}/volume-${VOLID} # If dest volume is a clone and rbd_max_clone_depth reached, # flatten the dest after cloning. depth=$(get_clone_depth ${CINDER_POOL}/volume-${VOLID}) if [[ "$depth" -ge "$rbd_max_clone_depth" ]]; then # Flatten destination volume rbd flatten ${CINDER_POOL}/volume-${VOLID} # remove temporary snap rbd snap unprotect \ ${CINDER_POOL}/volume-${SOURCE_VOLID}@volume-${VOLID}.clone_snap rbd snap rm \ ${CINDER_POOL}/volume-${SOURCE_VOLID}@volume-${VOLID}.clone_snap fi
When rbd_ max_ clone_ Depth > 0 and depth < rbd_ max_ clone_ When depth, verify through the rbd command:
int32bit $ rbd info volume-3b8b15a4-3020-41a0-80be-afaa35ed5eef rbd image 'volume-3b8b15a4-3020-41a0-80be-afaa35ed5eef': size 1024 MB in 256 objects order 22 (4096 kB objects) block_name_prefix: rbd_data.ae2e437c177a format: 2 features: layering, exclusive-lock, object-map, fast-diff, deep-flatten flags: create_timestamp: Wed Nov 22 12:32:09 2017 parent: openstack/volume-bf2d1c54-6c98-4a78-9c20-3e8ea033c3db@volume-3b8b15a4-3020-41a0-80be-afaa35ed5eef.clone_snap overlap: 1024 MB
It can be seen that the parent of volume-3b8b15a4-3020-41a0-80be-afaa35ed5eef is:
volume-bf2d1c54-6c98-4a78-9c20-3e8ea033c3db@volume-3b8b15a4-3020-41a0-80be-afaa35ed5eef.clone_snap.
create from image
Create volume from the image. Assuming that both grace and Cinder use the same Ceph cluster, Cinder can clone directly from grace without downloading the image:
def _create_from_image(self, context, volume, image_location, image_id, image_meta, image_service, **kwargs): ... model_update, cloned = self.driver.clone_image( context, volume, image_location, image_meta, image_service) ...
Let's look at the clone of the driver_ Image() method:
def clone_image(self, context, volume, image_location, image_meta, image_service): # iterate all locations to look for a cloneable one. for url_location in url_locations: if url_location and self._is_cloneable( url_location, image_meta): _prefix, pool, image, snapshot = \ self._parse_location(url_location) volume_update = self._clone(volume, pool, image, snapshot) volume_update['provider_location'] = None self._resize(volume) return volume_update, True return ({}, False)
rbd directly clone s, which is basically the same as creating virtual machines. If a new size is specified when creating the volume, rbd resize is called to perform the capacity expansion operation.
Assuming that the newly created volume uuid is 87ee1ec6-3fe4-413b-a4c0-8ec7756bf1b4 and the grace image UUID is db2b6552-394a-42d2-9de8-2295fe2b3180, the rbd command is:
rbd clone \ openstack/db2b6552-394a-42d2-9de8-2295fe2b3180@snap \ openstack/volume-87ee1ec6-3fe4-413b-a4c0-8ec7756bf1b4 if [[ -n "$size" ]]; then rbd resize --size $size \ openstack/volume-87ee1ec6-3fe4-413b-a4c0-8ec7756bf1b4 fi
Verify the following with the rbd command:
int32bit $ rbd info openstack/volume-87ee1ec6-3fe4-413b-a4c0-8ec7756bf1b4 rbd image 'volume-87ee1ec6-3fe4-413b-a4c0-8ec7756bf1b4': size 3072 MB in 768 objects order 22 (4096 kB objects) block_name_prefix: rbd_data.affc488ac1a format: 2 features: layering, exclusive-lock, object-map, fast-diff, deep-flatten flags: create_timestamp: Wed Nov 22 13:07:50 2017 parent: openstack/db2b6552-394a-42d2-9de8-2295fe2b3180@snap overlap: 2048 MB
It can be seen that the parent of the newly created rbd image is openstack/db2b6552-394a-42d2-9de8-2295fe2b3180@snap .
Note: in fact, I personally think this method needs to perform the flatten operation. Otherwise, when there is volume, grace cannot delete the image, which is equivalent to that the grace service depends on the Cinder service state, which is a little unreasonable.
4.3 creating snapshots
Create snapshot entry as cinder / volume / manager Py's create_snapshot() method, which does not use taskflow framework, but directly calls driver create_ The snapshot () method is as follows:
... try: utils.require_driver_initialized(self.driver) snapshot.context = context model_update = self.driver.create_snapshot(snapshot) ... except Exception: ...
Create of RBDDriver_ The snapshot () method is very simple:
def create_snapshot(self, snapshot): """Creates an rbd snapshot.""" with RBDVolumeProxy(self, snapshot.volume_name) as volume: snap = utils.convert_str(snapshot.name) volume.create_snap(snap) volume.protect_snap(snap)
Therefore, the volume snapshot is actually the corresponding Ceph rbd image snapshot. Assuming that the snapshot uuid is e4e534fc-420b-45c6-8e9f-b23dcfcb7f86 and the volume uuid is bf2d1c54-6c98-4a78-9c20-3e8ea033c3db, the corresponding rbd commands are roughly as follows:
rbd -p openstack snap create \ volume-bf2d1c54-6c98-4a78-9c20-3e8ea033c3db@snapshot-e4e534fc-420b-45c6-8e9f-b23dcfcb7f86 rbd -p openstack snap protect \ volume-bf2d1c54-6c98-4a78-9c20-3e8ea033c3db@snapshot-e4e534fc-420b-45c6-8e9f-b23dcfcb7f86
From here, we can see the difference between virtual machine snapshot and volume snapshot. Virtual machine snapshot needs to clone and then flash from rbd image snapshot on root disk, while volume snapshot only needs to create rbd image snapshot. Therefore, virtual machine snapshot usually takes several minutes, while volume snapshot can be completed in seconds.
4.4 create volume backup
Before understanding volume backup, you need to clarify the difference between snapshot and backup. We can use git analogy. Snapshot is similar to git commit operation, which only indicates that the data has been committed. It is mainly used for backtracking and rollback. When the cluster collapses, resulting in data loss, it is usually impossible to fully recover the data from the snapshot. Backup is similar to git push, which pushes data security to the remote storage system. It is mainly used to ensure data security. Even if local data is lost, it can be recovered from backup. Cinder's disk backup also supports a variety of storage back ends. Here, we only consider the case where both volume and backup driver are Ceph. For other details, please refer to Principle and practice of Cinder data volume backup . In production, volume and backup must use different Ceph clusters to ensure that when the volume ceph cluster hangs, data can be quickly recovered from another cluster. This article is just to test the function, so the same Ceph cluster is used. It is distinguished by pool. Volume uses openstackpool and backup uses cinder_backuppool.
In addition, Cinder supports incremental backup. You can specify the -- incremental parameter to decide whether to use full backup or incremental backup. However, for the Ceph backend, Cinder always tries to perform incremental backup first. Only when the incremental backup fails will it fall back to the full backup, regardless of whether the user specifies the -- incremental parameter. Nevertheless, we still divide backup into full backup and incremental backup. Note that only the first backup can be full backup, and the rest are incremental backup.
Full backup (first backup)
Let's directly look at CephBackupDriver's backup() method. The code is located in cinder / backup / Drivers / CEPH py.
if self._file_is_rbd(volume_file): # If volume an RBD, attempt incremental backup. LOG.debug("Volume file is RBD: attempting incremental backup.") try: updates = self._backup_rbd(backup, volume_file, volume.name, length) except exception.BackupRBDOperationFailed: LOG.debug("Forcing full backup of volume %s.", volume.id) do_full_backup = True
Here, we mainly judge whether the source volume is rbd, that is, whether the Ceph backend is used. Incremental backup can be performed only when the volume also uses the Ceph storage backend.
Let's see_ backup_rbd() method:
from_snap = self._get_most_recent_snap(source_rbd_image) base_name = self._get_backup_base_name(volume_id, diff_format=True) image_created = False with rbd_driver.RADOSClient(self, backup.container) as client: if base_name not in self.rbd.RBD().list(ioctx=client.ioctx): ... # Create new base image self._create_base_image(base_name, length, client) image_created = True else: ...
from_snap is the snapshot point of the last backup. Since this is the first backup, it is from_snap is None, base_ The format of name is volume -% s.backup Base, what does this base do? Let's check_ create_ base_ The image () method knows:
def _create_base_image(self, name, size, rados_client): old_format, features = self._get_rbd_support() self.rbd.RBD().create(ioctx=rados_client.ioctx, name=name, size=size, old_format=old_format, features=features, stripe_unit=self.rbd_stripe_unit, stripe_count=self.rbd_stripe_count)
It can be seen that the base is actually an empty volume with the same size as the previous volume.
That is, if it is the first backup, the Ceph cluster in the backup will first create an empty volume with the same size as volume.
Let's continue to look at the source code:
def _backup_rbd(self, backup, volume_file, volume_name, length): ... new_snap = self._get_new_snap_name(backup.id) LOG.debug("Creating backup snapshot='%s'", new_snap) source_rbd_image.create_snap(new_snap) try: self._rbd_diff_transfer(volume_name, rbd_pool, base_name, backup.container, src_user=rbd_user, src_conf=rbd_conf, dest_user=self._ceph_backup_user, dest_conf=self._ceph_backup_conf, src_snap=new_snap, from_snap=from_snap) def _get_new_snap_name(self, backup_id): return utils.convert_str("backup.%s.snap.%s" % (backup_id, time.time()))
First, a new snapshot is created in the source volume. The snapshot is called backup$ {backup_id}. snap.$ {timestamp}, and then called rbd_. diff_ Transfer() method:
def _rbd_diff_transfer(self, src_name, src_pool, dest_name, dest_pool, src_user, src_conf, dest_user, dest_conf, src_snap=None, from_snap=None): src_ceph_args = self._ceph_args(src_user, src_conf, pool=src_pool) dest_ceph_args = self._ceph_args(dest_user, dest_conf, pool=dest_pool) cmd1 = ['rbd', 'export-diff'] + src_ceph_args if from_snap is not None: cmd1.extend(['--from-snap', from_snap]) if src_snap: path = utils.convert_str("%s/%s@%s" % (src_pool, src_name, src_snap)) else: path = utils.convert_str("%s/%s" % (src_pool, src_name)) cmd1.extend([path, '-']) cmd2 = ['rbd', 'import-diff'] + dest_ceph_args rbd_path = utils.convert_str("%s/%s" % (dest_pool, dest_name)) cmd2.extend(['-', rbd_path]) ret, stderr = self._piped_execute(cmd1, cmd2) if ret: msg = (_("RBD diff op failed - (ret=%(ret)s stderr=%(stderr)s)") % {'ret': ret, 'stderr': stderr}) LOG.info(msg) raise exception.BackupRBDOperationFailed(msg)
Method calls the rbd command. First, export the difference file of the source rbd image through the export diff subcommand, and then import it into the backup image through import diff.
Assuming that the uuid of the source volume is 075c06ed-37e2-407d-b998-e270c4edc53c, the size is 1GB, and the backup uuid is db563496-0c15-4349-95f3-fc5194bfb11a, the corresponding rbd commands are roughly as follows:
VOLUME_ID=075c06ed-37e2-407d-b998-e270c4edc53c BACKUP_ID=db563496-0c15-4349-95f3-fc5194bfb11a rbd -p cinder_backup create \ --size 1024 \ volume-${VOLUME_ID}.backup.base new_snap=volume-${VOLUME_ID}@backup.${BACKUP_ID}.snap.1511344566.67 rbd -p openstack snap create ${new_snap} rbd export-diff --pool openstack ${new_snap} - \ | rbd import-diff --pool cinder_backup - volume-${VOLUME_ID}.backup.base
We can verify the following with the rbd command:
# volume ceph cluster int32bit $ rbd -p openstack snap ls volume-075c06ed-37e2-407d-b998-e270c4edc53c SNAPID NAME SIZE TIMESTAMP 52 backup.db563496-0c15-4349-95f3-fc5194bfb11a.snap.1511344566.67 1024 MB Wed Nov 22 17:56:15 2017 # backup ceph cluster int32bit $ rbd -p cinder_backup ls -l NAME SIZE PARENT FMT PROT LOCK volume-075c06ed-37e2-407d-b998-e270c4edc53c.backup.base 1024M 2 volume-075c06ed-37e2-407d-b998-e270c4edc53c.backup.base@backup.db563496-0c15-4349-95f3-fc5194bfb11a.snap.1511344566.67 1024M 2
From the output, the source volume creates a snapshot with ID 52, and creates an empty volume volume volume-075c06ed-37e2-407d-b998-e270c4edc53c in the Ceph cluster of backup backup. Base and contains a snapshot backup xxx. snap. 1511344566.67, the snapshot was created through import diff.
Incremental backup
The previous process is the same as full backup. Let's skip to_ backup_rbd() method:
from_snap = self._get_most_recent_snap(source_rbd_image) with rbd_driver.RADOSClient(self, backup.container) as client: if base_name not in self.rbd.RBD().list(ioctx=client.ioctx): ... else: if not self._snap_exists(base_name, from_snap, client): errmsg = (_("Snapshot='%(snap)s' does not exist in base " "image='%(base)s' - aborting incremental " "backup") % {'snap': from_snap, 'base': base_name}) LOG.info(errmsg) raise exception.BackupRBDOperationFailed(errmsg)
First, obtain the latest snapshot of rbd image corresponding to the source volume as the most parent, and then judge whether the same snapshot exists in the base of Ceph cluster of backup (according to the previous full backup, there must be the same snapshot as the source volume).
Let's continue to look at the following parts:
new_snap = self._get_new_snap_name(backup.id) source_rbd_image.create_snap(new_snap) try: before = time.time() self._rbd_diff_transfer(volume_name, rbd_pool, base_name, backup.container, src_user=rbd_user, src_conf=rbd_conf, dest_user=self._ceph_backup_user, dest_conf=self._ceph_backup_conf, src_snap=new_snap, from_snap=from_snap) if from_snap: source_rbd_image.remove_snap(from_snap)
This is basically the same as full backup. The only difference is from_snap is not None, and from will be deleted later_ snap. _ rbd_ diff_ The transfer method can turn to the previous code.
Assuming that the source volume uuid is 075c06ed-37e2-407d-b998-e270c4edc53c, the backup uuid is e3db9e85-d352-47e2-bced-5bad68da853b, and the parent backup uuid is db563496-0c15-4349-95f3-fc5194bfb11a, the corresponding rbd commands are roughly as follows:
VOLUME_ID=075c06ed-37e2-407d-b998-e270c4edc53c BACKUP_ID=e3db9e85-d352-47e2-bced-5bad68da853b PARENT_ID=db563496-0c15-4349-95f3-fc5194bfb11a rbd -p openstack snap create \ volume-${VOLUME_ID}@backup.${BACKUP_ID}.snap.1511348180.27 rbd export-diff --pool openstack \ --from-snap backup.${PARENT_ID}.snap.1511344566.67 \ openstack/volume-${VOLUME_ID}@backup.${BACKUP_ID}.snap.1511348180.27 - \ | rbd import-diff --pool cinder_backup - \ cinder_backup/volume-${VOLUME_ID}.backup.base rbd -p openstack snap rm \ volume-${VOLUME_ID}.backup.base@backup.${PARENT_ID}.snap.1511344566.67
We verify the following with the rbd command:
int32bit $ rbd -p openstack snap ls volume-075c06ed-37e2-407d-b998-e270c4edc53c SNAPID NAME SIZE TIMESTAMP 53 backup.e3db9e85-d352-47e2-bced-5bad68da853b.snap.1511348180.27 1024 MB Wed Nov 22 18:56:20 2017 int32bit $ rbd -p cinder_backup ls -l NAME SIZE PARENT FMT PROT LOCK volume-075c06ed-37e2-407d-b998-e270c4edc53c.backup.base 1024M 2 volume-075c06ed-37e2-407d-b998-e270c4edc53c.backup.base@backup.db563496-0c15-4349-95f3-fc5194bfb11a.snap.1511344566.67 1024M 2 volume-075c06ed-37e2-407d-b998-e270c4edc53c.backup.base@backup.e3db9e85-d352-47e2-bced-5bad68da853b.snap.1511348180.27 1024M 2
Consistent with the results of our analysis, the snapshot of the source volume will delete the old one and only keep the latest one, and backup will keep all snapshots.
4.5 backup and recovery
Backup recovery is the reverse process of backup, that is, restoring data from remote storage to local storage. The source code of backup and recovery is located in cinder / backup / Drivers / CEPH Py's restore() method, which directly calls_ restore_volume() method, so let's look directly_ restore_volume() method:
def _restore_volume(self, backup, volume, volume_file): length = int(volume.size) * units.Gi base_name = self._get_backup_base_name(backup.volume_id, diff_format=True) with rbd_driver.RADOSClient(self, backup.container) as client: diff_allowed, restore_point = \ self._diff_restore_allowed(base_name, backup, volume, volume_file, client)
Among them_ diff_restore_allowed() is a very important method to judge whether recovery through direct import is supported. Let's see the implementation of this method:
def _diff_restore_allowed(self, base_name, backup, volume, volume_file, rados_client): rbd_exists, base_name = self._rbd_image_exists(base_name, backup.volume_id, rados_client) if not rbd_exists: return False, None restore_point = self._get_restore_point(base_name, backup.id) if restore_point: if self._file_is_rbd(volume_file): if volume.id == backup.volume_id: return False, restore_point if self._rbd_has_extents(volume_file.rbd_image): return False, restore_point return True, restore_point
From this method, we can see that all the following conditions need to be met to support data recovery by differential import:
backup Cluster correspondence volume of rbd base image Must exist.
The recovery point must exist, i.e backup base image The corresponding snapshot must exist.
Recovery target volume Must be RBD,Namely volume The storage backend must also be Ceph.
Recovery target volume It must be an empty volume and does not support overwriting existing content image.
Recovery target volume uuid and backup Source of volume uuid It cannot be the same, that is, it cannot overwrite the original volume.
In other words, although Cinder supports restoring data to existing volumes (including source volumes), if Ceph is used, the backend does not support incremental recovery, resulting in very low efficiency.
Therefore, if Ceph storage backend is used, it is recommended in the official document to restore the backup to an empty volume (no volume is specified), and it is not recommended to restore to an existing volume.
Note that Cinder supports restoring to a new volume or the original volume the backup was taken from.
For the latter case, a full copy is enforced since this was deemed the safest action to take.
It is therefore recommended to always restore to a new volume (default).
Here, suppose we restore to the empty volume, and the command is as follows:
cinder backup-restore \ --name int32bit-restore-1 \ e3db9e85-d352-47e2-bced-5bad68da853b
Note that we did not specify the -- volume parameter. At this time, perform incremental recovery. The code is as follows:
def _diff_restore_rbd(self, backup, restore_file, restore_name, restore_point, restore_length): rbd_user = restore_file.rbd_user rbd_pool = restore_file.rbd_pool rbd_conf = restore_file.rbd_conf base_name = self._get_backup_base_name(backup.volume_id, diff_format=True) before = time.time() try: self._rbd_diff_transfer(base_name, backup.container, restore_name, rbd_pool, src_user=self._ceph_backup_user, src_conf=self._ceph_backup_conf, dest_user=rbd_user, dest_conf=rbd_conf, src_snap=restore_point) except exception.BackupRBDOperationFailed: raise self._check_restore_vol_size(backup, restore_name, restore_length, rbd_pool)
It can be seen that incremental recovery is very simple and only calls the previously described_ rbd_ diff_ The transfer () method exports diff the snapshot of the base image corresponding to the backup Ceph cluster to the Ceph cluster of volume, and adjusts the size.
Assuming that the backup uuid is e3db9e85-d352-47e2-bced-5bad68da853b, the source volume uuid is 075c06ed-37e2-407d-b998-e270c4edc53c, and the target volume uuid is f65cf534-5266-44bb-ad57-ddba21d9e5f9, the corresponding rbd command is:
BACKUP_ID=e3db9e85-d352-47e2-bced-5bad68da853b SOURCE_VOLUME_ID=075c06ed-37e2-407d-b998-e270c4edc53c DEST_VOLUME_ID=f65cf534-5266-44bb-ad57-ddba21d9e5f9 rbd export-diff --pool cinder_backup \ cinder_backup/volume-${SOURCE_VOLUME_ID}.backup.base@backup.${BACKUP_ID}.snap.1511348180.27 - \ | rbd import-diff --pool openstack - \ openstack/volume-${DEST_VOLUME_ID} rbd -p openstack resize \ --size ${new_size} volume-${DEST_VOLUME_ID}
If one of the above five conditions is not met, Cinder will perform a full backup, which means writing data piece by piece:
def _transfer_data(self, src, src_name, dest, dest_name, length): chunks = int(length / self.chunk_size) for chunk in range(0, chunks): before = time.time() data = src.read(self.chunk_size) dest.write(data) dest.flush() delta = (time.time() - before) rate = (self.chunk_size / delta) / 1024 # yield to any other pending backups eventlet.sleep(0) rem = int(length % self.chunk_size) if rem: dest.write(data) dest.flush() # yield to any other pending backups eventlet.sleep(0)
In this case, it is inefficient and time-consuming, and is not recommended.
5 Summary
5.1 Glance
1. Upload image
rbd -p ${GLANCE_POOL} create --size ${SIZE} ${IMAGE_ID} rbd -p ${GLANCE_POOL} snap create ${IMAGE_ID}@snap rbd -p ${GLANCE_POOL} snap protect ${IMAGE_ID}@snap
2. Delete image
rbd -p ${GLANCE_POOL} snap unprotect ${IMAGE_ID}@snap rbd -p ${GLANCE_POOL} snap rm ${IMAGE_ID}@snap rbd -p ${GLANCE_POOL} rm ${IMAGE_ID}
5.2 Nova
1 create virtual machine
rbd clone \ ${GLANCE_POOL}/${IMAGE_ID}@snap \ ${NOVA_POOL}/${SERVER_ID}_disk
2 create virtual machine snapshot
# Snapshot the disk and clone # it into Glance's storage pool rbd -p ${NOVA_POOL} snap create \ ${SERVER_ID}_disk@${RANDOM_UUID} rbd -p ${NOVA_POOL} snap protect \ ${SERVER_ID}_disk@${RANDOM_UUID} rbd clone \ ${NOVA_POOL}/${SERVER_ID}_disk@${RANDOM_UUID} \ ${GLANCE_POOL}/${IMAGE_ID} # Flatten the image, which detaches it from the # source snapshot rbd -p ${GLANCE_POOL} flatten ${IMAGE_ID} # all done with the source snapshot, clean it up rbd -p ${NOVA_POOL} snap unprotect \ ${SERVER_ID}_disk@${RANDOM_UUID} rbd -p ${NOVA_POOL} snap rm \ ${SERVER_ID}_disk@${RANDOM_UUID} # Makes a protected snapshot called 'snap' on # uploaded images and hands it out rbd -p ${GLANCE_POOL} snap create ${IMAGE_ID}@snap rbd -p ${GLANCE_POOL} snap protect ${IMAGE_ID}@snap
3 delete virtual machine
for image in $(rbd -p ${NOVA_POOL} ls | grep "^${SERVER_ID}"); do rbd -p ${NOVA_POOL} rm "$image"; done
5.3 Cinder
1 create volume
(1) Create a blank volume
rbd -p ${CINDER_POOL} create \ --new-format --size ${SIZE} \ volume-${VOLUME_ID}
(2) Create from snapshot
rbd clone \ ${CINDER_POOL}/volume-${SOURCE_VOLUME_ID}@snapshot-${SNAPSHOT_ID} \ ${CINDER_POOL}/volume-${VOLUME_ID} rbd resize --size ${SIZE} \ openstack/volume-${VOLUME_ID}
(3) Create from volume
# Do full copy if rbd_max_clone_depth <= 0. if [[ "$rbd_max_clone_depth" -le 0 ]]; then rbd copy \ ${CINDER_POOL}/volume-${SOURCE_VOLUME_ID} \ ${CINDER_POOL}/volume-${VOLUME_ID} exit 0 fi # Otherwise do COW clone. # Create new snapshot of source volume rbd snap create \ ${CINDER_POOL}/volume-${SOURCE_VOLUME_ID}@volume-${VOLUME_ID}.clone_snap rbd snap protect \ ${CINDER_POOL}/volume-${SOURCE_VOLUME_ID}@volume-${VOLUME_ID}.clone_snap # Now clone source volume snapshot rbd clone \ ${CINDER_POOL}/volume-${SOURCE_VOLUME_ID}@volume-${VOLUME_ID}.clone_snap \ ${CINDER_POOL}/volume-${VOLUME_ID} # If dest volume is a clone and rbd_max_clone_depth reached, # flatten the dest after cloning. depth=$(get_clone_depth ${CINDER_POOL}/volume-${VOLUME_ID}) if [[ "$depth" -ge "$rbd_max_clone_depth" ]]; then # Flatten destination volume rbd flatten ${CINDER_POOL}/volume-${VOLUME_ID} # remove temporary snap rbd snap unprotect \ ${CINDER_POOL}/volume-${SOURCE_VOLUME_ID}@volume-${VOLUME_ID}.clone_snap rbd snap rm \ ${CINDER_POOL}/volume-${SOURCE_VOLUME_ID}@volume-${VOLUME_ID}.clone_snap fi
(4) Create from mirror
rbd clone \ ${GLANCE_POOL}/${IMAGE_ID}@snap \ ${CINDER_POOL}/volume-${VOLUME_ID} if [[ -n "${SIZE}" ]]; then rbd resize --size ${SIZE} ${CINDER_POOL}/volume-${VOLUME_ID} fi
2 create a snapshot
rbd -p ${CINDER_POOL} snap create \ volume-${VOLUME_ID}@snapshot-${SNAPSHOT_ID} rbd -p ${CINDER_POOL} snap protect \ volume-${VOLUME_ID}@snapshot-${SNAPSHOT_ID}
3 create backup
(1) First backup
rbd -p ${BACKUP_POOL} create \ --size ${VOLUME_SIZE} \ volume-${VOLUME_ID}.backup.base NEW_SNAP=volume-${VOLUME_ID}@backup.${BACKUP_ID}.snap.${TIMESTAMP} rbd -p ${CINDER_POOL} snap create ${NEW_SNAP} rbd export-diff ${CINDER_POOL}/volume-${VOLUME_ID}${NEW_SNAP} - \ | rbd import-diff --pool ${BACKUP_POOL} - \ volume-${VOLUME_ID}.backup.base
(2) Incremental backup
rbd -p ${CINDER_POOL} snap create \ volume-${VOLUME_ID}@backup.${BACKUP_ID}.snap.${TIMESTAMP} rbd export-diff --pool ${CINDER_POOL} \ --from-snap backup.${PARENT_ID}.snap.${LAST_TIMESTAMP} \ ${CINDER_POOL}/volume-${VOLUME_ID}@backup.${BACKUP_ID}.snap.${TIMESTRAMP} - \ | rbd import-diff --pool ${BACKUP_POOL} - \ ${BACKUP_POOL}/volume-${VOLUME_ID}.backup.base rbd -p ${CINDER_POOL} snap rm \ volume-${VOLUME_ID}.backup.base@backup.${PARENT_ID}.snap.${LAST_TIMESTAMP}
4 backup and recovery
rbd export-diff --pool ${BACKUP_POOL} \ volume-${SOURCE_VOLUME_ID}.backup.base@backup.${BACKUP_ID}.snap.${TIMESTRAMP} - \ | rbd import-diff --pool ${CINDER_POOL} - \ volume-${DEST_VOLUME_ID} rbd -p ${CINDER_POOL} resize \ --size ${new_size} volume-${DEST_VOLUME_ID}
reference resources: https://zhuanlan.zhihu.com/p/31581145
$ rbd -p openstack flatten int32bit-test-2 Image flatten: 31% complete...