========================= Design of Pool Migration ========================= Background ========== The objective for pool migration is to be able to migrate all RADOS objects from one pool to another within the same cluster non-disruptively. This functionality is planned for the Umbrella release. Use cases for pool migration: * This provides the ability to change the erasure code profile (and in particular the choice of K and M) non-disruptively. Implementing this as a non-disruptive migration between pools is simpler and no less efficient than trying to perform this type of transformation in place. * Converting between replica and erasure coded pools. Changes being made to add OMAP and class support to EC pools will remove the need to have a separate replica pool for metadata when using RBD, CephFS or RGW, this should make these migrations viable in conjunction with this work. * The general use case of wanting to migrate data between two pools. By non-disruptive we mean that there will be no time where I/O or applications need to switch from using the old pool to the new pool, not even a very short outage at the start or end of the migration (as for example is required by RBD live migration). The migration will however need to read and write every object in the pool so there will be a performance impact during the migration - similar to that when PGs are backfilling. The same techniques and controls that are used when splitting/merging PGs and backfilling individual PGs will be used by pool migration to manage this impact. For the first release we will require that the target pool is empty to begin with (this means we don't need to worry about objects with the same name). See the section on avoiding name collisions for more details. Supporting merging of pools (either where constraints prevent object name collisions or where these collisions are resolved automatically during the migration) is a possible future enhancement, but it's not clear what use cases this solves. During a pool migration restrictions are placed on the source pool, it is not permitted to modify the number of PGs (to cause splits or merges). See sections on stopping changes to the number of PGs during migration and on the CLI and UI for more details. Deletion of the source pool will not be permitted during a migration. Other actions such as rebalancing are permitted but perhaps should be discouraged as the data is being moved anyway. During a pool migration restrictions are placed on the target pool, it is not permitted to migrate this pool to another (i.e. no daisy chained or cyclical migrations). Splits, merges and rebalancing are permitted. Deletion of the target pool will not be permitted during a migration. Once a pool has finished migrating it is permited to start a new migration of the target pool of a previous migration. For the first release there is no option to cancel, suspend or reverse a pool migration once it has started. For the first release there is no plan to have the clients update their references to the pool once the migration has completed, they will continue to reference the old pool and objector (``librados``) will reroute the request to the new pool. The ``OSDMap`` will retain stub information for the old pool redirecting to the new pool. The feature requires changes to client code; all clients and daemons will need to be upgraded before a pool migration is permitted. The two main clients, objector (in ``librados``) and the kernel client will be updated. Updates to the kernel client are likely to lag the Umbrella release. Where the clients are integrated into other products (e.g. ``ODF``) these products will need to incorporate the new clients before the feature can be used. For the first release there is no plan to support pool migration between Ceph clusters. Theoretically this could be added later building upon the first release code but would require substantial extra effort. It would require clients to be able to update references to the cluster and pool once the migration had completed and for clients to be able to redirect I/O to a different cluster. There would be extra authentication challenges as all OSDs and clients in the source cluster would need to be able to submit requests to the target cluster. Design ====== Reuse of Existing Design ------------------------ Let's start by looking at existing code or features that we can copy / reuse / refactor / take inspiration from, we don't want to reinvent the wheel or repeat past mistakes. Backfill ~~~~~~~~ Backfill is a process run by a PG to recover objects on an OSD that has either just been added to the PG (starts with no objects) or has been absent from the PG for a while (has some objects that are up to date, some that are stale, is probably missing new objects and may have objects that are no longer needed because they were deleted while it was absent). Backfill takes a long time so I/O must be permitted to continue while the backfill happens. It uses the fact that all objects have a hash and that it is possible to list objects in hash order. This means that backfill can recover objects in hash order and can simply keep a watermark hash value to track what progress has been made. I/Os to objects with a hash below the watermark are to an object that has been recovered and need to update all OSDs including the backfilling OSD. I/Os to objects with a hash above the watermark can ignore the backfilling OSDs as the backfill process will recover this object later. The object(s) currently being recovered by the backfill process are locked to prevent I/O for the short time it takes to backfill an object. Another property of backfill is that the process is idempotent, while there are performance benefits to preserving the watermark there is no correctness issues if the watermark is reset to the start and the backfill process starts again as repeating the process will determine that objects have already been recovered. This simplifies the design because the watermark doesn't have to be rigorously replicated and checkpointed, although for backfill it is part of ``pg_info_t`` so progress is checkpointed fairly frequently. Relevance to pool migration: * Pool migration can list objects in hash order and migrate them to the new pool. * A watermark can be used to keep track of which objects have been migrated. There is no need for the watermark to be persistent. * Clients can cache a copy of the watermark to help direct I/Os to the correct pool and PG. * The client's cached copy can become stale, if I/Os are misdirected they will be failed providing an up-to-date watermark so the I/O can be retried. * Backfill recovers all parts of a RADOS object - attributes, data and OMAP. Large objects are recovered in phases (something like 2MB at a time) and utililize a temporary object which is renamed at the end of the recovery for atomicity. If a peering cycle interrupts the process, then the temporary object is discarded. If pool migration uses this technique it needs to be aware that a peering cycle might disrupt the target pool but not the source pool and therefore may need to restart the migration of the object if the target discards the object. * Backfill recovers a RADOS head object and its associated snapshots at the same time and uses locking (e.g. ``PrimaryLogPG::is_degraded_or_backfilling_object`` and ``PrimaryLogPG::is_unreadable_object``) to ensure that none of these can be accessed while they are being recovered because of dependencies between them. Pool migration needs to migrate the head object and the snapshots at the same time and needs to ensure we don't process I/O to the object halfway through this process. * Backfill is meant to preserve the space-efficiency of snapshots when recovering them using the information in the snapset attribute to work out which regions of the snapshots are clones - see ``calc_clone_subsets`` / ``calc_head_subsets``. This hasn't been implemented for EC pools yet and `tracker 72753 `_ shows it currently isn't working for replica pools either. We will want to use this (and the way a ``PushOp`` re-establishes cloned regions) for migration. Unlike backfill, for migration we want the clients to know the watermark so they can route I/Os to the old/new pool. We don't care if clients have a stale watermark - this will just cause a few I/Os to be incorrectly routed to the old pool which can fail them back to the client and communicate a new watermark so the I/O can be resubmitted to the new pool. We deliberately make updating the client's copy of the watermark lazy - there could be hundreds or thousands of clients so updating them all the time would be expensive. Putting the watermark into the ``OSDMap`` and issuing new epochs to distribute it to all the clients would be even more expensive. In contrast we are thinking about recording which PGs are migrating/have finished migrating in the ``OSDMap`` - a rule of thumb would be to try and only update the ``OSDMap`` once a second during a migration. For migration to be able to support direct reads we do need all the OSDs in the PG to know where the watermark is and for this to be updated as each object is migrated. Migrating an object involves reading it from the source pool, writing it to the target pool and then deleting it from the source pool. Other OSDs can update migration progress as they process the delete request. There will be some complexity regarding direct reads and migrating an object + its snapshots. There is already some code that fails direct reads with ``EAGAIN`` (to redirect these to the primary) when an object + its snapshots have not all been recovered, we may need to use this when midway through migrating an object + snapshots and then have the primary stall the I/O until the object + snapshots have all been migrated before failing the I/O again for redirection to the new pool. The watermark doesn't necessarily need to be checkpointed to disk, it is cheap to find the object with the lowest hash in a PG so we could do this to recalculate the watermark whenever peering starts migration. Scheduling Backfill / Recovery ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Deciding how to prioritize backfill/recovery and how fast to run this process versus processing I/O from clients is a complex problem. Firstly, a decision is made as to which PGs should be backfilling/recovering, and which should wait. This involves messages between OSDs and considers whether I/O is blocked and how much redundancy the PG has left (for example a replica-3 pool with 2 failures is prioritized over a replica-3 pool with 1 failure). Secondly once a PG has been selected to backfill/recover the schedule has to decide how frequently to perform backfill/recovery versus process client I/O. This happens within the primary OSD using weighted costs. Relevance to pool migration: * Pool migration is less critical than backfill or recovery. It needs to fit into the same process to determine when a PG should start migrating. * Once a PG is permitted to start migration the OSD scheduler needs to pace the work. The overheads for migrating an object are like the overheads for backfilling an object so hopefully we can just copy the backfill scheduling for migration. The objective is to reuse as much of the scheduler (e.g. ``mclock``) as possible, just teaching it that migration has a lower priority than backfill or async recovery but higher priority than deep scrub. ``Mclock`` works by assigning a weighting to each backfill / recovery op and each client I/O request, it also benchmarks OSDs at startup to get some idea what the maximum performance of the OSD is. This information is then used to work out when to schedule background work. The same concepts should work for migration requests. We will need to assign a weighting to migration work; this should be similar/identical to the weighting for backfills. We will take a similar approach for supporting clusers running with ``WeightedPriorityQueue`` scheduling. The expectation is that there should be no need for new tuneable settings for migration, the existing tuneable settings for backfill/recovery should be sufficient, we don't want to further complicate this part of the UI. Statistics ~~~~~~~~~~ I believe there are a few statistics collected about the performance of backfill/recovery. We should supplement these with similar statistics about the process of migrations. We need to consider OSD stats that are gathered by ``Prometheus`` and any progress summary that is presented via ``HealthCheck`` and/or the UI. CopyFrom ~~~~~~~~ ``CopyFrom`` is a RADOS op that can copy the contents of an object into a new object. It is sent to the OSD and PG that will store the new object. The OSD is responsible for reading the source object which involves sending messages to another OSD and PG and then writing the data it reads to the new object. If the object being copied is large, then the copy operation is broken up into multiple stages and this is made atomic by using a temporary object to store the new data until the last data has been copied at which point the temporary object can be renamed to become the new object. Relevance to pool migration: * Pool migration needs to copy objects from the old pool to a new pool - this will involve one OSD and PG reading the object and another OSD and PG writing the object. * Pool migration will want to drive the copy operation from the source side, so we probably need a ``CopyTo`` type operation. * The way messages are sent between OSDs, the way a large object copy is staged and the use of a temporary object name when staging are all concepts that can be reused. Alternatively, pool migration might want to copy the recover object implementation in ``ECBackend`` which is used to recover an object being recovered or backfilled. This also stages the recovery of large objects using a temporary object and uses ``PushOp`` messages to send data to the OSDs being backfilled. It might be possible to use most of the recover object process without changes, just changing the ``PushOp`` messages to be sent to a different PG and sending the messages for all shards as the entire object is being migrated. Lets consider the differences between the backend recovery op and CopyFrom: * ``CopyFrom`` is a process that runs in ``PrimaryLogPG`` above either the replica or ``ECBackend`` that copies an object from a primary OSD for one PG to the primary OSD for another PG. In the case of EC the primary OSD may need to issue ``SubOp`` commands to other OSDs to read/write the data. * ``run_recovery_op`` implemented by replica and EC pools runs on the primary OSD and reads data (in the case of EC issuing ``SubOp`` commands to other OSDs) but then issues ``PushOp`` commands to write the recovered data to the destination OSDs. * ``CopyFrom`` working at the ``PrimaryLogPG`` level ensures that the copied object is included in the PG stats and gets its own PG log entry so the update can be rolled forward/backwards and can be recovered by async recovery. * ``run_recovery_op`` is implemented at the ``PGBackend`` level and assumes the PG already has stats and a PG log entry for the object, it is just responsible for bringing other shards in the PG up to date. * CopyFrom ends up issuing read and write ops to the PGBackend, it doesn't provide techniques for copying a snapshot and preserving its space-efficiency. * ``run_recovery_op`` is meant to preserve space-efficiency of clones (not implemented yet for EC pools and replica pools have bugs) – the ``PushOp`` message includes a way of describing which parts of an object should be clones. For pool migration we probably want a hybrid implementation. We can probably re-use a lot of the ``run_recovery_op`` code to read the object that we want to migrate, and ideally handle the space-efficiency in snaps. Instead of issuing PushOps we probably want to issue a new ``COPY_PUT`` type op to the priamry PG of the target pool, but passing the same kind of information as a PushOp so we can keep track of what needs to be cloned. The target pool can then submit a mixture of write and clone ops to the PGBackend layer to create the object as well as updating the PG stats and creating a PG log entry. Splitting PGs ~~~~~~~~~~~~~ Normally a pool has a number of PGs that is a power of 2. This is because we want each PG to hold roughly the same number of objects, and we use the most significant N bits of the object hash to select which PG to use. However, when doubling the number of PGs that a pool has this causes approximately half the objects in the pool to need to be moved to a new PG. We don't want all this migration to happen at once; we want it to be paced over time to have less impact. To deal with this the MGR controls the increase in the number of PGs, it has a target for how many PGs the pool should have and slowly increases the number of PGs waiting for PGs to finish recovery before doing further splits. When a pool has a non-power of 2 number of PGs this means that not all PGs are the same size. For example, if there are 5 PGs then PGs 0 and 4 will be half the size of PGs 1 to 3 because the choice between PG 0 and 4 is based on one extra bit of the object hash. While this is not desirable as a long-term state it is fine during the splitting process. Relevance to pool migration: * Pool migration needs to migrate all the objects in all the PGs in the old pool to the new pool. Just like splitting we don't want to overwhelm the system while performing the migration. * Pool migration should therefore migrate one (or a small number) of PGs at a time. * A process needs to monitor the progress of migrations, notice when PGs finish migrating and start the next PG. This could either be in the MON (in which case it would need to be event driven with OSDs telling the MON when a PG has finished migrating - somewhat similar to how PG merges work) or it could be implemented in the MGR (in which case the MGR can poll the state of the PGs and then tell the MON via a CLI command to start the next PG migration). Direct I/O / Balanced Reads ~~~~~~~~~~~~~~~~~~~~~~~~~~~ The EC direct I/O feature is making changes to the client to decide which OSD to send client I/O requests to, it is building on top of the balanced reads flag for replica pools which tells the client to distribute read I/Os evenly across all the OSDs in a replica PG rather than sending them all to the primary. Relevance to pool migration: * It's changing code in the client at a similar place to where we want the client to implement pool migration deciding which pool (and hence PG and OSD) to send I/O to. * Direct I/O / balanced reads are permitted to be failed by the OSD that receives the request with ``EAGAIN`` to deal with corner cases where the OSD is unable to process the I/O. In this case the client retries the I/O but sends it to the primary OSD. A similar retry mechanism is going to be required when a client issue an I/O to the wrong pool because an object has been recently migrated. When I/Os are retried, we need to worry about ordering as this generates opportunities for I/Os to overtake or be reordered. See section Read/Write ordering below. * Direct I/O is adding extra information to the pg_pool_t structure that is part of the ``OSDMap`` that gets sent to every Ceph daemon and client by the monitor. This extra information is being used to determine that direct I/O is supported and to help work out where to route the I/O request. Pool migration will similarly need to add details to ``pg_pool_t`` structure so that clients are aware that a migration is happening. Read/Write Ordering ------------------- Ceph has some fairly strict read / write ordering rules. Once a write has completed to the client any read must return the new data. Prior to the write completing a read is expected to return all old data or all new data (a mixture is not permitted). If writes A and B are issued concurrently one after another to the same object then write A is expected to be applied before write B – ordering of the writes is expected to be preserved through the client, messenger and the OSD. If write A and read B are issued concurrently then there is scope for read B to overtake write A. There is a flag ``RWORDERED`` that can be set that prevents this overtaking from happening. There are no ordering guarantees when reads or writes are issued to different objects - these objects are almost certainly stored on different OSDs and even if they are on the same OSD will be processed by different threads with different locks so can easily be reordered. There do not appear to be many uses of the ``RWORDERED`` flag, RBD and RGW do not use the flag, CephFS uses the flag in MDS ``RecoveryQueue`` (calls ``filer.probe`` which is implemented in ``osdc/Filer.cc``) which I think is only used in some recovery scenarios. These rules make it tricky to implement the watermark in the client and use this to decide which pool to route I/O requests to without using something equivalent to a new epoch to advance the watermark. The problem is that if the watermark is advanced without quiescing I/O it is possible that this causes requests to be reordered. For example: * Write A issued to old pool. * Write B issued to old pool. * Write A fails with updated watermark and is retried to new pool. * Read B with ``RWORDERING`` issued to new pool. * Write B fails and needs to be retried to new pool. In this example read B has overtaken write B. Perhaps more concerning is that the rules would also be broken if instead of Read B we issued another write to B. The simplest way to prevent reordering violations is to not advance the watermark while there are outstanding writes (or reads with ``RWORDERING`` flag set) in flight. This isn't idea as it may result it quite a number of I/Os being failed for retry before the watermark can be updated. A more sophisticated implementation stalls issuing new writes to objects with a hash between the old and new watermark while there are other writes in flight to objects with a hash between the old and new watermark. Other Pool Migration Issues --------------------------- Other topics that we need to think about for pool migration. Avoiding Name Collisions ~~~~~~~~~~~~~~~~~~~~~~~~ For the first release we will require that the target pool is empty when the migration starts (by having a UI interface that only starts a migration while a new pool is being created). We can also protect against objects being written to the target pool during the migration by adding client code to reject attempts to initiate requests to the target pool (the client code itself is still permitted to redirect requests from the source pool to the target pool). Because we will require a minimum client version to use pool migration this will ensure that all clients include this extra policing. OSDs cannot themselves implement the policing so there is no protection against a rouge client – we probably should have migration halt rather that crash if a name collision is found. Post first release if there is a use case for merging pools then it is theoretically possible to deal with name collisions by additionally using the pool which the client is accessing the object from to uniquify the name. This would require extra information in the request from the client to the OSDs. Stopping Changes to the Number of PGs During Migration ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ During a migration we don't really want to be changing the number of PGs in the source pool. There are three reasons why: #. We don't really want to be moving objects around in the source pool when we are about to migrate them - we are probably better off getting on with the migration than trying to fix any imbalance in the source pool. #. Splitting/merging PGs in the source pool makes it harder to schedule the migration. Scheduling is done at two levels - we say how many source PGs are migrating at a time and then control the rate of migration within a source PG. If we split/merge the source pool this makes selecting which PGs to migrate more difficult. #. If we block splits and merges and migrate the PGs in reverse order (starting with the highest numbered PG in the pool) then we can reduce the number of PGs in the source pool as PGs finish migrating. This helps keeps the overall number of PGs more manageable. In contrast we don't really care so much about the target pool - we can easily cope with splits/merges while the migration is in progress. From a performance perspective we do however want to avoid migrating objects to the target pool and then having splits/merges occur that copy the objects a second time. That means that normally we would want to set the number of target pool PGs to be the same as the source pool at the start of the migrate. We might also want to default to disable the auto-scaler for the target pool during the migration as we don't want it seeing a nearly empty target pool with loads of PGs and thinking that it should reduce the number of PGs. CLI and UI ~~~~~~~~~~ Pool migration will need a new CLI to start the migration, there will also need to be a way of monitoring PGs that a migrating and the progress of the migration. The CLI to start a migration will need to be implemented by the MON (``OSDMonitor.cc`` already implements most of the pool CLI commands) because the migration will need to update the ``pg_pool_t`` structures in the ``OSDMap`` to record details of the migration. The new map will then be distributed to clients and OSDs so that they know that the migration has started. PGs that have been scheduled to start migration will need to determine at the end of the peering process that they don't need to recovery or backfill and that they should attempt to schedule a migration (will need new PG states ``MIGRATION_WAIT`` and ``MIGRATING``). We will need to work with the dashboard team to add support for pool migration to the dashboard and to provide a REST API for starting a migration. We will want to block some CLIs while a pool migration is taking place: * We don't want to be able to split/merge PGs in the source pool while it is being migrated (see above). * We don't want the target pool to become the source of another migration (no chaining migrations). Some of these CLIs are issued by MGR, in this case we probably will need to change the MGR code to either cope with the failures and/or to detect that the pool is migrating and avoid issuing the CLIs. We probably will need both as although checking if the pool is migrating before issuing a CLI is probably more efficient, it is exposed to a race hazard where the migrate may start between the check and CLI being issued. We need to look at how the progress of things like backfill and recovery are reported in the UI (possibly by ``HealthCheck``?) and think about how to report the progress of a pool migration. We need to think what are the right units for reporting progress (e.g. number of objects out of total objects, number of PGs out of total PGs or just a percentage). Backwards Compatibility / Software Upgrade ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Pool migration requires code changes in the Ceph daemons (MON, OSD and possibly MGR) and to the Ceph clients that issue I/O. We can't allow a pool migration to happen while any of these are running old code because the old code won't understand that a pool migration is happening. Old clients won't have any way of directing I/O to the correct pool, PG and OSD and having OSDs forward all these requests to the correct OSD would be far too expensive. Ceph daemons and clients have a set of feature bits indicating what features they support and there are mechanisms for setting a minimum set of feature bits that are required by daemons and separately for clients. Once set this prevents down-level daemons and clients connecting to the cluster. There are also mechanisms to ensure that once a minimum level has been set that this cannot be reversed. Pool migration will need to define a new feature bit and use the existing mechanisms for setting minimum required levels for daemons and clients. The new pool migration CLIs will need to fail an attempt to start a migration unless the minimum levels have been set. End of Migration ~~~~~~~~~~~~~~~~ When a migration completes, we will have moved all objects from pool A to pool B, however clients (e.g. RBD, CephFs, RGW, ...) will still have pool A embedded in their own data structures. We don't want to force all the clients to update their data structures to point at the new pool, so instead we will retain stub information about pool A saying that it has been migrated and that all I/O should now be submitted to pool B. Retaining a stub ``pg_pool_t`` structure in the ``OSDMap`` is cheap - there won't be thousands of pools and there isn't that much data stored for the pool. We will want to ensure that the old pool has no PGs associated with it, we can do this by reducing the number of PGs it has to 0 and letting the same code that runs when PGs are merged clean up and delete the old PGs. We need to think about the consequences of this on the UI interface. While in the code we start with pool A and create and migrate objects to pool B, from the perspective of the UI we probably want to show this as a transformation of pool A and hide the existence of pool B from the user. An alternative implementation would just show the pool redirection in the UI, so users would see an RBD image used pool A but would then find that pool A has been migrated to pool B. This alternative implementation might be better if we plan to support merging of pools (migration to a non-empty target pool) in the future. Walkthrough of how Pool Migration Might Work -------------------------------------------- Initiating the Pool Migration ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ #. User creates a new pool, perhaps they use a new flag ``--migratefrom`` to say they want to start a pool migration. #. Starting the migration as part of pool creation means we know the pool is initially empty. #. Unless the user specifies a number of PGs we can ensure that the newly created pool has the same number of PGs as the source pool. There is no requirement that the number of PGs is the same, it just avoids having to perform a migration and then perform a second copy of data as the number of PGs is adjusted to cater for the eventual number of objects in the pool. #. The CLI command sets up the ``pg_pool_t`` structures in the ``OSDMap`` to indicate that a pool migration is starting. We record that pool A is being migrated to pool B, and record which PG(s) we are going to start migrating. If we are going to migrate more than one PG at a time, we probably want to specify a set of PGs (e.g. 0,1,2,3) that are being migrated. Any PG in the set is migrating. Any PG not in the set that is higher than the lowest value in the set is assumed to have completed migration, any PG not in the set that is lower than the lowest value in the set is assumed to have not started migration. #. We migrate PGs in reverse order - so for example if a pool has PGs 0-15 then we will start by migrating PG 15. #. MON publishes new ``OSDMap`` as a new epoch. Client ~~~~~~ #. Clients use the ``pg_pool_t`` structure in the ``OSDMap`` to work out a migration is in progress. #. From the range of PGs being migrated they can work out which PGs have been migrated, which have not started migrating and which are in the process of migrating. a. If an I/O is submitted to a PG that has been migrated the object hash and new pool is used to determine which PG and OSD to route the I/O request to. b. If an I/O is submitted to a PG that has not started migration the object hash and old pool is used to determine which PG and OSD to route the I/O request to. c. If an I/O is submitted to a PG that is marked as being migrated the client checks if it has a cached watermark for this PG. If it does, then it uses this to decide whether to route the request to the old or new pool. If it has no cached watermark, it guesses and sends the I/O to the old pool. #. If an I/O is misrouted to the wrong pool the OSD will fail the request providing an update to the watermark. The client needs to update its cached copy of the watermark and resubmit the I/O. OSD ~~~ #. OSDs use the ``pg_pool_t`` structure in the ``OSDMap`` to work out if a PG needs migrating. #. At the end of peering if the PG needs migrating and is not performing backfill or recovery it sets the PG state to ``MIGRATION_WAIT`` and checks with other OSDs whether they have the resources and free capacity to start the migration. #. If everything is good the PG state changes to ``MIGRATING``, sets the watermark to 0 and the scheduler is instructed to start scheduling migration work. #. Migration starts by scanning the next range of objects to be migrated creating a list of object OIDs. #. Each object is then migrated, with the watermark being updated after the object has been migrated. a. The primary reads the object and sends it to the primary of the target PG which then writes the object. b. If the object is large this is done in stages with the target using a temporary object name which is renamed when the last data is written. c. Once an object has been migrated it is deleted from the source pool. #. Client I/O checks the object hash of the client I/O with the watermark. If the I/O is below the watermark it is failed for retry to the new pool, providing the current watermark for the client to cache. #. If a PG completes a migration, then it sends a message to the MON telling it that the migration has completed. MON ~~~ #. When MON gets a message from an OSD saying that a migration has completed it updates the set in the ``pg_pool_t`` to record that the PG has finished migration and that the next PG is starting migration. A new ``OSDMap`` is published as a new epoch. #. Because migrations are scheduled in reverse order and objects are deleted as the migration happens, this means that as PG migrations complete that we should have empty PGs that can be deleted by simply reducing the number of PGs that the source pool has. PG migrations might not complete in the order which they are started so we might have a few empty PGs hanging around that cannot be deleted until another PG migration completes. #. At the end of the migration there are no more PGs to start migrating, so the set of migrating PGs diminishes. When the set becomes empty we should have also reduced the number of PGs for the source pool to zero and at this point the migration is complete. The MON can make final updates to the ``pg_pool_t`` state to indicate the migration has finished. The ``pg_pool_t`` structure needs to be kept so that clients know to direct all I/O requests to this pool to the new pool instead. #. Pools can be migrated more than once, this can result in multiple stub ``pg_pool_t`` structures being kept. We do not want to have to recurse through these stubs when I/Os are submitted, so at the end of a migration the MON should attempt to reduce these redirects to a single level. Testing and Test Tools ---------------------- The objectives of testing pool migration are: #. Validate that all the objects in the source pool are migrated to the target pool and that their contents (data, attributes and OMAP) are retained. #. Validate that during a migration object can be read (data, attributes, OMAP) for objects that haven't yet been migrated, objects that have been migrated and objects in the middle of being migrated. #. Validate that during a migration objects can be updated (create, delete, write, update attributes, update OMAPs) for objects that haven't yet been migrated, objects that have been migrated and objects in the middle of being migrated. #. Validating pool migration under error scenarios, including resetting and failing OSDs. #. Validate that snapshots, clones are migrated and can be used during a pool migration. #. Validate the UI for pool migration, including restrictions placed on the UI during the migration. #. Validation of migrating multiple different pools in parallel. Validation of migration a single pool multiple times in series. #. Validation of pool migration with unreadable objects (excessive medium errors plus possibly other failures that defeat the redundancy of the replica/EC pool without taking it offline). #. Validation of software upgrade / compatibility for both daemons (OSD, MON, MGR) and clients. #. Validation of performance impact during a migration. Pool migration makes changes to client code, so all modified clients will need testing. Existing tools such as ``ceph_test_rados`` are good for creating and exercising a set of objects and performing some consistency checking of objects. A simple script is probably better for creating a large number of objects and then validating the contents of the objects. Writing a script is probably better for being able to test attributes and OMAPs as well. If the script has two phases (create objects and validate objects) then these phases can be run at different times (before, during, after pool migration) to test different aspects. The script could use a command line tool such as rados to create and validate objects, using pseudo random numbers to generate data patterns, attributes and OMAP data that could then be validated. The script would need to run many rados commands in parallel to generate a decent I/O workload. There may be scripts that already exist that can do this, it may be possible to adapt ceph_test_rados to do this. Tools such as ``VDBench`` can test data integrity of block volumes, either creating a data set and then validating it, or can continuously create and update data keeping a journal so it can be validated at any point. However block volume tools can only test object data, not attributes or OMAPs. A tool such as ``FIO`` is best suited for doing performance measurements. The I/O sequence tool ``ceph_test_rados_io_sequence`` is probably not useful for testing pool migration - it specializes it testing a very small number of objects and focuses on boundary conditions within an object (e.g. EC chunk size, strip size) and data integrity. The objective should be to use teuthology to perform most of the testing for pool migration (at a minimum 1 to 5 in the list above). It should be possible to add pool migration as an option to existing tests in the RADOS suite, extending the ``thrashOSD`` class to include the option of starting a migration.