]> git.proxmox.com Git - ceph.git/blame - ceph/src/doc/rgw/multisite-reshard.md
update ceph source to reef 18.1.2
[ceph.git] / ceph / src / doc / rgw / multisite-reshard.md
CommitLineData
9f95a23c
TL
1# Dynamic Resharding for Multisite
2
3## Requirements
4
5* Each zone manages bucket resharding decisions independently
6 - With per-bucket replication policies, some zones may only replicate a subset of objects, so require fewer shards.
7 - Avoids having to coordinate reshards across zones.
8* Resharding a bucket does not require a full sync of its objects
9 - Existing bilogs must be preserved and processed before new bilog shards.
10* Backward compatibility
11 - No zone can reshard until all peer zones upgrade to a supported release.
12 - Requires a manual zonegroup change to enable resharding.
13
14## Layout
15
16A layout describes a set of rados objects, along with some strategy to distribute things across them. A bucket index layout distributes object names across some number of shards via `ceph_str_hash_linux()`. Resharding a bucket enacts a transition from one such layout to another. Each layout could represent data differently. For example, a bucket index layout would be used with cls_rgw to write/delete keys. Whereas a datalog layout may be used with cls_log to append and trim log entries, then later transition to a layout based on some other primitive like cls_queue or cls_fifo.
17
18## Bucket Index Resharding
19
20To reshard a bucket, we currently create a new bucket instance with the desired sharding layout, and switch to that instance when resharding completes. In multisite, though, the metadata master zone is authoritative for all bucket metadata, including the sharding layout and reshard status. Any changes to metadata must take place on the metadata master zone and replicate from there to other zones.
21
22If we want to allow each zone to manage its bucket sharding independently, we can't allow them each to create a new bucket instance, because data sync relies on the consistency of instance ids between zones. We also can't allow metadata sync to overwrite our local sharding information with the metadata master's copy.
23
24That means that the bucket's sharding information needs to be kept private to the local zone's bucket instance, and that information also needs to track all reshard status that's currently spread between the old and new bucket instance metadata: old shard layout, new shard layout, and current reshard progress. To make this information private, we can just prevent metadata sync from overwriting these fields.
25
26This change also affects the rados object names of the bucket index shards, currently of the form `.dir.<instance-id>.<shard-id>`. Since we need to represent multiple sharding layouts for a single instance-id, we need to add some unique identifier to the object names. This comes in the form of a generation number, incremented with each reshard, like `.dir.<instance-id>.<generation>.<shard-id>`. The first generation number 0 would be omitted from the object names for backward compatibility.
27
28## Bucket Index Log Resharding
29
1e59de90
TL
30The bucket replication logs for multisite are stored in the same bucket index
31shards as the keys that they modify. However, we can't reshard these log
32entries like we do with normal keys, because other zones need to track their
33position in the logs. If we shuffle the log entries around between shards,
34other zones no longer have a way to associate their old shard marker positions
35with the new shards, and their only recourse would be to restart a full sync.
36So when resharding buckets, we need to preserve the old bucket index logs so
37that other zones can finish processing their log entries, while any new events
38are recorded in the new bucket index logs.
9f95a23c
TL
39
40An additional goal is to move replication logs out of omap (so out of the bucket index) into separate rados objects. To enable this, the bucket instance metadata should be able to describe a bucket whose *index layout* is different from its *log layout*. For existing buckets, the two layouts would be identical and share the bucket index objects. Alternate log layouts are otherwise out of scope for this design.
41
42To support peer zones that are still processing old logs, the local bucket instance metadata must track the history of all log layouts that haven't been fully trimmed yet. Once bilog trimming advances past an old generation, it can delete the associated rados objects and remove that layout from the bucket instance metadata. To prevent this history from growing too large, we can refuse to reshard bucket index logs until trimming catches up.
43
44The distinction between *index layout* and *log layout* is important, because incremental sync only cares about changes to the *log layout*. Changes to the *index layout* would only affect full sync, which uses a custom RGWListBucket extension to list the objects of each index shard separately. But by changing the scope of full sync from per-bucket-shard to per-bucket and using a normal bucket listing to get all objects, we can make full sync independent of the *index layout*. And once the replication logs are moved out of the bucket index, dynamic resharding is free to change the *index layout* as much as it wants with no effect on multisite replication.
45
46## Tasks
47
48### Bucket Reshard
49
50* Modify existing state machine for bucket reshard to mutate its existing bucket instance instead of creating a new one.
51
52* Add fields for log layout. When resharding a bucket whose logs are in the index:
53 - Add a new log layout generation to the bucket instance
54 - Copy the bucket index entries into their new index layout
55 - Commit the log generation change so new entries will be written there
56 - Create a datalog entry with the new log generation
57
58### Metadata Sync
59
60* When sync fetches a bucket instance from the master zone, preserve any private fields in the local instance. Use cls_version to guarantee that we write back the most recent version of those private fields.
61
62### Data Sync
63
64* Datalog entries currently include a bucket shard number. We need to add the log generation number to these entries so we can tell which sharding layout it refers to. If we see a new generation number, that entry also implies an obligation to finish syncing all shards of prior generations.
65
66### Bucket Sync Status
67
f67539c2
TL
68* Add a per-bucket sync status object that tracks:
69 - full sync progress,
70 - the current generation of incremental sync, and
71 - the set of shards that have completed incremental sync of that generation
72* Existing per-bucket-shard sync status objects continue to track incremental sync.
73 - their object names should include the generation number, except for generation 0
9f95a23c
TL
74* For backward compatibility, add special handling when we get ENOENT trying to read this per-bucket sync status:
75 - If the remote's oldest log layout has generation=0, read any existing per-shard sync status objects. If any are found, resume incremental sync from there.
76 - Otherwise, initialize for full sync.
77
78### Bucket Sync
79
80* Full sync uses a single bucket-wide listing to fetch all objects.
f67539c2 81 - Use a cls_lock to prevent different shards from duplicating this work.
9f95a23c 82* When incremental sync gets to the end of a log shard (i.e. listing the log returns truncated=false):
f67539c2
TL
83 - If the remote has a newer log generation, flag that shard as 'done' in the bucket sync status.
84 - Once all shards in the current generation reach that 'done' state, incremental bucket sync can advance to the next generation.
85 - Use cls_version on the bucket sync status object to detect racing writes from other shards.
86
87### Bucket Sync Disable/Enable
88
89Reframe in terms of log generations, instead of handling SYNCSTOP events with a special Stopped state:
90
91* radosgw-admin bucket sync enable: create a new log generation in the bucket instance metadata
92 - detect races with reshard: fail if reshard in progress, and write with cls_version to detect race with start of reshard
93 - if the current log generation is shared with the bucket index layout (BucketLogType::InIndex), the new log generation will point at the same index layout/generation. so the log generation increments, but the index objects keep the same generation
94* SYNCSTOP in incremental sync: flag the shard as 'done' and ignore datalog events on that bucket until we see a new generation
9f95a23c
TL
95
96### Log Trimming
97
98* Use generation number from sync status to trim the right logs
99* Once all shards of a log generation are trimmed:
100 - Remove their rados objects.
f67539c2 101 - Remove the associated incremental sync status objects.
9f95a23c 102 - Remove the log generation from its bucket instance metadata.
f67539c2
TL
103
104### Admin APIs
105
106* RGWOp_BILog_List response should include the bucket's highest log generation
107 - Allows incremental sync to determine whether truncated=false means that it's caught up, or that it needs to transition to the next generation.
108* RGWOp_BILog_Info response should include the bucket's lowest and highest log generations
109 - Allows bucket sync status initialization to decide whether it needs to scan for existing shard status, and where it should resume incremental sync after full sync completes.
110* RGWOp_BILog_Status response should include per-bucket status information
111 - For log trimming of old generations