]> git.proxmox.com Git - ceph.git/blame - ceph/doc/dev/osd_internals/partial_object_recovery.rst
update source to Ceph Pacific 16.2.2
[ceph.git] / ceph / doc / dev / osd_internals / partial_object_recovery.rst
CommitLineData
9f95a23c
TL
1=======================
2Partial Object Recovery
3=======================
4
f67539c2
TL
5Partial Object Recovery improves the efficiency of log-based recovery (vs
6backfill). Original log-based recovery calculates missing_set based on pg_log
7differences.
9f95a23c
TL
8
9The whole object should be recovery from one OSD to another
10if the object is indicated modified by pg_log regardless of how much
11content in the object is really modified. That means a 4M object,
12which is just modified 4k inside, should recovery the whole 4M object
13rather than the modified 4k content. In addition, object map should be
14also recovered even if it is not modified at all.
15
16Partial Object Recovery is designed to solve the problem mentioned above.
17In order to achieve the goals, two things should be done:
18
191. logging where the object is modified is necessary
202. logging whether the object_map of an object is modified is also necessary
21
22class ObjectCleanRegion is introduced to do what we want.
23clean_offsets is a variable of interval_set<uint64_t>
24and is used to indicate the unmodified content in an object.
25clean_omap is a variable of bool indicating whether object_map is modified.
26new_object means that osd does not exist for an object
27max_num_intervals is an upbound of the number of intervals in clean_offsets
28so that the memory cost of clean_offsets is always bounded.
29
30The shortest clean interval will be trimmed if the number of intervals
31in clean_offsets exceeds the boundary.
32
33 etc. max_num_intervals=2, clean_offsets:{[5~10], [20~5]}
34
35 then new interval [30~10] will evict out the shortest one [20~5]
36
37 finally, clean_offsets becomes {[5~10], [30~10]}
38
39Procedures for Partial Object Recovery
40======================================
41
42Firstly, OpContext and pg_log_entry_t should contain ObjectCleanRegion.
43In do_osd_ops(), finish_copyfrom(), finish_promote(), corresponding content
44in ObjectCleanRegion should mark dirty so that trace the modification of an object.
45Also update ObjectCleanRegion in OpContext to its pg_log_entry_t.
46
47Secondly, pg_missing_set can build and rebuild correctly.
48when calculating pg_missing_set during peering process,
49also merge ObjectCleanRegion in each pg_log_entry_t.
50
51 etc. object aa has pg_log:
52 26'101 {[0~4096, 8192~MAX], false}
53
54 26'104 {0~8192, 12288~MAX, false}
55
56 28'108 {[0~12288, 16384~MAX], true}
57
58 missing_set for object aa: merge pg_log above --> {[0~4096, 16384~MAX], true}.
59 which means 4096~16384 is modified and object_map is also modified on version 28'108
60
61Also, OSD may be crash after merge log.
62Therefore, we need to read_log and rebuild pg_missing_set. For example, pg_log is:
63
64 object aa: 26'101 {[0~4096, 8192~MAX], false}
65
66 object bb: 26'102 {[0~4096, 8192~MAX], false}
67
68 object cc: 26'103 {[0~4096, 8192~MAX], false}
69
70 object aa: 26'104 {0~8192, 12288~MAX, false}
71
72 object dd: 26'105 {[0~4096, 8192~MAX], false}
73
74 object aa: 28'108 {[0~12288, 16384~MAX], true}
75
f67539c2 76Originally, if bb,cc,dd is recovered, and aa is not.
9f95a23c
TL
77So we need to rebuild pg_missing_set for object aa,
78and find aa is modified on version 28'108.
79If version in object_info is 26'96 < 28'108,
80we don't need to consider 26'104 and 26'101 because the whole object will be recovered.
81However, Partial Object Recovery should also require us to rebuild ObjectCleanRegion.
82
83Knowing whether the object is modified is not enough.
84
85Therefore, we also need to traverse the pg_log before,
86that says 26'104 and 26'101 also > object_info(26'96)
87and rebuild pg_missing_set for object aa based on those three logs: 28'108, 26'104, 26'101.
88The way how to merge logs is the same as mentioned above
89
90Finally, finish the push and pull process based on pg_missing_set.
91Updating copy_subset in recovery_info based on ObjectCleanRegion in pg_missing_set.
92copy_subset indicates the intervals of content need to pull and push.
93
94The complicated part here is submit_push_data
95and serval cases should be considered separately.
96what we need to consider is how to deal with the object data,
97object data makes up of omap_header, xattrs, omap, data:
98
99case 1: first && complete: since object recovering is finished in a single PushOp,
100we would like to preserve the original object and overwrite on the object directly.
101Object will not be removed and touch a new one.
102
103 issue 1: As object is not removed, old xattrs remain in the old object
104 but maybe updated in new object. Overwriting for the same key or adding new keys is correct,
105 but removing keys will be wrong.
106 In order to solve this issue, We need to remove the all original xattrs in the object, and then update new xattrs.
107
108 issue 2: As object is not removed,
109 object_map may be recovered depending on the clean_omap.
110 Therefore, if recovering clean_omap, we need to remove old omap of the object for the same reason
111 since omap updating may also be a deletion.
112 Thus, in this case, we should do:
113
114 1) clear xattrs of the object
115 2) clear omap of the object if omap recovery is needed
116 3) truncate the object into recovery_info.size
117 4) recovery omap_header
118 5) recovery xattrs, and recover omap if needed
119 6) punch zeros for original object if fiemap tells nothing there
120 7) overwrite object content which is modified
121 8) finish recovery
122
123case 2: first && !complete: object recovering should be done in multiple times.
124Here, target_oid will indicate a new temp_object in pgid_TEMP,
125so the issues are a bit difference.
126
127 issue 1: As object is newly created, there is no need to deal with xattrs
128
129 issue 2: As object is newly created,
130 and object_map may not be transmitted depending on clean_omap.
131 Therefore, if clean_omap is true, we need to clone object_map from original object.
132 issue 3: As object is newly created, and unmodified data will not be transmitted.
133 Therefore, we need to clone unmodified data from the original object.
134 Thus, in this case, we should do:
135
136 1) remove the temp object
137 2) create a new temp object
138 3) set alloc_hint for the new temp object
139 4) truncate new temp object to recovery_info.size
140 5) recovery omap_header
141 6) clone object_map from original object if omap is clean
142 7) clone unmodified object_data from original object
143 8) punch zeros for the new temp object
144 9) recovery xattrs, and recover omap if needed
145 10) overwrite object content which is modified
146 11) remove the original object
147 12) move and rename the new temp object to replace the original object
148 13) finish recovery