]>
Commit | Line | Data |
---|---|---|
9f95a23c TL |
1 | ======================= |
2 | Partial Object Recovery | |
3 | ======================= | |
4 | ||
f67539c2 TL |
5 | Partial Object Recovery improves the efficiency of log-based recovery (vs |
6 | backfill). Original log-based recovery calculates missing_set based on pg_log | |
7 | differences. | |
9f95a23c TL |
8 | |
9 | The whole object should be recovery from one OSD to another | |
10 | if the object is indicated modified by pg_log regardless of how much | |
11 | content in the object is really modified. That means a 4M object, | |
12 | which is just modified 4k inside, should recovery the whole 4M object | |
13 | rather than the modified 4k content. In addition, object map should be | |
14 | also recovered even if it is not modified at all. | |
15 | ||
16 | Partial Object Recovery is designed to solve the problem mentioned above. | |
17 | In order to achieve the goals, two things should be done: | |
18 | ||
19 | 1. logging where the object is modified is necessary | |
20 | 2. logging whether the object_map of an object is modified is also necessary | |
21 | ||
22 | class ObjectCleanRegion is introduced to do what we want. | |
23 | clean_offsets is a variable of interval_set<uint64_t> | |
24 | and is used to indicate the unmodified content in an object. | |
25 | clean_omap is a variable of bool indicating whether object_map is modified. | |
26 | new_object means that osd does not exist for an object | |
27 | max_num_intervals is an upbound of the number of intervals in clean_offsets | |
28 | so that the memory cost of clean_offsets is always bounded. | |
29 | ||
30 | The shortest clean interval will be trimmed if the number of intervals | |
31 | in clean_offsets exceeds the boundary. | |
32 | ||
33 | etc. max_num_intervals=2, clean_offsets:{[5~10], [20~5]} | |
34 | ||
35 | then new interval [30~10] will evict out the shortest one [20~5] | |
36 | ||
37 | finally, clean_offsets becomes {[5~10], [30~10]} | |
38 | ||
39 | Procedures for Partial Object Recovery | |
40 | ====================================== | |
41 | ||
42 | Firstly, OpContext and pg_log_entry_t should contain ObjectCleanRegion. | |
43 | In do_osd_ops(), finish_copyfrom(), finish_promote(), corresponding content | |
44 | in ObjectCleanRegion should mark dirty so that trace the modification of an object. | |
45 | Also update ObjectCleanRegion in OpContext to its pg_log_entry_t. | |
46 | ||
47 | Secondly, pg_missing_set can build and rebuild correctly. | |
48 | when calculating pg_missing_set during peering process, | |
49 | also merge ObjectCleanRegion in each pg_log_entry_t. | |
50 | ||
51 | etc. object aa has pg_log: | |
52 | 26'101 {[0~4096, 8192~MAX], false} | |
53 | ||
54 | 26'104 {0~8192, 12288~MAX, false} | |
55 | ||
56 | 28'108 {[0~12288, 16384~MAX], true} | |
57 | ||
58 | missing_set for object aa: merge pg_log above --> {[0~4096, 16384~MAX], true}. | |
59 | which means 4096~16384 is modified and object_map is also modified on version 28'108 | |
60 | ||
61 | Also, OSD may be crash after merge log. | |
62 | Therefore, we need to read_log and rebuild pg_missing_set. For example, pg_log is: | |
63 | ||
64 | object aa: 26'101 {[0~4096, 8192~MAX], false} | |
65 | ||
66 | object bb: 26'102 {[0~4096, 8192~MAX], false} | |
67 | ||
68 | object cc: 26'103 {[0~4096, 8192~MAX], false} | |
69 | ||
70 | object aa: 26'104 {0~8192, 12288~MAX, false} | |
71 | ||
72 | object dd: 26'105 {[0~4096, 8192~MAX], false} | |
73 | ||
74 | object aa: 28'108 {[0~12288, 16384~MAX], true} | |
75 | ||
f67539c2 | 76 | Originally, if bb,cc,dd is recovered, and aa is not. |
9f95a23c TL |
77 | So we need to rebuild pg_missing_set for object aa, |
78 | and find aa is modified on version 28'108. | |
79 | If version in object_info is 26'96 < 28'108, | |
80 | we don't need to consider 26'104 and 26'101 because the whole object will be recovered. | |
81 | However, Partial Object Recovery should also require us to rebuild ObjectCleanRegion. | |
82 | ||
83 | Knowing whether the object is modified is not enough. | |
84 | ||
85 | Therefore, we also need to traverse the pg_log before, | |
86 | that says 26'104 and 26'101 also > object_info(26'96) | |
87 | and rebuild pg_missing_set for object aa based on those three logs: 28'108, 26'104, 26'101. | |
88 | The way how to merge logs is the same as mentioned above | |
89 | ||
90 | Finally, finish the push and pull process based on pg_missing_set. | |
91 | Updating copy_subset in recovery_info based on ObjectCleanRegion in pg_missing_set. | |
92 | copy_subset indicates the intervals of content need to pull and push. | |
93 | ||
94 | The complicated part here is submit_push_data | |
95 | and serval cases should be considered separately. | |
96 | what we need to consider is how to deal with the object data, | |
97 | object data makes up of omap_header, xattrs, omap, data: | |
98 | ||
99 | case 1: first && complete: since object recovering is finished in a single PushOp, | |
100 | we would like to preserve the original object and overwrite on the object directly. | |
101 | Object will not be removed and touch a new one. | |
102 | ||
103 | issue 1: As object is not removed, old xattrs remain in the old object | |
104 | but maybe updated in new object. Overwriting for the same key or adding new keys is correct, | |
105 | but removing keys will be wrong. | |
106 | In order to solve this issue, We need to remove the all original xattrs in the object, and then update new xattrs. | |
107 | ||
108 | issue 2: As object is not removed, | |
109 | object_map may be recovered depending on the clean_omap. | |
110 | Therefore, if recovering clean_omap, we need to remove old omap of the object for the same reason | |
111 | since omap updating may also be a deletion. | |
112 | Thus, in this case, we should do: | |
113 | ||
114 | 1) clear xattrs of the object | |
115 | 2) clear omap of the object if omap recovery is needed | |
116 | 3) truncate the object into recovery_info.size | |
117 | 4) recovery omap_header | |
118 | 5) recovery xattrs, and recover omap if needed | |
119 | 6) punch zeros for original object if fiemap tells nothing there | |
120 | 7) overwrite object content which is modified | |
121 | 8) finish recovery | |
122 | ||
123 | case 2: first && !complete: object recovering should be done in multiple times. | |
124 | Here, target_oid will indicate a new temp_object in pgid_TEMP, | |
125 | so the issues are a bit difference. | |
126 | ||
127 | issue 1: As object is newly created, there is no need to deal with xattrs | |
128 | ||
129 | issue 2: As object is newly created, | |
130 | and object_map may not be transmitted depending on clean_omap. | |
131 | Therefore, if clean_omap is true, we need to clone object_map from original object. | |
132 | issue 3: As object is newly created, and unmodified data will not be transmitted. | |
133 | Therefore, we need to clone unmodified data from the original object. | |
134 | Thus, in this case, we should do: | |
135 | ||
136 | 1) remove the temp object | |
137 | 2) create a new temp object | |
138 | 3) set alloc_hint for the new temp object | |
139 | 4) truncate new temp object to recovery_info.size | |
140 | 5) recovery omap_header | |
141 | 6) clone object_map from original object if omap is clean | |
142 | 7) clone unmodified object_data from original object | |
143 | 8) punch zeros for the new temp object | |
144 | 9) recovery xattrs, and recover omap if needed | |
145 | 10) overwrite object content which is modified | |
146 | 11) remove the original object | |
147 | 12) move and rename the new temp object to replace the original object | |
148 | 13) finish recovery |