]> git.proxmox.com Git - ceph.git/blob - ceph/doc/dev/mon-osdmap-prune.rst
update sources to ceph Nautilus 14.2.1
[ceph.git] / ceph / doc / dev / mon-osdmap-prune.rst
1 ===========================
2 FULL OSDMAP VERSION PRUNING
3 ===========================
4
5 For each incremental osdmap epoch, the monitor will keep a full osdmap
6 epoch in the store.
7
8 While this is great when serving osdmap requests from clients, allowing
9 us to fulfill their request without having to recompute the full osdmap
10 from a myriad of incrementals, it can also become a burden once we start
11 keeping an unbounded number of osdmaps.
12
13 The monitors will attempt to keep a bounded number of osdmaps in the store.
14 This number is defined (and configurable) via ``mon_min_osdmap_epochs``, and
15 defaults to 500 epochs. Generally speaking, we will remove older osdmap
16 epochs once we go over this limit.
17
18 However, there are a few constraints to removing osdmaps. These are all
19 defined in ``OSDMonitor::get_trim_to()``.
20
21 In the event one of these conditions is not met, we may go over the bounds
22 defined by ``mon_min_osdmap_epochs``. And if the cluster does not meet the
23 trim criteria for some time (e.g., unclean pgs), the monitor may start
24 keeping a lot of osdmaps. This can start putting pressure on the underlying
25 key/value store, as well as on the available disk space.
26
27 One way to mitigate this problem would be to stop keeping full osdmap
28 epochs on disk. We would have to rebuild osdmaps on-demand, or grab them
29 from cache if they had been recently served. We would still have to keep
30 at least one osdmap, and apply all incrementals on top of either this
31 oldest map epoch kept in the store or a more recent map grabbed from cache.
32 While this would be feasible, it seems like a lot of cpu (and potentially
33 IO) would be going into rebuilding osdmaps.
34
35 Additionally, this would prevent the aforementioned problem going forward,
36 but would do nothing for stores currently in a state that would truly
37 benefit from not keeping osdmaps.
38
39 This brings us to full osdmap pruning.
40
41 Instead of not keeping full osdmap epochs, we are going to prune some of
42 them when we have too many.
43
44 Deciding whether we have too many will be dictated by a configurable option
45 ``mon_osdmap_full_prune_min`` (default: 10000). The pruning algorithm will be
46 engaged once we go over this threshold.
47
48 We will not remove all ``mon_osdmap_full_prune_min`` full osdmap epochs
49 though. Instead, we are going to poke some holes in the sequence of full
50 maps. By default, we will keep one full osdmap per 10 maps since the last
51 map kept; i.e., if we keep epoch 1, we will also keep epoch 10 and remove
52 full map epochs 2 to 9. The size of this interval is configurable with
53 ``mon_osdmap_full_prune_interval``.
54
55 Essentially, we are proposing to keep ~10% of the full maps, but we will
56 always honour the minimum number of osdmap epochs, as defined by
57 ``mon_min_osdmap_epochs``, and these won't be used for the count of the
58 minimum versions to prune. For instance, if we have on-disk versions
59 [1..50000], we would allow the pruning algorithm to operate only over
60 osdmap epochs [1..49500); but, if have on-disk versions [1..10200], we
61 won't be pruning because the algorithm would only operate on versions
62 [1..9700), and this interval contains less versions than the minimum
63 required by ``mon_osdmap_full_prune_min``.
64
65
66 ALGORITHM
67 =========
68
69 Say we have 50,000 osdmap epochs in the store, and we're using the
70 defaults for all configurable options.
71
72 ::
73
74 -----------------------------------------------------------
75 |1|2|..|10|11|..|100|..|1000|..|10000|10001|..|49999|50000|
76 -----------------------------------------------------------
77 ^ first last ^
78
79 We will prune when all the following constraints are met:
80
81 1. number of versions is greater than ``mon_min_osdmap_epochs``;
82
83 2. the number of versions between ``first`` and ``prune_to`` is greater (or
84 equal) than ``mon_osdmap_full_prune_min``, with ``prune_to`` being equal to
85 ``last`` minus ``mon_min_osdmap_epochs``.
86
87 If any of these conditions fails, we will *not* prune any maps.
88
89 Furthermore, if it is known that we have been pruning, but since then we
90 are no longer satisfying at least one of the above constraints, we will
91 not continue to prune. In essence, we only prune full osdmaps if the
92 number of epochs in the store so warrants it.
93
94 As pruning will create gaps in the sequence of full maps, we need to keep
95 track of the intervals of missing maps. We do this by keeping a manifest of
96 pinned maps -- i.e., a list of maps that, by being pinned, are not to be
97 pruned.
98
99 While pinned maps are not removed from the store, maps between two consecutive
100 pinned maps will; and the number of maps to be removed will be dictated by the
101 configurable option ``mon_osdmap_full_prune_interval``. The algorithm makes an
102 effort to keep pinned maps apart by as many maps as defined by this option,
103 but in the event of corner cases it may allow smaller intervals. Additionally,
104 as this is a configurable option that is read any time a prune iteration
105 occurs, there is the possibility this interval will change if the user changes
106 this config option.
107
108 Pinning maps is performed lazily: we will be pinning maps as we are removing
109 maps. This grants us more flexibility to change the prune interval while
110 pruning is happening, but also simplifies considerably the algorithm, as well
111 as the information we need to keep in the manifest. Below we show a simplified
112 version of the algorithm:::
113
114 manifest.pin(first)
115 last_to_prune = last - mon_min_osdmap_epochs
116
117 while manifest.get_last_pinned() + prune_interval < last_to_prune AND
118 last_to_prune - first > mon_min_osdmap_epochs AND
119 last_to_prune - first > mon_osdmap_full_prune_min AND
120 num_pruned < mon_osdmap_full_prune_txsize:
121
122 last_pinned = manifest.get_last_pinned()
123 new_pinned = last_pinned + prune_interval
124 manifest.pin(new_pinned)
125 for e in (last_pinned .. new_pinned):
126 store.erase(e)
127 ++num_pruned
128
129 In essence, the algorithm ensures that the first version in the store is
130 *always* pinned. After all, we need a starting point when rebuilding maps, and
131 we can't simply remove the earliest map we have; otherwise we would be unable
132 to rebuild maps for the very first pruned interval.
133
134 Once we have at least one pinned map, each iteration of the algorithm can
135 simply base itself on the manifest's last pinned map (which we can obtain by
136 reading the element at the tail of the manifest's pinned maps list).
137
138 We'll next need to determine the interval of maps to be removed: all the maps
139 from ``last_pinned`` up to ``new_pinned``, which in turn is nothing more than
140 ``last_pinned`` plus ``mon_osdmap_full_prune_interval``. We know that all maps
141 between these two values, ``last_pinned`` and ``new_pinned`` can be removed,
142 considering ``new_pinned`` has been pinned.
143
144 The algorithm ceases to execute as soon as one of the two initial
145 preconditions is not met, or if we do not meet two additional conditions that
146 have no weight on the algorithm's correctness:
147
148 1. We will stop if we are not able to create a new pruning interval properly
149 aligned with ``mon_osdmap_full_prune_interval`` that is lower than
150 ``last_pruned``. There is no particular technical reason why we enforce
151 this requirement, besides allowing us to keep the intervals with an
152 expected size, and preventing small, irregular intervals that would be
153 bound to happen eventually (e.g., pruning continues over the course of
154 several iterations, removing one or two or three maps each time).
155
156 2. We will stop once we know that we have pruned more than a certain number of
157 maps. This value is defined by ``mon_osdmap_full_prune_txsize``, and
158 ensures we don't spend an unbounded number of cycles pruning maps. We don't
159 enforce this value religiously (deletes do not cost much), but we make an
160 effort to honor it.
161
162 We could do the removal in one go, but we have no idea how long that would
163 take. Therefore, we will perform several iterations, removing at most
164 ``mon_osdmap_full_prune_txsize`` osdmaps per iteration.
165
166 In the end, our on-disk map sequence will look similar to::
167
168 ------------------------------------------
169 |1|10|20|30|..|49500|49501|..|49999|50000|
170 ------------------------------------------
171 ^ first last ^
172
173
174 Because we are not pruning all versions in one go, we need to keep state
175 about how far along on our pruning we are. With that in mind, we have
176 created a data structure, ``osdmap_manifest_t``, that holds the set of pinned
177 maps:::
178
179 struct osdmap_manifest_t:
180 set<version_t> pinned;
181
182 Given we are only pinning maps while we are pruning, we don't need to keep
183 track of additional state about the last pruned version. We know as a matter
184 of fact that we have pruned all the intermediate maps between any two
185 consecutive pinned maps.
186
187 The question one could ask, though, is how can we be sure we pruned all the
188 intermediate maps if, for instance, the monitor crashes. To ensure we are
189 protected against such an event, we always write the osdmap manifest to disk
190 on the same transaction that is deleting the maps. This way we have the
191 guarantee that, if the monitor crashes, we will read the latest version of the
192 manifest: either containing the newly pinned maps, meaning we also pruned the
193 in-between maps; or we will find the previous version of the osdmap manifest,
194 which will not contain the maps we were pinning at the time we crashed, given
195 the transaction on which we would be writing the updated osdmap manifest was
196 not applied (alongside with the maps removal).
197
198 The osdmap manifest will be written to the store each time we prune, with an
199 updated list of pinned maps. It is written in the transaction effectively
200 pruning the maps, so we guarantee the manifest is always up to date. As a
201 consequence of this criteria, the first time we will write the osdmap manifest
202 is the first time we prune. If an osdmap manifest does not exist, we can be
203 certain we do not hold pruned map intervals.
204
205 We will rely on the manifest to ascertain whether we have pruned maps
206 intervals. In theory, this will always be the on-disk osdmap manifest, but we
207 make sure to read the on-disk osdmap manifest each time we update from paxos;
208 this way we always ensure having an up to date in-memory osdmap manifest.
209
210 Once we finish pruning maps, we will keep the manifest in the store, to
211 allow us to easily find which maps have been pinned (instead of checking
212 the store until we find a map). This has the added benefit of allowing us to
213 quickly figure out which is the next interval we need to prune (i.e., last
214 pinned plus the prune interval). This doesn't however mean we will forever
215 keep the osdmap manifest: the osdmap manifest will no longer be required once
216 the monitor trims osdmaps and the earliest available epoch in the store is
217 greater than the last map we pruned.
218
219 The same conditions from ``OSDMonitor::get_trim_to()`` that force the monitor
220 to keep a lot of osdmaps, thus requiring us to prune, may eventually change
221 and allow the monitor to remove some of its oldest maps.
222
223 MAP TRIMMING
224 ------------
225
226 If the monitor trims maps, we must then adjust the osdmap manifest to
227 reflect our pruning status, or remove the manifest entirely if it no longer
228 makes sense to keep it. For instance, take the map sequence from before, but
229 let us assume we did not finish pruning all the maps.::
230
231 -------------------------------------------------------------
232 |1|10|20|30|..|490|500|501|502|..|49500|49501|..|49999|50000|
233 -------------------------------------------------------------
234 ^ first ^ pinned.last() last ^
235
236 pinned = {1, 10, 20, ..., 490, 500}
237
238 Now let us assume that the monitor will trim up to epoch 501. This means
239 removing all maps prior to epoch 501, and updating the ``first_committed``
240 pointer to ``501``. Given removing all those maps would invalidate our
241 existing pruning efforts, we can consider our pruning has finished and drop
242 our osdmap manifest. Doing so also simplifies starting a new prune, if all
243 the starting conditions are met once we refreshed our state from the
244 store.
245
246 We would then have the following map sequence: ::
247
248 ---------------------------------------
249 |501|502|..|49500|49501|..|49999|50000|
250 ---------------------------------------
251 ^ first last ^
252
253 However, imagine a slightly more convoluted scenario: the monitor will trim
254 up to epoch 491. In this case, epoch 491 has been previously pruned from the
255 store.
256
257 Given we will always need to have the oldest known map in the store, before
258 we trim we will have to check whether that map is in the prune interval
259 (i.e., if said map epoch belongs to ``[ pinned.first()..pinned.last() )``).
260 If so, we need to check if this is a pinned map, in which case we don't have
261 much to be concerned aside from removing lower epochs from the manifest's
262 pinned list. On the other hand, if the map being trimmed to is not a pinned
263 map, we will need to rebuild said map and pin it, and only then will we remove
264 the pinned maps prior to the map's epoch.
265
266 In this case, we would end up with the following sequence:::
267
268 -----------------------------------------------
269 |491|500|501|502|..|49500|49501|..|49999|50000|
270 -----------------------------------------------
271 ^ ^- pinned.last() last ^
272 `- first
273
274 There is still an edge case that we should mention. Consider that we are
275 going to trim up to epoch 499, which is the very last pruned epoch.
276
277 Much like the scenario above, we would end up writing osdmap epoch 499 to
278 the store; but what should we do about pinned maps and pruning?
279
280 The simplest solution is to drop the osdmap manifest. After all, given we
281 are trimming to the last pruned map, and we are rebuilding this map, we can
282 guarantee that all maps greater than e 499 are sequential (because we have
283 not pruned any of them). In essence, dropping the osdmap manifest in this
284 case is essentially the same as if we were trimming over the last pruned
285 epoch: we can prune again later if we meet the required conditions.
286
287 And, with this, we have fully dwelled into full osdmap pruning. Later in this
288 document one can find detailed `REQUIREMENTS, CONDITIONS & INVARIANTS` for the
289 whole algorithm, from pruning to trimming. Additionally, the next section
290 details several additional checks to guarantee the sanity of our configuration
291 options. Enjoy.
292
293
294 CONFIGURATION OPTIONS SANITY CHECKS
295 -----------------------------------
296
297 We perform additional checks before pruning to ensure all configuration
298 options involved are sane:
299
300 1. If ``mon_osdmap_full_prune_interval`` is zero we will not prune; we
301 require an actual positive number, greater than one, to be able to prune
302 maps. If the interval is one, we would not actually be pruning any maps, as
303 the interval between pinned maps would essentially be a single epoch. This
304 means we would have zero maps in-between pinned maps, hence no maps would
305 ever be pruned.
306
307 2. If ``mon_osdmap_full_prune_min`` is zero we will not prune; we require a
308 positive, greater than zero, value so we know the threshold over which we
309 should prune. We don't want to guess.
310
311 3. If ``mon_osdmap_full_prune_interval`` is greater than
312 ``mon_osdmap_full_prune_min`` we will not prune, as it is impossible to
313 ascertain a proper prune interval.
314
315 4. If ``mon_osdmap_full_prune_txsize`` is lower than
316 ``mon_osdmap_full_prune_interval`` we will not prune; we require a
317 ``txsize`` with a value at least equal than ``interval``, and (depending on
318 the value of the latter) ideally higher.
319
320
321 REQUIREMENTS, CONDITIONS & INVARIANTS
322 -------------------------------------
323
324 REQUIREMENTS
325 ~~~~~~~~~~~~
326
327 * All monitors in the quorum need to support pruning.
328
329 * Once pruning has been enabled, monitors not supporting pruning will not be
330 allowed in the quorum, nor will be allowed to synchronize.
331
332 * Removing the osdmap manifest results in disabling the pruning feature quorum
333 requirement. This means that monitors not supporting pruning will be allowed
334 to synchronize and join the quorum, granted they support any other features
335 required.
336
337
338 CONDITIONS & INVARIANTS
339 ~~~~~~~~~~~~~~~~~~~~~~~
340
341 * Pruning has never happened, or we have trimmed past its previous
342 intervals:::
343
344 invariant: first_committed > 1
345
346 condition: pinned.empty() AND !store.exists(manifest)
347
348
349 * Pruning has happened at least once:::
350
351 invariant: first_committed > 0
352 invariant: !pinned.empty())
353 invariant: pinned.first() == first_committed
354 invariant: pinned.last() < last_committed
355
356 precond: pinned.last() < prune_to AND
357 pinned.last() + prune_interval < prune_to
358
359 postcond: pinned.size() > old_pinned.size() AND
360 (for each v in [pinned.first()..pinned.last()]:
361 if pinned.count(v) > 0: store.exists_full(v)
362 else: !store.exists_full(v)
363 )
364
365
366 * Pruning has finished:::
367
368 invariant: first_committed > 0
369 invariant: !pinned.empty()
370 invariant: pinned.first() == first_committed
371 invariant: pinned.last() < last_committed
372
373 condition: pinned.last() == prune_to OR
374 pinned.last() + prune_interval < prune_to
375
376
377 * Pruning intervals can be trimmed:::
378
379 precond: OSDMonitor::get_trim_to() > 0
380
381 condition: !pinned.empty()
382
383 invariant: pinned.first() == first_committed
384 invariant: pinned.last() < last_committed
385 invariant: pinned.first() <= OSDMonitor::get_trim_to()
386 invariant: pinned.last() >= OSDMonitor::get_trim_to()
387
388 * Trim pruned intervals:::
389
390 invariant: !pinned.empty()
391 invariant: pinned.first() == first_committed
392 invariant: pinned.last() < last_committed
393 invariant: pinned.first() <= OSDMonitor::get_trim_to()
394 invariant: pinned.last() >= OSDMonitor::get_trim_to()
395
396 postcond: pinned.empty() OR
397 (pinned.first() == OSDMonitor::get_trim_to() AND
398 pinned.last() > pinned.first() AND
399 (for each v in [0..pinned.first()]:
400 !store.exists(v) AND
401 !store.exists_full(v)
402 ) AND
403 (for each m in [pinned.first()..pinned.last()]:
404 if pinned.count(m) > 0: store.exists_full(m)
405 else: !store.exists_full(m) AND store.exists(m)
406 )
407 )
408 postcond: !pinned.empty() OR
409 (!store.exists(manifest) AND
410 (for each v in [pinned.first()..pinned.last()]:
411 !store.exists(v) AND
412 !store.exists_full(v)
413 )
414 )
415