]> git.proxmox.com Git - ceph.git/blob - ceph/doc/dev/osd_internals/backfill_reservation.rst
update source to Ceph Pacific 16.2.2
[ceph.git] / ceph / doc / dev / osd_internals / backfill_reservation.rst
1 ====================
2 Backfill Reservation
3 ====================
4
5 When a new OSD joins a cluster all PGs with it in their acting sets must
6 eventually backfill. If all of these backfills happen simultaneously
7 they will present excessive load on the OSD: the "thundering herd"
8 effect.
9
10 The ``osd_max_backfills`` tunable limits the number of outgoing or
11 incoming backfills that are active on a given OSD. Note that this limit is
12 applied separately to incoming and to outgoing backfill operations.
13 Thus there can be as many as ``osd_max_backfills * 2`` backfill operations
14 in flight on each OSD. This subtlety is often missed, and Ceph
15 operators can be puzzled as to why more ops are observed than expected.
16
17 Each ``OSDService`` now has two AsyncReserver instances: one for backfills going
18 from the OSD (``local_reserver``) and one for backfills going to the OSD
19 (``remote_reserver``). An ``AsyncReserver`` (``common/AsyncReserver.h``)
20 manages a queue by priority of waiting items and a set of current reservation
21 holders. When a slot frees up, the ``AsyncReserver`` queues the ``Context*``
22 associated with the next item on the highest priority queue in the finisher
23 provided to the constructor.
24
25 For a primary to initiate a backfill it must first obtain a reservation from
26 its own ``local_reserver``. Then it must obtain a reservation from the backfill
27 target's ``remote_reserver`` via a ``MBackfillReserve`` message. This process is
28 managed by sub-states of ``Active`` and ``ReplicaActive`` (see the sub-states
29 of ``Active`` in PG.h). The reservations are dropped either on the ``Backfilled``
30 event, which is sent on the primary before calling ``recovery_complete``
31 and on the replica on receipt of the ``BackfillComplete`` progress message),
32 or upon leaving ``Active`` or ``ReplicaActive``.
33
34 It's important to always grab the local reservation before the remote
35 reservation in order to prevent a circular dependency.
36
37 We minimize the risk of data loss by prioritizing the order in
38 which PGs are recovered. Admins can override the default order by using
39 ``force-recovery`` or ``force-backfill``. A ``force-recovery`` with op
40 priority ``255`` will start before a ``force-backfill`` op at priority ``254``.
41
42 If recovery is needed because a PG is below ``min_size`` a base priority of
43 ``220`` is used. This is incremented by the number of OSDs short of the pool's
44 ``min_size`` as well as a value relative to the pool's ``recovery_priority``.
45 The resultant priority is capped at ``253`` so that it does not confound forced
46 ops as described above. Under ordinary circumstances a recovery op is
47 prioritized at ``180`` plus a value relative to the pool's ``recovery_priority``.
48 The resultant priority is capped at ``219``.
49
50 If backfill is needed because the number of acting OSDs is less than
51 the pool's ``min_size``, a priority of ``220`` is used. The number of OSDs
52 short of the pool's ``min_size`` is added as well as a value relative to
53 the pool's ``recovery_priority``. The total priority is limited to ``253``.
54
55 If backfill is needed because a PG is undersized,
56 a priority of ``140`` is used. The number of OSDs below the size of the pool is
57 added as well as a value relative to the pool's ``recovery_priority``. The
58 resultant priority is capped at ``179``. If a backfill op is
59 needed because a PG is degraded, a priority of ``140`` is used. A value
60 relative to the pool's ``recovery_priority`` is added. The resultant priority
61 is capped at ``179`` . Under ordinary circumstances a
62 backfill op priority of ``100`` is used. A value relative to the pool's
63 ``recovery_priority`` is added. The total priority is capped at ``139``.
64
65 .. list-table:: Backfill and Recovery op priorities
66 :widths: 20 20 20
67 :header-rows: 1
68
69 * - Description
70 - Base priority
71 - Maximum priority
72 * - Backfill
73 - 100
74 - 139
75 * - Degraded Backfill
76 - 140
77 - 179
78 * - Recovery
79 - 180
80 - 219
81 * - Inactive Recovery
82 - 220
83 - 253
84 * - Inactive Backfill
85 - 220
86 - 253
87 * - force-backfill
88 - 254
89 -
90 * - force-recovery
91 - 255
92 -
93