]> git.proxmox.com Git - ceph.git/blob - ceph/doc/dev/osd_internals/async_recovery.rst
update source to Ceph Pacific 16.2.2
[ceph.git] / ceph / doc / dev / osd_internals / async_recovery.rst
1 =====================
2 Asynchronous Recovery
3 =====================
4
5 Ceph Placement Groups (PGs) maintain a log of write transactions to
6 facilitate speedy recovery of data. During recovery, each of these PG logs
7 is used to determine which content in each OSD is missing or outdated.
8 This obviates the need to scan all RADOS objects.
9 See :ref:`Log Based PG <log-based-pg>` for more details on this process.
10
11 Prior to the Nautilus release this recovery process was synchronous: it
12 blocked writes to a RADOS object until it was recovered. In contrast,
13 backfill could allow writes to proceed (assuming enough up-to-date replicas
14 were available) by temporarily assigning a different acting set, and
15 backfilling an OSD outside of the acting set. In some circumstances
16 this ends up being significantly better for availability, e.g. if the
17 PG log contains 3000 writes to disjoint objects. When the PG log contains
18 thousands of entries, it could actually be faster (though not as safe) to
19 trade backfill for recovery by deleting and redeploying the containing
20 OSD than to iterate through the PG log. Recovering several megabytes
21 of RADOS object data (or even worse, several megabytes of omap keys,
22 notably RGW bucket indexes) can drastically increase latency for a small
23 update, and combined with requests spread across many degraded objects
24 it is a recipe for slow requests.
25
26 To avoid this we can perform recovery in the background on an OSD
27 out-of-band of the live acting set, similar to backfill, but still using
28 the PG log to determine what needs to be done. This is known as *asynchronous
29 recovery*.
30
31 The threashold for performing asynchronous recovery instead of synchronous
32 recovery is not a clear-cut. There are a few criteria which
33 need to be met for asynchronous recovery:
34
35 * Try to keep ``min_size`` replicas available
36 * Use the approximate magnitude of the difference in length of
37 logs combined with historical missing objects to estimate the cost of
38 recovery
39 * Use the parameter ``osd_async_recovery_min_cost`` to determine
40 when asynchronous recovery is appropriate
41
42 With the existing peering process, when we choose the acting set we
43 have not fetched the PG log from each peer; we have only the bounds of
44 it and other metadata from their ``pg_info_t``. It would be more expensive
45 to fetch and examine every log at this point, so we only consider an
46 approximate check for log length for now. In Nautilus, we improved
47 the accounting of missing objects, so post-Nautilus this information
48 is also used to determine the cost of recovery.
49
50 While async recovery is occurring, writes to members of the acting set
51 may proceed, but we need to send their log entries to the async
52 recovery targets (just like we do for backfill OSDs) so that they
53 can completely catch up.