]> git.proxmox.com Git - ceph.git/blame - ceph/doc/dev/osd_internals/async_recovery.rst
import quincy beta 17.1.0
[ceph.git] / ceph / doc / dev / osd_internals / async_recovery.rst
CommitLineData
11fdf7f2
TL
1=====================
2Asynchronous Recovery
3=====================
4
f67539c2
TL
5Ceph Placement Groups (PGs) maintain a log of write transactions to
6facilitate speedy recovery of data. During recovery, each of these PG logs
7is used to determine which content in each OSD is missing or outdated.
8This obviates the need to scan all RADOS objects.
9See :ref:`Log Based PG <log-based-pg>` for more details on this process.
11fdf7f2 10
f67539c2
TL
11Prior to the Nautilus release this recovery process was synchronous: it
12blocked writes to a RADOS object until it was recovered. In contrast,
13backfill could allow writes to proceed (assuming enough up-to-date replicas
14were available) by temporarily assigning a different acting set, and
15backfilling an OSD outside of the acting set. In some circumstances
11fdf7f2 16this ends up being significantly better for availability, e.g. if the
f67539c2
TL
17PG log contains 3000 writes to disjoint objects. When the PG log contains
18thousands of entries, it could actually be faster (though not as safe) to
19trade backfill for recovery by deleting and redeploying the containing
20OSD than to iterate through the PG log. Recovering several megabytes
21of RADOS object data (or even worse, several megabytes of omap keys,
22notably RGW bucket indexes) can drastically increase latency for a small
11fdf7f2
TL
23update, and combined with requests spread across many degraded objects
24it is a recipe for slow requests.
25
f67539c2
TL
26To avoid this we can perform recovery in the background on an OSD
27out-of-band of the live acting set, similar to backfill, but still using
28the PG log to determine what needs to be done. This is known as *asynchronous
29recovery*.
11fdf7f2 30
20effc67 31The threshold for performing asynchronous recovery instead of synchronous
f67539c2 32recovery is not a clear-cut. There are a few criteria which
11fdf7f2
TL
33need to be met for asynchronous recovery:
34
f67539c2
TL
35* Try to keep ``min_size`` replicas available
36* Use the approximate magnitude of the difference in length of
37 logs combined with historical missing objects to estimate the cost of
38 recovery
39* Use the parameter ``osd_async_recovery_min_cost`` to determine
11fdf7f2
TL
40 when asynchronous recovery is appropriate
41
42With the existing peering process, when we choose the acting set we
f67539c2
TL
43have not fetched the PG log from each peer; we have only the bounds of
44it and other metadata from their ``pg_info_t``. It would be more expensive
11fdf7f2 45to fetch and examine every log at this point, so we only consider an
81eedcae 46approximate check for log length for now. In Nautilus, we improved
f67539c2 47the accounting of missing objects, so post-Nautilus this information
81eedcae 48is also used to determine the cost of recovery.
11fdf7f2 49
f67539c2 50While async recovery is occurring, writes to members of the acting set
11fdf7f2 51may proceed, but we need to send their log entries to the async
f67539c2 52recovery targets (just like we do for backfill OSDs) so that they
11fdf7f2 53can completely catch up.