]>
Commit | Line | Data |
---|---|---|
11fdf7f2 TL |
1 | ===================== |
2 | Asynchronous Recovery | |
3 | ===================== | |
4 | ||
f67539c2 TL |
5 | Ceph Placement Groups (PGs) maintain a log of write transactions to |
6 | facilitate speedy recovery of data. During recovery, each of these PG logs | |
7 | is used to determine which content in each OSD is missing or outdated. | |
8 | This obviates the need to scan all RADOS objects. | |
9 | See :ref:`Log Based PG <log-based-pg>` for more details on this process. | |
11fdf7f2 | 10 | |
f67539c2 TL |
11 | Prior to the Nautilus release this recovery process was synchronous: it |
12 | blocked writes to a RADOS object until it was recovered. In contrast, | |
13 | backfill could allow writes to proceed (assuming enough up-to-date replicas | |
14 | were available) by temporarily assigning a different acting set, and | |
15 | backfilling an OSD outside of the acting set. In some circumstances | |
11fdf7f2 | 16 | this ends up being significantly better for availability, e.g. if the |
f67539c2 TL |
17 | PG log contains 3000 writes to disjoint objects. When the PG log contains |
18 | thousands of entries, it could actually be faster (though not as safe) to | |
19 | trade backfill for recovery by deleting and redeploying the containing | |
20 | OSD than to iterate through the PG log. Recovering several megabytes | |
21 | of RADOS object data (or even worse, several megabytes of omap keys, | |
22 | notably RGW bucket indexes) can drastically increase latency for a small | |
11fdf7f2 TL |
23 | update, and combined with requests spread across many degraded objects |
24 | it is a recipe for slow requests. | |
25 | ||
f67539c2 TL |
26 | To avoid this we can perform recovery in the background on an OSD |
27 | out-of-band of the live acting set, similar to backfill, but still using | |
28 | the PG log to determine what needs to be done. This is known as *asynchronous | |
29 | recovery*. | |
11fdf7f2 | 30 | |
20effc67 | 31 | The threshold for performing asynchronous recovery instead of synchronous |
f67539c2 | 32 | recovery is not a clear-cut. There are a few criteria which |
11fdf7f2 TL |
33 | need to be met for asynchronous recovery: |
34 | ||
f67539c2 TL |
35 | * Try to keep ``min_size`` replicas available |
36 | * Use the approximate magnitude of the difference in length of | |
37 | logs combined with historical missing objects to estimate the cost of | |
38 | recovery | |
39 | * Use the parameter ``osd_async_recovery_min_cost`` to determine | |
11fdf7f2 TL |
40 | when asynchronous recovery is appropriate |
41 | ||
42 | With the existing peering process, when we choose the acting set we | |
f67539c2 TL |
43 | have not fetched the PG log from each peer; we have only the bounds of |
44 | it and other metadata from their ``pg_info_t``. It would be more expensive | |
11fdf7f2 | 45 | to fetch and examine every log at this point, so we only consider an |
81eedcae | 46 | approximate check for log length for now. In Nautilus, we improved |
f67539c2 | 47 | the accounting of missing objects, so post-Nautilus this information |
81eedcae | 48 | is also used to determine the cost of recovery. |
11fdf7f2 | 49 | |
f67539c2 | 50 | While async recovery is occurring, writes to members of the acting set |
11fdf7f2 | 51 | may proceed, but we need to send their log entries to the async |
f67539c2 | 52 | recovery targets (just like we do for backfill OSDs) so that they |
11fdf7f2 | 53 | can completely catch up. |