]> git.proxmox.com Git - ceph.git/blob - ceph/doc/dev/prim-balancer-design.rst
update ceph source to reef 18.1.2
[ceph.git] / ceph / doc / dev / prim-balancer-design.rst
1
2 This document describes the requirements and high-level design of the primary
3 balancer for Ceph.
4
5 Introduction
6 ============
7
8 In a distributed storage system such as Ceph, there are some requirements to keep the system balanced in order to make it perform well:
9
10 #. Balance the capacity - This is a functional requirement, a system like Ceph is "as full as its fullest device". When one device is full the system can not serve write requests anymore. In order to do this we want to balance the capacity across the devices in a fair way - that each device gets capacity proportionally to its size and therefore all the devices have the same fullness level. This is a functional requirement. From performance perspective, capacity balancing creates fair share workloads on the OSDs for *write* requests.
11
12 #. Balance the workload - This is a performance requirement, we want to make sure that all the devices will receive a workload according to their performance. Assuming all the devices in a pool use the same technology and have the same bandwidth (a strong recommendation for a well configured system), and all devices in a pool have the same capacity, this means that for each pool, each device gets its fair share of primary OSDs so that the *read* requests are distributed evenly across the OSDs in the cluster. Managing workload balancing for devices with different capacities is discussed in the future enhancements section.
13
14 Requirements
15 ============
16
17 - For each pool, each OSD should have its fair share of PGs in which it is primary. For replicated pools, this would be the number of PGs mapped to this OSD divided by the replica size.
18 - This may be improved in future releases. (see below)
19 - Improve the existing capacity balancer code to improve its maintainability
20 - Primary balancing is performed without data movement (data is moved only when balancing the capacity)
21 - Fix the global +/-1 balancing issue that happens since the current balancer works on a single pool at a time (this is a stretch goal for the first version)
22
23 - Problem description: In a perfectly balanced system, for each pool, each OSD has a number of PGs that ideally would have mapped to it to create a perfect capacity balancing. This number is usually not an integer, so some OSDs get a bit more PGs mapped and some a bit less. If you have many pools and you balance on a pool-by-pool basis, it is possible that some OSDs always get the "a bit more" side. When this happens, even to a single OSD, the result is non-balanced system where one OSD is more full than the others. This may happen with the current capacity balancer.
24
25 First release (Quincy) assumptions
26 ----------------------------------
27
28 - Optional - In the first version the feature will be optional and by default will be disabled
29 - CLI only - In the first version we will probably give access to the primary balancer only by ``osdmaptool`` CLI and will not enable it in the online balancer; this way, the use of the feature is more controlled for early adopters
30 - No data movement
31
32 Future possible enhancements
33 ----------------------------
34
35 - Improve the behavior for non identical OSDs in a pool
36 - Improve the capacity balancing behavior in extreme cases
37 - Add workload balancing to the online balancer
38 - A more futuristic feature can be to improve workload balancing based on real load statistics of the OSDs.
39
40 High Level Design
41 =================
42
43 - The capacity balancing code will remain in one function ``OSDMap::calc_pg_upmaps`` (the signature might be changed)
44 - The workload (a.k.a primary) balancer will be implemented in a different function
45 - The workload balancer will do its best based on the current status of the system
46
47 - When called on a balanced system (capacity-wise) with pools with identical devices, it will create a near optimal workload split among the OSDs
48 - Calling the workload balancer on an unbalanced system (capacity-wise) may yield non optimal results, and in some cases may give worse performance than before the call
49
50 Helper functionality
51 --------------------
52
53 - Set a seed for random generation in ``osdmaptool`` (For regression tests)