]>
Commit | Line | Data |
---|---|---|
7c673cae FG |
1 | |
2 | =================================== | |
3 | Configuring Directory fragmentation | |
4 | =================================== | |
5 | ||
6 | In CephFS, directories are *fragmented* when they become very large | |
7 | or very busy. This splits up the metadata so that it can be shared | |
8 | between multiple MDS daemons, and between multiple objects in the | |
9 | metadata pool. | |
10 | ||
11 | In normal operation, directory fragmentation is invisbible to | |
12 | users and administrators, and all the configuration settings mentioned | |
13 | here should be left at their default values. | |
14 | ||
15 | While directory fragmentation enables CephFS to handle very large | |
16 | numbers of entries in a single directory, application programmers should | |
17 | remain conservative about creating very large directories, as they still | |
18 | have a resource cost in situations such as a CephFS client listing | |
19 | the directory, where all the fragments must be loaded at once. | |
20 | ||
21 | All directories are initially created as a single fragment. This fragment | |
22 | may be *split* to divide up the directory into more fragments, and these | |
23 | fragments may be *merged* to reduce the number of fragments in the directory. | |
24 | ||
25 | Splitting and merging | |
26 | ===================== | |
27 | ||
28 | An MDS will only consider doing splits and merges if the ``mds_bal_frag`` | |
29 | setting is true in the MDS's configuration file, and the allow_dirfrags | |
30 | setting is true in the filesystem map (set on the mons). These settings | |
31 | are both true by default since the *Luminous* (12.2.x) release of Ceph. | |
32 | ||
33 | When an MDS identifies a directory fragment to be split, it does not | |
34 | do the split immediately. Because splitting interrupts metadata IO, | |
35 | a short delay is used to allow short bursts of client IO to complete | |
36 | before the split begins. This delay is configured with | |
37 | ``mds_bal_fragment_interval``, which defaults to 5 seconds. | |
38 | ||
39 | When the split is done, the directory fragment is broken up into | |
40 | a power of two number of new fragments. The number of new | |
41 | fragments is given by two to the power ``mds_bal_split_bits``, i.e. | |
42 | if ``mds_bal_split_bits`` is 2, then four new fragments will be | |
43 | created. The default setting is 3, i.e. splits create 8 new fragments. | |
44 | ||
45 | The criteria for initiating a split or a merge are described in the | |
46 | following sections. | |
47 | ||
48 | Size thresholds | |
49 | =============== | |
50 | ||
51 | A directory fragment is elegible for splitting when its size exceeds | |
52 | ``mds_bal_split_size`` (default 10000). Ordinarily this split is | |
53 | delayed by ``mds_bal_fragment_interval``, but if the fragment size | |
54 | exceeds a factor of ``mds_bal_fragment_fast_factor`` the split size, | |
55 | the split will happen immediately (holding up any client metadata | |
56 | IO on the directory). | |
57 | ||
58 | ``mds_bal_fragment_size_max`` is the hard limit on the size of | |
59 | directory fragments. If it is reached, clients will receive | |
60 | ENOSPC errors if they try to create files in the fragment. On | |
61 | a properly configured system, this limit should never be reached on | |
62 | ordinary directories, as they will have split long before. By default, | |
63 | this is set to 10 times the split size, giving a dirfrag size limit of | |
64 | 100000. Increasing this limit may lead to oversized directory fragment | |
65 | objects in the metadata pool, which the OSDs may not be able to handle. | |
66 | ||
67 | A directory fragment is elegible for merging when its size is less | |
68 | than ``mds_bal_merge_size``. There is no merge equivalent of the | |
69 | "fast splitting" explained above: fast splitting exists to avoid | |
70 | creating oversized directory fragments, there is no equivalent issue | |
71 | to avoid when merging. The default merge size is 50. | |
72 | ||
73 | Activity thresholds | |
74 | =================== | |
75 | ||
76 | In addition to splitting fragments based | |
77 | on their size, the MDS may split directory fragments if their | |
78 | activity exceeds a threshold. | |
79 | ||
80 | The MDS maintains separate time-decaying load counters for read and write | |
81 | operations on directory fragments. The decaying load counters have an | |
82 | exponential decay based on the ``mds_decay_halflife`` setting. | |
83 | ||
84 | On writes, the write counter is | |
85 | incremented, and compared with ``mds_bal_split_wr``, triggering a | |
86 | split if the threshold is exceeded. Write operations include metadata IO | |
87 | such as renames, unlinks and creations. | |
88 | ||
89 | The ``mds_bal_split_rd`` threshold is applied based on the read operation | |
90 | load counter, which tracks readdir operations. | |
91 | ||
92 | By the default, the read threshold is 25000 and the write threshold is | |
93 | 10000, i.e. 2.5x as many reads as writes would be required to trigger | |
94 | a split. | |
95 | ||
96 | After fragments are split due to the activity thresholds, they are only | |
97 | merged based on the size threshold (``mds_bal_merge_size``), so | |
98 | a spike in activity may cause a directory to stay fragmented | |
99 | forever unless some entries are unlinked. | |
100 |