]>
Commit | Line | Data |
---|---|---|
7c673cae FG |
1 | |
2 | =================================== | |
3 | Configuring Directory fragmentation | |
4 | =================================== | |
5 | ||
6 | In CephFS, directories are *fragmented* when they become very large | |
7 | or very busy. This splits up the metadata so that it can be shared | |
8 | between multiple MDS daemons, and between multiple objects in the | |
9 | metadata pool. | |
10 | ||
11 | In normal operation, directory fragmentation is invisbible to | |
12 | users and administrators, and all the configuration settings mentioned | |
13 | here should be left at their default values. | |
14 | ||
15 | While directory fragmentation enables CephFS to handle very large | |
16 | numbers of entries in a single directory, application programmers should | |
17 | remain conservative about creating very large directories, as they still | |
18 | have a resource cost in situations such as a CephFS client listing | |
19 | the directory, where all the fragments must be loaded at once. | |
20 | ||
21 | All directories are initially created as a single fragment. This fragment | |
22 | may be *split* to divide up the directory into more fragments, and these | |
23 | fragments may be *merged* to reduce the number of fragments in the directory. | |
24 | ||
25 | Splitting and merging | |
26 | ===================== | |
27 | ||
f64942e4 AA |
28 | An MDS will only consider doing splits if the allow_dirfrags setting is true in |
29 | the file system map (set on the mons). This setting is true by default since | |
30 | the *Luminous* release (12.2.X). | |
7c673cae FG |
31 | |
32 | When an MDS identifies a directory fragment to be split, it does not | |
33 | do the split immediately. Because splitting interrupts metadata IO, | |
34 | a short delay is used to allow short bursts of client IO to complete | |
35 | before the split begins. This delay is configured with | |
36 | ``mds_bal_fragment_interval``, which defaults to 5 seconds. | |
37 | ||
38 | When the split is done, the directory fragment is broken up into | |
39 | a power of two number of new fragments. The number of new | |
40 | fragments is given by two to the power ``mds_bal_split_bits``, i.e. | |
41 | if ``mds_bal_split_bits`` is 2, then four new fragments will be | |
42 | created. The default setting is 3, i.e. splits create 8 new fragments. | |
43 | ||
44 | The criteria for initiating a split or a merge are described in the | |
45 | following sections. | |
46 | ||
47 | Size thresholds | |
48 | =============== | |
49 | ||
50 | A directory fragment is elegible for splitting when its size exceeds | |
51 | ``mds_bal_split_size`` (default 10000). Ordinarily this split is | |
52 | delayed by ``mds_bal_fragment_interval``, but if the fragment size | |
53 | exceeds a factor of ``mds_bal_fragment_fast_factor`` the split size, | |
54 | the split will happen immediately (holding up any client metadata | |
55 | IO on the directory). | |
56 | ||
57 | ``mds_bal_fragment_size_max`` is the hard limit on the size of | |
58 | directory fragments. If it is reached, clients will receive | |
59 | ENOSPC errors if they try to create files in the fragment. On | |
60 | a properly configured system, this limit should never be reached on | |
61 | ordinary directories, as they will have split long before. By default, | |
62 | this is set to 10 times the split size, giving a dirfrag size limit of | |
63 | 100000. Increasing this limit may lead to oversized directory fragment | |
64 | objects in the metadata pool, which the OSDs may not be able to handle. | |
65 | ||
66 | A directory fragment is elegible for merging when its size is less | |
67 | than ``mds_bal_merge_size``. There is no merge equivalent of the | |
68 | "fast splitting" explained above: fast splitting exists to avoid | |
69 | creating oversized directory fragments, there is no equivalent issue | |
70 | to avoid when merging. The default merge size is 50. | |
71 | ||
72 | Activity thresholds | |
73 | =================== | |
74 | ||
75 | In addition to splitting fragments based | |
76 | on their size, the MDS may split directory fragments if their | |
77 | activity exceeds a threshold. | |
78 | ||
79 | The MDS maintains separate time-decaying load counters for read and write | |
80 | operations on directory fragments. The decaying load counters have an | |
81 | exponential decay based on the ``mds_decay_halflife`` setting. | |
82 | ||
83 | On writes, the write counter is | |
84 | incremented, and compared with ``mds_bal_split_wr``, triggering a | |
85 | split if the threshold is exceeded. Write operations include metadata IO | |
86 | such as renames, unlinks and creations. | |
87 | ||
88 | The ``mds_bal_split_rd`` threshold is applied based on the read operation | |
89 | load counter, which tracks readdir operations. | |
90 | ||
91 | By the default, the read threshold is 25000 and the write threshold is | |
92 | 10000, i.e. 2.5x as many reads as writes would be required to trigger | |
93 | a split. | |
94 | ||
95 | After fragments are split due to the activity thresholds, they are only | |
96 | merged based on the size threshold (``mds_bal_merge_size``), so | |
97 | a spike in activity may cause a directory to stay fragmented | |
98 | forever unless some entries are unlinked. | |
99 |