]> git.proxmox.com Git - ceph.git/blame - ceph/doc/cephfs/dirfrags.rst
update source to 12.2.11
[ceph.git] / ceph / doc / cephfs / dirfrags.rst
CommitLineData
7c673cae
FG
1
2===================================
3Configuring Directory fragmentation
4===================================
5
6In CephFS, directories are *fragmented* when they become very large
7or very busy. This splits up the metadata so that it can be shared
8between multiple MDS daemons, and between multiple objects in the
9metadata pool.
10
11In normal operation, directory fragmentation is invisbible to
12users and administrators, and all the configuration settings mentioned
13here should be left at their default values.
14
15While directory fragmentation enables CephFS to handle very large
16numbers of entries in a single directory, application programmers should
17remain conservative about creating very large directories, as they still
18have a resource cost in situations such as a CephFS client listing
19the directory, where all the fragments must be loaded at once.
20
21All directories are initially created as a single fragment. This fragment
22may be *split* to divide up the directory into more fragments, and these
23fragments may be *merged* to reduce the number of fragments in the directory.
24
25Splitting and merging
26=====================
27
f64942e4
AA
28An MDS will only consider doing splits if the allow_dirfrags setting is true in
29the file system map (set on the mons). This setting is true by default since
30the *Luminous* release (12.2.X).
7c673cae
FG
31
32When an MDS identifies a directory fragment to be split, it does not
33do the split immediately. Because splitting interrupts metadata IO,
34a short delay is used to allow short bursts of client IO to complete
35before the split begins. This delay is configured with
36``mds_bal_fragment_interval``, which defaults to 5 seconds.
37
38When the split is done, the directory fragment is broken up into
39a power of two number of new fragments. The number of new
40fragments is given by two to the power ``mds_bal_split_bits``, i.e.
41if ``mds_bal_split_bits`` is 2, then four new fragments will be
42created. The default setting is 3, i.e. splits create 8 new fragments.
43
44The criteria for initiating a split or a merge are described in the
45following sections.
46
47Size thresholds
48===============
49
50A directory fragment is elegible for splitting when its size exceeds
51``mds_bal_split_size`` (default 10000). Ordinarily this split is
52delayed by ``mds_bal_fragment_interval``, but if the fragment size
53exceeds a factor of ``mds_bal_fragment_fast_factor`` the split size,
54the split will happen immediately (holding up any client metadata
55IO on the directory).
56
57``mds_bal_fragment_size_max`` is the hard limit on the size of
58directory fragments. If it is reached, clients will receive
59ENOSPC errors if they try to create files in the fragment. On
60a properly configured system, this limit should never be reached on
61ordinary directories, as they will have split long before. By default,
62this is set to 10 times the split size, giving a dirfrag size limit of
63100000. Increasing this limit may lead to oversized directory fragment
64objects in the metadata pool, which the OSDs may not be able to handle.
65
66A directory fragment is elegible for merging when its size is less
67than ``mds_bal_merge_size``. There is no merge equivalent of the
68"fast splitting" explained above: fast splitting exists to avoid
69creating oversized directory fragments, there is no equivalent issue
70to avoid when merging. The default merge size is 50.
71
72Activity thresholds
73===================
74
75In addition to splitting fragments based
76on their size, the MDS may split directory fragments if their
77activity exceeds a threshold.
78
79The MDS maintains separate time-decaying load counters for read and write
80operations on directory fragments. The decaying load counters have an
81exponential decay based on the ``mds_decay_halflife`` setting.
82
83On writes, the write counter is
84incremented, and compared with ``mds_bal_split_wr``, triggering a
85split if the threshold is exceeded. Write operations include metadata IO
86such as renames, unlinks and creations.
87
88The ``mds_bal_split_rd`` threshold is applied based on the read operation
89load counter, which tracks readdir operations.
90
91By the default, the read threshold is 25000 and the write threshold is
9210000, i.e. 2.5x as many reads as writes would be required to trigger
93a split.
94
95After fragments are split due to the activity thresholds, they are only
96merged based on the size threshold (``mds_bal_merge_size``), so
97a spike in activity may cause a directory to stay fragmented
98forever unless some entries are unlinked.
99