[ceph.git] / ceph / doc / cephfs / dirfrags.rst


===================================
Configuring Directory fragmentation
===================================

In CephFS, directories are *fragmented* when they become very large
or very busy.  This splits up the metadata so that it can be shared
between multiple MDS daemons, and between multiple objects in the
metadata pool.

In normal operation, directory fragmentation is invisbible to
users and administrators, and all the configuration settings mentioned
here should be left at their default values.

While directory fragmentation enables CephFS to handle very large
numbers of entries in a single directory, application programmers should
remain conservative about creating very large directories, as they still
have a resource cost in situations such as a CephFS client listing
the directory, where all the fragments must be loaded at once.

All directories are initially created as a single fragment.  This fragment
may be *split* to divide up the directory into more fragments, and these
fragments may be *merged* to reduce the number of fragments in the directory.

Splitting and merging
=====================

An MDS will only consider doing splits if the allow_dirfrags setting is true in
the file system map (set on the mons).  This setting is true by default since
the *Luminous* release (12.2.X).

When an MDS identifies a directory fragment to be split, it does not
do the split immediately.  Because splitting interrupts metadata IO,
a short delay is used to allow short bursts of client IO to complete
before the split begins.  This delay is configured with
``mds_bal_fragment_interval``, which defaults to 5 seconds.

When the split is done, the directory fragment is broken up into
a power of two number of new fragments.  The number of new
fragments is given by two to the power ``mds_bal_split_bits``, i.e.
if ``mds_bal_split_bits`` is 2, then four new fragments will be
created.  The default setting is 3, i.e. splits create 8 new fragments.

The criteria for initiating a split or a merge are described in the
following sections.

Size thresholds
===============

A directory fragment is elegible for splitting when its size exceeds
``mds_bal_split_size`` (default 10000).  Ordinarily this split is
delayed by ``mds_bal_fragment_interval``, but if the fragment size
exceeds a factor of ``mds_bal_fragment_fast_factor`` the split size,
the split will happen immediately (holding up any client metadata
IO on the directory).

``mds_bal_fragment_size_max`` is the hard limit on the size of
directory fragments.  If it is reached, clients will receive
ENOSPC errors if they try to create files in the fragment.  On
a properly configured system, this limit should never be reached on
ordinary directories, as they will have split long before.  By default,
this is set to 10 times the split size, giving a dirfrag size limit of
100000.  Increasing this limit may lead to oversized directory fragment
objects in the metadata pool, which the OSDs may not be able to handle.

A directory fragment is elegible for merging when its size is less
than ``mds_bal_merge_size``.  There is no merge equivalent of the
"fast splitting" explained above: fast splitting exists to avoid
creating oversized directory fragments, there is no equivalent issue
to avoid when merging.  The default merge size is 50.

Activity thresholds
===================

In addition to splitting fragments based
on their size, the MDS may split directory fragments if their
activity exceeds a threshold.

The MDS maintains separate time-decaying load counters for read and write
operations on directory fragments.  The decaying load counters have an
exponential decay based on the ``mds_decay_halflife`` setting.

On writes, the write counter is
incremented, and compared with ``mds_bal_split_wr``, triggering a 
split if the threshold is exceeded.  Write operations include metadata IO
such as renames, unlinks and creations. 

The ``mds_bal_split_rd`` threshold is applied based on the read operation
load counter, which tracks readdir operations.

By the default, the read threshold is 25000 and the write threshold is
10000, i.e. 2.5x as many reads as writes would be required to trigger
a split.

After fragments are split due to the activity thresholds, they are only
merged based on the size threshold (``mds_bal_merge_size``), so 
a spike in activity may cause a directory to stay fragmented
forever unless some entries are unlinked.
Commit	Line	Data
7c673cae FG	1
	2	===================================
	3	Configuring Directory fragmentation
	4	===================================
	5
	6	In CephFS, directories are fragmented when they become very large
	7	or very busy. This splits up the metadata so that it can be shared
	8	between multiple MDS daemons, and between multiple objects in the
	9	metadata pool.
	10
	11	In normal operation, directory fragmentation is invisbible to
	12	users and administrators, and all the configuration settings mentioned
	13	here should be left at their default values.
	14
	15	While directory fragmentation enables CephFS to handle very large
	16	numbers of entries in a single directory, application programmers should
	17	remain conservative about creating very large directories, as they still
	18	have a resource cost in situations such as a CephFS client listing
	19	the directory, where all the fragments must be loaded at once.
	20
	21	All directories are initially created as a single fragment. This fragment
	22	may be split to divide up the directory into more fragments, and these
	23	fragments may be merged to reduce the number of fragments in the directory.
	24
	25	Splitting and merging
	26	=====================
	27
f64942e4 AA	28	An MDS will only consider doing splits if the allow_dirfrags setting is true in
	29	the file system map (set on the mons). This setting is true by default since
	30	the Luminous release (12.2.X).
7c673cae FG	31
	32	When an MDS identifies a directory fragment to be split, it does not
	33	do the split immediately. Because splitting interrupts metadata IO,
	34	a short delay is used to allow short bursts of client IO to complete
	35	before the split begins. This delay is configured with
	36	``mds_bal_fragment_interval``, which defaults to 5 seconds.
	37
	38	When the split is done, the directory fragment is broken up into
	39	a power of two number of new fragments. The number of new
	40	fragments is given by two to the power ``mds_bal_split_bits``, i.e.
	41	if ``mds_bal_split_bits`` is 2, then four new fragments will be
	42	created. The default setting is 3, i.e. splits create 8 new fragments.
	43
	44	The criteria for initiating a split or a merge are described in the
	45	following sections.
	46
	47	Size thresholds
	48	===============
	49
	50	A directory fragment is elegible for splitting when its size exceeds
	51	``mds_bal_split_size`` (default 10000). Ordinarily this split is
	52	delayed by ``mds_bal_fragment_interval``, but if the fragment size
	53	exceeds a factor of ``mds_bal_fragment_fast_factor`` the split size,
	54	the split will happen immediately (holding up any client metadata
	55	IO on the directory).
	56
	57	``mds_bal_fragment_size_max`` is the hard limit on the size of
	58	directory fragments. If it is reached, clients will receive
	59	ENOSPC errors if they try to create files in the fragment. On
	60	a properly configured system, this limit should never be reached on
	61	ordinary directories, as they will have split long before. By default,
	62	this is set to 10 times the split size, giving a dirfrag size limit of
	63	100000. Increasing this limit may lead to oversized directory fragment
	64	objects in the metadata pool, which the OSDs may not be able to handle.
	65
	66	A directory fragment is elegible for merging when its size is less
	67	than ``mds_bal_merge_size``. There is no merge equivalent of the
	68	"fast splitting" explained above: fast splitting exists to avoid
	69	creating oversized directory fragments, there is no equivalent issue
	70	to avoid when merging. The default merge size is 50.
	71
	72	Activity thresholds
	73	===================
	74
	75	In addition to splitting fragments based
	76	on their size, the MDS may split directory fragments if their
	77	activity exceeds a threshold.
	78
	79	The MDS maintains separate time-decaying load counters for read and write
	80	operations on directory fragments. The decaying load counters have an
	81	exponential decay based on the ``mds_decay_halflife`` setting.
	82
	83	On writes, the write counter is
	84	incremented, and compared with ``mds_bal_split_wr``, triggering a
	85	split if the threshold is exceeded. Write operations include metadata IO
	86	such as renames, unlinks and creations.
	87
	88	The ``mds_bal_split_rd`` threshold is applied based on the read operation
	89	load counter, which tracks readdir operations.
	90
	91	By the default, the read threshold is 25000 and the write threshold is
	92	10000, i.e. 2.5x as many reads as writes would be required to trigger
	93	a split.
	94
95	After fragments are split due to the activity thresholds, they are only
96	merged based on the size threshold (``mds_bal_merge_size``), so
97	a spike in activity may cause a directory to stay fragmented
98	forever unless some entries are unlinked.
99