]> git.proxmox.com Git - ceph.git/blame - ceph/doc/cephfs/dynamic-metadata-management.rst
import 15.2.5
[ceph.git] / ceph / doc / cephfs / dynamic-metadata-management.rst
CommitLineData
9f95a23c
TL
1==================================
2CephFS Dynamic Metadata Management
3==================================
4Metadata operations usually take up more than 50 percent of all
5file system operations. Also the metadata scales in a more complex
6fashion when compared to scaling storage (which in turn scales I/O
7throughput linearly). This is due to the hierarchical and
8interdependent nature of the file system metadata. So in CephFS,
9the metadata workload is decoupled from data workload so as to
10avoid placing unnecessary strain on the RADOS cluster. The metadata
11is hence handled by a cluster of Metadata Servers (MDSs).
12CephFS distributes metadata across MDSs via `Dynamic Subtree Partitioning <https://ceph.com/wp-content/uploads/2016/08/weil-mds-sc04.pdf>`__.
13
14Dynamic Subtree Partitioning
15----------------------------
16In traditional subtree partitioning, subtrees of the file system
17hierarchy are assigned to individual MDSs. This metadata distribution
18strategy provides good hierarchical locality, linear growth of
19cache and horizontal scaling across MDSs and a fairly good distribution
20of metadata across MDSs.
21
22.. image:: subtree-partitioning.svg
23
24The problem with traditional subtree partitioning is that the workload
25growth by depth (across a single MDS) leads to a hotspot of activity.
26This results in lack of vertical scaling and wastage of non-busy resources/MDSs.
27
28This led to the adoption of a more dynamic way of handling
29metadata: Dynamic Subtree Partitioning, where load intensive portions
30of the directory hierarchy from busy MDSs are migrated to non busy MDSs.
31
32This strategy ensures that activity hotspots are relieved as they
33appear and so leads to vertical scaling of the metadata workload in
34addition to horizontal scaling.
35
36Export Process During Subtree Migration
37---------------------------------------
38
39Once the exporter verifies that the subtree is permissible to be exported
40(Non degraded cluster, non-frozen subtree root), the subtree root
41directory is temporarily auth pinned, the subtree freeze is initiated,
42and the exporter is committed to the subtree migration, barring an
43intervening failure of the importer or itself.
44
45The MExportDiscover message is exchanged to ensure that the inode for the
46base directory being exported is open on the destination node. It is
47auth pinned by the importer to prevent it from being trimmed. This occurs
48before the exporter completes the freeze of the subtree to ensure that
49the importer is able to replicate the necessary metadata. When the
50exporter receives the MDiscoverAck, it allows the freeze to proceed by
51removing its temporary auth pin.
52
53A warning stage occurs only if the base subtree directory is open by
54nodes other than the importer and exporter. If it is not, then this
55implies that no metadata within or nested beneath the subtree is
56replicated by any node other than the importer and exporter. If it is,
57then an MExportWarning message informs any bystanders that the
58authority for the region is temporarily ambiguous, and lists both the
59exporter and importer as authoritative MDS nodes. In particular,
60bystanders who are trimming items from their cache must send
61MCacheExpire messages to both the old and new authorities. This is
62necessary to ensure that the surviving authority reliably receives all
63expirations even if the importer or exporter fails. While the subtree
64is frozen (on both the importer and exporter), expirations will not be
65immediately processed; instead, they will be queued until the region
66is unfrozen and it can be determined that the node is or is not
67authoritative.
68
69The exporter then packages an MExport message containing all metadata
70of the subtree and flags the objects as non-authoritative. The MExport message sends
71the actual subtree metadata to the importer. Upon receipt, the
72importer inserts the data into its cache, marks all objects as
73authoritative, and logs a copy of all metadata in an EImportStart
74journal message. Once that has safely flushed, it replies with an
75MExportAck. The exporter can now log an EExport journal entry, which
76ultimately specifies that the export was a success. In the presence
77of failures, it is the existence of the EExport entry only that
78disambiguates authority during recovery.
79
80Once logged, the exporter will send an MExportNotify to any
81bystanders, informing them that the authority is no longer ambiguous
82and cache expirations should be sent only to the new authority (the
83importer). Once these are acknowledged back to the exporter,
84implicitly flushing the bystander to exporter message streams of any
85stray expiration notices, the exporter unfreezes the subtree, cleans
86up its migration-related state, and sends a final MExportFinish to the
87importer. Upon receipt, the importer logs an EImportFinish(true)
88(noting locally that the export was indeed a success), unfreezes its
89subtree, processes any queued cache expierations, and cleans up its
90state.