]>
Commit | Line | Data |
---|---|---|
9f95a23c TL |
1 | ================================== |
2 | CephFS Dynamic Metadata Management | |
3 | ================================== | |
4 | Metadata operations usually take up more than 50 percent of all | |
5 | file system operations. Also the metadata scales in a more complex | |
6 | fashion when compared to scaling storage (which in turn scales I/O | |
7 | throughput linearly). This is due to the hierarchical and | |
8 | interdependent nature of the file system metadata. So in CephFS, | |
9 | the metadata workload is decoupled from data workload so as to | |
10 | avoid placing unnecessary strain on the RADOS cluster. The metadata | |
11 | is hence handled by a cluster of Metadata Servers (MDSs). | |
12 | CephFS distributes metadata across MDSs via `Dynamic Subtree Partitioning <https://ceph.com/wp-content/uploads/2016/08/weil-mds-sc04.pdf>`__. | |
13 | ||
14 | Dynamic Subtree Partitioning | |
15 | ---------------------------- | |
16 | In traditional subtree partitioning, subtrees of the file system | |
17 | hierarchy are assigned to individual MDSs. This metadata distribution | |
18 | strategy provides good hierarchical locality, linear growth of | |
19 | cache and horizontal scaling across MDSs and a fairly good distribution | |
20 | of metadata across MDSs. | |
21 | ||
22 | .. image:: subtree-partitioning.svg | |
23 | ||
24 | The problem with traditional subtree partitioning is that the workload | |
25 | growth by depth (across a single MDS) leads to a hotspot of activity. | |
26 | This results in lack of vertical scaling and wastage of non-busy resources/MDSs. | |
27 | ||
28 | This led to the adoption of a more dynamic way of handling | |
29 | metadata: Dynamic Subtree Partitioning, where load intensive portions | |
30 | of the directory hierarchy from busy MDSs are migrated to non busy MDSs. | |
31 | ||
32 | This strategy ensures that activity hotspots are relieved as they | |
33 | appear and so leads to vertical scaling of the metadata workload in | |
34 | addition to horizontal scaling. | |
35 | ||
36 | Export Process During Subtree Migration | |
37 | --------------------------------------- | |
38 | ||
39 | Once the exporter verifies that the subtree is permissible to be exported | |
40 | (Non degraded cluster, non-frozen subtree root), the subtree root | |
41 | directory is temporarily auth pinned, the subtree freeze is initiated, | |
42 | and the exporter is committed to the subtree migration, barring an | |
43 | intervening failure of the importer or itself. | |
44 | ||
45 | The MExportDiscover message is exchanged to ensure that the inode for the | |
46 | base directory being exported is open on the destination node. It is | |
47 | auth pinned by the importer to prevent it from being trimmed. This occurs | |
48 | before the exporter completes the freeze of the subtree to ensure that | |
49 | the importer is able to replicate the necessary metadata. When the | |
50 | exporter receives the MDiscoverAck, it allows the freeze to proceed by | |
51 | removing its temporary auth pin. | |
52 | ||
53 | A warning stage occurs only if the base subtree directory is open by | |
54 | nodes other than the importer and exporter. If it is not, then this | |
55 | implies that no metadata within or nested beneath the subtree is | |
56 | replicated by any node other than the importer and exporter. If it is, | |
57 | then an MExportWarning message informs any bystanders that the | |
58 | authority for the region is temporarily ambiguous, and lists both the | |
59 | exporter and importer as authoritative MDS nodes. In particular, | |
60 | bystanders who are trimming items from their cache must send | |
61 | MCacheExpire messages to both the old and new authorities. This is | |
62 | necessary to ensure that the surviving authority reliably receives all | |
63 | expirations even if the importer or exporter fails. While the subtree | |
64 | is frozen (on both the importer and exporter), expirations will not be | |
65 | immediately processed; instead, they will be queued until the region | |
66 | is unfrozen and it can be determined that the node is or is not | |
67 | authoritative. | |
68 | ||
69 | The exporter then packages an MExport message containing all metadata | |
70 | of the subtree and flags the objects as non-authoritative. The MExport message sends | |
71 | the actual subtree metadata to the importer. Upon receipt, the | |
72 | importer inserts the data into its cache, marks all objects as | |
73 | authoritative, and logs a copy of all metadata in an EImportStart | |
74 | journal message. Once that has safely flushed, it replies with an | |
75 | MExportAck. The exporter can now log an EExport journal entry, which | |
76 | ultimately specifies that the export was a success. In the presence | |
77 | of failures, it is the existence of the EExport entry only that | |
78 | disambiguates authority during recovery. | |
79 | ||
80 | Once logged, the exporter will send an MExportNotify to any | |
81 | bystanders, informing them that the authority is no longer ambiguous | |
82 | and cache expirations should be sent only to the new authority (the | |
83 | importer). Once these are acknowledged back to the exporter, | |
84 | implicitly flushing the bystander to exporter message streams of any | |
85 | stray expiration notices, the exporter unfreezes the subtree, cleans | |
86 | up its migration-related state, and sends a final MExportFinish to the | |
87 | importer. Upon receipt, the importer logs an EImportFinish(true) | |
88 | (noting locally that the export was indeed a success), unfreezes its | |
89 | subtree, processes any queued cache expierations, and cleans up its | |
90 | state. |