5 - name: mds_alternate_name_max
8 desc: set the maximum length of alternate names for dentries
14 - name: mds_valgrind_exit
22 - name: mds_standby_replay_damaged
31 desc: set mds's cpu affinity to a numa node (-1 for none)
40 desc: path to MDS data and keyring
41 default: /var/lib/ceph/mds/$cluster-$id
50 desc: file system MDS prefers to join
51 long_desc: This setting indicates which file system name the MDS should prefer to
52 join (affinity). The monitors will try to have the MDS cluster safely reach a
53 state where all MDS have strong affinity, even via failovers to a standby.
58 # max xattr kv pairs size for each dir/file
59 - name: mds_max_xattr_pairs_size
62 desc: maximum aggregate size of extended attributes on a file
67 - name: mds_cache_trim_interval
70 desc: interval in seconds between cache trimming
76 - name: mds_cache_release_free_interval
79 desc: interval in seconds between heap releases
85 - name: mds_cache_memory_limit
88 desc: target maximum memory usage of MDS cache
89 long_desc: This sets a target maximum memory usage of the MDS cache and is the primary
90 tunable to limit the MDS memory usage. The MDS will try to stay under a reservation
91 of this limit (by default 95%; 1 - mds_cache_reservation) by trimming unused metadata
92 in its cache and recalling cached items in the client caches. It is possible for
93 the MDS to exceed this limit due to slow recall from clients. The mds_health_cache_threshold
94 (150%) sets a cache full threshold for when the MDS signals a cluster health warning.
100 - name: mds_cache_reservation
103 desc: amount of memory to reserve for future cached objects
104 fmt_desc: The cache reservation (memory or inodes) for the MDS cache to maintain.
105 Once the MDS begins dipping into its reservation, it will recall
106 client state until its cache size shrinks to restore the
113 - name: mds_health_cache_threshold
116 desc: threshold for cache size to generate health warning
120 - name: mds_cache_mid
123 desc: midpoint for MDS cache LRU
124 fmt_desc: The insertion point for new items in the cache LRU
129 - name: mds_cache_trim_decay_rate
132 desc: decay rate for trimming MDS cache throttle
138 - name: mds_cache_trim_threshold
141 desc: threshold for number of dentries that can be trimmed
147 - name: mds_max_file_recover
150 desc: maximum number of files to recover file sizes in parallel
155 - name: mds_dir_max_commit_size
158 desc: maximum size in megabytes for a RADOS write to a directory
159 fmt_desc: The maximum size of a directory update before Ceph breaks it into
160 smaller transactions (MB).
165 - name: mds_dir_keys_per_op
168 desc: number of directory entries to read in one RADOS operation
173 - name: mds_decay_halflife
176 desc: rate of decay for temperature counters on each directory for balancing
181 - name: mds_beacon_interval
184 desc: interval in seconds between MDS beacon messages sent to monitors
189 - name: mds_beacon_grace
192 desc: tolerance in seconds for missed MDS beacons to monitors
193 fmt_desc: The interval without beacons before Ceph declares an MDS laggy
194 (and possibly replace it).
199 - name: mds_heartbeat_reset_grace
202 desc: the basic unit of tolerance in how many circles in a loop, which will
203 keep running by holding the mds_lock, it must trigger to reset heartbeat
207 - name: mds_heartbeat_grace
210 desc: tolerance in seconds for MDS internal heartbeat
214 - name: mds_enforce_unique_name
217 desc: require MDS name is unique in the cluster
222 # whether to blocklist clients whose sessions are dropped due to timeout
223 - name: mds_session_blocklist_on_timeout
226 desc: blocklist clients whose sessions have become stale
231 # whether to blocklist clients whose sessions are dropped via admin commands
232 - name: mds_session_blocklist_on_evict
235 desc: blocklist clients that have been evicted
240 # how many sessions should I try to load/store in a single OMAP operation?
241 - name: mds_sessionmap_keys_per_op
244 desc: number of omap keys to read from the SessionMap in one operation
249 - name: mds_recall_max_caps
252 desc: maximum number of caps to recall from client session in single recall
258 - name: mds_recall_max_decay_rate
261 desc: decay rate for throttle on recalled caps on a session
267 - name: mds_recall_max_decay_threshold
270 desc: decay threshold for throttle on recalled caps on a session
276 - name: mds_recall_global_max_decay_threshold
279 desc: decay threshold for throttle on recalled caps globally
285 - name: mds_recall_warning_threshold
288 desc: decay threshold for warning on slow session cap recall
294 - name: mds_recall_warning_decay_rate
297 desc: decay rate for warning on slow session cap recall
303 - name: mds_session_cache_liveness_decay_rate
306 desc: decay rate for session liveness leading to preemptive cap recall
307 long_desc: This determines how long a session needs to be quiescent before the MDS
308 begins preemptively recalling capabilities. The default of 5 minutes will cause
309 10 halvings of the decay counter after 1 hour, or 1/1024. The default magnitude
310 of 10 (1^10 or 1024) is chosen so that the MDS considers a previously chatty session
311 (approximately) to be quiescent after 1 hour.
316 - mds_session_cache_liveness_magnitude
319 - name: mds_session_cache_liveness_magnitude
322 desc: decay magnitude for preemptively recalling caps on quiet client
323 long_desc: This is the order of magnitude difference (in base 2) of the internal
324 liveness decay counter and the number of capabilities the session holds. When
325 this difference occurs, the MDS treats the session as quiescent and begins recalling
331 - mds_session_cache_liveness_decay_rate
334 - name: mds_session_cap_acquisition_decay_rate
337 desc: decay rate for session readdir caps leading to readdir throttle
338 long_desc: The half-life for the session cap acquisition counter of caps acquired
339 by readdir. This is used for throttling readdir requests from clients slow to
346 - name: mds_session_cap_acquisition_throttle
349 desc: throttle point for cap acquisition decay counter
353 - name: mds_session_max_caps_throttle_ratio
356 desc: ratio of mds_max_caps_per_client that client must exceed before readdir may
357 be throttled by cap acquisition throttle
361 - name: mds_cap_acquisition_throttle_retry_request_timeout
364 desc: timeout in seconds after which a client request is retried due to cap acquisition
369 # detecting freeze tree deadlock
370 - name: mds_freeze_tree_timeout
377 # collapse N-client health metrics to a single 'many'
378 - name: mds_health_summarize_threshold
381 desc: threshold of number of clients to summarize late client recall
386 # seconds to wait for clients during mds restart
387 # make it (mdsmap.session_timeout - mds_beacon_grace)
388 - name: mds_reconnect_timeout
391 desc: timeout in seconds to wait for clients to reconnect during MDS reconnect recovery
397 - name: mds_deny_all_reconnect
400 desc: flag to deny all client reconnects during failover
406 - name: mds_tick_interval
409 desc: time in seconds between upkeep tasks
410 fmt_desc: How frequently the MDS performs internal periodic tasks.
415 # try to avoid propagating more often than this
416 - name: mds_dirstat_min_interval
422 fmt_desc: The minimum interval (in seconds) to try to avoid propagating
423 recursive stats up the tree.
425 # how quickly dirstat changes propagate up the hierarchy
426 - name: mds_scatter_nudge_interval
429 desc: minimum interval between scatter lock updates
430 fmt_desc: How quickly dirstat changes propagate up.
435 - name: mds_client_prealloc_inos
438 desc: number of unused inodes to pre-allocate to clients for file creation
439 fmt_desc: The number of inode numbers to preallocate per client session.
444 - name: mds_client_delegate_inos_pct
447 desc: percentage of preallocated inos to delegate to client
453 - name: mds_early_reply
456 desc: additional reply to clients that metadata requests are complete but not yet
458 fmt_desc: Determines whether the MDS should allow clients to see request
459 results before they commit to the journal.
464 - name: mds_replay_unsafe_with_closed_session
467 desc: complete all the replay request when mds is restarted, no matter the session
474 - name: mds_default_dir_hash
477 desc: hash function to select directory fragment for dentry name
478 fmt_desc: The function to use for hashing files across directory fragments.
479 # CEPH_STR_HASH_RJENKINS
484 - name: mds_log_pause
491 - name: mds_log_skip_corrupt_events
497 fmt_desc: Determines whether the MDS should try to skip corrupt journal
498 events during journal replay.
500 - name: mds_log_max_events
503 desc: maximum number of events in the MDS journal (-1 is unlimited)
504 fmt_desc: The maximum events in the journal before we initiate trimming.
505 Set to ``-1`` to disable limits.
510 - name: mds_log_events_per_segment
513 desc: maximum number of events in an MDS journal segment
518 # segment size for mds log, default to default file_layout_t
519 - name: mds_log_segment_size
522 desc: size in bytes of each MDS log segment
527 - name: mds_log_max_segments
530 desc: maximum number of segments which may be untrimmed
531 fmt_desc: The maximum number of segments (objects) in the journal before
532 we initiate trimming. Set to ``-1`` to disable limits.
537 - name: mds_log_warn_factor
540 desc: trigger MDS_HEALTH_TRIM warning when the mds log is longer than mds_log_max_segments
541 * mds_log_warn_factor
548 - name: mds_bal_export_pin
551 desc: allow setting directory export pins to particular ranks
556 - name: mds_export_ephemeral_random
559 desc: allow ephemeral random pinning of the loaded subtrees
560 long_desc: probabilistically pin the loaded directory inode and the subtree beneath
561 it to an MDS based on the consistent hash of the inode number. The higher this
562 value the more likely the loaded subtrees get pinned
568 - name: mds_export_ephemeral_random_max
571 desc: the maximum percent permitted for random ephemeral pin policy
576 - mds_export_ephemeral_random
581 - name: mds_export_ephemeral_distributed
584 desc: allow ephemeral distributed pinning of the loaded subtrees
585 long_desc: 'pin the immediate child directories of the loaded directory inode based
586 on the consistent hash of the child''s inode number. '
592 - name: mds_export_ephemeral_distributed_factor
595 desc: multiple of max_mds for splitting and distributing directory
603 - name: mds_bal_sample_interval
606 desc: interval in seconds between balancer ticks
607 fmt_desc: Determines how frequently to sample directory temperature
608 (for fragmentation decisions).
613 - name: mds_bal_replicate_threshold
616 desc: hot popularity threshold to replicate a subtree
617 fmt_desc: The maximum temperature before Ceph attempts to replicate
618 metadata to other nodes.
623 - name: mds_bal_unreplicate_threshold
626 desc: cold popularity threshold to merge subtrees
627 fmt_desc: The minimum temperature before Ceph stops replicating
628 metadata to other nodes.
633 - name: mds_bal_split_size
636 desc: minimum size of directory fragment before splitting
637 fmt_desc: The maximum directory size before the MDS will split a directory
638 fragment into smaller bits.
643 - name: mds_bal_split_rd
646 desc: hot read popularity threshold for splitting a directory fragment
647 fmt_desc: The maximum directory read temperature before Ceph splits
648 a directory fragment.
653 - name: mds_bal_split_wr
656 desc: hot write popularity threshold for splitting a directory fragment
657 fmt_desc: The maximum directory write temperature before Ceph splits
658 a directory fragment.
663 - name: mds_bal_split_bits
666 desc: power of two child fragments for a fragment on split
667 fmt_desc: The number of bits by which to split a directory fragment.
674 - name: mds_bal_merge_size
677 desc: size of fragments where merging should occur
678 fmt_desc: The minimum directory size before Ceph tries to merge
679 adjacent directory fragments.
684 - name: mds_bal_interval
687 desc: interval between MDS balancer cycles
688 fmt_desc: The frequency (in seconds) of workload exchanges between MDSs.
692 - name: mds_bal_fragment_interval
695 desc: delay in seconds before interrupting client IO to perform splits
696 fmt_desc: The delay (in seconds) between a fragment being eligible for split
697 or merge and executing the fragmentation change.
701 # order of magnitude higher than split size
702 - name: mds_bal_fragment_size_max
705 desc: maximum size of a directory fragment before new creat/links fail
706 fmt_desc: The maximum size of a fragment before any new entries
707 are rejected with ENOSPC.
712 # multiple of size_max that triggers immediate split
713 - name: mds_bal_fragment_fast_factor
716 desc: ratio of mds_bal_split_size at which fast fragment splitting occurs
717 fmt_desc: The ratio by which frags may exceed the split size before
718 a split is executed immediately (skipping the fragment interval)
723 - name: mds_bal_fragment_dirs
726 desc: enable directory fragmentation
727 long_desc: Directory fragmentation is a standard feature of CephFS that allows sharding
728 directories across multiple objects for performance and stability. Additionally,
729 this allows fragments to be distributed across multiple active MDSs to increase
730 throughput. Disabling (new) fragmentation should only be done in exceptional circumstances
731 and may lead to performance issues.
735 - name: mds_bal_idle_threshold
738 desc: idle metadata popularity threshold before rebalancing
739 fmt_desc: The minimum temperature before Ceph migrates a subtree
751 fmt_desc: The number of iterations to run balancer before Ceph stops.
752 (used for testing purposes only)
754 - name: mds_bal_max_until
760 fmt_desc: The number of seconds to run balancer before Ceph stops.
761 (used for testing purposes only)
770 The method for calculating MDS load.
773 - ``1`` = Request rate and latency.
776 # must be this much above average before we export anything
777 - name: mds_bal_min_rebalance
780 desc: amount overloaded over internal target before balancer begins offloading
781 fmt_desc: The minimum subtree temperature before Ceph migrates.
786 # if we need less than this, we don't do anything
787 - name: mds_bal_min_start
793 fmt_desc: The minimum subtree temperature before Ceph searches a subtree.
795 # take within this range of what we need
796 - name: mds_bal_need_min
802 fmt_desc: The minimum fraction of target subtree size to accept.
804 - name: mds_bal_need_max
810 fmt_desc: The maximum fraction of target subtree size to accept.
812 # any sub bigger than this taken in full
813 - name: mds_bal_midchunk
819 fmt_desc: Ceph will migrate any subtree that is larger than this fraction
820 of the target subtree size.
822 # never take anything smaller than this
823 - name: mds_bal_minchunk
829 fmt_desc: Ceph will ignore any subtree that is smaller than this fraction
830 of the target subtree size.
832 # target decay half-life in MDSMap (2x larger is approx. 2x slower)
833 - name: mds_bal_target_decay
836 desc: rate of decay for export targets communicated to clients
841 - name: mds_oft_prefetch_dirfrags
844 desc: prefetch dirfrags recorded in open file table on startup
850 # time to wait before starting replay again
851 - name: mds_replay_interval
854 desc: time in seconds between replay of updates to journal by standby replay MDS
855 fmt_desc: The journal poll interval when in standby-replay mode.
861 - name: mds_shutdown_check
867 fmt_desc: The interval for polling the cache during MDS shutdown.
869 - name: mds_thrash_exports
875 fmt_desc: Ceph will randomly export subtrees between nodes (testing only).
877 - name: mds_thrash_fragments
883 fmt_desc: Ceph will randomly fragment or merge directories.
885 - name: mds_dump_cache_on_map
891 fmt_desc: Ceph will dump the MDS cache contents to a file on each MDSMap.
893 - name: mds_dump_cache_after_rejoin
899 fmt_desc: Ceph will dump MDS cache contents to a file after
900 rejoining the cache (during recovery).
902 - name: mds_verify_scatter
908 fmt_desc: Ceph will assert that various scatter/gather invariants
909 are ``true`` (developers only).
911 - name: mds_debug_scatterstat
917 fmt_desc: Ceph will assert that various recursive stat invariants
918 are ``true`` (for developers only).
920 - name: mds_debug_frag
926 fmt_desc: Ceph will verify directory fragmentation invariants
927 when convenient (developers only).
929 - name: mds_debug_auth_pins
935 fmt_desc: The debug auth pin invariants (for developers only).
937 - name: mds_debug_subtrees
943 fmt_desc: The debug subtree invariants (for developers only).
945 - name: mds_kill_mdstable_at
951 fmt_desc: Ceph will inject MDS failure in MDSTable code
952 (for developers only).
954 - name: mds_max_export_size
960 - name: mds_kill_export_at
966 fmt_desc: Ceph will inject MDS failure in the subtree export code
967 (for developers only).
969 - name: mds_kill_import_at
975 fmt_desc: Ceph will inject MDS failure in the subtree import code
976 (for developers only).
978 - name: mds_kill_link_at
984 fmt_desc: Ceph will inject MDS failure in hard link code
985 (for developers only).
987 - name: mds_kill_rename_at
993 fmt_desc: Ceph will inject MDS failure in the rename code
994 (for developers only).
996 - name: mds_kill_openc_at
1004 - name: mds_kill_journal_at
1010 - name: mds_kill_journal_expire_at
1017 - name: mds_kill_journal_replay_at
1024 - name: mds_journal_format
1031 - name: mds_kill_create_at
1038 - name: mds_inject_health_dummy
1044 # percentage of MDS modify replies to skip sending the client a trace on [0-1]
1045 - name: mds_inject_traceless_reply_probability
1052 - name: mds_wipe_sessions
1058 fmt_desc: Ceph will delete all client sessions on startup
1061 - name: mds_wipe_ino_prealloc
1067 fmt_desc: Ceph will delete ino preallocation metadata on startup
1070 - name: mds_skip_ino
1076 fmt_desc: The number of inode numbers to skip on startup
1079 - name: mds_enable_op_tracker
1082 desc: track remote operation progression and statistics
1087 # Max number of completed ops to track
1088 - name: mds_op_history_size
1091 desc: maximum size for list of historical operations
1096 # Oldest completed op to track
1097 - name: mds_op_history_duration
1100 desc: expiration time in seconds of historical operations
1105 # how many seconds old makes an op complaint-worthy
1106 - name: mds_op_complaint_time
1109 desc: time in seconds to consider an operation blocked after no updates
1114 # how many op log messages to show in one go
1115 - name: mds_op_log_threshold
1122 - name: mds_snap_min_uid
1125 desc: minimum uid of client to perform snapshots
1130 - name: mds_snap_max_uid
1133 desc: maximum uid of client to perform snapshots
1138 - name: mds_snap_rstat
1141 desc: enabled nested rstat for snapshots
1146 - name: mds_verify_backtrace
1153 # detect clients which aren't trimming completed requests
1154 - name: mds_max_completed_flushes
1161 - name: mds_max_completed_requests
1168 - name: mds_action_on_write_error
1171 desc: action to take when MDS cannot write to RADOS (0:ignore, 1:read-only, 2:suicide)
1176 - name: mds_mon_shutdown_timeout
1179 desc: time to wait for mon to receive damaged MDS rank notification
1184 # Maximum number of concurrent stray files to purge
1185 - name: mds_max_purge_files
1188 desc: maximum number of deleted files to purge in parallel
1193 # Maximum number of concurrent RADOS ops to issue in purging
1194 - name: mds_max_purge_ops
1197 desc: maximum number of purge operations performed in parallel
1202 # Maximum number of concurrent RADOS ops to issue in purging, scaled by PG count
1203 - name: mds_max_purge_ops_per_pg
1206 desc: number of parallel purge operations performed per PG
1211 - name: mds_purge_queue_busy_flush_period
1218 - name: mds_root_ino_uid
1221 desc: default uid for new root directory
1226 - name: mds_root_ino_gid
1229 desc: default gid for new root directory
1234 - name: mds_max_scrub_ops_in_progress
1237 desc: maximum number of scrub operations performed in parallel
1242 - name: mds_forward_all_requests_to_auth
1245 desc: always process op on auth mds
1251 # Maximum number of damaged frags/dentries before whole MDS rank goes damaged
1252 - name: mds_damage_table_max_entries
1255 desc: maximum number of damage table entries
1260 # Maximum increment for client writable range, counted by number of objects
1261 - name: mds_client_writeable_range_max_inc_objs
1264 desc: maximum number of objects in writeable range of a file for a client
1269 - name: mds_min_caps_per_client
1272 desc: minimum number of capabilities a client may hold
1276 - name: mds_min_caps_working_set
1279 desc: number of capabilities a client may hold without cache pressure warnings generated
1285 - name: mds_max_caps_per_client
1288 desc: maximum number of capabilities a client may hold
1292 - name: mds_hack_allow_loading_invalid_metadata
1295 desc: INTENTIONALLY CAUSE DATA LOSS by bypasing checks for invalid metadata on disk.
1296 Allows testing repair tools.
1300 - name: mds_defer_session_stale
1306 - name: mds_inject_migrator_session_race
1312 - name: mds_request_load_average_decay_rate
1315 desc: rate of decay in seconds for calculating request load average
1319 - name: mds_cap_revoke_eviction_timeout
1322 desc: number of seconds after which clients which have not responded to cap revoke
1323 messages by the MDS are evicted.
1327 - name: mds_max_retries_on_remount_failure
1330 desc: number of consecutive failed remount attempts for invalidating kernel dcache
1331 after which client would abort.
1335 - name: mds_dump_cache_threshold_formatter
1338 desc: threshold for cache usage to disallow "dump cache" operation to formatter
1339 long_desc: Disallow MDS from dumping caches to formatter via "dump cache" command
1340 if cache usage exceeds this threshold.
1344 - name: mds_dump_cache_threshold_file
1347 desc: threshold for cache usage to disallow "dump cache" operation to file
1348 long_desc: Disallow MDS from dumping caches to file via "dump cache" command if
1349 cache usage exceeds this threshold.
1353 - name: mds_task_status_update_interval
1356 desc: task status update interval to manager
1357 long_desc: interval (in seconds) for sending mds task status to ceph manager
1361 - name: mds_max_snaps_per_dir
1364 desc: max snapshots per directory
1365 long_desc: maximum number of snapshots that can be created per directory
1373 - name: mds_asio_thread_count
1376 desc: Size of thread pool for ASIO completions
1383 - name: mds_ping_grace
1386 desc: timeout after which an MDS is considered laggy by rank 0 MDS.
1387 long_desc: timeout for replying to a ping message sent by rank 0 after which an
1388 active MDS considered laggy (delayed metrics) by rank 0.
1394 - name: mds_ping_interval
1397 desc: interval in seconds for sending ping messages to active MDSs.
1398 long_desc: interval in seconds for rank 0 to send ping messages to all active MDSs.
1404 - name: mds_metrics_update_interval
1407 desc: interval in seconds for metrics data update.
1408 long_desc: interval in seconds after which active MDSs send client metrics data
1415 - name: mds_dir_max_entries
1418 desc: maximum number of entries per directory before new creat/links fail
1419 long_desc: The maximum number of entries before any new entries
1420 are rejected with ENOSPC.
1426 - name: mds_sleep_rank_change
1432 - name: mds_connect_bootstrapping
1438 - name: mds_symlink_recovery
1441 desc: Stores symlink target on the first data object of symlink file.
1442 Allows recover of symlink using recovery tools.