5 - name: mds_alternate_name_max
8 desc: set the maximum length of alternate names for dentries
14 - name: mds_fscrypt_last_block_max_size
17 desc: maximum size of the last block without the header along with a truncate
18 request when the fscrypt is enabled.
24 - name: mds_valgrind_exit
32 - name: mds_standby_replay_damaged
41 desc: set mds's cpu affinity to a numa node (-1 for none)
50 desc: path to MDS data and keyring
51 default: /var/lib/ceph/mds/$cluster-$id
60 desc: file system MDS prefers to join
61 long_desc: This setting indicates which file system name the MDS should prefer to
62 join (affinity). The monitors will try to have the MDS cluster safely reach a
63 state where all MDS have strong affinity, even via failovers to a standby.
68 # max xattr kv pairs size for each dir/file
69 - name: mds_max_xattr_pairs_size
72 desc: maximum aggregate size of extended attributes on a file
77 - name: mds_cache_trim_interval
80 desc: interval in seconds between cache trimming
86 - name: mds_cache_release_free_interval
89 desc: interval in seconds between heap releases
95 - name: mds_cache_memory_limit
98 desc: target maximum memory usage of MDS cache
99 long_desc: This sets a target maximum memory usage of the MDS cache and is the primary
100 tunable to limit the MDS memory usage. The MDS will try to stay under a reservation
101 of this limit (by default 95%; 1 - mds_cache_reservation) by trimming unused metadata
102 in its cache and recalling cached items in the client caches. It is possible for
103 the MDS to exceed this limit due to slow recall from clients. The mds_health_cache_threshold
104 (150%) sets a cache full threshold for when the MDS signals a cluster health warning.
110 - name: mds_cache_reservation
113 desc: amount of memory to reserve for future cached objects
114 fmt_desc: The cache reservation (memory or inodes) for the MDS cache to maintain.
115 Once the MDS begins dipping into its reservation, it will recall
116 client state until its cache size shrinks to restore the
123 - name: mds_health_cache_threshold
126 desc: threshold for cache size to generate health warning
130 - name: mds_cache_mid
133 desc: midpoint for MDS cache LRU
134 fmt_desc: The insertion point for new items in the cache LRU
139 - name: mds_cache_trim_decay_rate
142 desc: decay rate for trimming MDS cache throttle
148 - name: mds_cache_trim_threshold
151 desc: threshold for number of dentries that can be trimmed
157 - name: mds_max_file_recover
160 desc: maximum number of files to recover file sizes in parallel
165 - name: mds_dir_max_commit_size
168 desc: maximum size in megabytes for a RADOS write to a directory
169 fmt_desc: The maximum size of a directory update before Ceph breaks it into
170 smaller transactions (MB).
175 - name: mds_dir_keys_per_op
178 desc: number of directory entries to read in one RADOS operation
183 - name: mds_decay_halflife
186 desc: rate of decay for temperature counters on each directory for balancing
191 - name: mds_beacon_interval
194 desc: interval in seconds between MDS beacon messages sent to monitors
199 - name: mds_beacon_grace
202 desc: tolerance in seconds for missed MDS beacons to monitors
203 fmt_desc: The interval without beacons before Ceph declares an MDS laggy
204 (and possibly replace it).
209 - name: mds_heartbeat_reset_grace
212 desc: the basic unit of tolerance in how many circles in a loop, which will
213 keep running by holding the mds_lock, it must trigger to reset heartbeat
217 - name: mds_heartbeat_grace
220 desc: tolerance in seconds for MDS internal heartbeat
224 - name: mds_enforce_unique_name
227 desc: require MDS name is unique in the cluster
232 # whether to blocklist clients whose sessions are dropped due to timeout
233 - name: mds_session_blocklist_on_timeout
236 desc: blocklist clients whose sessions have become stale
241 # whether to blocklist clients whose sessions are dropped via admin commands
242 - name: mds_session_blocklist_on_evict
245 desc: blocklist clients that have been evicted
250 # how many sessions should I try to load/store in a single OMAP operation?
251 - name: mds_sessionmap_keys_per_op
254 desc: number of omap keys to read from the SessionMap in one operation
259 - name: mds_recall_max_caps
262 desc: maximum number of caps to recall from client session in single recall
268 - name: mds_recall_max_decay_rate
271 desc: decay rate for throttle on recalled caps on a session
277 - name: mds_recall_max_decay_threshold
280 desc: decay threshold for throttle on recalled caps on a session
286 - name: mds_recall_global_max_decay_threshold
289 desc: decay threshold for throttle on recalled caps globally
295 - name: mds_recall_warning_threshold
298 desc: decay threshold for warning on slow session cap recall
304 - name: mds_recall_warning_decay_rate
307 desc: decay rate for warning on slow session cap recall
313 - name: mds_session_cache_liveness_decay_rate
316 desc: decay rate for session liveness leading to preemptive cap recall
317 long_desc: This determines how long a session needs to be quiescent before the MDS
318 begins preemptively recalling capabilities. The default of 5 minutes will cause
319 10 halvings of the decay counter after 1 hour, or 1/1024. The default magnitude
320 of 10 (1^10 or 1024) is chosen so that the MDS considers a previously chatty session
321 (approximately) to be quiescent after 1 hour.
326 - mds_session_cache_liveness_magnitude
329 - name: mds_session_cache_liveness_magnitude
332 desc: decay magnitude for preemptively recalling caps on quiet client
333 long_desc: This is the order of magnitude difference (in base 2) of the internal
334 liveness decay counter and the number of capabilities the session holds. When
335 this difference occurs, the MDS treats the session as quiescent and begins recalling
341 - mds_session_cache_liveness_decay_rate
344 - name: mds_session_cap_acquisition_decay_rate
347 desc: decay rate for session readdir caps leading to readdir throttle
348 long_desc: The half-life for the session cap acquisition counter of caps acquired
349 by readdir. This is used for throttling readdir requests from clients slow to
356 - name: mds_session_cap_acquisition_throttle
359 desc: throttle point for cap acquisition decay counter
363 - name: mds_session_max_caps_throttle_ratio
366 desc: ratio of mds_max_caps_per_client that client must exceed before readdir may
367 be throttled by cap acquisition throttle
371 - name: mds_cap_acquisition_throttle_retry_request_timeout
374 desc: timeout in seconds after which a client request is retried due to cap acquisition
379 # detecting freeze tree deadlock
380 - name: mds_freeze_tree_timeout
387 # collapse N-client health metrics to a single 'many'
388 - name: mds_health_summarize_threshold
391 desc: threshold of number of clients to summarize late client recall
396 # seconds to wait for clients during mds restart
397 # make it (mdsmap.session_timeout - mds_beacon_grace)
398 - name: mds_reconnect_timeout
401 desc: timeout in seconds to wait for clients to reconnect during MDS reconnect recovery
407 - name: mds_deny_all_reconnect
410 desc: flag to deny all client reconnects during failover
416 - name: mds_dir_prefetch
419 desc: flag to prefetch entire dir
425 - name: mds_tick_interval
428 desc: time in seconds between upkeep tasks
429 fmt_desc: How frequently the MDS performs internal periodic tasks.
434 # try to avoid propagating more often than this
435 - name: mds_dirstat_min_interval
441 fmt_desc: The minimum interval (in seconds) to try to avoid propagating
442 recursive stats up the tree.
444 # how quickly dirstat changes propagate up the hierarchy
445 - name: mds_scatter_nudge_interval
448 desc: minimum interval between scatter lock updates
449 fmt_desc: How quickly dirstat changes propagate up.
454 - name: mds_client_prealloc_inos
457 desc: number of unused inodes to pre-allocate to clients for file creation
458 fmt_desc: The number of inode numbers to preallocate per client session.
463 - name: mds_client_delegate_inos_pct
466 desc: percentage of preallocated inos to delegate to client
472 - name: mds_early_reply
475 desc: additional reply to clients that metadata requests are complete but not yet
477 fmt_desc: Determines whether the MDS should allow clients to see request
478 results before they commit to the journal.
483 - name: mds_replay_unsafe_with_closed_session
486 desc: complete all the replay request when mds is restarted, no matter the session
493 - name: mds_default_dir_hash
496 desc: hash function to select directory fragment for dentry name
497 fmt_desc: The function to use for hashing files across directory fragments.
498 # CEPH_STR_HASH_RJENKINS
503 - name: mds_log_pause
510 - name: mds_log_skip_corrupt_events
516 fmt_desc: Determines whether the MDS should try to skip corrupt journal
517 events during journal replay.
519 - name: mds_log_max_events
522 desc: maximum number of events in the MDS journal (-1 is unlimited)
523 fmt_desc: The maximum events in the journal before we initiate trimming.
524 Set to ``-1`` to disable limits.
529 - name: mds_log_events_per_segment
532 desc: maximum number of events in an MDS journal segment
537 # segment size for mds log, default to default file_layout_t
538 - name: mds_log_segment_size
541 desc: size in bytes of each MDS log segment
546 - name: mds_log_max_segments
549 desc: maximum number of segments which may be untrimmed
550 fmt_desc: The maximum number of segments (objects) in the journal before
551 we initiate trimming. Set to ``-1`` to disable limits.
556 - name: mds_log_warn_factor
559 desc: trigger MDS_HEALTH_TRIM warning when the mds log is longer than mds_log_max_segments
560 * mds_log_warn_factor
567 - name: mds_bal_export_pin
570 desc: allow setting directory export pins to particular ranks
575 - name: mds_export_ephemeral_random
578 desc: allow ephemeral random pinning of the loaded subtrees
579 long_desc: probabilistically pin the loaded directory inode and the subtree beneath
580 it to an MDS based on the consistent hash of the inode number. The higher this
581 value the more likely the loaded subtrees get pinned
587 - name: mds_export_ephemeral_random_max
590 desc: the maximum percent permitted for random ephemeral pin policy
595 - mds_export_ephemeral_random
600 - name: mds_export_ephemeral_distributed
603 desc: allow ephemeral distributed pinning of the loaded subtrees
604 long_desc: 'pin the immediate child directories of the loaded directory inode based
605 on the consistent hash of the child''s inode number. '
611 - name: mds_export_ephemeral_distributed_factor
614 desc: multiple of max_mds for splitting and distributing directory
622 - name: mds_bal_sample_interval
625 desc: interval in seconds between balancer ticks
626 fmt_desc: Determines how frequently to sample directory temperature
627 (for fragmentation decisions).
632 - name: mds_bal_replicate_threshold
635 desc: hot popularity threshold to replicate a subtree
636 fmt_desc: The minimum temperature before Ceph attempts to replicate
637 metadata to other nodes.
642 - name: mds_bal_unreplicate_threshold
645 desc: cold popularity threshold to merge subtrees
646 fmt_desc: The minimum temperature before Ceph stops replicating
647 metadata to other nodes.
652 - name: mds_bal_split_size
655 desc: minimum size of directory fragment before splitting
656 fmt_desc: The maximum directory size before the MDS will split a directory
657 fragment into smaller bits.
662 - name: mds_bal_split_rd
665 desc: hot read popularity threshold for splitting a directory fragment
666 fmt_desc: The maximum directory read temperature before Ceph splits
667 a directory fragment.
672 - name: mds_bal_split_wr
675 desc: hot write popularity threshold for splitting a directory fragment
676 fmt_desc: The maximum directory write temperature before Ceph splits
677 a directory fragment.
682 - name: mds_bal_split_bits
685 desc: power of two child fragments for a fragment on split
686 fmt_desc: The number of bits by which to split a directory fragment.
693 - name: mds_bal_merge_size
696 desc: size of fragments where merging should occur
697 fmt_desc: The minimum directory size before Ceph tries to merge
698 adjacent directory fragments.
703 - name: mds_bal_interval
706 desc: interval between MDS balancer cycles
707 fmt_desc: The frequency (in seconds) of workload exchanges between MDSs.
711 - name: mds_bal_fragment_interval
714 desc: delay in seconds before interrupting client IO to perform splits
715 fmt_desc: The delay (in seconds) between a fragment being eligible for split
716 or merge and executing the fragmentation change.
720 # order of magnitude higher than split size
721 - name: mds_bal_fragment_size_max
724 desc: maximum size of a directory fragment before new creat/links fail
725 fmt_desc: The maximum size of a fragment before any new entries
726 are rejected with ENOSPC.
731 # multiple of size_max that triggers immediate split
732 - name: mds_bal_fragment_fast_factor
735 desc: ratio of mds_bal_split_size at which fast fragment splitting occurs
736 fmt_desc: The ratio by which frags may exceed the split size before
737 a split is executed immediately (skipping the fragment interval)
742 - name: mds_bal_fragment_dirs
745 desc: enable directory fragmentation
746 long_desc: Directory fragmentation is a standard feature of CephFS that allows sharding
747 directories across multiple objects for performance and stability. Additionally,
748 this allows fragments to be distributed across multiple active MDSs to increase
749 throughput. Disabling (new) fragmentation should only be done in exceptional circumstances
750 and may lead to performance issues.
754 - name: mds_bal_idle_threshold
757 desc: idle metadata popularity threshold before rebalancing
758 fmt_desc: The minimum temperature before Ceph migrates a subtree
770 fmt_desc: The number of iterations to run balancer before Ceph stops.
771 (used for testing purposes only)
773 - name: mds_bal_max_until
779 fmt_desc: The number of seconds to run balancer before Ceph stops.
780 (used for testing purposes only)
789 The method for calculating MDS load.
792 - ``1`` = Request rate and latency.
795 # must be this much above average before we export anything
796 - name: mds_bal_min_rebalance
799 desc: amount overloaded over internal target before balancer begins offloading
800 fmt_desc: The minimum subtree temperature before Ceph migrates.
805 # if we need less than this, we don't do anything
806 - name: mds_bal_min_start
812 fmt_desc: The minimum subtree temperature before Ceph searches a subtree.
814 # take within this range of what we need
815 - name: mds_bal_need_min
821 fmt_desc: The minimum fraction of target subtree size to accept.
823 - name: mds_bal_need_max
829 fmt_desc: The maximum fraction of target subtree size to accept.
831 # any sub bigger than this taken in full
832 - name: mds_bal_midchunk
838 fmt_desc: Ceph will migrate any subtree that is larger than this fraction
839 of the target subtree size.
841 # never take anything smaller than this
842 - name: mds_bal_minchunk
848 fmt_desc: Ceph will ignore any subtree that is smaller than this fraction
849 of the target subtree size.
851 # target decay half-life in MDSMap (2x larger is approx. 2x slower)
852 - name: mds_bal_target_decay
855 desc: rate of decay for export targets communicated to clients
860 - name: mds_oft_prefetch_dirfrags
863 desc: prefetch dirfrags recorded in open file table on startup
869 # time to wait before starting replay again
870 - name: mds_replay_interval
873 desc: time in seconds between replay of updates to journal by standby replay MDS
874 fmt_desc: The journal poll interval when in standby-replay mode.
880 - name: mds_shutdown_check
886 fmt_desc: The interval for polling the cache during MDS shutdown.
888 - name: mds_thrash_exports
894 fmt_desc: Ceph will randomly export subtrees between nodes (testing only).
896 - name: mds_thrash_fragments
902 fmt_desc: Ceph will randomly fragment or merge directories.
904 - name: mds_dump_cache_on_map
910 fmt_desc: Ceph will dump the MDS cache contents to a file on each MDSMap.
912 - name: mds_dump_cache_after_rejoin
918 fmt_desc: Ceph will dump MDS cache contents to a file after
919 rejoining the cache (during recovery).
921 - name: mds_verify_scatter
927 fmt_desc: Ceph will assert that various scatter/gather invariants
928 are ``true`` (developers only).
930 - name: mds_debug_scatterstat
936 fmt_desc: Ceph will assert that various recursive stat invariants
937 are ``true`` (for developers only).
939 - name: mds_debug_frag
945 fmt_desc: Ceph will verify directory fragmentation invariants
946 when convenient (developers only).
948 - name: mds_debug_auth_pins
954 fmt_desc: The debug auth pin invariants (for developers only).
956 - name: mds_debug_subtrees
962 fmt_desc: The debug subtree invariants (for developers only).
964 - name: mds_abort_on_newly_corrupt_dentry
970 fmt_desc: MDS will abort if dentry is detected newly corrupted.
971 - name: mds_go_bad_corrupt_dentry
977 fmt_desc: MDS will mark a corrupt dentry as bad and isolate
980 - name: mds_inject_rename_corrupt_dentry_first
986 fmt_desc: probabilistically inject corrupt CDentry::first at rename
989 - name: mds_inject_journal_corrupt_dentry_first
995 fmt_desc: probabilistically inject corrupt CDentry::first at journal load
998 - name: mds_kill_mdstable_at
1004 fmt_desc: Ceph will inject MDS failure in MDSTable code
1005 (for developers only).
1007 - name: mds_max_export_size
1013 - name: mds_kill_export_at
1019 fmt_desc: Ceph will inject MDS failure in the subtree export code
1020 (for developers only).
1022 - name: mds_kill_import_at
1028 fmt_desc: Ceph will inject MDS failure in the subtree import code
1029 (for developers only).
1031 - name: mds_kill_link_at
1037 fmt_desc: Ceph will inject MDS failure in hard link code
1038 (for developers only).
1040 - name: mds_kill_rename_at
1046 fmt_desc: Ceph will inject MDS failure in the rename code
1047 (for developers only).
1049 - name: mds_kill_openc_at
1057 - name: mds_kill_journal_at
1063 - name: mds_kill_journal_expire_at
1070 - name: mds_kill_journal_replay_at
1077 - name: mds_journal_format
1084 - name: mds_kill_create_at
1091 - name: mds_inject_health_dummy
1097 # percentage of MDS modify replies to skip sending the client a trace on [0-1]
1098 - name: mds_inject_traceless_reply_probability
1105 - name: mds_wipe_sessions
1111 fmt_desc: Ceph will delete all client sessions on startup
1114 - name: mds_wipe_ino_prealloc
1120 fmt_desc: Ceph will delete ino preallocation metadata on startup
1123 - name: mds_skip_ino
1129 fmt_desc: The number of inode numbers to skip on startup
1132 - name: mds_enable_op_tracker
1135 desc: track remote operation progression and statistics
1140 # Max number of completed ops to track
1141 - name: mds_op_history_size
1144 desc: maximum size for list of historical operations
1149 # Oldest completed op to track
1150 - name: mds_op_history_duration
1153 desc: expiration time in seconds of historical operations
1158 # how many seconds old makes an op complaint-worthy
1159 - name: mds_op_complaint_time
1162 desc: time in seconds to consider an operation blocked after no updates
1167 # how many op log messages to show in one go
1168 - name: mds_op_log_threshold
1175 - name: mds_snap_min_uid
1178 desc: minimum uid of client to perform snapshots
1183 - name: mds_snap_max_uid
1186 desc: maximum uid of client to perform snapshots
1191 - name: mds_snap_rstat
1194 desc: enabled nested rstat for snapshots
1199 - name: mds_verify_backtrace
1206 # detect clients which aren't trimming completed requests
1207 - name: mds_max_completed_flushes
1214 - name: mds_max_completed_requests
1221 - name: mds_action_on_write_error
1224 desc: action to take when MDS cannot write to RADOS (0:ignore, 1:read-only, 2:suicide)
1229 - name: mds_mon_shutdown_timeout
1232 desc: time to wait for mon to receive damaged MDS rank notification
1237 # Maximum number of concurrent stray files to purge
1238 - name: mds_max_purge_files
1241 desc: maximum number of deleted files to purge in parallel
1246 # Maximum number of concurrent RADOS ops to issue in purging
1247 - name: mds_max_purge_ops
1250 desc: maximum number of purge operations performed in parallel
1255 # Maximum number of concurrent RADOS ops to issue in purging, scaled by PG count
1256 - name: mds_max_purge_ops_per_pg
1259 desc: number of parallel purge operations performed per PG
1264 - name: mds_purge_queue_busy_flush_period
1271 - name: mds_root_ino_uid
1274 desc: default uid for new root directory
1279 - name: mds_root_ino_gid
1282 desc: default gid for new root directory
1287 - name: mds_max_scrub_ops_in_progress
1290 desc: maximum number of scrub operations performed in parallel
1295 - name: mds_forward_all_requests_to_auth
1298 desc: always process op on auth mds
1304 # Maximum number of damaged frags/dentries before whole MDS rank goes damaged
1305 - name: mds_damage_table_max_entries
1308 desc: maximum number of damage table entries
1313 # Maximum increment for client writable range, counted by number of objects
1314 - name: mds_client_writeable_range_max_inc_objs
1317 desc: maximum number of objects in writeable range of a file for a client
1322 - name: mds_min_caps_per_client
1325 desc: minimum number of capabilities a client may hold
1329 - name: mds_min_caps_working_set
1332 desc: number of capabilities a client may hold without cache pressure warnings generated
1338 - name: mds_max_caps_per_client
1341 desc: maximum number of capabilities a client may hold
1345 - name: mds_hack_allow_loading_invalid_metadata
1348 desc: INTENTIONALLY CAUSE DATA LOSS by bypasing checks for invalid metadata on disk.
1349 Allows testing repair tools.
1353 - name: mds_defer_session_stale
1359 - name: mds_inject_migrator_session_race
1365 - name: mds_request_load_average_decay_rate
1368 desc: rate of decay in seconds for calculating request load average
1372 - name: mds_cap_revoke_eviction_timeout
1375 desc: number of seconds after which clients which have not responded to cap revoke
1376 messages by the MDS are evicted.
1380 - name: mds_dump_cache_threshold_formatter
1383 desc: threshold for cache usage to disallow "dump cache" operation to formatter
1384 long_desc: Disallow MDS from dumping caches to formatter via "dump cache" command
1385 if cache usage exceeds this threshold.
1389 - name: mds_dump_cache_threshold_file
1392 desc: threshold for cache usage to disallow "dump cache" operation to file
1393 long_desc: Disallow MDS from dumping caches to file via "dump cache" command if
1394 cache usage exceeds this threshold.
1398 - name: mds_task_status_update_interval
1401 desc: task status update interval to manager
1402 long_desc: interval (in seconds) for sending mds task status to ceph manager
1406 - name: mds_max_snaps_per_dir
1409 desc: max snapshots per directory
1410 long_desc: maximum number of snapshots that can be created per directory
1418 - name: mds_asio_thread_count
1421 desc: Size of thread pool for ASIO completions
1428 - name: mds_ping_grace
1431 desc: timeout after which an MDS is considered laggy by rank 0 MDS.
1432 long_desc: timeout for replying to a ping message sent by rank 0 after which an
1433 active MDS considered laggy (delayed metrics) by rank 0.
1439 - name: mds_ping_interval
1442 desc: interval in seconds for sending ping messages to active MDSs.
1443 long_desc: interval in seconds for rank 0 to send ping messages to all active MDSs.
1449 - name: mds_metrics_update_interval
1452 desc: interval in seconds for metrics data update.
1453 long_desc: interval in seconds after which active MDSs send client metrics data
1460 - name: mds_dir_max_entries
1463 desc: maximum number of entries per directory before new creat/links fail
1464 long_desc: The maximum number of entries before any new entries
1465 are rejected with ENOSPC.
1471 - name: mds_sleep_rank_change
1477 - name: mds_connect_bootstrapping
1483 - name: mds_symlink_recovery
1486 desc: Stores symlink target on the first data object of symlink file.
1487 Allows recover of symlink using recovery tools.
1493 - name: mds_extraordinary_events_dump_interval
1496 desc: Interval in seconds for dumping the recent in-memory logs when there is an extra-ordinary event.
1497 long_desc: Interval in seconds for dumping the recent in-memory logs when there is an extra-ordinary
1498 event. The default is ``0`` (disabled). The log level should be ``< 10`` and the gather level
1499 should be ``>=10`` in debug_mds for enabling this option.