man/man5/zfs-module-parameters.5

   1 '\" te
   2 .\" Copyright (c) 2013 by Turbo Fredriksson <turbo@bayour.com>. All rights reserved.
   3 .\" Copyright (c) 2019 by Delphix. All rights reserved.
   4 .\" Copyright (c) 2019 Datto Inc.
   5 .\" The contents of this file are subject to the terms of the Common Development
   6 .\" and Distribution License (the "License").  You may not use this file except
   7 .\" in compliance with the License. You can obtain a copy of the license at
   8 .\" usr/src/OPENSOLARIS.LICENSE or http://www.opensolaris.org/os/licensing.
   9 .\"
  10 .\" See the License for the specific language governing permissions and
  11 .\" limitations under the License. When distributing Covered Code, include this
  12 .\" CDDL HEADER in each file and include the License file at
  13 .\" usr/src/OPENSOLARIS.LICENSE.  If applicable, add the following below this
  14 .\" CDDL HEADER, with the fields enclosed by brackets "[]" replaced with your
  15 .\" own identifying information:
  16 .\" Portions Copyright [yyyy] [name of copyright owner]
  17 .TH ZFS-MODULE-PARAMETERS 5 "Feb 15, 2019"
  18 .SH NAME
  19 zfs\-module\-parameters \- ZFS module parameters
  20 .SH DESCRIPTION
  21 .sp
  22 .LP
  23 Description of the different parameters to the ZFS module.
  24
  25 .SS "Module parameters"
  26 .sp
  27 .LP
  28
  29 .sp
  30 .ne 2
  31 .na
  32 \fBdbuf_cache_max_bytes\fR (ulong)
  33 .ad
  34 .RS 12n
  35 Maximum size in bytes of the dbuf cache.  When \fB0\fR this value will default
  36 to \fB1/2^dbuf_cache_shift\fR (1/32) of the target ARC size, otherwise the
  37 provided value in bytes will be used.  The behavior of the dbuf cache and its
  38 associated settings can be observed via the \fB/proc/spl/kstat/zfs/dbufstats\fR
  39 kstat.
  40 .sp
  41 Default value: \fB0\fR.
  42 .RE
  43
  44 .sp
  45 .ne 2
  46 .na
  47 \fBdbuf_metadata_cache_max_bytes\fR (ulong)
  48 .ad
  49 .RS 12n
  50 Maximum size in bytes of the metadata dbuf cache.  When \fB0\fR this value will
  51 default to \fB1/2^dbuf_cache_shift\fR (1/16) of the target ARC size, otherwise
  52 the provided value in bytes will be used.  The behavior of the metadata dbuf
  53 cache and its associated settings can be observed via the
  54 \fB/proc/spl/kstat/zfs/dbufstats\fR kstat.
  55 .sp
  56 Default value: \fB0\fR.
  57 .RE
  58
  59 .sp
  60 .ne 2
  61 .na
  62 \fBdbuf_cache_hiwater_pct\fR (uint)
  63 .ad
  64 .RS 12n
  65 The percentage over \fBdbuf_cache_max_bytes\fR when dbufs must be evicted
  66 directly.
  67 .sp
  68 Default value: \fB10\fR%.
  69 .RE
  70
  71 .sp
  72 .ne 2
  73 .na
  74 \fBdbuf_cache_lowater_pct\fR (uint)
  75 .ad
  76 .RS 12n
  77 The percentage below \fBdbuf_cache_max_bytes\fR when the evict thread stops
  78 evicting dbufs.
  79 .sp
  80 Default value: \fB10\fR%.
  81 .RE
  82
  83 .sp
  84 .ne 2
  85 .na
  86 \fBdbuf_cache_shift\fR (int)
  87 .ad
  88 .RS 12n
  89 Set the size of the dbuf cache, \fBdbuf_cache_max_bytes\fR, to a log2 fraction
  90 of the target arc size.
  91 .sp
  92 Default value: \fB5\fR.
  93 .RE
  94
  95 .sp
  96 .ne 2
  97 .na
  98 \fBdbuf_metadata_cache_shift\fR (int)
  99 .ad
 100 .RS 12n
 101 Set the size of the dbuf metadata cache, \fBdbuf_metadata_cache_max_bytes\fR,
 102 to a log2 fraction of the target arc size.
 103 .sp
 104 Default value: \fB6\fR.
 105 .RE
 106
 107 .sp
 108 .ne 2
 109 .na
 110 \fBdmu_prefetch_max\fR (int)
 111 .ad
 112 .RS 12n
 113 Limit the amount we can prefetch with one call to this amount (in bytes).
 114 This helps to limit the amount of memory that can be used by prefetching.
 115 .sp
 116 Default value: \fB134,217,728\fR (128MB).
 117 .RE
 118
 119 .sp
 120 .ne 2
 121 .na
 122 \fBignore_hole_birth\fR (int)
 123 .ad
 124 .RS 12n
 125 This is an alias for \fBsend_holes_without_birth_time\fR.
 126 .RE
 127
 128 .sp
 129 .ne 2
 130 .na
 131 \fBl2arc_feed_again\fR (int)
 132 .ad
 133 .RS 12n
 134 Turbo L2ARC warm-up. When the L2ARC is cold the fill interval will be set as
 135 fast as possible.
 136 .sp
 137 Use \fB1\fR for yes (default) and \fB0\fR to disable.
 138 .RE
 139
 140 .sp
 141 .ne 2
 142 .na
 143 \fBl2arc_feed_min_ms\fR (ulong)
 144 .ad
 145 .RS 12n
 146 Min feed interval in milliseconds. Requires \fBl2arc_feed_again=1\fR and only
 147 applicable in related situations.
 148 .sp
 149 Default value: \fB200\fR.
 150 .RE
 151
 152 .sp
 153 .ne 2
 154 .na
 155 \fBl2arc_feed_secs\fR (ulong)
 156 .ad
 157 .RS 12n
 158 Seconds between L2ARC writing
 159 .sp
 160 Default value: \fB1\fR.
 161 .RE
 162
 163 .sp
 164 .ne 2
 165 .na
 166 \fBl2arc_headroom\fR (ulong)
 167 .ad
 168 .RS 12n
 169 How far through the ARC lists to search for L2ARC cacheable content, expressed
 170 as a multiplier of \fBl2arc_write_max\fR
 171 .sp
 172 Default value: \fB2\fR.
 173 .RE
 174
 175 .sp
 176 .ne 2
 177 .na
 178 \fBl2arc_headroom_boost\fR (ulong)
 179 .ad
 180 .RS 12n
 181 Scales \fBl2arc_headroom\fR by this percentage when L2ARC contents are being
 182 successfully compressed before writing. A value of 100 disables this feature.
 183 .sp
 184 Default value: \fB200\fR%.
 185 .RE
 186
 187 .sp
 188 .ne 2
 189 .na
 190 \fBl2arc_noprefetch\fR (int)
 191 .ad
 192 .RS 12n
 193 Do not write buffers to L2ARC if they were prefetched but not used by
 194 applications
 195 .sp
 196 Use \fB1\fR for yes (default) and \fB0\fR to disable.
 197 .RE
 198
 199 .sp
 200 .ne 2
 201 .na
 202 \fBl2arc_norw\fR (int)
 203 .ad
 204 .RS 12n
 205 No reads during writes
 206 .sp
 207 Use \fB1\fR for yes and \fB0\fR for no (default).
 208 .RE
 209
 210 .sp
 211 .ne 2
 212 .na
 213 \fBl2arc_write_boost\fR (ulong)
 214 .ad
 215 .RS 12n
 216 Cold L2ARC devices will have \fBl2arc_write_max\fR increased by this amount
 217 while they remain cold.
 218 .sp
 219 Default value: \fB8,388,608\fR.
 220 .RE
 221
 222 .sp
 223 .ne 2
 224 .na
 225 \fBl2arc_write_max\fR (ulong)
 226 .ad
 227 .RS 12n
 228 Max write bytes per interval
 229 .sp
 230 Default value: \fB8,388,608\fR.
 231 .RE
 232
 233 .sp
 234 .ne 2
 235 .na
 236 \fBmetaslab_aliquot\fR (ulong)
 237 .ad
 238 .RS 12n
 239 Metaslab granularity, in bytes. This is roughly similar to what would be
 240 referred to as the "stripe size" in traditional RAID arrays. In normal
 241 operation, ZFS will try to write this amount of data to a top-level vdev
 242 before moving on to the next one.
 243 .sp
 244 Default value: \fB524,288\fR.
 245 .RE
 246
 247 .sp
 248 .ne 2
 249 .na
 250 \fBmetaslab_bias_enabled\fR (int)
 251 .ad
 252 .RS 12n
 253 Enable metaslab group biasing based on its vdev's over- or under-utilization
 254 relative to the pool.
 255 .sp
 256 Use \fB1\fR for yes (default) and \fB0\fR for no.
 257 .RE
 258
 259 .sp
 260 .ne 2
 261 .na
 262 \fBmetaslab_force_ganging\fR (ulong)
 263 .ad
 264 .RS 12n
 265 Make some blocks above a certain size be gang blocks.  This option is used
 266 by the test suite to facilitate testing.
 267 .sp
 268 Default value: \fB16,777,217\fR.
 269 .RE
 270
 271 .sp
 272 .ne 2
 273 .na
 274 \fBzfs_keep_log_spacemaps_at_export\fR (int)
 275 .ad
 276 .RS 12n
 277 Prevent log spacemaps from being destroyed during pool exports and destroys.
 278 .sp
 279 Use \fB1\fR for yes and \fB0\fR for no (default).
 280 .RE
 281
 282 .sp
 283 .ne 2
 284 .na
 285 \fBzfs_metaslab_segment_weight_enabled\fR (int)
 286 .ad
 287 .RS 12n
 288 Enable/disable segment-based metaslab selection.
 289 .sp
 290 Use \fB1\fR for yes (default) and \fB0\fR for no.
 291 .RE
 292
 293 .sp
 294 .ne 2
 295 .na
 296 \fBzfs_metaslab_switch_threshold\fR (int)
 297 .ad
 298 .RS 12n
 299 When using segment-based metaslab selection, continue allocating
 300 from the active metaslab until \fBzfs_metaslab_switch_threshold\fR
 301 worth of buckets have been exhausted.
 302 .sp
 303 Default value: \fB2\fR.
 304 .RE
 305
 306 .sp
 307 .ne 2
 308 .na
 309 \fBmetaslab_debug_load\fR (int)
 310 .ad
 311 .RS 12n
 312 Load all metaslabs during pool import.
 313 .sp
 314 Use \fB1\fR for yes and \fB0\fR for no (default).
 315 .RE
 316
 317 .sp
 318 .ne 2
 319 .na
 320 \fBmetaslab_debug_unload\fR (int)
 321 .ad
 322 .RS 12n
 323 Prevent metaslabs from being unloaded.
 324 .sp
 325 Use \fB1\fR for yes and \fB0\fR for no (default).
 326 .RE
 327
 328 .sp
 329 .ne 2
 330 .na
 331 \fBmetaslab_fragmentation_factor_enabled\fR (int)
 332 .ad
 333 .RS 12n
 334 Enable use of the fragmentation metric in computing metaslab weights.
 335 .sp
 336 Use \fB1\fR for yes (default) and \fB0\fR for no.
 337 .RE
 338
 339 .sp
 340 .ne 2
 341 .na
 342 \fBmetaslab_df_max_search\fR (int)
 343 .ad
 344 .RS 12n
 345 Maximum distance to search forward from the last offset. Without this limit,
 346 fragmented pools can see >100,000 iterations and metaslab_block_picker()
 347 becomes the performance limiting factor on high-performance storage.
 348
 349 With the default setting of 16MB, we typically see less than 500 iterations,
 350 even with very fragmented, ashift=9 pools. The maximum number of iterations
 351 possible is: \fBmetaslab_df_max_search / (2 * (1<<ashift))\fR.
 352 With the default setting of 16MB this is 16*1024 (with ashift=9) or 2048
 353 (with ashift=12).
 354 .sp
 355 Default value: \fB16,777,216\fR (16MB)
 356 .RE
 357
 358 .sp
 359 .ne 2
 360 .na
 361 \fBmetaslab_df_use_largest_segment\fR (int)
 362 .ad
 363 .RS 12n
 364 If we are not searching forward (due to metaslab_df_max_search,
 365 metaslab_df_free_pct, or metaslab_df_alloc_threshold), this tunable controls
 366 what segment is used.  If it is set, we will use the largest free segment.
 367 If it is not set, we will use a segment of exactly the requested size (or
 368 larger).
 369 .sp
 370 Use \fB1\fR for yes and \fB0\fR for no (default).
 371 .RE
 372
 373 .sp
 374 .ne 2
 375 .na
 376 \fBzfs_metaslab_max_size_cache_sec\fR (ulong)
 377 .ad
 378 .RS 12n
 379 When we unload a metaslab, we cache the size of the largest free chunk. We use
 380 that cached size to determine whether or not to load a metaslab for a given
 381 allocation. As more frees accumulate in that metaslab while it's unloaded, the
 382 cached max size becomes less and less accurate. After a number of seconds
 383 controlled by this tunable, we stop considering the cached max size and start
 384 considering only the histogram instead.
 385 .sp
 386 Default value: \fB3600 seconds\fR (one hour)
 387 .RE
 388
 389 .sp
 390 .ne 2
 391 .na
 392 \fBzfs_metaslab_mem_limit\fR (int)
 393 .ad
 394 .RS 12n
 395 When we are loading a new metaslab, we check the amount of memory being used
 396 to store metaslab range trees. If it is over a threshold, we attempt to unload
 397 the least recently used metaslab to prevent the system from clogging all of
 398 its memory with range trees. This tunable sets the percentage of total system
 399 memory that is the threshold.
 400 .sp
 401 Default value: \fB25 percent\fR
 402 .RE
 403
 404 .sp
 405 .ne 2
 406 .na
 407 \fBzfs_vdev_default_ms_count\fR (int)
 408 .ad
 409 .RS 12n
 410 When a vdev is added target this number of metaslabs per top-level vdev.
 411 .sp
 412 Default value: \fB200\fR.
 413 .RE
 414
 415 .sp
 416 .ne 2
 417 .na
 418 \fBzfs_vdev_default_ms_shift\fR (int)
 419 .ad
 420 .RS 12n
 421 Default limit for metaslab size.
 422 .sp
 423 Default value: \fB29\fR [meaning (1 << 29) = 512MB].
 424 .RE
 425
 426 .sp
 427 .ne 2
 428 .na
 429 \fBzfs_vdev_min_ms_count\fR (int)
 430 .ad
 431 .RS 12n
 432 Minimum number of metaslabs to create in a top-level vdev.
 433 .sp
 434 Default value: \fB16\fR.
 435 .RE
 436
 437 .sp
 438 .ne 2
 439 .na
 440 \fBvdev_ms_count_limit\fR (int)
 441 .ad
 442 .RS 12n
 443 Practical upper limit of total metaslabs per top-level vdev.
 444 .sp
 445 Default value: \fB131,072\fR.
 446 .RE
 447
 448 .sp
 449 .ne 2
 450 .na
 451 \fBmetaslab_preload_enabled\fR (int)
 452 .ad
 453 .RS 12n
 454 Enable metaslab group preloading.
 455 .sp
 456 Use \fB1\fR for yes (default) and \fB0\fR for no.
 457 .RE
 458
 459 .sp
 460 .ne 2
 461 .na
 462 \fBmetaslab_lba_weighting_enabled\fR (int)
 463 .ad
 464 .RS 12n
 465 Give more weight to metaslabs with lower LBAs, assuming they have
 466 greater bandwidth as is typically the case on a modern constant
 467 angular velocity disk drive.
 468 .sp
 469 Use \fB1\fR for yes (default) and \fB0\fR for no.
 470 .RE
 471
 472 .sp
 473 .ne 2
 474 .na
 475 \fBmetaslab_unload_delay\fR (int)
 476 .ad
 477 .RS 12n
 478 After a metaslab is used, we keep it loaded for this many txgs, to attempt to
 479 reduce unnecessary reloading. Note that both this many txgs and
 480 \fBmetaslab_unload_delay_ms\fR milliseconds must pass before unloading will
 481 occur.
 482 .sp
 483 Default value: \fB32\fR.
 484 .RE
 485
 486 .sp
 487 .ne 2
 488 .na
 489 \fBmetaslab_unload_delay_ms\fR (int)
 490 .ad
 491 .RS 12n
 492 After a metaslab is used, we keep it loaded for this many milliseconds, to
 493 attempt to reduce unnecessary reloading. Note that both this many
 494 milliseconds and \fBmetaslab_unload_delay\fR txgs must pass before unloading
 495 will occur.
 496 .sp
 497 Default value: \fB600000\fR (ten minutes).
 498 .RE
 499
 500 .sp
 501 .ne 2
 502 .na
 503 \fBsend_holes_without_birth_time\fR (int)
 504 .ad
 505 .RS 12n
 506 When set, the hole_birth optimization will not be used, and all holes will
 507 always be sent on zfs send.  This is useful if you suspect your datasets are
 508 affected by a bug in hole_birth.
 509 .sp
 510 Use \fB1\fR for on (default) and \fB0\fR for off.
 511 .RE
 512
 513 .sp
 514 .ne 2
 515 .na
 516 \fBspa_config_path\fR (charp)
 517 .ad
 518 .RS 12n
 519 SPA config file
 520 .sp
 521 Default value: \fB/etc/zfs/zpool.cache\fR.
 522 .RE
 523
 524 .sp
 525 .ne 2
 526 .na
 527 \fBspa_asize_inflation\fR (int)
 528 .ad
 529 .RS 12n
 530 Multiplication factor used to estimate actual disk consumption from the
 531 size of data being written. The default value is a worst case estimate,
 532 but lower values may be valid for a given pool depending on its
 533 configuration.  Pool administrators who understand the factors involved
 534 may wish to specify a more realistic inflation factor, particularly if
 535 they operate close to quota or capacity limits.
 536 .sp
 537 Default value: \fB24\fR.
 538 .RE
 539
 540 .sp
 541 .ne 2
 542 .na
 543 \fBspa_load_print_vdev_tree\fR (int)
 544 .ad
 545 .RS 12n
 546 Whether to print the vdev tree in the debugging message buffer during pool import.
 547 Use 0 to disable and 1 to enable.
 548 .sp
 549 Default value: \fB0\fR.
 550 .RE
 551
 552 .sp
 553 .ne 2
 554 .na
 555 \fBspa_load_verify_data\fR (int)
 556 .ad
 557 .RS 12n
 558 Whether to traverse data blocks during an "extreme rewind" (\fB-X\fR)
 559 import.  Use 0 to disable and 1 to enable.
 560
 561 An extreme rewind import normally performs a full traversal of all
 562 blocks in the pool for verification.  If this parameter is set to 0,
 563 the traversal skips non-metadata blocks.  It can be toggled once the
 564 import has started to stop or start the traversal of non-metadata blocks.
 565 .sp
 566 Default value: \fB1\fR.
 567 .RE
 568
 569 .sp
 570 .ne 2
 571 .na
 572 \fBspa_load_verify_metadata\fR (int)
 573 .ad
 574 .RS 12n
 575 Whether to traverse blocks during an "extreme rewind" (\fB-X\fR)
 576 pool import.  Use 0 to disable and 1 to enable.
 577
 578 An extreme rewind import normally performs a full traversal of all
 579 blocks in the pool for verification.  If this parameter is set to 0,
 580 the traversal is not performed.  It can be toggled once the import has
 581 started to stop or start the traversal.
 582 .sp
 583 Default value: \fB1\fR.
 584 .RE
 585
 586 .sp
 587 .ne 2
 588 .na
 589 \fBspa_load_verify_shift\fR (int)
 590 .ad
 591 .RS 12n
 592 Sets the maximum number of bytes to consume during pool import to the log2
 593 fraction of the target arc size.
 594 .sp
 595 Default value: \fB4\fR.
 596 .RE
 597
 598 .sp
 599 .ne 2
 600 .na
 601 \fBspa_slop_shift\fR (int)
 602 .ad
 603 .RS 12n
 604 Normally, we don't allow the last 3.2% (1/(2^spa_slop_shift)) of space
 605 in the pool to be consumed.  This ensures that we don't run the pool
 606 completely out of space, due to unaccounted changes (e.g. to the MOS).
 607 It also limits the worst-case time to allocate space.  If we have
 608 less than this amount of free space, most ZPL operations (e.g. write,
 609 create) will return ENOSPC.
 610 .sp
 611 Default value: \fB5\fR.
 612 .RE
 613
 614 .sp
 615 .ne 2
 616 .na
 617 \fBvdev_removal_max_span\fR (int)
 618 .ad
 619 .RS 12n
 620 During top-level vdev removal, chunks of data are copied from the vdev
 621 which may include free space in order to trade bandwidth for IOPS.
 622 This parameter determines the maximum span of free space (in bytes)
 623 which will be included as "unnecessary" data in a chunk of copied data.
 624
 625 The default value here was chosen to align with
 626 \fBzfs_vdev_read_gap_limit\fR, which is a similar concept when doing
 627 regular reads (but there's no reason it has to be the same).
 628 .sp
 629 Default value: \fB32,768\fR.
 630 .RE
 631
 632 .sp
 633 .ne 2
 634 .na
 635 \fBzap_iterate_prefetch\fR (int)
 636 .ad
 637 .RS 12n
 638 If this is set, when we start iterating over a ZAP object, zfs will prefetch
 639 the entire object (all leaf blocks).  However, this is limited by
 640 \fBdmu_prefetch_max\fR.
 641 .sp
 642 Use \fB1\fR for on (default) and \fB0\fR for off.
 643 .RE
 644
 645 .sp
 646 .ne 2
 647 .na
 648 \fBzfetch_array_rd_sz\fR (ulong)
 649 .ad
 650 .RS 12n
 651 If prefetching is enabled, disable prefetching for reads larger than this size.
 652 .sp
 653 Default value: \fB1,048,576\fR.
 654 .RE
 655
 656 .sp
 657 .ne 2
 658 .na
 659 \fBzfetch_max_distance\fR (uint)
 660 .ad
 661 .RS 12n
 662 Max bytes to prefetch per stream (default 8MB).
 663 .sp
 664 Default value: \fB8,388,608\fR.
 665 .RE
 666
 667 .sp
 668 .ne 2
 669 .na
 670 \fBzfetch_max_streams\fR (uint)
 671 .ad
 672 .RS 12n
 673 Max number of streams per zfetch (prefetch streams per file).
 674 .sp
 675 Default value: \fB8\fR.
 676 .RE
 677
 678 .sp
 679 .ne 2
 680 .na
 681 \fBzfetch_min_sec_reap\fR (uint)
 682 .ad
 683 .RS 12n
 684 Min time before an active prefetch stream can be reclaimed
 685 .sp
 686 Default value: \fB2\fR.
 687 .RE
 688
 689 .sp
 690 .ne 2
 691 .na
 692 \fBzfs_abd_scatter_min_size\fR (uint)
 693 .ad
 694 .RS 12n
 695 This is the minimum allocation size that will use scatter (page-based)
 696 ABD's.  Smaller allocations will use linear ABD's.
 697 .sp
 698 Default value: \fB1536\fR (512B and 1KB allocations will be linear).
 699 .RE
 700
 701 .sp
 702 .ne 2
 703 .na
 704 \fBzfs_arc_dnode_limit\fR (ulong)
 705 .ad
 706 .RS 12n
 707 When the number of bytes consumed by dnodes in the ARC exceeds this number of
 708 bytes, try to unpin some of it in response to demand for non-metadata. This
 709 value acts as a ceiling to the amount of dnode metadata, and defaults to 0 which
 710 indicates that a percent which is based on \fBzfs_arc_dnode_limit_percent\fR of
 711 the ARC meta buffers that may be used for dnodes.
 712
 713 See also \fBzfs_arc_meta_prune\fR which serves a similar purpose but is used
 714 when the amount of metadata in the ARC exceeds \fBzfs_arc_meta_limit\fR rather
 715 than in response to overall demand for non-metadata.
 716
 717 .sp
 718 Default value: \fB0\fR.
 719 .RE
 720
 721 .sp
 722 .ne 2
 723 .na
 724 \fBzfs_arc_dnode_limit_percent\fR (ulong)
 725 .ad
 726 .RS 12n
 727 Percentage that can be consumed by dnodes of ARC meta buffers.
 728 .sp
 729 See also \fBzfs_arc_dnode_limit\fR which serves a similar purpose but has a
 730 higher priority if set to nonzero value.
 731 .sp
 732 Default value: \fB10\fR%.
 733 .RE
 734
 735 .sp
 736 .ne 2
 737 .na
 738 \fBzfs_arc_dnode_reduce_percent\fR (ulong)
 739 .ad
 740 .RS 12n
 741 Percentage of ARC dnodes to try to scan in response to demand for non-metadata
 742 when the number of bytes consumed by dnodes exceeds \fBzfs_arc_dnode_limit\fR.
 743
 744 .sp
 745 Default value: \fB10\fR% of the number of dnodes in the ARC.
 746 .RE
 747
 748 .sp
 749 .ne 2
 750 .na
 751 \fBzfs_arc_average_blocksize\fR (int)
 752 .ad
 753 .RS 12n
 754 The ARC's buffer hash table is sized based on the assumption of an average
 755 block size of \fBzfs_arc_average_blocksize\fR (default 8K).  This works out
 756 to roughly 1MB of hash table per 1GB of physical memory with 8-byte pointers.
 757 For configurations with a known larger average block size this value can be
 758 increased to reduce the memory footprint.
 759
 760 .sp
 761 Default value: \fB8192\fR.
 762 .RE
 763
 764 .sp
 765 .ne 2
 766 .na
 767 \fBzfs_arc_evict_batch_limit\fR (int)
 768 .ad
 769 .RS 12n
 770 Number ARC headers to evict per sub-list before proceeding to another sub-list.
 771 This batch-style operation prevents entire sub-lists from being evicted at once
 772 but comes at a cost of additional unlocking and locking.
 773 .sp
 774 Default value: \fB10\fR.
 775 .RE
 776
 777 .sp
 778 .ne 2
 779 .na
 780 \fBzfs_arc_grow_retry\fR (int)
 781 .ad
 782 .RS 12n
 783 If set to a non zero value, it will replace the arc_grow_retry value with this value.
 784 The arc_grow_retry value (default 5) is the number of seconds the ARC will wait before
 785 trying to resume growth after a memory pressure event.
 786 .sp
 787 Default value: \fB0\fR.
 788 .RE
 789
 790 .sp
 791 .ne 2
 792 .na
 793 \fBzfs_arc_lotsfree_percent\fR (int)
 794 .ad
 795 .RS 12n
 796 Throttle I/O when free system memory drops below this percentage of total
 797 system memory.  Setting this value to 0 will disable the throttle.
 798 .sp
 799 Default value: \fB10\fR%.
 800 .RE
 801
 802 .sp
 803 .ne 2
 804 .na
 805 \fBzfs_arc_max\fR (ulong)
 806 .ad
 807 .RS 12n
 808 Max arc size of ARC in bytes. If set to 0 then it will consume 1/2 of system
 809 RAM. This value must be at least 67108864 (64 megabytes).
 810 .sp
 811 This value can be changed dynamically with some caveats. It cannot be set back
 812 to 0 while running and reducing it below the current ARC size will not cause
 813 the ARC to shrink without memory pressure to induce shrinking.
 814 .sp
 815 Default value: \fB0\fR.
 816 .RE
 817
 818 .sp
 819 .ne 2
 820 .na
 821 \fBzfs_arc_meta_adjust_restarts\fR (ulong)
 822 .ad
 823 .RS 12n
 824 The number of restart passes to make while scanning the ARC attempting
 825 the free buffers in order to stay below the \fBzfs_arc_meta_limit\fR.
 826 This value should not need to be tuned but is available to facilitate
 827 performance analysis.
 828 .sp
 829 Default value: \fB4096\fR.
 830 .RE
 831
 832 .sp
 833 .ne 2
 834 .na
 835 \fBzfs_arc_meta_limit\fR (ulong)
 836 .ad
 837 .RS 12n
 838 The maximum allowed size in bytes that meta data buffers are allowed to
 839 consume in the ARC.  When this limit is reached meta data buffers will
 840 be reclaimed even if the overall arc_c_max has not been reached.  This
 841 value defaults to 0 which indicates that a percent which is based on
 842 \fBzfs_arc_meta_limit_percent\fR of the ARC may be used for meta data.
 843 .sp
 844 This value my be changed dynamically except that it cannot be set back to 0
 845 for a specific percent of the ARC; it must be set to an explicit value.
 846 .sp
 847 Default value: \fB0\fR.
 848 .RE
 849
 850 .sp
 851 .ne 2
 852 .na
 853 \fBzfs_arc_meta_limit_percent\fR (ulong)
 854 .ad
 855 .RS 12n
 856 Percentage of ARC buffers that can be used for meta data.
 857
 858 See also \fBzfs_arc_meta_limit\fR which serves a similar purpose but has a
 859 higher priority if set to nonzero value.
 860
 861 .sp
 862 Default value: \fB75\fR%.
 863 .RE
 864
 865 .sp
 866 .ne 2
 867 .na
 868 \fBzfs_arc_meta_min\fR (ulong)
 869 .ad
 870 .RS 12n
 871 The minimum allowed size in bytes that meta data buffers may consume in
 872 the ARC.  This value defaults to 0 which disables a floor on the amount
 873 of the ARC devoted meta data.
 874 .sp
 875 Default value: \fB0\fR.
 876 .RE
 877
 878 .sp
 879 .ne 2
 880 .na
 881 \fBzfs_arc_meta_prune\fR (int)
 882 .ad
 883 .RS 12n
 884 The number of dentries and inodes to be scanned looking for entries
 885 which can be dropped.  This may be required when the ARC reaches the
 886 \fBzfs_arc_meta_limit\fR because dentries and inodes can pin buffers
 887 in the ARC.  Increasing this value will cause to dentry and inode caches
 888 to be pruned more aggressively.  Setting this value to 0 will disable
 889 pruning the inode and dentry caches.
 890 .sp
 891 Default value: \fB10,000\fR.
 892 .RE
 893
 894 .sp
 895 .ne 2
 896 .na
 897 \fBzfs_arc_meta_strategy\fR (int)
 898 .ad
 899 .RS 12n
 900 Define the strategy for ARC meta data buffer eviction (meta reclaim strategy).
 901 A value of 0 (META_ONLY) will evict only the ARC meta data buffers.
 902 A value of 1 (BALANCED) indicates that additional data buffers may be evicted if
 903 that is required to in order to evict the required number of meta data buffers.
 904 .sp
 905 Default value: \fB1\fR.
 906 .RE
 907
 908 .sp
 909 .ne 2
 910 .na
 911 \fBzfs_arc_min\fR (ulong)
 912 .ad
 913 .RS 12n
 914 Min arc size of ARC in bytes. If set to 0 then arc_c_min will default to
 915 consuming the larger of 32M or 1/32 of total system memory.
 916 .sp
 917 Default value: \fB0\fR.
 918 .RE
 919
 920 .sp
 921 .ne 2
 922 .na
 923 \fBzfs_arc_min_prefetch_ms\fR (int)
 924 .ad
 925 .RS 12n
 926 Minimum time prefetched blocks are locked in the ARC, specified in ms.
 927 A value of \fB0\fR will default to 1000 ms.
 928 .sp
 929 Default value: \fB0\fR.
 930 .RE
 931
 932 .sp
 933 .ne 2
 934 .na
 935 \fBzfs_arc_min_prescient_prefetch_ms\fR (int)
 936 .ad
 937 .RS 12n
 938 Minimum time "prescient prefetched" blocks are locked in the ARC, specified
 939 in ms. These blocks are meant to be prefetched fairly aggressively ahead of
 940 the code that may use them. A value of \fB0\fR will default to 6000 ms.
 941 .sp
 942 Default value: \fB0\fR.
 943 .RE
 944
 945 .sp
 946 .ne 2
 947 .na
 948 \fBzfs_max_missing_tvds\fR (int)
 949 .ad
 950 .RS 12n
 951 Number of missing top-level vdevs which will be allowed during
 952 pool import (only in read-only mode).
 953 .sp
 954 Default value: \fB0\fR
 955 .RE
 956
 957 .sp
 958 .ne 2
 959 .na
 960 \fBzfs_multilist_num_sublists\fR (int)
 961 .ad
 962 .RS 12n
 963 To allow more fine-grained locking, each ARC state contains a series
 964 of lists for both data and meta data objects.  Locking is performed at
 965 the level of these "sub-lists".  This parameters controls the number of
 966 sub-lists per ARC state, and also applies to other uses of the
 967 multilist data structure.
 968 .sp
 969 Default value: \fB4\fR or the number of online CPUs, whichever is greater
 970 .RE
 971
 972 .sp
 973 .ne 2
 974 .na
 975 \fBzfs_arc_overflow_shift\fR (int)
 976 .ad
 977 .RS 12n
 978 The ARC size is considered to be overflowing if it exceeds the current
 979 ARC target size (arc_c) by a threshold determined by this parameter.
 980 The threshold is calculated as a fraction of arc_c using the formula
 981 "arc_c >> \fBzfs_arc_overflow_shift\fR".
 982
 983 The default value of 8 causes the ARC to be considered to be overflowing
 984 if it exceeds the target size by 1/256th (0.3%) of the target size.
 985
 986 When the ARC is overflowing, new buffer allocations are stalled until
 987 the reclaim thread catches up and the overflow condition no longer exists.
 988 .sp
 989 Default value: \fB8\fR.
 990 .RE
 991
 992 .sp
 993 .ne 2
 994 .na
 995
 996 \fBzfs_arc_p_min_shift\fR (int)
 997 .ad
 998 .RS 12n
 999 If set to a non zero value, this will update arc_p_min_shift (default 4)
1000 with the new value.
1001 arc_p_min_shift is used to shift of arc_c for calculating both min and max
1002 max arc_p
1003 .sp
1004 Default value: \fB0\fR.
1005 .RE
1006
1007 .sp
1008 .ne 2
1009 .na
1010 \fBzfs_arc_p_dampener_disable\fR (int)
1011 .ad
1012 .RS 12n
1013 Disable arc_p adapt dampener
1014 .sp
1015 Use \fB1\fR for yes (default) and \fB0\fR to disable.
1016 .RE
1017
1018 .sp
1019 .ne 2
1020 .na
1021 \fBzfs_arc_shrink_shift\fR (int)
1022 .ad
1023 .RS 12n
1024 If set to a non zero value, this will update arc_shrink_shift (default 7)
1025 with the new value.
1026 .sp
1027 Default value: \fB0\fR.
1028 .RE
1029
1030 .sp
1031 .ne 2
1032 .na
1033 \fBzfs_arc_pc_percent\fR (uint)
1034 .ad
1035 .RS 12n
1036 Percent of pagecache to reclaim arc to
1037
1038 This tunable allows ZFS arc to play more nicely with the kernel's LRU
1039 pagecache. It can guarantee that the arc size won't collapse under scanning
1040 pressure on the pagecache, yet still allows arc to be reclaimed down to
1041 zfs_arc_min if necessary. This value is specified as percent of pagecache
1042 size (as measured by NR_FILE_PAGES) where that percent may exceed 100. This
1043 only operates during memory pressure/reclaim.
1044 .sp
1045 Default value: \fB0\fR% (disabled).
1046 .RE
1047
1048 .sp
1049 .ne 2
1050 .na
1051 \fBzfs_arc_sys_free\fR (ulong)
1052 .ad
1053 .RS 12n
1054 The target number of bytes the ARC should leave as free memory on the system.
1055 Defaults to the larger of 1/64 of physical memory or 512K.  Setting this
1056 option to a non-zero value will override the default.
1057 .sp
1058 Default value: \fB0\fR.
1059 .RE
1060
1061 .sp
1062 .ne 2
1063 .na
1064 \fBzfs_autoimport_disable\fR (int)
1065 .ad
1066 .RS 12n
1067 Disable pool import at module load by ignoring the cache file (typically \fB/etc/zfs/zpool.cache\fR).
1068 .sp
1069 Use \fB1\fR for yes (default) and \fB0\fR for no.
1070 .RE
1071
1072 .sp
1073 .ne 2
1074 .na
1075 \fBzfs_checksums_per_second\fR (int)
1076 .ad
1077 .RS 12n
1078 Rate limit checksum events to this many per second.  Note that this should
1079 not be set below the zed thresholds (currently 10 checksums over 10 sec)
1080 or else zed may not trigger any action.
1081 .sp
1082 Default value: 20
1083 .RE
1084
1085 .sp
1086 .ne 2
1087 .na
1088 \fBzfs_commit_timeout_pct\fR (int)
1089 .ad
1090 .RS 12n
1091 This controls the amount of time that a ZIL block (lwb) will remain "open"
1092 when it isn't "full", and it has a thread waiting for it to be committed to
1093 stable storage.  The timeout is scaled based on a percentage of the last lwb
1094 latency to avoid significantly impacting the latency of each individual
1095 transaction record (itx).
1096 .sp
1097 Default value: \fB5\fR%.
1098 .RE
1099
1100 .sp
1101 .ne 2
1102 .na
1103 \fBzfs_condense_indirect_vdevs_enable\fR (int)
1104 .ad
1105 .RS 12n
1106 Enable condensing indirect vdev mappings.  When set to a non-zero value,
1107 attempt to condense indirect vdev mappings if the mapping uses more than
1108 \fBzfs_condense_min_mapping_bytes\fR bytes of memory and if the obsolete
1109 space map object uses more than \fBzfs_condense_max_obsolete_bytes\fR
1110 bytes on-disk.  The condensing process is an attempt to save memory by
1111 removing obsolete mappings.
1112 .sp
1113 Default value: \fB1\fR.
1114 .RE
1115
1116 .sp
1117 .ne 2
1118 .na
1119 \fBzfs_condense_max_obsolete_bytes\fR (ulong)
1120 .ad
1121 .RS 12n
1122 Only attempt to condense indirect vdev mappings if the on-disk size
1123 of the obsolete space map object is greater than this number of bytes
1124 (see \fBfBzfs_condense_indirect_vdevs_enable\fR).
1125 .sp
1126 Default value: \fB1,073,741,824\fR.
1127 .RE
1128
1129 .sp
1130 .ne 2
1131 .na
1132 \fBzfs_condense_min_mapping_bytes\fR (ulong)
1133 .ad
1134 .RS 12n
1135 Minimum size vdev mapping to attempt to condense (see
1136 \fBzfs_condense_indirect_vdevs_enable\fR).
1137 .sp
1138 Default value: \fB131,072\fR.
1139 .RE
1140
1141 .sp
1142 .ne 2
1143 .na
1144 \fBzfs_dbgmsg_enable\fR (int)
1145 .ad
1146 .RS 12n
1147 Internally ZFS keeps a small log to facilitate debugging.  By default the log
1148 is disabled, to enable it set this option to 1.  The contents of the log can
1149 be accessed by reading the /proc/spl/kstat/zfs/dbgmsg file.  Writing 0 to
1150 this proc file clears the log.
1151 .sp
1152 Default value: \fB0\fR.
1153 .RE
1154
1155 .sp
1156 .ne 2
1157 .na
1158 \fBzfs_dbgmsg_maxsize\fR (int)
1159 .ad
1160 .RS 12n
1161 The maximum size in bytes of the internal ZFS debug log.
1162 .sp
1163 Default value: \fB4M\fR.
1164 .RE
1165
1166 .sp
1167 .ne 2
1168 .na
1169 \fBzfs_dbuf_state_index\fR (int)
1170 .ad
1171 .RS 12n
1172 This feature is currently unused. It is normally used for controlling what
1173 reporting is available under /proc/spl/kstat/zfs.
1174 .sp
1175 Default value: \fB0\fR.
1176 .RE
1177
1178 .sp
1179 .ne 2
1180 .na
1181 \fBzfs_deadman_enabled\fR (int)
1182 .ad
1183 .RS 12n
1184 When a pool sync operation takes longer than \fBzfs_deadman_synctime_ms\fR
1185 milliseconds, or when an individual I/O takes longer than
1186 \fBzfs_deadman_ziotime_ms\fR milliseconds, then the operation is considered to
1187 be "hung".  If \fBzfs_deadman_enabled\fR is set then the deadman behavior is
1188 invoked as described by the \fBzfs_deadman_failmode\fR module option.
1189 By default the deadman is enabled and configured to \fBwait\fR which results
1190 in "hung" I/Os only being logged.  The deadman is automatically disabled
1191 when a pool gets suspended.
1192 .sp
1193 Default value: \fB1\fR.
1194 .RE
1195
1196 .sp
1197 .ne 2
1198 .na
1199 \fBzfs_deadman_failmode\fR (charp)
1200 .ad
1201 .RS 12n
1202 Controls the failure behavior when the deadman detects a "hung" I/O.  Valid
1203 values are \fBwait\fR, \fBcontinue\fR, and \fBpanic\fR.
1204 .sp
1205 \fBwait\fR - Wait for a "hung" I/O to complete.  For each "hung" I/O a
1206 "deadman" event will be posted describing that I/O.
1207 .sp
1208 \fBcontinue\fR - Attempt to recover from a "hung" I/O by re-dispatching it
1209 to the I/O pipeline if possible.
1210 .sp
1211 \fBpanic\fR - Panic the system.  This can be used to facilitate an automatic
1212 fail-over to a properly configured fail-over partner.
1213 .sp
1214 Default value: \fBwait\fR.
1215 .RE
1216
1217 .sp
1218 .ne 2
1219 .na
1220 \fBzfs_deadman_checktime_ms\fR (int)
1221 .ad
1222 .RS 12n
1223 Check time in milliseconds. This defines the frequency at which we check
1224 for hung I/O and potentially invoke the \fBzfs_deadman_failmode\fR behavior.
1225 .sp
1226 Default value: \fB60,000\fR.
1227 .RE
1228
1229 .sp
1230 .ne 2
1231 .na
1232 \fBzfs_deadman_synctime_ms\fR (ulong)
1233 .ad
1234 .RS 12n
1235 Interval in milliseconds after which the deadman is triggered and also
1236 the interval after which a pool sync operation is considered to be "hung".
1237 Once this limit is exceeded the deadman will be invoked every
1238 \fBzfs_deadman_checktime_ms\fR milliseconds until the pool sync completes.
1239 .sp
1240 Default value: \fB600,000\fR.
1241 .RE
1242
1243 .sp
1244 .ne 2
1245 .na
1246 \fBzfs_deadman_ziotime_ms\fR (ulong)
1247 .ad
1248 .RS 12n
1249 Interval in milliseconds after which the deadman is triggered and an
1250 individual I/O operation is considered to be "hung".  As long as the I/O
1251 remains "hung" the deadman will be invoked every \fBzfs_deadman_checktime_ms\fR
1252 milliseconds until the I/O completes.
1253 .sp
1254 Default value: \fB300,000\fR.
1255 .RE
1256
1257 .sp
1258 .ne 2
1259 .na
1260 \fBzfs_dedup_prefetch\fR (int)
1261 .ad
1262 .RS 12n
1263 Enable prefetching dedup-ed blks
1264 .sp
1265 Use \fB1\fR for yes and \fB0\fR to disable (default).
1266 .RE
1267
1268 .sp
1269 .ne 2
1270 .na
1271 \fBzfs_delay_min_dirty_percent\fR (int)
1272 .ad
1273 .RS 12n
1274 Start to delay each transaction once there is this amount of dirty data,
1275 expressed as a percentage of \fBzfs_dirty_data_max\fR.
1276 This value should be >= zfs_vdev_async_write_active_max_dirty_percent.
1277 See the section "ZFS TRANSACTION DELAY".
1278 .sp
1279 Default value: \fB60\fR%.
1280 .RE
1281
1282 .sp
1283 .ne 2
1284 .na
1285 \fBzfs_delay_scale\fR (int)
1286 .ad
1287 .RS 12n
1288 This controls how quickly the transaction delay approaches infinity.
1289 Larger values cause longer delays for a given amount of dirty data.
1290 .sp
1291 For the smoothest delay, this value should be about 1 billion divided
1292 by the maximum number of operations per second.  This will smoothly
1293 handle between 10x and 1/10th this number.
1294 .sp
1295 See the section "ZFS TRANSACTION DELAY".
1296 .sp
1297 Note: \fBzfs_delay_scale\fR * \fBzfs_dirty_data_max\fR must be < 2^64.
1298 .sp
1299 Default value: \fB500,000\fR.
1300 .RE
1301
1302 .sp
1303 .ne 2
1304 .na
1305 \fBzfs_slow_io_events_per_second\fR (int)
1306 .ad
1307 .RS 12n
1308 Rate limit delay zevents (which report slow I/Os) to this many per second.
1309 .sp
1310 Default value: 20
1311 .RE
1312
1313 .sp
1314 .ne 2
1315 .na
1316 \fBzfs_unflushed_max_mem_amt\fR (ulong)
1317 .ad
1318 .RS 12n
1319 Upper-bound limit for unflushed metadata changes to be held by the
1320 log spacemap in memory (in bytes).
1321 .sp
1322 Default value: \fB1,073,741,824\fR (1GB).
1323 .RE
1324
1325 .sp
1326 .ne 2
1327 .na
1328 \fBzfs_unflushed_max_mem_ppm\fR (ulong)
1329 .ad
1330 .RS 12n
1331 Percentage of the overall system memory that ZFS allows to be used
1332 for unflushed metadata changes by the log spacemap.
1333 (value is calculated over 1000000 for finer granularity).
1334 .sp
1335 Default value: \fB1000\fR (which is divided by 1000000, resulting in
1336 the limit to be \fB0.1\fR% of memory)
1337 .RE
1338
1339 .sp
1340 .ne 2
1341 .na
1342 \fBzfs_unflushed_log_block_max\fR (ulong)
1343 .ad
1344 .RS 12n
1345 Describes the maximum number of log spacemap blocks allowed for each pool.
1346 The default value of 262144 means that the space in all the log spacemaps
1347 can add up to no more than 262144 blocks (which means 32GB of logical
1348 space before compression and ditto blocks, assuming that blocksize is
1349 128k).
1350 .sp
1351 This tunable is important because it involves a trade-off between import
1352 time after an unclean export and the frequency of flushing metaslabs.
1353 The higher this number is, the more log blocks we allow when the pool is
1354 active which means that we flush metaslabs less often and thus decrease
1355 the number of I/Os for spacemap updates per TXG.
1356 At the same time though, that means that in the event of an unclean export,
1357 there will be more log spacemap blocks for us to read, inducing overhead
1358 in the import time of the pool.
1359 The lower the number, the amount of flushing increases destroying log
1360 blocks quicker as they become obsolete faster, which leaves less blocks
1361 to be read during import time after a crash.
1362 .sp
1363 Each log spacemap block existing during pool import leads to approximately
1364 one extra logical I/O issued.
1365 This is the reason why this tunable is exposed in terms of blocks rather
1366 than space used.
1367 .sp
1368 Default value: \fB262144\fR (256K).
1369 .RE
1370
1371 .sp
1372 .ne 2
1373 .na
1374 \fBzfs_unflushed_log_block_min\fR (ulong)
1375 .ad
1376 .RS 12n
1377 If the number of metaslabs is small and our incoming rate is high, we
1378 could get into a situation that we are flushing all our metaslabs every
1379 TXG.
1380 Thus we always allow at least this many log blocks.
1381 .sp
1382 Default value: \fB1000\fR.
1383 .RE
1384
1385 .sp
1386 .ne 2
1387 .na
1388 \fBzfs_unflushed_log_block_pct\fR (ulong)
1389 .ad
1390 .RS 12n
1391 Tunable used to determine the number of blocks that can be used for
1392 the spacemap log, expressed as a percentage of the total number of
1393 metaslabs in the pool.
1394 .sp
1395 Default value: \fB400\fR (read as \fB400\fR% - meaning that the number
1396 of log spacemap blocks are capped at 4 times the number of
1397 metaslabs in the pool).
1398 .RE
1399
1400 .sp
1401 .ne 2
1402 .na
1403 \fBzfs_unlink_suspend_progress\fR (uint)
1404 .ad
1405 .RS 12n
1406 When enabled, files will not be asynchronously removed from the list of pending
1407 unlinks and the space they consume will be leaked. Once this option has been
1408 disabled and the dataset is remounted, the pending unlinks will be processed
1409 and the freed space returned to the pool.
1410 This option is used by the test suite to facilitate testing.
1411 .sp
1412 Uses \fB0\fR (default) to allow progress and \fB1\fR to pause progress.
1413 .RE
1414
1415 .sp
1416 .ne 2
1417 .na
1418 \fBzfs_delete_blocks\fR (ulong)
1419 .ad
1420 .RS 12n
1421 This is the used to define a large file for the purposes of delete.  Files
1422 containing more than \fBzfs_delete_blocks\fR will be deleted asynchronously
1423 while smaller files are deleted synchronously.  Decreasing this value will
1424 reduce the time spent in an unlink(2) system call at the expense of a longer
1425 delay before the freed space is available.
1426 .sp
1427 Default value: \fB20,480\fR.
1428 .RE
1429
1430 .sp
1431 .ne 2
1432 .na
1433 \fBzfs_dirty_data_max\fR (int)
1434 .ad
1435 .RS 12n
1436 Determines the dirty space limit in bytes.  Once this limit is exceeded, new
1437 writes are halted until space frees up. This parameter takes precedence
1438 over \fBzfs_dirty_data_max_percent\fR.
1439 See the section "ZFS TRANSACTION DELAY".
1440 .sp
1441 Default value: \fB10\fR% of physical RAM, capped at \fBzfs_dirty_data_max_max\fR.
1442 .RE
1443
1444 .sp
1445 .ne 2
1446 .na
1447 \fBzfs_dirty_data_max_max\fR (int)
1448 .ad
1449 .RS 12n
1450 Maximum allowable value of \fBzfs_dirty_data_max\fR, expressed in bytes.
1451 This limit is only enforced at module load time, and will be ignored if
1452 \fBzfs_dirty_data_max\fR is later changed.  This parameter takes
1453 precedence over \fBzfs_dirty_data_max_max_percent\fR. See the section
1454 "ZFS TRANSACTION DELAY".
1455 .sp
1456 Default value: \fB25\fR% of physical RAM.
1457 .RE
1458
1459 .sp
1460 .ne 2
1461 .na
1462 \fBzfs_dirty_data_max_max_percent\fR (int)
1463 .ad
1464 .RS 12n
1465 Maximum allowable value of \fBzfs_dirty_data_max\fR, expressed as a
1466 percentage of physical RAM.  This limit is only enforced at module load
1467 time, and will be ignored if \fBzfs_dirty_data_max\fR is later changed.
1468 The parameter \fBzfs_dirty_data_max_max\fR takes precedence over this
1469 one. See the section "ZFS TRANSACTION DELAY".
1470 .sp
1471 Default value: \fB25\fR%.
1472 .RE
1473
1474 .sp
1475 .ne 2
1476 .na
1477 \fBzfs_dirty_data_max_percent\fR (int)
1478 .ad
1479 .RS 12n
1480 Determines the dirty space limit, expressed as a percentage of all
1481 memory.  Once this limit is exceeded, new writes are halted until space frees
1482 up.  The parameter \fBzfs_dirty_data_max\fR takes precedence over this
1483 one.  See the section "ZFS TRANSACTION DELAY".
1484 .sp
1485 Default value: \fB10\fR%, subject to \fBzfs_dirty_data_max_max\fR.
1486 .RE
1487
1488 .sp
1489 .ne 2
1490 .na
1491 \fBzfs_dirty_data_sync_percent\fR (int)
1492 .ad
1493 .RS 12n
1494 Start syncing out a transaction group if there's at least this much dirty data
1495 as a percentage of \fBzfs_dirty_data_max\fR.  This should be less than
1496 \fBzfs_vdev_async_write_active_min_dirty_percent\fR.
1497 .sp
1498 Default value: \fB20\fR% of \fBzfs_dirty_data_max\fR.
1499 .RE
1500
1501 .sp
1502 .ne 2
1503 .na
1504 \fBzfs_fletcher_4_impl\fR (string)
1505 .ad
1506 .RS 12n
1507 Select a fletcher 4 implementation.
1508 .sp
1509 Supported selectors are: \fBfastest\fR, \fBscalar\fR, \fBsse2\fR, \fBssse3\fR,
1510 \fBavx2\fR, \fBavx512f\fR, \fBavx512bw\fR, and \fBaarch64_neon\fR.
1511 All of the selectors except \fBfastest\fR and \fBscalar\fR require instruction
1512 set extensions to be available and will only appear if ZFS detects that they are
1513 present at runtime. If multiple implementations of fletcher 4 are available,
1514 the \fBfastest\fR will be chosen using a micro benchmark. Selecting \fBscalar\fR
1515 results in the original, CPU based calculation, being used. Selecting any option
1516 other than \fBfastest\fR and \fBscalar\fR results in vector instructions from
1517 the respective CPU instruction set being used.
1518 .sp
1519 Default value: \fBfastest\fR.
1520 .RE
1521
1522 .sp
1523 .ne 2
1524 .na
1525 \fBzfs_free_bpobj_enabled\fR (int)
1526 .ad
1527 .RS 12n
1528 Enable/disable the processing of the free_bpobj object.
1529 .sp
1530 Default value: \fB1\fR.
1531 .RE
1532
1533 .sp
1534 .ne 2
1535 .na
1536 \fBzfs_async_block_max_blocks\fR (ulong)
1537 .ad
1538 .RS 12n
1539 Maximum number of blocks freed in a single txg.
1540 .sp
1541 Default value: \fB100,000\fR.
1542 .RE
1543
1544 .sp
1545 .ne 2
1546 .na
1547 \fBzfs_override_estimate_recordsize\fR (ulong)
1548 .ad
1549 .RS 12n
1550 Record size calculation override for zfs send estimates.
1551 .sp
1552 Default value: \fB0\fR.
1553 .RE
1554
1555 .sp
1556 .ne 2
1557 .na
1558 \fBzfs_vdev_async_read_max_active\fR (int)
1559 .ad
1560 .RS 12n
1561 Maximum asynchronous read I/Os active to each device.
1562 See the section "ZFS I/O SCHEDULER".
1563 .sp
1564 Default value: \fB3\fR.
1565 .RE
1566
1567 .sp
1568 .ne 2
1569 .na
1570 \fBzfs_vdev_async_read_min_active\fR (int)
1571 .ad
1572 .RS 12n
1573 Minimum asynchronous read I/Os active to each device.
1574 See the section "ZFS I/O SCHEDULER".
1575 .sp
1576 Default value: \fB1\fR.
1577 .RE
1578
1579 .sp
1580 .ne 2
1581 .na
1582 \fBzfs_vdev_async_write_active_max_dirty_percent\fR (int)
1583 .ad
1584 .RS 12n
1585 When the pool has more than
1586 \fBzfs_vdev_async_write_active_max_dirty_percent\fR dirty data, use
1587 \fBzfs_vdev_async_write_max_active\fR to limit active async writes.  If
1588 the dirty data is between min and max, the active I/O limit is linearly
1589 interpolated. See the section "ZFS I/O SCHEDULER".
1590 .sp
1591 Default value: \fB60\fR%.
1592 .RE
1593
1594 .sp
1595 .ne 2
1596 .na
1597 \fBzfs_vdev_async_write_active_min_dirty_percent\fR (int)
1598 .ad
1599 .RS 12n
1600 When the pool has less than
1601 \fBzfs_vdev_async_write_active_min_dirty_percent\fR dirty data, use
1602 \fBzfs_vdev_async_write_min_active\fR to limit active async writes.  If
1603 the dirty data is between min and max, the active I/O limit is linearly
1604 interpolated. See the section "ZFS I/O SCHEDULER".
1605 .sp
1606 Default value: \fB30\fR%.
1607 .RE
1608
1609 .sp
1610 .ne 2
1611 .na
1612 \fBzfs_vdev_async_write_max_active\fR (int)
1613 .ad
1614 .RS 12n
1615 Maximum asynchronous write I/Os active to each device.
1616 See the section "ZFS I/O SCHEDULER".
1617 .sp
1618 Default value: \fB10\fR.
1619 .RE
1620
1621 .sp
1622 .ne 2
1623 .na
1624 \fBzfs_vdev_async_write_min_active\fR (int)
1625 .ad
1626 .RS 12n
1627 Minimum asynchronous write I/Os active to each device.
1628 See the section "ZFS I/O SCHEDULER".
1629 .sp
1630 Lower values are associated with better latency on rotational media but poorer
1631 resilver performance. The default value of 2 was chosen as a compromise. A
1632 value of 3 has been shown to improve resilver performance further at a cost of
1633 further increasing latency.
1634 .sp
1635 Default value: \fB2\fR.
1636 .RE
1637
1638 .sp
1639 .ne 2
1640 .na
1641 \fBzfs_vdev_initializing_max_active\fR (int)
1642 .ad
1643 .RS 12n
1644 Maximum initializing I/Os active to each device.
1645 See the section "ZFS I/O SCHEDULER".
1646 .sp
1647 Default value: \fB1\fR.
1648 .RE
1649
1650 .sp
1651 .ne 2
1652 .na
1653 \fBzfs_vdev_initializing_min_active\fR (int)
1654 .ad
1655 .RS 12n
1656 Minimum initializing I/Os active to each device.
1657 See the section "ZFS I/O SCHEDULER".
1658 .sp
1659 Default value: \fB1\fR.
1660 .RE
1661
1662 .sp
1663 .ne 2
1664 .na
1665 \fBzfs_vdev_max_active\fR (int)
1666 .ad
1667 .RS 12n
1668 The maximum number of I/Os active to each device.  Ideally, this will be >=
1669 the sum of each queue's max_active.  It must be at least the sum of each
1670 queue's min_active.  See the section "ZFS I/O SCHEDULER".
1671 .sp
1672 Default value: \fB1,000\fR.
1673 .RE
1674
1675 .sp
1676 .ne 2
1677 .na
1678 \fBzfs_vdev_removal_max_active\fR (int)
1679 .ad
1680 .RS 12n
1681 Maximum removal I/Os active to each device.
1682 See the section "ZFS I/O SCHEDULER".
1683 .sp
1684 Default value: \fB2\fR.
1685 .RE
1686
1687 .sp
1688 .ne 2
1689 .na
1690 \fBzfs_vdev_removal_min_active\fR (int)
1691 .ad
1692 .RS 12n
1693 Minimum removal I/Os active to each device.
1694 See the section "ZFS I/O SCHEDULER".
1695 .sp
1696 Default value: \fB1\fR.
1697 .RE
1698
1699 .sp
1700 .ne 2
1701 .na
1702 \fBzfs_vdev_scrub_max_active\fR (int)
1703 .ad
1704 .RS 12n
1705 Maximum scrub I/Os active to each device.
1706 See the section "ZFS I/O SCHEDULER".
1707 .sp
1708 Default value: \fB2\fR.
1709 .RE
1710
1711 .sp
1712 .ne 2
1713 .na
1714 \fBzfs_vdev_scrub_min_active\fR (int)
1715 .ad
1716 .RS 12n
1717 Minimum scrub I/Os active to each device.
1718 See the section "ZFS I/O SCHEDULER".
1719 .sp
1720 Default value: \fB1\fR.
1721 .RE
1722
1723 .sp
1724 .ne 2
1725 .na
1726 \fBzfs_vdev_sync_read_max_active\fR (int)
1727 .ad
1728 .RS 12n
1729 Maximum synchronous read I/Os active to each device.
1730 See the section "ZFS I/O SCHEDULER".
1731 .sp
1732 Default value: \fB10\fR.
1733 .RE
1734
1735 .sp
1736 .ne 2
1737 .na
1738 \fBzfs_vdev_sync_read_min_active\fR (int)
1739 .ad
1740 .RS 12n
1741 Minimum synchronous read I/Os active to each device.
1742 See the section "ZFS I/O SCHEDULER".
1743 .sp
1744 Default value: \fB10\fR.
1745 .RE
1746
1747 .sp
1748 .ne 2
1749 .na
1750 \fBzfs_vdev_sync_write_max_active\fR (int)
1751 .ad
1752 .RS 12n
1753 Maximum synchronous write I/Os active to each device.
1754 See the section "ZFS I/O SCHEDULER".
1755 .sp
1756 Default value: \fB10\fR.
1757 .RE
1758
1759 .sp
1760 .ne 2
1761 .na
1762 \fBzfs_vdev_sync_write_min_active\fR (int)
1763 .ad
1764 .RS 12n
1765 Minimum synchronous write I/Os active to each device.
1766 See the section "ZFS I/O SCHEDULER".
1767 .sp
1768 Default value: \fB10\fR.
1769 .RE
1770
1771 .sp
1772 .ne 2
1773 .na
1774 \fBzfs_vdev_trim_max_active\fR (int)
1775 .ad
1776 .RS 12n
1777 Maximum trim/discard I/Os active to each device.
1778 See the section "ZFS I/O SCHEDULER".
1779 .sp
1780 Default value: \fB2\fR.
1781 .RE
1782
1783 .sp
1784 .ne 2
1785 .na
1786 \fBzfs_vdev_trim_min_active\fR (int)
1787 .ad
1788 .RS 12n
1789 Minimum trim/discard I/Os active to each device.
1790 See the section "ZFS I/O SCHEDULER".
1791 .sp
1792 Default value: \fB1\fR.
1793 .RE
1794
1795 .sp
1796 .ne 2
1797 .na
1798 \fBzfs_vdev_queue_depth_pct\fR (int)
1799 .ad
1800 .RS 12n
1801 Maximum number of queued allocations per top-level vdev expressed as
1802 a percentage of \fBzfs_vdev_async_write_max_active\fR which allows the
1803 system to detect devices that are more capable of handling allocations
1804 and to allocate more blocks to those devices.  It allows for dynamic
1805 allocation distribution when devices are imbalanced as fuller devices
1806 will tend to be slower than empty devices.
1807
1808 See also \fBzio_dva_throttle_enabled\fR.
1809 .sp
1810 Default value: \fB1000\fR%.
1811 .RE
1812
1813 .sp
1814 .ne 2
1815 .na
1816 \fBzfs_expire_snapshot\fR (int)
1817 .ad
1818 .RS 12n
1819 Seconds to expire .zfs/snapshot
1820 .sp
1821 Default value: \fB300\fR.
1822 .RE
1823
1824 .sp
1825 .ne 2
1826 .na
1827 \fBzfs_admin_snapshot\fR (int)
1828 .ad
1829 .RS 12n
1830 Allow the creation, removal, or renaming of entries in the .zfs/snapshot
1831 directory to cause the creation, destruction, or renaming of snapshots.
1832 When enabled this functionality works both locally and over NFS exports
1833 which have the 'no_root_squash' option set. This functionality is disabled
1834 by default.
1835 .sp
1836 Use \fB1\fR for yes and \fB0\fR for no (default).
1837 .RE
1838
1839 .sp
1840 .ne 2
1841 .na
1842 \fBzfs_flags\fR (int)
1843 .ad
1844 .RS 12n
1845 Set additional debugging flags. The following flags may be bitwise-or'd
1846 together.
1847 .sp
1848 .TS
1849 box;
1850 rB lB
1851 lB lB
1852 r l.
1853 Value   Symbolic Name
1854         Description
1855 _
1856 1       ZFS_DEBUG_DPRINTF
1857         Enable dprintf entries in the debug log.
1858 _
1859 2       ZFS_DEBUG_DBUF_VERIFY *
1860         Enable extra dbuf verifications.
1861 _
1862 4       ZFS_DEBUG_DNODE_VERIFY *
1863         Enable extra dnode verifications.
1864 _
1865 8       ZFS_DEBUG_SNAPNAMES
1866         Enable snapshot name verification.
1867 _
1868 16      ZFS_DEBUG_MODIFY
1869         Check for illegally modified ARC buffers.
1870 _
1871 64      ZFS_DEBUG_ZIO_FREE
1872         Enable verification of block frees.
1873 _
1874 128     ZFS_DEBUG_HISTOGRAM_VERIFY
1875         Enable extra spacemap histogram verifications.
1876 _
1877 256     ZFS_DEBUG_METASLAB_VERIFY
1878         Verify space accounting on disk matches in-core range_trees.
1879 _
1880 512     ZFS_DEBUG_SET_ERROR
1881         Enable SET_ERROR and dprintf entries in the debug log.
1882 _
1883 1024    ZFS_DEBUG_INDIRECT_REMAP
1884         Verify split blocks created by device removal.
1885 _
1886 2048    ZFS_DEBUG_TRIM
1887         Verify TRIM ranges are always within the allocatable range tree.
1888 _
1889 4096    ZFS_DEBUG_LOG_SPACEMAP
1890         Verify that the log summary is consistent with the spacemap log
1891         and enable zfs_dbgmsgs for metaslab loading and flushing.
1892 .TE
1893 .sp
1894 * Requires debug build.
1895 .sp
1896 Default value: \fB0\fR.
1897 .RE
1898
1899 .sp
1900 .ne 2
1901 .na
1902 \fBzfs_free_leak_on_eio\fR (int)
1903 .ad
1904 .RS 12n
1905 If destroy encounters an EIO while reading metadata (e.g. indirect
1906 blocks), space referenced by the missing metadata can not be freed.
1907 Normally this causes the background destroy to become "stalled", as
1908 it is unable to make forward progress.  While in this stalled state,
1909 all remaining space to free from the error-encountering filesystem is
1910 "temporarily leaked".  Set this flag to cause it to ignore the EIO,
1911 permanently leak the space from indirect blocks that can not be read,
1912 and continue to free everything else that it can.
1913
1914 The default, "stalling" behavior is useful if the storage partially
1915 fails (i.e. some but not all i/os fail), and then later recovers.  In
1916 this case, we will be able to continue pool operations while it is
1917 partially failed, and when it recovers, we can continue to free the
1918 space, with no leaks.  However, note that this case is actually
1919 fairly rare.
1920
1921 Typically pools either (a) fail completely (but perhaps temporarily,
1922 e.g. a top-level vdev going offline), or (b) have localized,
1923 permanent errors (e.g. disk returns the wrong data due to bit flip or
1924 firmware bug).  In case (a), this setting does not matter because the
1925 pool will be suspended and the sync thread will not be able to make
1926 forward progress regardless.  In case (b), because the error is
1927 permanent, the best we can do is leak the minimum amount of space,
1928 which is what setting this flag will do.  Therefore, it is reasonable
1929 for this flag to normally be set, but we chose the more conservative
1930 approach of not setting it, so that there is no possibility of
1931 leaking space in the "partial temporary" failure case.
1932 .sp
1933 Default value: \fB0\fR.
1934 .RE
1935
1936 .sp
1937 .ne 2
1938 .na
1939 \fBzfs_free_min_time_ms\fR (int)
1940 .ad
1941 .RS 12n
1942 During a \fBzfs destroy\fR operation using \fBfeature@async_destroy\fR a minimum
1943 of this much time will be spent working on freeing blocks per txg.
1944 .sp
1945 Default value: \fB1,000\fR.
1946 .RE
1947
1948 .sp
1949 .ne 2
1950 .na
1951 \fBzfs_immediate_write_sz\fR (long)
1952 .ad
1953 .RS 12n
1954 Largest data block to write to zil. Larger blocks will be treated as if the
1955 dataset being written to had the property setting \fBlogbias=throughput\fR.
1956 .sp
1957 Default value: \fB32,768\fR.
1958 .RE
1959
1960 .sp
1961 .ne 2
1962 .na
1963 \fBzfs_initialize_value\fR (ulong)
1964 .ad
1965 .RS 12n
1966 Pattern written to vdev free space by \fBzpool initialize\fR.
1967 .sp
1968 Default value: \fB16,045,690,984,833,335,022\fR (0xdeadbeefdeadbeee).
1969 .RE
1970
1971 .sp
1972 .ne 2
1973 .na
1974 \fBzfs_initialize_chunk_size\fR (ulong)
1975 .ad
1976 .RS 12n
1977 Size of writes used by \fBzpool initialize\fR.
1978 This option is used by the test suite to facilitate testing.
1979 .sp
1980 Default value: \fB1,048,576\fR
1981 .RE
1982
1983 .sp
1984 .ne 2
1985 .na
1986 \fBzfs_livelist_max_entries\fR (ulong)
1987 .ad
1988 .RS 12n
1989 The threshold size (in block pointers) at which we create a new sub-livelist.
1990 Larger sublists are more costly from a memory perspective but the fewer
1991 sublists there are, the lower the cost of insertion.
1992 .sp
1993 Default value: \fB500,000\fR.
1994 .RE
1995
1996 .sp
1997 .ne 2
1998 .na
1999 \fBzfs_livelist_min_percent_shared\fR (int)
2000 .ad
2001 .RS 12n
2002 If the amount of shared space between a snapshot and its clone drops below
2003 this threshold, the clone turns off the livelist and reverts to the old deletion
2004 method. This is in place because once a clone has been overwritten enough
2005 livelists no long give us a benefit.
2006 .sp
2007 Default value: \fB75\fR.
2008 .RE
2009
2010 .sp
2011 .ne 2
2012 .na
2013 \fBzfs_livelist_condense_new_alloc\fR (int)
2014 .ad
2015 .RS 12n
2016 Incremented each time an extra ALLOC blkptr is added to a livelist entry while
2017 it is being condensed.
2018 This option is used by the test suite to track race conditions.
2019 .sp
2020 Default value: \fB0\fR.
2021 .RE
2022
2023 .sp
2024 .ne 2
2025 .na
2026 \fBzfs_livelist_condense_sync_cancel\fR (int)
2027 .ad
2028 .RS 12n
2029 Incremented each time livelist condensing is canceled while in
2030 spa_livelist_condense_sync.
2031 This option is used by the test suite to track race conditions.
2032 .sp
2033 Default value: \fB0\fR.
2034 .RE
2035
2036 .sp
2037 .ne 2
2038 .na
2039 \fBzfs_livelist_condense_sync_pause\fR (int)
2040 .ad
2041 .RS 12n
2042 When set, the livelist condense process pauses indefinitely before
2043 executing the synctask - spa_livelist_condense_sync.
2044 This option is used by the test suite to trigger race conditions.
2045 .sp
2046 Default value: \fB0\fR.
2047 .RE
2048
2049 .sp
2050 .ne 2
2051 .na
2052 \fBzfs_livelist_condense_zthr_cancel\fR (int)
2053 .ad
2054 .RS 12n
2055 Incremented each time livelist condensing is canceled while in
2056 spa_livelist_condense_cb.
2057 This option is used by the test suite to track race conditions.
2058 .sp
2059 Default value: \fB0\fR.
2060 .RE
2061
2062 .sp
2063 .ne 2
2064 .na
2065 \fBzfs_livelist_condense_zthr_pause\fR (int)
2066 .ad
2067 .RS 12n
2068 When set, the livelist condense process pauses indefinitely before
2069 executing the open context condensing work in spa_livelist_condense_cb.
2070 This option is used by the test suite to trigger race conditions.
2071 .sp
2072 Default value: \fB0\fR.
2073 .RE
2074
2075 .sp
2076 .ne 2
2077 .na
2078 \fBzfs_lua_max_instrlimit\fR (ulong)
2079 .ad
2080 .RS 12n
2081 The maximum execution time limit that can be set for a ZFS channel program,
2082 specified as a number of Lua instructions.
2083 .sp
2084 Default value: \fB100,000,000\fR.
2085 .RE
2086
2087 .sp
2088 .ne 2
2089 .na
2090 \fBzfs_lua_max_memlimit\fR (ulong)
2091 .ad
2092 .RS 12n
2093 The maximum memory limit that can be set for a ZFS channel program, specified
2094 in bytes.
2095 .sp
2096 Default value: \fB104,857,600\fR.
2097 .RE
2098
2099 .sp
2100 .ne 2
2101 .na
2102 \fBzfs_max_dataset_nesting\fR (int)
2103 .ad
2104 .RS 12n
2105 The maximum depth of nested datasets.  This value can be tuned temporarily to
2106 fix existing datasets that exceed the predefined limit.
2107 .sp
2108 Default value: \fB50\fR.
2109 .RE
2110
2111 .sp
2112 .ne 2
2113 .na
2114 \fBzfs_max_log_walking\fR (ulong)
2115 .ad
2116 .RS 12n
2117 The number of past TXGs that the flushing algorithm of the log spacemap
2118 feature uses to estimate incoming log blocks.
2119 .sp
2120 Default value: \fB5\fR.
2121 .RE
2122
2123 .sp
2124 .ne 2
2125 .na
2126 \fBzfs_max_logsm_summary_length\fR (ulong)
2127 .ad
2128 .RS 12n
2129 Maximum number of rows allowed in the summary of the spacemap log.
2130 .sp
2131 Default value: \fB10\fR.
2132 .RE
2133
2134 .sp
2135 .ne 2
2136 .na
2137 \fBzfs_max_recordsize\fR (int)
2138 .ad
2139 .RS 12n
2140 We currently support block sizes from 512 bytes to 16MB.  The benefits of
2141 larger blocks, and thus larger I/O, need to be weighed against the cost of
2142 COWing a giant block to modify one byte.  Additionally, very large blocks
2143 can have an impact on i/o latency, and also potentially on the memory
2144 allocator.  Therefore, we do not allow the recordsize to be set larger than
2145 zfs_max_recordsize (default 1MB).  Larger blocks can be created by changing
2146 this tunable, and pools with larger blocks can always be imported and used,
2147 regardless of this setting.
2148 .sp
2149 Default value: \fB1,048,576\fR.
2150 .RE
2151
2152 .sp
2153 .ne 2
2154 .na
2155 \fBzfs_allow_redacted_dataset_mount\fR (int)
2156 .ad
2157 .RS 12n
2158 Allow datasets received with redacted send/receive to be mounted. Normally
2159 disabled because these datasets may be missing key data.
2160 .sp
2161 Default value: \fB0\fR.
2162 .RE
2163
2164 .sp
2165 .ne 2
2166 .na
2167 \fBzfs_min_metaslabs_to_flush\fR (ulong)
2168 .ad
2169 .RS 12n
2170 Minimum number of metaslabs to flush per dirty TXG
2171 .sp
2172 Default value: \fB1\fR.
2173 .RE
2174
2175 .sp
2176 .ne 2
2177 .na
2178 \fBzfs_metaslab_fragmentation_threshold\fR (int)
2179 .ad
2180 .RS 12n
2181 Allow metaslabs to keep their active state as long as their fragmentation
2182 percentage is less than or equal to this value. An active metaslab that
2183 exceeds this threshold will no longer keep its active status allowing
2184 better metaslabs to be selected.
2185 .sp
2186 Default value: \fB70\fR.
2187 .RE
2188
2189 .sp
2190 .ne 2
2191 .na
2192 \fBzfs_mg_fragmentation_threshold\fR (int)
2193 .ad
2194 .RS 12n
2195 Metaslab groups are considered eligible for allocations if their
2196 fragmentation metric (measured as a percentage) is less than or equal to
2197 this value. If a metaslab group exceeds this threshold then it will be
2198 skipped unless all metaslab groups within the metaslab class have also
2199 crossed this threshold.
2200 .sp
2201 Default value: \fB95\fR.
2202 .RE
2203
2204 .sp
2205 .ne 2
2206 .na
2207 \fBzfs_mg_noalloc_threshold\fR (int)
2208 .ad
2209 .RS 12n
2210 Defines a threshold at which metaslab groups should be eligible for
2211 allocations.  The value is expressed as a percentage of free space
2212 beyond which a metaslab group is always eligible for allocations.
2213 If a metaslab group's free space is less than or equal to the
2214 threshold, the allocator will avoid allocating to that group
2215 unless all groups in the pool have reached the threshold.  Once all
2216 groups have reached the threshold, all groups are allowed to accept
2217 allocations.  The default value of 0 disables the feature and causes
2218 all metaslab groups to be eligible for allocations.
2219
2220 This parameter allows one to deal with pools having heavily imbalanced
2221 vdevs such as would be the case when a new vdev has been added.
2222 Setting the threshold to a non-zero percentage will stop allocations
2223 from being made to vdevs that aren't filled to the specified percentage
2224 and allow lesser filled vdevs to acquire more allocations than they
2225 otherwise would under the old \fBzfs_mg_alloc_failures\fR facility.
2226 .sp
2227 Default value: \fB0\fR.
2228 .RE
2229
2230 .sp
2231 .ne 2
2232 .na
2233 \fBzfs_ddt_data_is_special\fR (int)
2234 .ad
2235 .RS 12n
2236 If enabled, ZFS will place DDT data into the special allocation class.
2237 .sp
2238 Default value: \fB1\fR.
2239 .RE
2240
2241 .sp
2242 .ne 2
2243 .na
2244 \fBzfs_user_indirect_is_special\fR (int)
2245 .ad
2246 .RS 12n
2247 If enabled, ZFS will place user data (both file and zvol) indirect blocks
2248 into the special allocation class.
2249 .sp
2250 Default value: \fB1\fR.
2251 .RE
2252
2253 .sp
2254 .ne 2
2255 .na
2256 \fBzfs_multihost_history\fR (int)
2257 .ad
2258 .RS 12n
2259 Historical statistics for the last N multihost updates will be available in
2260 \fB/proc/spl/kstat/zfs/<pool>/multihost\fR
2261 .sp
2262 Default value: \fB0\fR.
2263 .RE
2264
2265 .sp
2266 .ne 2
2267 .na
2268 \fBzfs_multihost_interval\fR (ulong)
2269 .ad
2270 .RS 12n
2271 Used to control the frequency of multihost writes which are performed when the
2272 \fBmultihost\fR pool property is on.  This is one factor used to determine the
2273 length of the activity check during import.
2274 .sp
2275 The multihost write period is \fBzfs_multihost_interval / leaf-vdevs\fR
2276 milliseconds.  On average a multihost write will be issued for each leaf vdev
2277 every \fBzfs_multihost_interval\fR milliseconds.  In practice, the observed
2278 period can vary with the I/O load and this observed value is the delay which is
2279 stored in the uberblock.
2280 .sp
2281 Default value: \fB1000\fR.
2282 .RE
2283
2284 .sp
2285 .ne 2
2286 .na
2287 \fBzfs_multihost_import_intervals\fR (uint)
2288 .ad
2289 .RS 12n
2290 Used to control the duration of the activity test on import.  Smaller values of
2291 \fBzfs_multihost_import_intervals\fR will reduce the import time but increase
2292 the risk of failing to detect an active pool.  The total activity check time is
2293 never allowed to drop below one second.
2294 .sp
2295 On import the activity check waits a minimum amount of time determined by
2296 \fBzfs_multihost_interval * zfs_multihost_import_intervals\fR, or the same
2297 product computed on the host which last had the pool imported (whichever is
2298 greater).  The activity check time may be further extended if the value of mmp
2299 delay found in the best uberblock indicates actual multihost updates happened
2300 at longer intervals than \fBzfs_multihost_interval\fR.  A minimum value of
2301 \fB100ms\fR is enforced.
2302 .sp
2303 A value of 0 is ignored and treated as if it was set to 1.
2304 .sp
2305 Default value: \fB20\fR.
2306 .RE
2307
2308 .sp
2309 .ne 2
2310 .na
2311 \fBzfs_multihost_fail_intervals\fR (uint)
2312 .ad
2313 .RS 12n
2314 Controls the behavior of the pool when multihost write failures or delays are
2315 detected.
2316 .sp
2317 When \fBzfs_multihost_fail_intervals = 0\fR, multihost write failures or delays
2318 are ignored.  The failures will still be reported to the ZED which depending on
2319 its configuration may take action such as suspending the pool or offlining a
2320 device.
2321
2322 .sp
2323 When \fBzfs_multihost_fail_intervals > 0\fR, the pool will be suspended if
2324 \fBzfs_multihost_fail_intervals * zfs_multihost_interval\fR milliseconds pass
2325 without a successful mmp write.  This guarantees the activity test will see
2326 mmp writes if the pool is imported.  A value of 1 is ignored and treated as
2327 if it was set to 2.  This is necessary to prevent the pool from being suspended
2328 due to normal, small I/O latency variations.
2329
2330 .sp
2331 Default value: \fB10\fR.
2332 .RE
2333
2334 .sp
2335 .ne 2
2336 .na
2337 \fBzfs_no_scrub_io\fR (int)
2338 .ad
2339 .RS 12n
2340 Set for no scrub I/O. This results in scrubs not actually scrubbing data and
2341 simply doing a metadata crawl of the pool instead.
2342 .sp
2343 Use \fB1\fR for yes and \fB0\fR for no (default).
2344 .RE
2345
2346 .sp
2347 .ne 2
2348 .na
2349 \fBzfs_no_scrub_prefetch\fR (int)
2350 .ad
2351 .RS 12n
2352 Set to disable block prefetching for scrubs.
2353 .sp
2354 Use \fB1\fR for yes and \fB0\fR for no (default).
2355 .RE
2356
2357 .sp
2358 .ne 2
2359 .na
2360 \fBzfs_nocacheflush\fR (int)
2361 .ad
2362 .RS 12n
2363 Disable cache flush operations on disks when writing.  Setting this will
2364 cause pool corruption on power loss if a volatile out-of-order write cache
2365 is enabled.
2366 .sp
2367 Use \fB1\fR for yes and \fB0\fR for no (default).
2368 .RE
2369
2370 .sp
2371 .ne 2
2372 .na
2373 \fBzfs_nopwrite_enabled\fR (int)
2374 .ad
2375 .RS 12n
2376 Enable NOP writes
2377 .sp
2378 Use \fB1\fR for yes (default) and \fB0\fR to disable.
2379 .RE
2380
2381 .sp
2382 .ne 2
2383 .na
2384 \fBzfs_dmu_offset_next_sync\fR (int)
2385 .ad
2386 .RS 12n
2387 Enable forcing txg sync to find holes. When enabled forces ZFS to act
2388 like prior versions when SEEK_HOLE or SEEK_DATA flags are used, which
2389 when a dnode is dirty causes txg's to be synced so that this data can be
2390 found.
2391 .sp
2392 Use \fB1\fR for yes and \fB0\fR to disable (default).
2393 .RE
2394
2395 .sp
2396 .ne 2
2397 .na
2398 \fBzfs_pd_bytes_max\fR (int)
2399 .ad
2400 .RS 12n
2401 The number of bytes which should be prefetched during a pool traversal
2402 (eg: \fBzfs send\fR or other data crawling operations)
2403 .sp
2404 Default value: \fB52,428,800\fR.
2405 .RE
2406
2407 .sp
2408 .ne 2
2409 .na
2410 \fBzfs_per_txg_dirty_frees_percent \fR (ulong)
2411 .ad
2412 .RS 12n
2413 Tunable to control percentage of dirtied indirect blocks from frees allowed
2414 into one TXG. After this threshold is crossed, additional frees will wait until
2415 the next TXG.
2416 A value of zero will disable this throttle.
2417 .sp
2418 Default value: \fB5\fR, set to \fB0\fR to disable.
2419 .RE
2420
2421 .sp
2422 .ne 2
2423 .na
2424 \fBzfs_prefetch_disable\fR (int)
2425 .ad
2426 .RS 12n
2427 This tunable disables predictive prefetch.  Note that it leaves "prescient"
2428 prefetch (e.g. prefetch for zfs send) intact.  Unlike predictive prefetch,
2429 prescient prefetch never issues i/os that end up not being needed, so it
2430 can't hurt performance.
2431 .sp
2432 Use \fB1\fR for yes and \fB0\fR for no (default).
2433 .RE
2434
2435 .sp
2436 .ne 2
2437 .na
2438 \fBzfs_qat_checksum_disable\fR (int)
2439 .ad
2440 .RS 12n
2441 This tunable disables qat hardware acceleration for sha256 checksums. It
2442 may be set after the zfs modules have been loaded to initialize the qat
2443 hardware as long as support is compiled in and the qat driver is present.
2444 .sp
2445 Use \fB1\fR for yes and \fB0\fR for no (default).
2446 .RE
2447
2448 .sp
2449 .ne 2
2450 .na
2451 \fBzfs_qat_compress_disable\fR (int)
2452 .ad
2453 .RS 12n
2454 This tunable disables qat hardware acceleration for gzip compression. It
2455 may be set after the zfs modules have been loaded to initialize the qat
2456 hardware as long as support is compiled in and the qat driver is present.
2457 .sp
2458 Use \fB1\fR for yes and \fB0\fR for no (default).
2459 .RE
2460
2461 .sp
2462 .ne 2
2463 .na
2464 \fBzfs_qat_encrypt_disable\fR (int)
2465 .ad
2466 .RS 12n
2467 This tunable disables qat hardware acceleration for AES-GCM encryption. It
2468 may be set after the zfs modules have been loaded to initialize the qat
2469 hardware as long as support is compiled in and the qat driver is present.
2470 .sp
2471 Use \fB1\fR for yes and \fB0\fR for no (default).
2472 .RE
2473
2474 .sp
2475 .ne 2
2476 .na
2477 \fBzfs_read_chunk_size\fR (long)
2478 .ad
2479 .RS 12n
2480 Bytes to read per chunk
2481 .sp
2482 Default value: \fB1,048,576\fR.
2483 .RE
2484
2485 .sp
2486 .ne 2
2487 .na
2488 \fBzfs_read_history\fR (int)
2489 .ad
2490 .RS 12n
2491 Historical statistics for the last N reads will be available in
2492 \fB/proc/spl/kstat/zfs/<pool>/reads\fR
2493 .sp
2494 Default value: \fB0\fR (no data is kept).
2495 .RE
2496
2497 .sp
2498 .ne 2
2499 .na
2500 \fBzfs_read_history_hits\fR (int)
2501 .ad
2502 .RS 12n
2503 Include cache hits in read history
2504 .sp
2505 Use \fB1\fR for yes and \fB0\fR for no (default).
2506 .RE
2507
2508 .sp
2509 .ne 2
2510 .na
2511 \fBzfs_reconstruct_indirect_combinations_max\fR (int)
2512 .ad
2513 .RS 12na
2514 If an indirect split block contains more than this many possible unique
2515 combinations when being reconstructed, consider it too computationally
2516 expensive to check them all. Instead, try at most
2517 \fBzfs_reconstruct_indirect_combinations_max\fR randomly-selected
2518 combinations each time the block is accessed.  This allows all segment
2519 copies to participate fairly in the reconstruction when all combinations
2520 cannot be checked and prevents repeated use of one bad copy.
2521 .sp
2522 Default value: \fB4096\fR.
2523 .RE
2524
2525 .sp
2526 .ne 2
2527 .na
2528 \fBzfs_recover\fR (int)
2529 .ad
2530 .RS 12n
2531 Set to attempt to recover from fatal errors. This should only be used as a
2532 last resort, as it typically results in leaked space, or worse.
2533 .sp
2534 Use \fB1\fR for yes and \fB0\fR for no (default).
2535 .RE
2536
2537 .sp
2538 .ne 2
2539 .na
2540 \fBzfs_removal_ignore_errors\fR (int)
2541 .ad
2542 .RS 12n
2543 .sp
2544 Ignore hard IO errors during device removal.  When set, if a device encounters
2545 a hard IO error during the removal process the removal will not be cancelled.
2546 This can result in a normally recoverable block becoming permanently damaged
2547 and is not recommended.  This should only be used as a last resort when the
2548 pool cannot be returned to a healthy state prior to removing the device.
2549 .sp
2550 Default value: \fB0\fR.
2551 .RE
2552
2553 .sp
2554 .ne 2
2555 .na
2556 \fBzfs_removal_suspend_progress\fR (int)
2557 .ad
2558 .RS 12n
2559 .sp
2560 This is used by the test suite so that it can ensure that certain actions
2561 happen while in the middle of a removal.
2562 .sp
2563 Default value: \fB0\fR.
2564 .RE
2565
2566 .sp
2567 .ne 2
2568 .na
2569 \fBzfs_remove_max_segment\fR (int)
2570 .ad
2571 .RS 12n
2572 .sp
2573 The largest contiguous segment that we will attempt to allocate when removing
2574 a device.  This can be no larger than 16MB.  If there is a performance
2575 problem with attempting to allocate large blocks, consider decreasing this.
2576 .sp
2577 Default value: \fB16,777,216\fR (16MB).
2578 .RE
2579
2580 .sp
2581 .ne 2
2582 .na
2583 \fBzfs_resilver_min_time_ms\fR (int)
2584 .ad
2585 .RS 12n
2586 Resilvers are processed by the sync thread. While resilvering it will spend
2587 at least this much time working on a resilver between txg flushes.
2588 .sp
2589 Default value: \fB3,000\fR.
2590 .RE
2591
2592 .sp
2593 .ne 2
2594 .na
2595 \fBzfs_scan_ignore_errors\fR (int)
2596 .ad
2597 .RS 12n
2598 If set to a nonzero value, remove the DTL (dirty time list) upon
2599 completion of a pool scan (scrub) even if there were unrepairable
2600 errors.  It is intended to be used during pool repair or recovery to
2601 stop resilvering when the pool is next imported.
2602 .sp
2603 Default value: \fB0\fR.
2604 .RE
2605
2606 .sp
2607 .ne 2
2608 .na
2609 \fBzfs_scrub_min_time_ms\fR (int)
2610 .ad
2611 .RS 12n
2612 Scrubs are processed by the sync thread. While scrubbing it will spend
2613 at least this much time working on a scrub between txg flushes.
2614 .sp
2615 Default value: \fB1,000\fR.
2616 .RE
2617
2618 .sp
2619 .ne 2
2620 .na
2621 \fBzfs_scan_checkpoint_intval\fR (int)
2622 .ad
2623 .RS 12n
2624 To preserve progress across reboots the sequential scan algorithm periodically
2625 needs to stop metadata scanning and issue all the verifications I/Os to disk.
2626 The frequency of this flushing is determined by the
2627 \fBzfs_scan_checkpoint_intval\fR tunable.
2628 .sp
2629 Default value: \fB7200\fR seconds (every 2 hours).
2630 .RE
2631
2632 .sp
2633 .ne 2
2634 .na
2635 \fBzfs_scan_fill_weight\fR (int)
2636 .ad
2637 .RS 12n
2638 This tunable affects how scrub and resilver I/O segments are ordered. A higher
2639 number indicates that we care more about how filled in a segment is, while a
2640 lower number indicates we care more about the size of the extent without
2641 considering the gaps within a segment. This value is only tunable upon module
2642 insertion. Changing the value afterwards will have no affect on scrub or
2643 resilver performance.
2644 .sp
2645 Default value: \fB3\fR.
2646 .RE
2647
2648 .sp
2649 .ne 2
2650 .na
2651 \fBzfs_scan_issue_strategy\fR (int)
2652 .ad
2653 .RS 12n
2654 Determines the order that data will be verified while scrubbing or resilvering.
2655 If set to \fB1\fR, data will be verified as sequentially as possible, given the
2656 amount of memory reserved for scrubbing (see \fBzfs_scan_mem_lim_fact\fR). This
2657 may improve scrub performance if the pool's data is very fragmented. If set to
2658 \fB2\fR, the largest mostly-contiguous chunk of found data will be verified
2659 first. By deferring scrubbing of small segments, we may later find adjacent data
2660 to coalesce and increase the segment size. If set to \fB0\fR, zfs will use
2661 strategy \fB1\fR during normal verification and strategy \fB2\fR while taking a
2662 checkpoint.
2663 .sp
2664 Default value: \fB0\fR.
2665 .RE
2666
2667 .sp
2668 .ne 2
2669 .na
2670 \fBzfs_scan_legacy\fR (int)
2671 .ad
2672 .RS 12n
2673 A value of 0 indicates that scrubs and resilvers will gather metadata in
2674 memory before issuing sequential I/O. A value of 1 indicates that the legacy
2675 algorithm will be used where I/O is initiated as soon as it is discovered.
2676 Changing this value to 0 will not affect scrubs or resilvers that are already
2677 in progress.
2678 .sp
2679 Default value: \fB0\fR.
2680 .RE
2681
2682 .sp
2683 .ne 2
2684 .na
2685 \fBzfs_scan_max_ext_gap\fR (int)
2686 .ad
2687 .RS 12n
2688 Indicates the largest gap in bytes between scrub / resilver I/Os that will still
2689 be considered sequential for sorting purposes. Changing this value will not
2690 affect scrubs or resilvers that are already in progress.
2691 .sp
2692 Default value: \fB2097152 (2 MB)\fR.
2693 .RE
2694
2695 .sp
2696 .ne 2
2697 .na
2698 \fBzfs_scan_mem_lim_fact\fR (int)
2699 .ad
2700 .RS 12n
2701 Maximum fraction of RAM used for I/O sorting by sequential scan algorithm.
2702 This tunable determines the hard limit for I/O sorting memory usage.
2703 When the hard limit is reached we stop scanning metadata and start issuing
2704 data verification I/O. This is done until we get below the soft limit.
2705 .sp
2706 Default value: \fB20\fR which is 5% of RAM (1/20).
2707 .RE
2708
2709 .sp
2710 .ne 2
2711 .na
2712 \fBzfs_scan_mem_lim_soft_fact\fR (int)
2713 .ad
2714 .RS 12n
2715 The fraction of the hard limit used to determined the soft limit for I/O sorting
2716 by the sequential scan algorithm. When we cross this limit from below no action
2717 is taken. When we cross this limit from above it is because we are issuing
2718 verification I/O. In this case (unless the metadata scan is done) we stop
2719 issuing verification I/O and start scanning metadata again until we get to the
2720 hard limit.
2721 .sp
2722 Default value: \fB20\fR which is 5% of the hard limit (1/20).
2723 .RE
2724
2725 .sp
2726 .ne 2
2727 .na
2728 \fBzfs_scan_vdev_limit\fR (int)
2729 .ad
2730 .RS 12n
2731 Maximum amount of data that can be concurrently issued at once for scrubs and
2732 resilvers per leaf device, given in bytes.
2733 .sp
2734 Default value: \fB41943040\fR.
2735 .RE
2736
2737 .sp
2738 .ne 2
2739 .na
2740 \fBzfs_send_corrupt_data\fR (int)
2741 .ad
2742 .RS 12n
2743 Allow sending of corrupt data (ignore read/checksum errors when sending data)
2744 .sp
2745 Use \fB1\fR for yes and \fB0\fR for no (default).
2746 .RE
2747
2748 .sp
2749 .ne 2
2750 .na
2751 \fBzfs_send_unmodified_spill_blocks\fR (int)
2752 .ad
2753 .RS 12n
2754 Include unmodified spill blocks in the send stream. Under certain circumstances
2755 previous versions of ZFS could incorrectly remove the spill block from an
2756 existing object.  Including unmodified copies of the spill blocks creates a
2757 backwards compatible stream which will recreate a spill block if it was
2758 incorrectly removed.
2759 .sp
2760 Use \fB1\fR for yes (default) and \fB0\fR for no.
2761 .RE
2762
2763 .sp
2764 .ne 2
2765 .na
2766 \fBzfs_send_no_prefetch_queue_ff\fR (int)
2767 .ad
2768 .RS 12n
2769 The fill fraction of the \fBzfs send\fR internal queues. The fill fraction
2770 controls the timing with which internal threads are woken up.
2771 .sp
2772 Default value: \fB20\fR.
2773 .RE
2774
2775 .sp
2776 .ne 2
2777 .na
2778 \fBzfs_send_no_prefetch_queue_length\fR (int)
2779 .ad
2780 .RS 12n
2781 The maximum number of bytes allowed in \fBzfs send\fR's internal queues.
2782 .sp
2783 Default value: \fB1,048,576\fR.
2784 .RE
2785
2786 .sp
2787 .ne 2
2788 .na
2789 \fBzfs_send_queue_ff\fR (int)
2790 .ad
2791 .RS 12n
2792 The fill fraction of the \fBzfs send\fR prefetch queue. The fill fraction
2793 controls the timing with which internal threads are woken up.
2794 .sp
2795 Default value: \fB20\fR.
2796 .RE
2797
2798 .sp
2799 .ne 2
2800 .na
2801 \fBzfs_send_queue_length\fR (int)
2802 .ad
2803 .RS 12n
2804 The maximum number of bytes allowed that will be prefetched by \fBzfs send\fR.
2805 This value must be at least twice the maximum block size in use.
2806 .sp
2807 Default value: \fB16,777,216\fR.
2808 .RE
2809
2810 .sp
2811 .ne 2
2812 .na
2813 \fBzfs_recv_queue_ff\fR (int)
2814 .ad
2815 .RS 12n
2816 The fill fraction of the \fBzfs receive\fR queue. The fill fraction
2817 controls the timing with which internal threads are woken up.
2818 .sp
2819 Default value: \fB20\fR.
2820 .RE
2821
2822 .sp
2823 .ne 2
2824 .na
2825 \fBzfs_recv_queue_length\fR (int)
2826 .ad
2827 .RS 12n
2828 The maximum number of bytes allowed in the \fBzfs receive\fR queue. This value
2829 must be at least twice the maximum block size in use.
2830 .sp
2831 Default value: \fB16,777,216\fR.
2832 .RE
2833
2834 .sp
2835 .ne 2
2836 .na
2837 \fBzfs_override_estimate_recordsize\fR (ulong)
2838 .ad
2839 .RS 12n
2840 Setting this variable overrides the default logic for estimating block
2841 sizes when doing a zfs send. The default heuristic is that the average
2842 block size will be the current recordsize. Override this value if most data
2843 in your dataset is not of that size and you require accurate zfs send size
2844 estimates.
2845 .sp
2846 Default value: \fB0\fR.
2847 .RE
2848
2849 .sp
2850 .ne 2
2851 .na
2852 \fBzfs_sync_pass_deferred_free\fR (int)
2853 .ad
2854 .RS 12n
2855 Flushing of data to disk is done in passes. Defer frees starting in this pass
2856 .sp
2857 Default value: \fB2\fR.
2858 .RE
2859
2860 .sp
2861 .ne 2
2862 .na
2863 \fBzfs_spa_discard_memory_limit\fR (int)
2864 .ad
2865 .RS 12n
2866 Maximum memory used for prefetching a checkpoint's space map on each
2867 vdev while discarding the checkpoint.
2868 .sp
2869 Default value: \fB16,777,216\fR.
2870 .RE
2871
2872 .sp
2873 .ne 2
2874 .na
2875 \fBzfs_special_class_metadata_reserve_pct\fR (int)
2876 .ad
2877 .RS 12n
2878 Only allow small data blocks to be allocated on the special and dedup vdev
2879 types when the available free space percentage on these vdevs exceeds this
2880 value. This ensures reserved space is available for pool meta data as the
2881 special vdevs approach capacity.
2882 .sp
2883 Default value: \fB25\fR.
2884 .RE
2885
2886 .sp
2887 .ne 2
2888 .na
2889 \fBzfs_sync_pass_dont_compress\fR (int)
2890 .ad
2891 .RS 12n
2892 Starting in this sync pass, we disable compression (including of metadata).
2893 With the default setting, in practice, we don't have this many sync passes,
2894 so this has no effect.
2895 .sp
2896 The original intent was that disabling compression would help the sync passes
2897 to converge. However, in practice disabling compression increases the average
2898 number of sync passes, because when we turn compression off, a lot of block's
2899 size will change and thus we have to re-allocate (not overwrite) them. It
2900 also increases the number of 128KB allocations (e.g. for indirect blocks and
2901 spacemaps) because these will not be compressed. The 128K allocations are
2902 especially detrimental to performance on highly fragmented systems, which may
2903 have very few free segments of this size, and may need to load new metaslabs
2904 to satisfy 128K allocations.
2905 .sp
2906 Default value: \fB8\fR.
2907 .RE
2908
2909 .sp
2910 .ne 2
2911 .na
2912 \fBzfs_sync_pass_rewrite\fR (int)
2913 .ad
2914 .RS 12n
2915 Rewrite new block pointers starting in this pass
2916 .sp
2917 Default value: \fB2\fR.
2918 .RE
2919
2920 .sp
2921 .ne 2
2922 .na
2923 \fBzfs_sync_taskq_batch_pct\fR (int)
2924 .ad
2925 .RS 12n
2926 This controls the number of threads used by the dp_sync_taskq.  The default
2927 value of 75% will create a maximum of one thread per cpu.
2928 .sp
2929 Default value: \fB75\fR%.
2930 .RE
2931
2932 .sp
2933 .ne 2
2934 .na
2935 \fBzfs_trim_extent_bytes_max\fR (unsigned int)
2936 .ad
2937 .RS 12n
2938 Maximum size of TRIM command.  Ranges larger than this will be split in to
2939 chunks no larger than \fBzfs_trim_extent_bytes_max\fR bytes before being
2940 issued to the device.
2941 .sp
2942 Default value: \fB134,217,728\fR.
2943 .RE
2944
2945 .sp
2946 .ne 2
2947 .na
2948 \fBzfs_trim_extent_bytes_min\fR (unsigned int)
2949 .ad
2950 .RS 12n
2951 Minimum size of TRIM commands.  TRIM ranges smaller than this will be skipped
2952 unless they're part of a larger range which was broken in to chunks.  This is
2953 done because it's common for these small TRIMs to negatively impact overall
2954 performance.  This value can be set to 0 to TRIM all unallocated space.
2955 .sp
2956 Default value: \fB32,768\fR.
2957 .RE
2958
2959 .sp
2960 .ne 2
2961 .na
2962 \fBzfs_trim_metaslab_skip\fR (unsigned int)
2963 .ad
2964 .RS 12n
2965 Skip uninitialized metaslabs during the TRIM process.  This option is useful
2966 for pools constructed from large thinly-provisioned devices where TRIM
2967 operations are slow.  As a pool ages an increasing fraction of the pools
2968 metaslabs will be initialized progressively degrading the usefulness of
2969 this option.  This setting is stored when starting a manual TRIM and will
2970 persist for the duration of the requested TRIM.
2971 .sp
2972 Default value: \fB0\fR.
2973 .RE
2974
2975 .sp
2976 .ne 2
2977 .na
2978 \fBzfs_trim_queue_limit\fR (unsigned int)
2979 .ad
2980 .RS 12n
2981 Maximum number of queued TRIMs outstanding per leaf vdev.  The number of
2982 concurrent TRIM commands issued to the device is controlled by the
2983 \fBzfs_vdev_trim_min_active\fR and \fBzfs_vdev_trim_max_active\fR module
2984 options.
2985 .sp
2986 Default value: \fB10\fR.
2987 .RE
2988
2989 .sp
2990 .ne 2
2991 .na
2992 \fBzfs_trim_txg_batch\fR (unsigned int)
2993 .ad
2994 .RS 12n
2995 The number of transaction groups worth of frees which should be aggregated
2996 before TRIM operations are issued to the device.  This setting represents a
2997 trade-off between issuing larger, more efficient TRIM operations and the
2998 delay before the recently trimmed space is available for use by the device.
2999 .sp
3000 Increasing this value will allow frees to be aggregated for a longer time.
3001 This will result is larger TRIM operations and potentially increased memory
3002 usage.  Decreasing this value will have the opposite effect.  The default
3003 value of 32 was determined to be a reasonable compromise.
3004 .sp
3005 Default value: \fB32\fR.
3006 .RE
3007
3008 .sp
3009 .ne 2
3010 .na
3011 \fBzfs_txg_history\fR (int)
3012 .ad
3013 .RS 12n
3014 Historical statistics for the last N txgs will be available in
3015 \fB/proc/spl/kstat/zfs/<pool>/txgs\fR
3016 .sp
3017 Default value: \fB0\fR.
3018 .RE
3019
3020 .sp
3021 .ne 2
3022 .na
3023 \fBzfs_txg_timeout\fR (int)
3024 .ad
3025 .RS 12n
3026 Flush dirty data to disk at least every N seconds (maximum txg duration)
3027 .sp
3028 Default value: \fB5\fR.
3029 .RE
3030
3031 .sp
3032 .ne 2
3033 .na
3034 \fBzfs_vdev_aggregate_trim\fR (int)
3035 .ad
3036 .RS 12n
3037 Allow TRIM I/Os to be aggregated.  This is normally not helpful because
3038 the extents to be trimmed will have been already been aggregated by the
3039 metaslab.  This option is provided for debugging and performance analysis.
3040 .sp
3041 Default value: \fB0\fR.
3042 .RE
3043
3044 .sp
3045 .ne 2
3046 .na
3047 \fBzfs_vdev_aggregation_limit\fR (int)
3048 .ad
3049 .RS 12n
3050 Max vdev I/O aggregation size
3051 .sp
3052 Default value: \fB1,048,576\fR.
3053 .RE
3054
3055 .sp
3056 .ne 2
3057 .na
3058 \fBzfs_vdev_aggregation_limit_non_rotating\fR (int)
3059 .ad
3060 .RS 12n
3061 Max vdev I/O aggregation size for non-rotating media
3062 .sp
3063 Default value: \fB131,072\fR.
3064 .RE
3065
3066 .sp
3067 .ne 2
3068 .na
3069 \fBzfs_vdev_cache_bshift\fR (int)
3070 .ad
3071 .RS 12n
3072 Shift size to inflate reads too
3073 .sp
3074 Default value: \fB16\fR (effectively 65536).
3075 .RE
3076
3077 .sp
3078 .ne 2
3079 .na
3080 \fBzfs_vdev_cache_max\fR (int)
3081 .ad
3082 .RS 12n
3083 Inflate reads smaller than this value to meet the \fBzfs_vdev_cache_bshift\fR
3084 size (default 64k).
3085 .sp
3086 Default value: \fB16384\fR.
3087 .RE
3088
3089 .sp
3090 .ne 2
3091 .na
3092 \fBzfs_vdev_cache_size\fR (int)
3093 .ad
3094 .RS 12n
3095 Total size of the per-disk cache in bytes.
3096 .sp
3097 Currently this feature is disabled as it has been found to not be helpful
3098 for performance and in some cases harmful.
3099 .sp
3100 Default value: \fB0\fR.
3101 .RE
3102
3103 .sp
3104 .ne 2
3105 .na
3106 \fBzfs_vdev_mirror_rotating_inc\fR (int)
3107 .ad
3108 .RS 12n
3109 A number by which the balancing algorithm increments the load calculation for
3110 the purpose of selecting the least busy mirror member when an I/O immediately
3111 follows its predecessor on rotational vdevs for the purpose of making decisions
3112 based on load.
3113 .sp
3114 Default value: \fB0\fR.
3115 .RE
3116
3117 .sp
3118 .ne 2
3119 .na
3120 \fBzfs_vdev_mirror_rotating_seek_inc\fR (int)
3121 .ad
3122 .RS 12n
3123 A number by which the balancing algorithm increments the load calculation for
3124 the purpose of selecting the least busy mirror member when an I/O lacks
3125 locality as defined by the zfs_vdev_mirror_rotating_seek_offset.  I/Os within
3126 this that are not immediately following the previous I/O are incremented by
3127 half.
3128 .sp
3129 Default value: \fB5\fR.
3130 .RE
3131
3132 .sp
3133 .ne 2
3134 .na
3135 \fBzfs_vdev_mirror_rotating_seek_offset\fR (int)
3136 .ad
3137 .RS 12n
3138 The maximum distance for the last queued I/O in which the balancing algorithm
3139 considers an I/O to have locality.
3140 See the section "ZFS I/O SCHEDULER".
3141 .sp
3142 Default value: \fB1048576\fR.
3143 .RE
3144
3145 .sp
3146 .ne 2
3147 .na
3148 \fBzfs_vdev_mirror_non_rotating_inc\fR (int)
3149 .ad
3150 .RS 12n
3151 A number by which the balancing algorithm increments the load calculation for
3152 the purpose of selecting the least busy mirror member on non-rotational vdevs
3153 when I/Os do not immediately follow one another.
3154 .sp
3155 Default value: \fB0\fR.
3156 .RE
3157
3158 .sp
3159 .ne 2
3160 .na
3161 \fBzfs_vdev_mirror_non_rotating_seek_inc\fR (int)
3162 .ad
3163 .RS 12n
3164 A number by which the balancing algorithm increments the load calculation for
3165 the purpose of selecting the least busy mirror member when an I/O lacks
3166 locality as defined by the zfs_vdev_mirror_rotating_seek_offset. I/Os within
3167 this that are not immediately following the previous I/O are incremented by
3168 half.
3169 .sp
3170 Default value: \fB1\fR.
3171 .RE
3172
3173 .sp
3174 .ne 2
3175 .na
3176 \fBzfs_vdev_read_gap_limit\fR (int)
3177 .ad
3178 .RS 12n
3179 Aggregate read I/O operations if the gap on-disk between them is within this
3180 threshold.
3181 .sp
3182 Default value: \fB32,768\fR.
3183 .RE
3184
3185 .sp
3186 .ne 2
3187 .na
3188 \fBzfs_vdev_write_gap_limit\fR (int)
3189 .ad
3190 .RS 12n
3191 Aggregate write I/O over gap
3192 .sp
3193 Default value: \fB4,096\fR.
3194 .RE
3195
3196 .sp
3197 .ne 2
3198 .na
3199 \fBzfs_vdev_raidz_impl\fR (string)
3200 .ad
3201 .RS 12n
3202 Parameter for selecting raidz parity implementation to use.
3203
3204 Options marked (always) below may be selected on module load as they are
3205 supported on all systems.
3206 The remaining options may only be set after the module is loaded, as they
3207 are available only if the implementations are compiled in and supported
3208 on the running system.
3209
3210 Once the module is loaded, the content of
3211 /sys/module/zfs/parameters/zfs_vdev_raidz_impl will show available options
3212 with the currently selected one enclosed in [].
3213 Possible options are:
3214   fastest  - (always) implementation selected using built-in benchmark
3215   original - (always) original raidz implementation
3216   scalar   - (always) scalar raidz implementation
3217   sse2     - implementation using SSE2 instruction set (64bit x86 only)
3218   ssse3    - implementation using SSSE3 instruction set (64bit x86 only)
3219   avx2     - implementation using AVX2 instruction set (64bit x86 only)
3220   avx512f  - implementation using AVX512F instruction set (64bit x86 only)
3221   avx512bw - implementation using AVX512F & AVX512BW instruction sets (64bit x86 only)
3222   aarch64_neon - implementation using NEON (Aarch64/64 bit ARMv8 only)
3223   aarch64_neonx2 - implementation using NEON with more unrolling (Aarch64/64 bit ARMv8 only)
3224 .sp
3225 Default value: \fBfastest\fR.
3226 .RE
3227
3228 .sp
3229 .ne 2
3230 .na
3231 \fBzfs_zevent_cols\fR (int)
3232 .ad
3233 .RS 12n
3234 When zevents are logged to the console use this as the word wrap width.
3235 .sp
3236 Default value: \fB80\fR.
3237 .RE
3238
3239 .sp
3240 .ne 2
3241 .na
3242 \fBzfs_zevent_console\fR (int)
3243 .ad
3244 .RS 12n
3245 Log events to the console
3246 .sp
3247 Use \fB1\fR for yes and \fB0\fR for no (default).
3248 .RE
3249
3250 .sp
3251 .ne 2
3252 .na
3253 \fBzfs_zevent_len_max\fR (int)
3254 .ad
3255 .RS 12n
3256 Max event queue length. A value of 0 will result in a calculated value which
3257 increases with the number of CPUs in the system (minimum 64 events). Events
3258 in the queue can be viewed with the \fBzpool events\fR command.
3259 .sp
3260 Default value: \fB0\fR.
3261 .RE
3262
3263 .sp
3264 .ne 2
3265 .na
3266 \fBzfs_zil_clean_taskq_maxalloc\fR (int)
3267 .ad
3268 .RS 12n
3269 The maximum number of taskq entries that are allowed to be cached.  When this
3270 limit is exceeded transaction records (itxs) will be cleaned synchronously.
3271 .sp
3272 Default value: \fB1048576\fR.
3273 .RE
3274
3275 .sp
3276 .ne 2
3277 .na
3278 \fBzfs_zil_clean_taskq_minalloc\fR (int)
3279 .ad
3280 .RS 12n
3281 The number of taskq entries that are pre-populated when the taskq is first
3282 created and are immediately available for use.
3283 .sp
3284 Default value: \fB1024\fR.
3285 .RE
3286
3287 .sp
3288 .ne 2
3289 .na
3290 \fBzfs_zil_clean_taskq_nthr_pct\fR (int)
3291 .ad
3292 .RS 12n
3293 This controls the number of threads used by the dp_zil_clean_taskq.  The default
3294 value of 100% will create a maximum of one thread per cpu.
3295 .sp
3296 Default value: \fB100\fR%.
3297 .RE
3298
3299 .sp
3300 .ne 2
3301 .na
3302 \fBzil_maxblocksize\fR (int)
3303 .ad
3304 .RS 12n
3305 This sets the maximum block size used by the ZIL.  On very fragmented pools,
3306 lowering this (typically to 36KB) can improve performance.
3307 .sp
3308 Default value: \fB131072\fR (128KB).
3309 .RE
3310
3311 .sp
3312 .ne 2
3313 .na
3314 \fBzil_nocacheflush\fR (int)
3315 .ad
3316 .RS 12n
3317 Disable the cache flush commands that are normally sent to the disk(s) by
3318 the ZIL after an LWB write has completed. Setting this will cause ZIL
3319 corruption on power loss if a volatile out-of-order write cache is enabled.
3320 .sp
3321 Use \fB1\fR for yes and \fB0\fR for no (default).
3322 .RE
3323
3324 .sp
3325 .ne 2
3326 .na
3327 \fBzil_replay_disable\fR (int)
3328 .ad
3329 .RS 12n
3330 Disable intent logging replay. Can be disabled for recovery from corrupted
3331 ZIL
3332 .sp
3333 Use \fB1\fR for yes and \fB0\fR for no (default).
3334 .RE
3335
3336 .sp
3337 .ne 2
3338 .na
3339 \fBzil_slog_bulk\fR (ulong)
3340 .ad
3341 .RS 12n
3342 Limit SLOG write size per commit executed with synchronous priority.
3343 Any writes above that will be executed with lower (asynchronous) priority
3344 to limit potential SLOG device abuse by single active ZIL writer.
3345 .sp
3346 Default value: \fB786,432\fR.
3347 .RE
3348
3349 .sp
3350 .ne 2
3351 .na
3352 \fBzio_deadman_log_all\fR (int)
3353 .ad
3354 .RS 12n
3355 If non-zero, the zio deadman will produce debugging messages (see
3356 \fBzfs_dbgmsg_enable\fR) for all zios, rather than only for leaf
3357 zios possessing a vdev. This is meant to be used by developers to gain
3358 diagnostic information for hang conditions which don't involve a mutex
3359 or other locking primitive; typically conditions in which a thread in
3360 the zio pipeline is looping indefinitely.
3361 .sp
3362 Default value: \fB0\fR.
3363 .RE
3364
3365 .sp
3366 .ne 2
3367 .na
3368 \fBzio_decompress_fail_fraction\fR (int)
3369 .ad
3370 .RS 12n
3371 If non-zero, this value represents the denominator of the probability that zfs
3372 should induce a decompression failure. For instance, for a 5% decompression
3373 failure rate, this value should be set to 20.
3374 .sp
3375 Default value: \fB0\fR.
3376 .RE
3377
3378 .sp
3379 .ne 2
3380 .na
3381 \fBzio_slow_io_ms\fR (int)
3382 .ad
3383 .RS 12n
3384 When an I/O operation takes more than \fBzio_slow_io_ms\fR milliseconds to
3385 complete is marked as a slow I/O.  Each slow I/O causes a delay zevent.  Slow
3386 I/O counters can be seen with "zpool status -s".
3387
3388 .sp
3389 Default value: \fB30,000\fR.
3390 .RE
3391
3392 .sp
3393 .ne 2
3394 .na
3395 \fBzio_dva_throttle_enabled\fR (int)
3396 .ad
3397 .RS 12n
3398 Throttle block allocations in the I/O pipeline. This allows for
3399 dynamic allocation distribution when devices are imbalanced.
3400 When enabled, the maximum number of pending allocations per top-level vdev
3401 is limited by \fBzfs_vdev_queue_depth_pct\fR.
3402 .sp
3403 Default value: \fB1\fR.
3404 .RE
3405
3406 .sp
3407 .ne 2
3408 .na
3409 \fBzio_requeue_io_start_cut_in_line\fR (int)
3410 .ad
3411 .RS 12n
3412 Prioritize requeued I/O
3413 .sp
3414 Default value: \fB0\fR.
3415 .RE
3416
3417 .sp
3418 .ne 2
3419 .na
3420 \fBzio_taskq_batch_pct\fR (uint)
3421 .ad
3422 .RS 12n
3423 Percentage of online CPUs (or CPU cores, etc) which will run a worker thread
3424 for I/O. These workers are responsible for I/O work such as compression and
3425 checksum calculations. Fractional number of CPUs will be rounded down.
3426 .sp
3427 The default value of 75 was chosen to avoid using all CPUs which can result in
3428 latency issues and inconsistent application performance, especially when high
3429 compression is enabled.
3430 .sp
3431 Default value: \fB75\fR.
3432 .RE
3433
3434 .sp
3435 .ne 2
3436 .na
3437 \fBzvol_inhibit_dev\fR (uint)
3438 .ad
3439 .RS 12n
3440 Do not create zvol device nodes. This may slightly improve startup time on
3441 systems with a very large number of zvols.
3442 .sp
3443 Use \fB1\fR for yes and \fB0\fR for no (default).
3444 .RE
3445
3446 .sp
3447 .ne 2
3448 .na
3449 \fBzvol_major\fR (uint)
3450 .ad
3451 .RS 12n
3452 Major number for zvol block devices
3453 .sp
3454 Default value: \fB230\fR.
3455 .RE
3456
3457 .sp
3458 .ne 2
3459 .na
3460 \fBzvol_max_discard_blocks\fR (ulong)
3461 .ad
3462 .RS 12n
3463 Discard (aka TRIM) operations done on zvols will be done in batches of this
3464 many blocks, where block size is determined by the \fBvolblocksize\fR property
3465 of a zvol.
3466 .sp
3467 Default value: \fB16,384\fR.
3468 .RE
3469
3470 .sp
3471 .ne 2
3472 .na
3473 \fBzvol_prefetch_bytes\fR (uint)
3474 .ad
3475 .RS 12n
3476 When adding a zvol to the system prefetch \fBzvol_prefetch_bytes\fR
3477 from the start and end of the volume.  Prefetching these regions
3478 of the volume is desirable because they are likely to be accessed
3479 immediately by \fBblkid(8)\fR or by the kernel scanning for a partition
3480 table.
3481 .sp
3482 Default value: \fB131,072\fR.
3483 .RE
3484
3485 .sp
3486 .ne 2
3487 .na
3488 \fBzvol_request_sync\fR (uint)
3489 .ad
3490 .RS 12n
3491 When processing I/O requests for a zvol submit them synchronously.  This
3492 effectively limits the queue depth to 1 for each I/O submitter.  When set
3493 to 0 requests are handled asynchronously by a thread pool.  The number of
3494 requests which can be handled concurrently is controller by \fBzvol_threads\fR.
3495 .sp
3496 Default value: \fB0\fR.
3497 .RE
3498
3499 .sp
3500 .ne 2
3501 .na
3502 \fBzvol_threads\fR (uint)
3503 .ad
3504 .RS 12n
3505 Max number of threads which can handle zvol I/O requests concurrently.
3506 .sp
3507 Default value: \fB32\fR.
3508 .RE
3509
3510 .sp
3511 .ne 2
3512 .na
3513 \fBzvol_volmode\fR (uint)
3514 .ad
3515 .RS 12n
3516 Defines zvol block devices behaviour when \fBvolmode\fR is set to \fBdefault\fR.
3517 Valid values are \fB1\fR (full), \fB2\fR (dev) and \fB3\fR (none).
3518 .sp
3519 Default value: \fB1\fR.
3520 .RE
3521
3522 .SH ZFS I/O SCHEDULER
3523 ZFS issues I/O operations to leaf vdevs to satisfy and complete I/Os.
3524 The I/O scheduler determines when and in what order those operations are
3525 issued.  The I/O scheduler divides operations into five I/O classes
3526 prioritized in the following order: sync read, sync write, async read,
3527 async write, and scrub/resilver.  Each queue defines the minimum and
3528 maximum number of concurrent operations that may be issued to the
3529 device.  In addition, the device has an aggregate maximum,
3530 \fBzfs_vdev_max_active\fR. Note that the sum of the per-queue minimums
3531 must not exceed the aggregate maximum.  If the sum of the per-queue
3532 maximums exceeds the aggregate maximum, then the number of active I/Os
3533 may reach \fBzfs_vdev_max_active\fR, in which case no further I/Os will
3534 be issued regardless of whether all per-queue minimums have been met.
3535 .sp
3536 For many physical devices, throughput increases with the number of
3537 concurrent operations, but latency typically suffers. Further, physical
3538 devices typically have a limit at which more concurrent operations have no
3539 effect on throughput or can actually cause it to decrease.
3540 .sp
3541 The scheduler selects the next operation to issue by first looking for an
3542 I/O class whose minimum has not been satisfied. Once all are satisfied and
3543 the aggregate maximum has not been hit, the scheduler looks for classes
3544 whose maximum has not been satisfied. Iteration through the I/O classes is
3545 done in the order specified above. No further operations are issued if the
3546 aggregate maximum number of concurrent operations has been hit or if there
3547 are no operations queued for an I/O class that has not hit its maximum.
3548 Every time an I/O is queued or an operation completes, the I/O scheduler
3549 looks for new operations to issue.
3550 .sp
3551 In general, smaller max_active's will lead to lower latency of synchronous
3552 operations.  Larger max_active's may lead to higher overall throughput,
3553 depending on underlying storage.
3554 .sp
3555 The ratio of the queues' max_actives determines the balance of performance
3556 between reads, writes, and scrubs.  E.g., increasing
3557 \fBzfs_vdev_scrub_max_active\fR will cause the scrub or resilver to complete
3558 more quickly, but reads and writes to have higher latency and lower throughput.
3559 .sp
3560 All I/O classes have a fixed maximum number of outstanding operations
3561 except for the async write class. Asynchronous writes represent the data
3562 that is committed to stable storage during the syncing stage for
3563 transaction groups. Transaction groups enter the syncing state
3564 periodically so the number of queued async writes will quickly burst up
3565 and then bleed down to zero. Rather than servicing them as quickly as
3566 possible, the I/O scheduler changes the maximum number of active async
3567 write I/Os according to the amount of dirty data in the pool.  Since
3568 both throughput and latency typically increase with the number of
3569 concurrent operations issued to physical devices, reducing the
3570 burstiness in the number of concurrent operations also stabilizes the
3571 response time of operations from other -- and in particular synchronous
3572 -- queues. In broad strokes, the I/O scheduler will issue more
3573 concurrent operations from the async write queue as there's more dirty
3574 data in the pool.
3575 .sp
3576 Async Writes
3577 .sp
3578 The number of concurrent operations issued for the async write I/O class
3579 follows a piece-wise linear function defined by a few adjustable points.
3580 .nf
3581
3582        |              o---------| <-- zfs_vdev_async_write_max_active
3583   ^    |             /^         |
3584   |    |            / |         |
3585 active |           /  |         |
3586  I/O   |          /   |         |
3587 count  |         /    |         |
3588        |        /     |         |
3589        |-------o      |         | <-- zfs_vdev_async_write_min_active
3590       0|_______^______|_________|
3591        0%      |      |       100% of zfs_dirty_data_max
3592                |      |
3593                |      `-- zfs_vdev_async_write_active_max_dirty_percent
3594                `--------- zfs_vdev_async_write_active_min_dirty_percent
3595
3596 .fi
3597 Until the amount of dirty data exceeds a minimum percentage of the dirty
3598 data allowed in the pool, the I/O scheduler will limit the number of
3599 concurrent operations to the minimum. As that threshold is crossed, the
3600 number of concurrent operations issued increases linearly to the maximum at
3601 the specified maximum percentage of the dirty data allowed in the pool.
3602 .sp
3603 Ideally, the amount of dirty data on a busy pool will stay in the sloped
3604 part of the function between \fBzfs_vdev_async_write_active_min_dirty_percent\fR
3605 and \fBzfs_vdev_async_write_active_max_dirty_percent\fR. If it exceeds the
3606 maximum percentage, this indicates that the rate of incoming data is
3607 greater than the rate that the backend storage can handle. In this case, we
3608 must further throttle incoming writes, as described in the next section.
3609
3610 .SH ZFS TRANSACTION DELAY
3611 We delay transactions when we've determined that the backend storage
3612 isn't able to accommodate the rate of incoming writes.
3613 .sp
3614 If there is already a transaction waiting, we delay relative to when
3615 that transaction will finish waiting.  This way the calculated delay time
3616 is independent of the number of threads concurrently executing
3617 transactions.
3618 .sp
3619 If we are the only waiter, wait relative to when the transaction
3620 started, rather than the current time.  This credits the transaction for
3621 "time already served", e.g. reading indirect blocks.
3622 .sp
3623 The minimum time for a transaction to take is calculated as:
3624 .nf
3625     min_time = zfs_delay_scale * (dirty - min) / (max - dirty)
3626     min_time is then capped at 100 milliseconds.
3627 .fi
3628 .sp
3629 The delay has two degrees of freedom that can be adjusted via tunables.  The
3630 percentage of dirty data at which we start to delay is defined by
3631 \fBzfs_delay_min_dirty_percent\fR. This should typically be at or above
3632 \fBzfs_vdev_async_write_active_max_dirty_percent\fR so that we only start to
3633 delay after writing at full speed has failed to keep up with the incoming write
3634 rate. The scale of the curve is defined by \fBzfs_delay_scale\fR. Roughly speaking,
3635 this variable determines the amount of delay at the midpoint of the curve.
3636 .sp
3637 .nf
3638 delay
3639  10ms +-------------------------------------------------------------*+
3640       |                                                             *|
3641   9ms +                                                             *+
3642       |                                                             *|
3643   8ms +                                                             *+
3644       |                                                            * |
3645   7ms +                                                            * +
3646       |                                                            * |
3647   6ms +                                                            * +
3648       |                                                            * |
3649   5ms +                                                           *  +
3650       |                                                           *  |
3651   4ms +                                                           *  +
3652       |                                                           *  |
3653   3ms +                                                          *   +
3654       |                                                          *   |
3655   2ms +                                              (midpoint) *    +
3656       |                                                  |    **     |
3657   1ms +                                                  v ***       +
3658       |             zfs_delay_scale ---------->     ********         |
3659     0 +-------------------------------------*********----------------+
3660       0%                    <- zfs_dirty_data_max ->               100%
3661 .fi
3662 .sp
3663 Note that since the delay is added to the outstanding time remaining on the
3664 most recent transaction, the delay is effectively the inverse of IOPS.
3665 Here the midpoint of 500us translates to 2000 IOPS. The shape of the curve
3666 was chosen such that small changes in the amount of accumulated dirty data
3667 in the first 3/4 of the curve yield relatively small differences in the
3668 amount of delay.
3669 .sp
3670 The effects can be easier to understand when the amount of delay is
3671 represented on a log scale:
3672 .sp
3673 .nf
3674 delay
3675 100ms +-------------------------------------------------------------++
3676       +                                                              +
3677       |                                                              |
3678       +                                                             *+
3679  10ms +                                                             *+
3680       +                                                           ** +
3681       |                                              (midpoint)  **  |
3682       +                                                  |     **    +
3683   1ms +                                                  v ****      +
3684       +             zfs_delay_scale ---------->        *****         +
3685       |                                             ****             |
3686       +                                          ****                +
3687 100us +                                        **                    +
3688       +                                       *                      +
3689       |                                      *                       |
3690       +                                     *                        +
3691  10us +                                     *                        +
3692       +                                                              +
3693       |                                                              |
3694       +                                                              +
3695       +--------------------------------------------------------------+
3696       0%                    <- zfs_dirty_data_max ->               100%
3697 .fi
3698 .sp
3699 Note here that only as the amount of dirty data approaches its limit does
3700 the delay start to increase rapidly. The goal of a properly tuned system
3701 should be to keep the amount of dirty data out of that range by first
3702 ensuring that the appropriate limits are set for the I/O scheduler to reach
3703 optimal throughput on the backend storage, and then by changing the value
3704 of \fBzfs_delay_scale\fR to increase the steepness of the curve.