2 .\" Copyright (c) 2013 by Turbo Fredriksson <turbo@bayour.com>. All rights reserved.
3 .\" The contents of this file are subject to the terms of the Common Development
4 .\" and Distribution License (the "License"). You may not use this file except
5 .\" in compliance with the License. You can obtain a copy of the license at
6 .\" usr/src/OPENSOLARIS.LICENSE or http://www.opensolaris.org/os/licensing.
8 .\" See the License for the specific language governing permissions and
9 .\" limitations under the License. When distributing Covered Code, include this
10 .\" CDDL HEADER in each file and include the License file at
11 .\" usr/src/OPENSOLARIS.LICENSE. If applicable, add the following below this
12 .\" CDDL HEADER, with the fields enclosed by brackets "[]" replaced with your
13 .\" own identifying information:
14 .\" Portions Copyright [yyyy] [name of copyright owner]
15 .TH ZFS-MODULE-PARAMETERS 5 "Nov 16, 2013"
17 zfs\-module\-parameters \- ZFS module parameters
21 Description of the different parameters to the ZFS module.
23 .SS "Module parameters"
30 \fBl2arc_feed_again\fR (int)
35 Use \fB1\fR for yes (default) and \fB0\fR to disable.
41 \fBl2arc_feed_min_ms\fR (ulong)
44 Min feed interval in milliseconds
46 Default value: \fB200\fR.
52 \fBl2arc_feed_secs\fR (ulong)
55 Seconds between L2ARC writing
57 Default value: \fB1\fR.
63 \fBl2arc_headroom\fR (ulong)
66 Number of max device writes to precache
68 Default value: \fB2\fR.
74 \fBl2arc_headroom_boost\fR (ulong)
77 Compressed l2arc_headroom multiplier
79 Default value: \fB200\fR.
85 \fBl2arc_nocompress\fR (int)
88 Skip compressing L2ARC buffers
90 Use \fB1\fR for yes and \fB0\fR for no (default).
96 \fBl2arc_noprefetch\fR (int)
99 Skip caching prefetched buffers
101 Use \fB1\fR for yes (default) and \fB0\fR to disable.
107 \fBl2arc_norw\fR (int)
110 No reads during writes
112 Use \fB1\fR for yes and \fB0\fR for no (default).
118 \fBl2arc_write_boost\fR (ulong)
121 Extra write bytes during device warmup
123 Default value: \fB8,388,608\fR.
129 \fBl2arc_write_max\fR (ulong)
132 Max write bytes per interval
134 Default value: \fB8,388,608\fR.
140 \fBmetaslab_debug_load\fR (int)
143 Load all metaslabs during pool import.
145 Use \fB1\fR for yes and \fB0\fR for no (default).
151 \fBmetaslab_debug_unload\fR (int)
154 Prevent metaslabs from being unloaded.
156 Use \fB1\fR for yes and \fB0\fR for no (default).
162 \fBspa_config_path\fR (charp)
167 Default value: \fB/etc/zfs/zpool.cache\fR.
173 \fBspa_asize_inflation\fR (int)
176 Multiplication factor used to estimate actual disk consumption from the
177 size of data being written. The default value is a worst case estimate,
178 but lower values may be valid for a given pool depending on its
179 configuration. Pool administrators who understand the factors involved
180 may wish to specify a more realistic inflation factor, particularly if
181 they operate close to quota or capacity limits.
189 \fBzfetch_array_rd_sz\fR (ulong)
192 If prefetching is enabled, disable prefetching for reads larger than this size.
194 Default value: \fB1,048,576\fR.
200 \fBzfetch_block_cap\fR (uint)
203 Max number of blocks to prefetch at a time
205 Default value: \fB256\fR.
211 \fBzfetch_max_streams\fR (uint)
214 Max number of streams per zfetch (prefetch streams per file).
216 Default value: \fB8\fR.
222 \fBzfetch_min_sec_reap\fR (uint)
225 Min time before an active prefetch stream can be reclaimed
227 Default value: \fB2\fR.
233 \fBzfs_arc_grow_retry\fR (int)
236 Seconds before growing arc size
238 Default value: \fB5\fR.
244 \fBzfs_arc_max\fR (ulong)
249 Default value: \fB0\fR.
255 \fBzfs_arc_memory_throttle_disable\fR (int)
258 Disable memory throttle
260 Use \fB1\fR for yes (default) and \fB0\fR to disable.
266 \fBzfs_arc_meta_limit\fR (ulong)
269 Meta limit for arc size
271 Default value: \fB0\fR.
277 \fBzfs_arc_meta_prune\fR (int)
280 Bytes of meta data to prune
282 Default value: \fB1,048,576\fR.
288 \fBzfs_arc_min\fR (ulong)
293 Default value: \fB100\fR.
299 \fBzfs_arc_min_prefetch_lifespan\fR (int)
302 Min life of prefetch block
304 Default value: \fB100\fR.
310 \fBzfs_arc_p_aggressive_disable\fR (int)
313 Disable aggressive arc_p growth
315 Use \fB1\fR for yes (default) and \fB0\fR to disable.
321 \fBzfs_arc_p_dampener_disable\fR (int)
324 Disable arc_p adapt dampener
326 Use \fB1\fR for yes (default) and \fB0\fR to disable.
332 \fBzfs_arc_shrink_shift\fR (int)
335 log2(fraction of arc to reclaim)
337 Default value: \fB5\fR.
343 \fBzfs_autoimport_disable\fR (int)
346 Disable pool import at module load by ignoring the cache file (typically \fB/etc/zfs/zpool.cache\fR).
348 Use \fB1\fR for yes and \fB0\fR for no (default).
354 \fBzfs_dbuf_state_index\fR (int)
357 Calculate arc header index
359 Default value: \fB0\fR.
365 \fBzfs_deadman_enabled\fR (int)
370 Use \fB1\fR for yes (default) and \fB0\fR to disable.
376 \fBzfs_deadman_synctime_ms\fR (ulong)
379 Expiration time in milliseconds. This value has two meanings. First it is
380 used to determine when the spa_deadman() logic should fire. By default the
381 spa_deadman() will fire if spa_sync() has not completed in 1000 seconds.
382 Secondly, the value determines if an I/O is considered "hung". Any I/O that
383 has not completed in zfs_deadman_synctime_ms is considered "hung" resulting
384 in a zevent being logged.
386 Default value: \fB1,000,000\fR.
392 \fBzfs_dedup_prefetch\fR (int)
395 Enable prefetching dedup-ed blks
397 Use \fB1\fR for yes (default) and \fB0\fR to disable.
403 \fBzfs_delay_min_dirty_percent\fR (int)
406 Start to delay each transaction once there is this amount of dirty data,
407 expressed as a percentage of \fBzfs_dirty_data_max\fR.
408 This value should be >= zfs_vdev_async_write_active_max_dirty_percent.
409 See the section "ZFS TRANSACTION DELAY".
411 Default value: \fB60\fR.
417 \fBzfs_delay_scale\fR (int)
420 This controls how quickly the transaction delay approaches infinity.
421 Larger values cause longer delays for a given amount of dirty data.
423 For the smoothest delay, this value should be about 1 billion divided
424 by the maximum number of operations per second. This will smoothly
425 handle between 10x and 1/10th this number.
427 See the section "ZFS TRANSACTION DELAY".
429 Note: \fBzfs_delay_scale\fR * \fBzfs_dirty_data_max\fR must be < 2^64.
431 Default value: \fB500,000\fR.
437 \fBzfs_dirty_data_max\fR (int)
440 Determines the dirty space limit in bytes. Once this limit is exceeded, new
441 writes are halted until space frees up. This parameter takes precedence
442 over \fBzfs_dirty_data_max_percent\fR.
443 See the section "ZFS TRANSACTION DELAY".
445 Default value: 10 percent of all memory, capped at \fBzfs_dirty_data_max_max\fR.
451 \fBzfs_dirty_data_max_max\fR (int)
454 Maximum allowable value of \fBzfs_dirty_data_max\fR, expressed in bytes.
455 This limit is only enforced at module load time, and will be ignored if
456 \fBzfs_dirty_data_max\fR is later changed. This parameter takes
457 precedence over \fBzfs_dirty_data_max_max_percent\fR. See the section
458 "ZFS TRANSACTION DELAY".
460 Default value: 25% of physical RAM.
466 \fBzfs_dirty_data_max_max_percent\fR (int)
469 Maximum allowable value of \fBzfs_dirty_data_max\fR, expressed as a
470 percentage of physical RAM. This limit is only enforced at module load
471 time, and will be ignored if \fBzfs_dirty_data_max\fR is later changed.
472 The parameter \fBzfs_dirty_data_max_max\fR takes precedence over this
473 one. See the section "ZFS TRANSACTION DELAY".
481 \fBzfs_dirty_data_max_percent\fR (int)
484 Determines the dirty space limit, expressed as a percentage of all
485 memory. Once this limit is exceeded, new writes are halted until space frees
486 up. The parameter \fBzfs_dirty_data_max\fR takes precedence over this
487 one. See the section "ZFS TRANSACTION DELAY".
489 Default value: 10%, subject to \fBzfs_dirty_data_max_max\fR.
495 \fBzfs_dirty_data_sync\fR (int)
498 Start syncing out a transaction group if there is at least this much dirty data.
500 Default value: \fB67,108,864\fR.
506 \fBzfs_vdev_async_read_max_active\fR (int)
509 Maxium asynchronous read I/Os active to each device.
510 See the section "ZFS I/O SCHEDULER".
512 Default value: \fB3\fR.
518 \fBzfs_vdev_async_read_min_active\fR (int)
521 Minimum asynchronous read I/Os active to each device.
522 See the section "ZFS I/O SCHEDULER".
524 Default value: \fB1\fR.
530 \fBzfs_vdev_async_write_active_max_dirty_percent\fR (int)
533 When the pool has more than
534 \fBzfs_vdev_async_write_active_max_dirty_percent\fR dirty data, use
535 \fBzfs_vdev_async_write_max_active\fR to limit active async writes. If
536 the dirty data is between min and max, the active I/O limit is linearly
537 interpolated. See the section "ZFS I/O SCHEDULER".
539 Default value: \fB60\fR.
545 \fBzfs_vdev_async_write_active_min_dirty_percent\fR (int)
548 When the pool has less than
549 \fBzfs_vdev_async_write_active_min_dirty_percent\fR dirty data, use
550 \fBzfs_vdev_async_write_min_active\fR to limit active async writes. If
551 the dirty data is between min and max, the active I/O limit is linearly
552 interpolated. See the section "ZFS I/O SCHEDULER".
554 Default value: \fB30\fR.
560 \fBzfs_vdev_async_write_max_active\fR (int)
563 Maxium asynchronous write I/Os active to each device.
564 See the section "ZFS I/O SCHEDULER".
566 Default value: \fB10\fR.
572 \fBzfs_vdev_async_write_min_active\fR (int)
575 Minimum asynchronous write I/Os active to each device.
576 See the section "ZFS I/O SCHEDULER".
578 Default value: \fB1\fR.
584 \fBzfs_vdev_max_active\fR (int)
587 The maximum number of I/Os active to each device. Ideally, this will be >=
588 the sum of each queue's max_active. It must be at least the sum of each
589 queue's min_active. See the section "ZFS I/O SCHEDULER".
591 Default value: \fB1,000\fR.
597 \fBzfs_vdev_scrub_max_active\fR (int)
600 Maxium scrub I/Os active to each device.
601 See the section "ZFS I/O SCHEDULER".
603 Default value: \fB2\fR.
609 \fBzfs_vdev_scrub_min_active\fR (int)
612 Minimum scrub I/Os active to each device.
613 See the section "ZFS I/O SCHEDULER".
615 Default value: \fB1\fR.
621 \fBzfs_vdev_sync_read_max_active\fR (int)
624 Maxium synchronous read I/Os active to each device.
625 See the section "ZFS I/O SCHEDULER".
627 Default value: \fB10\fR.
633 \fBzfs_vdev_sync_read_min_active\fR (int)
636 Minimum synchronous read I/Os active to each device.
637 See the section "ZFS I/O SCHEDULER".
639 Default value: \fB10\fR.
645 \fBzfs_vdev_sync_write_max_active\fR (int)
648 Maxium synchronous write I/Os active to each device.
649 See the section "ZFS I/O SCHEDULER".
651 Default value: \fB10\fR.
657 \fBzfs_vdev_sync_write_min_active\fR (int)
660 Minimum synchronous write I/Os active to each device.
661 See the section "ZFS I/O SCHEDULER".
663 Default value: \fB10\fR.
669 \fBzfs_disable_dup_eviction\fR (int)
672 Disable duplicate buffer eviction
674 Use \fB1\fR for yes and \fB0\fR for no (default).
680 \fBzfs_expire_snapshot\fR (int)
683 Seconds to expire .zfs/snapshot
685 Default value: \fB300\fR.
691 \fBzfs_flags\fR (int)
694 Set additional debugging flags
696 Default value: \fB1\fR.
702 \fBzfs_free_min_time_ms\fR (int)
705 Min millisecs to free per txg
707 Default value: \fB1,000\fR.
713 \fBzfs_immediate_write_sz\fR (long)
716 Largest data block to write to zil
718 Default value: \fB32,768\fR.
724 \fBzfs_mdcomp_disable\fR (int)
727 Disable meta data compression
729 Use \fB1\fR for yes and \fB0\fR for no (default).
735 \fBzfs_mg_noalloc_threshold\fR (int)
738 Defines a threshold at which metaslab groups should be eligible for
739 allocations. The value is expressed as a percentage of free space
740 beyond which a metaslab group is always eligible for allocations.
741 If a metaslab group's free space is less than or equal to the
742 the threshold, the allocator will avoid allocating to that group
743 unless all groups in the pool have reached the threshold. Once all
744 groups have reached the threshold, all groups are allowed to accept
745 allocations. The default value of 0 disables the feature and causes
746 all metaslab groups to be eligible for allocations.
748 This parameter allows to deal with pools having heavily imbalanced
749 vdevs such as would be the case when a new vdev has been added.
750 Setting the threshold to a non-zero percentage will stop allocations
751 from being made to vdevs that aren't filled to the specified percentage
752 and allow lesser filled vdevs to acquire more allocations than they
753 otherwise would under the old \fBzfs_mg_alloc_failures\fR facility.
755 Default value: \fB0\fR.
761 \fBzfs_no_scrub_io\fR (int)
766 Use \fB1\fR for yes and \fB0\fR for no (default).
772 \fBzfs_no_scrub_prefetch\fR (int)
775 Set for no scrub prefetching
777 Use \fB1\fR for yes and \fB0\fR for no (default).
783 \fBzfs_nocacheflush\fR (int)
786 Disable cache flushes
788 Use \fB1\fR for yes and \fB0\fR for no (default).
794 \fBzfs_nopwrite_enabled\fR (int)
799 Use \fB1\fR for yes (default) and \fB0\fR to disable.
805 \fBzfs_pd_blks_max\fR (int)
808 Max number of blocks to prefetch
810 Default value: \fB100\fR.
816 \fBzfs_prefetch_disable\fR (int)
819 Disable all ZFS prefetching
821 Use \fB1\fR for yes and \fB0\fR for no (default).
827 \fBzfs_read_chunk_size\fR (long)
830 Bytes to read per chunk
832 Default value: \fB1,048,576\fR.
838 \fBzfs_read_history\fR (int)
841 Historic statistics for the last N reads
843 Default value: \fB0\fR.
849 \fBzfs_read_history_hits\fR (int)
852 Include cache hits in read history
854 Use \fB1\fR for yes and \fB0\fR for no (default).
860 \fBzfs_recover\fR (int)
863 Set to attempt to recover from fatal errors. This should only be used as a
864 last resort, as it typically results in leaked space, or worse.
866 Use \fB1\fR for yes and \fB0\fR for no (default).
872 \fBzfs_resilver_delay\fR (int)
875 Number of ticks to delay prior to issuing a resilver I/O operation when
876 a non-resilver or non-scrub I/O operation has occurred within the past
877 \fBzfs_scan_idle\fR ticks.
879 Default value: \fB2\fR.
885 \fBzfs_resilver_min_time_ms\fR (int)
888 Min millisecs to resilver per txg
890 Default value: \fB3,000\fR.
896 \fBzfs_scan_idle\fR (int)
899 Idle window in clock ticks. During a scrub or a resilver, if
900 a non-scrub or non-resilver I/O operation has occurred during this
901 window, the next scrub or resilver operation is delayed by, respectively
902 \fBzfs_scrub_delay\fR or \fBzfs_resilver_delay\fR ticks.
904 Default value: \fB50\fR.
910 \fBzfs_scan_min_time_ms\fR (int)
913 Min millisecs to scrub per txg
915 Default value: \fB1,000\fR.
921 \fBzfs_scrub_delay\fR (int)
924 Number of ticks to delay prior to issuing a scrub I/O operation when
925 a non-scrub or non-resilver I/O operation has occurred within the past
926 \fBzfs_scan_idle\fR ticks.
928 Default value: \fB4\fR.
934 \fBzfs_send_corrupt_data\fR (int)
937 Allow to send corrupt data (ignore read/checksum errors when sending data)
939 Use \fB1\fR for yes and \fB0\fR for no (default).
945 \fBzfs_sync_pass_deferred_free\fR (int)
948 Defer frees starting in this pass
950 Default value: \fB2\fR.
956 \fBzfs_sync_pass_dont_compress\fR (int)
959 Don't compress starting in this pass
961 Default value: \fB5\fR.
967 \fBzfs_sync_pass_rewrite\fR (int)
970 Rewrite new bps starting in this pass
972 Default value: \fB2\fR.
978 \fBzfs_top_maxinflight\fR (int)
981 Max I/Os per top-level vdev during scrub or resilver operations.
983 Default value: \fB32\fR.
989 \fBzfs_txg_history\fR (int)
992 Historic statistics for the last N txgs
994 Default value: \fB0\fR.
1000 \fBzfs_txg_timeout\fR (int)
1003 Max seconds worth of delta per txg
1005 Default value: \fB5\fR.
1011 \fBzfs_vdev_aggregation_limit\fR (int)
1014 Max vdev I/O aggregation size
1016 Default value: \fB131,072\fR.
1022 \fBzfs_vdev_cache_bshift\fR (int)
1025 Shift size to inflate reads too
1027 Default value: \fB16\fR.
1033 \fBzfs_vdev_cache_max\fR (int)
1036 Inflate reads small than max
1042 \fBzfs_vdev_cache_size\fR (int)
1045 Total size of the per-disk cache
1047 Default value: \fB0\fR.
1053 \fBzfs_vdev_mirror_switch_us\fR (int)
1056 Switch mirrors every N usecs
1058 Default value: \fB10,000\fR.
1064 \fBzfs_vdev_read_gap_limit\fR (int)
1067 Aggregate read I/O over gap
1069 Default value: \fB32,768\fR.
1075 \fBzfs_vdev_scheduler\fR (charp)
1080 Default value: \fBnoop\fR.
1086 \fBzfs_vdev_write_gap_limit\fR (int)
1089 Aggregate write I/O over gap
1091 Default value: \fB4,096\fR.
1097 \fBzfs_zevent_cols\fR (int)
1100 Max event column width
1102 Default value: \fB80\fR.
1108 \fBzfs_zevent_console\fR (int)
1111 Log events to the console
1113 Use \fB1\fR for yes and \fB0\fR for no (default).
1119 \fBzfs_zevent_len_max\fR (int)
1122 Max event queue length
1124 Default value: \fB0\fR.
1130 \fBzil_replay_disable\fR (int)
1133 Disable intent logging replay
1135 Use \fB1\fR for yes and \fB0\fR for no (default).
1141 \fBzil_slog_limit\fR (ulong)
1144 Max commit bytes to separate log device
1146 Default value: \fB1,048,576\fR.
1152 \fBzio_bulk_flags\fR (int)
1155 Additional flags to pass to bulk buffers
1157 Default value: \fB0\fR.
1163 \fBzio_delay_max\fR (int)
1166 Max zio millisec delay before posting event
1168 Default value: \fB30,000\fR.
1174 \fBzio_injection_enabled\fR (int)
1177 Enable fault injection
1179 Use \fB1\fR for yes and \fB0\fR for no (default).
1185 \fBzio_requeue_io_start_cut_in_line\fR (int)
1188 Prioritize requeued I/O
1190 Default value: \fB0\fR.
1196 \fBzvol_inhibit_dev\fR (uint)
1199 Do not create zvol device nodes
1201 Use \fB1\fR for yes and \fB0\fR for no (default).
1207 \fBzvol_major\fR (uint)
1210 Major number for zvol device
1212 Default value: \fB230\fR.
1218 \fBzvol_max_discard_blocks\fR (ulong)
1221 Max number of blocks to discard at once
1223 Default value: \fB16,384\fR.
1229 \fBzvol_threads\fR (uint)
1232 Number of threads for zvol device
1234 Default value: \fB32\fR.
1237 .SH ZFS I/O SCHEDULER
1238 ZFS issues I/O operations to leaf vdevs to satisfy and complete I/Os.
1239 The I/O scheduler determines when and in what order those operations are
1240 issued. The I/O scheduler divides operations into five I/O classes
1241 prioritized in the following order: sync read, sync write, async read,
1242 async write, and scrub/resilver. Each queue defines the minimum and
1243 maximum number of concurrent operations that may be issued to the
1244 device. In addition, the device has an aggregate maximum,
1245 \fBzfs_vdev_max_active\fR. Note that the sum of the per-queue minimums
1246 must not exceed the aggregate maximum. If the sum of the per-queue
1247 maximums exceeds the aggregate maximum, then the number of active I/Os
1248 may reach \fBzfs_vdev_max_active\fR, in which case no further I/Os will
1249 be issued regardless of whether all per-queue minimums have been met.
1251 For many physical devices, throughput increases with the number of
1252 concurrent operations, but latency typically suffers. Further, physical
1253 devices typically have a limit at which more concurrent operations have no
1254 effect on throughput or can actually cause it to decrease.
1256 The scheduler selects the next operation to issue by first looking for an
1257 I/O class whose minimum has not been satisfied. Once all are satisfied and
1258 the aggregate maximum has not been hit, the scheduler looks for classes
1259 whose maximum has not been satisfied. Iteration through the I/O classes is
1260 done in the order specified above. No further operations are issued if the
1261 aggregate maximum number of concurrent operations has been hit or if there
1262 are no operations queued for an I/O class that has not hit its maximum.
1263 Every time an I/O is queued or an operation completes, the I/O scheduler
1264 looks for new operations to issue.
1266 In general, smaller max_active's will lead to lower latency of synchronous
1267 operations. Larger max_active's may lead to higher overall throughput,
1268 depending on underlying storage.
1270 The ratio of the queues' max_actives determines the balance of performance
1271 between reads, writes, and scrubs. E.g., increasing
1272 \fBzfs_vdev_scrub_max_active\fR will cause the scrub or resilver to complete
1273 more quickly, but reads and writes to have higher latency and lower throughput.
1275 All I/O classes have a fixed maximum number of outstanding operations
1276 except for the async write class. Asynchronous writes represent the data
1277 that is committed to stable storage during the syncing stage for
1278 transaction groups. Transaction groups enter the syncing state
1279 periodically so the number of queued async writes will quickly burst up
1280 and then bleed down to zero. Rather than servicing them as quickly as
1281 possible, the I/O scheduler changes the maximum number of active async
1282 write I/Os according to the amount of dirty data in the pool. Since
1283 both throughput and latency typically increase with the number of
1284 concurrent operations issued to physical devices, reducing the
1285 burstiness in the number of concurrent operations also stabilizes the
1286 response time of operations from other -- and in particular synchronous
1287 -- queues. In broad strokes, the I/O scheduler will issue more
1288 concurrent operations from the async write queue as there's more dirty
1293 The number of concurrent operations issued for the async write I/O class
1294 follows a piece-wise linear function defined by a few adjustable points.
1297 | o---------| <-- zfs_vdev_async_write_max_active
1304 |-------o | | <-- zfs_vdev_async_write_min_active
1305 0|_______^______|_________|
1306 0% | | 100% of zfs_dirty_data_max
1308 | `-- zfs_vdev_async_write_active_max_dirty_percent
1309 `--------- zfs_vdev_async_write_active_min_dirty_percent
1312 Until the amount of dirty data exceeds a minimum percentage of the dirty
1313 data allowed in the pool, the I/O scheduler will limit the number of
1314 concurrent operations to the minimum. As that threshold is crossed, the
1315 number of concurrent operations issued increases linearly to the maximum at
1316 the specified maximum percentage of the dirty data allowed in the pool.
1318 Ideally, the amount of dirty data on a busy pool will stay in the sloped
1319 part of the function between \fBzfs_vdev_async_write_active_min_dirty_percent\fR
1320 and \fBzfs_vdev_async_write_active_max_dirty_percent\fR. If it exceeds the
1321 maximum percentage, this indicates that the rate of incoming data is
1322 greater than the rate that the backend storage can handle. In this case, we
1323 must further throttle incoming writes, as described in the next section.
1325 .SH ZFS TRANSACTION DELAY
1326 We delay transactions when we've determined that the backend storage
1327 isn't able to accommodate the rate of incoming writes.
1329 If there is already a transaction waiting, we delay relative to when
1330 that transaction will finish waiting. This way the calculated delay time
1331 is independent of the number of threads concurrently executing
1334 If we are the only waiter, wait relative to when the transaction
1335 started, rather than the current time. This credits the transaction for
1336 "time already served", e.g. reading indirect blocks.
1338 The minimum time for a transaction to take is calculated as:
1340 min_time = zfs_delay_scale * (dirty - min) / (max - dirty)
1341 min_time is then capped at 100 milliseconds.
1344 The delay has two degrees of freedom that can be adjusted via tunables. The
1345 percentage of dirty data at which we start to delay is defined by
1346 \fBzfs_delay_min_dirty_percent\fR. This should typically be at or above
1347 \fBzfs_vdev_async_write_active_max_dirty_percent\fR so that we only start to
1348 delay after writing at full speed has failed to keep up with the incoming write
1349 rate. The scale of the curve is defined by \fBzfs_delay_scale\fR. Roughly speaking,
1350 this variable determines the amount of delay at the midpoint of the curve.
1354 10ms +-------------------------------------------------------------*+
1370 2ms + (midpoint) * +
1373 | zfs_delay_scale ----------> ******** |
1374 0 +-------------------------------------*********----------------+
1375 0% <- zfs_dirty_data_max -> 100%
1378 Note that since the delay is added to the outstanding time remaining on the
1379 most recent transaction, the delay is effectively the inverse of IOPS.
1380 Here the midpoint of 500us translates to 2000 IOPS. The shape of the curve
1381 was chosen such that small changes in the amount of accumulated dirty data
1382 in the first 3/4 of the curve yield relatively small differences in the
1385 The effects can be easier to understand when the amount of delay is
1386 represented on a log scale:
1390 100ms +-------------------------------------------------------------++
1399 + zfs_delay_scale ----------> ***** +
1410 +--------------------------------------------------------------+
1411 0% <- zfs_dirty_data_max -> 100%
1414 Note here that only as the amount of dirty data approaches its limit does
1415 the delay start to increase rapidly. The goal of a properly tuned system
1416 should be to keep the amount of dirty data out of that range by first
1417 ensuring that the appropriate limits are set for the I/O scheduler to reach
1418 optimal throughput on the backend storage, and then by changing the value
1419 of \fBzfs_delay_scale\fR to increase the steepness of the curve.