]> git.proxmox.com Git - mirror_zfs.git/blame - man/man5/zfs-module-parameters.5
OpenZFS 7578 - Fix/improve some aspects of ZIL writing
[mirror_zfs.git] / man / man5 / zfs-module-parameters.5
CommitLineData
29714574
TF
1'\" te
2.\" Copyright (c) 2013 by Turbo Fredriksson <turbo@bayour.com>. All rights reserved.
3.\" The contents of this file are subject to the terms of the Common Development
4.\" and Distribution License (the "License"). You may not use this file except
5.\" in compliance with the License. You can obtain a copy of the license at
6.\" usr/src/OPENSOLARIS.LICENSE or http://www.opensolaris.org/os/licensing.
7.\"
8.\" See the License for the specific language governing permissions and
9.\" limitations under the License. When distributing Covered Code, include this
10.\" CDDL HEADER in each file and include the License file at
11.\" usr/src/OPENSOLARIS.LICENSE. If applicable, add the following below this
12.\" CDDL HEADER, with the fields enclosed by brackets "[]" replaced with your
13.\" own identifying information:
14.\" Portions Copyright [yyyy] [name of copyright owner]
15.TH ZFS-MODULE-PARAMETERS 5 "Nov 16, 2013"
16.SH NAME
17zfs\-module\-parameters \- ZFS module parameters
18.SH DESCRIPTION
19.sp
20.LP
21Description of the different parameters to the ZFS module.
22
23.SS "Module parameters"
24.sp
25.LP
26
6d836e6f
RE
27.sp
28.ne 2
29.na
30\fBignore_hole_birth\fR (int)
31.ad
32.RS 12n
33When set, the hole_birth optimization will not be used, and all holes will
34always be sent on zfs send. Useful if you suspect your datasets are affected
35by a bug in hole_birth.
36.sp
9ea9e0b9 37Use \fB1\fR for on (default) and \fB0\fR for off.
6d836e6f
RE
38.RE
39
29714574
TF
40.sp
41.ne 2
42.na
43\fBl2arc_feed_again\fR (int)
44.ad
45.RS 12n
83426735
D
46Turbo L2ARC warm-up. When the L2ARC is cold the fill interval will be set as
47fast as possible.
29714574
TF
48.sp
49Use \fB1\fR for yes (default) and \fB0\fR to disable.
50.RE
51
52.sp
53.ne 2
54.na
55\fBl2arc_feed_min_ms\fR (ulong)
56.ad
57.RS 12n
83426735
D
58Min feed interval in milliseconds. Requires \fBl2arc_feed_again=1\fR and only
59applicable in related situations.
29714574
TF
60.sp
61Default value: \fB200\fR.
62.RE
63
64.sp
65.ne 2
66.na
67\fBl2arc_feed_secs\fR (ulong)
68.ad
69.RS 12n
70Seconds between L2ARC writing
71.sp
72Default value: \fB1\fR.
73.RE
74
75.sp
76.ne 2
77.na
78\fBl2arc_headroom\fR (ulong)
79.ad
80.RS 12n
83426735
D
81How far through the ARC lists to search for L2ARC cacheable content, expressed
82as a multiplier of \fBl2arc_write_max\fR
29714574
TF
83.sp
84Default value: \fB2\fR.
85.RE
86
87.sp
88.ne 2
89.na
90\fBl2arc_headroom_boost\fR (ulong)
91.ad
92.RS 12n
83426735
D
93Scales \fBl2arc_headroom\fR by this percentage when L2ARC contents are being
94successfully compressed before writing. A value of 100 disables this feature.
29714574
TF
95.sp
96Default value: \fB200\fR.
97.RE
98
99.sp
100.ne 2
101.na
102\fBl2arc_nocompress\fR (int)
103.ad
104.RS 12n
105Skip compressing L2ARC buffers
106.sp
107Use \fB1\fR for yes and \fB0\fR for no (default).
108.RE
109
110.sp
111.ne 2
112.na
113\fBl2arc_noprefetch\fR (int)
114.ad
115.RS 12n
83426735
D
116Do not write buffers to L2ARC if they were prefetched but not used by
117applications
29714574
TF
118.sp
119Use \fB1\fR for yes (default) and \fB0\fR to disable.
120.RE
121
122.sp
123.ne 2
124.na
125\fBl2arc_norw\fR (int)
126.ad
127.RS 12n
128No reads during writes
129.sp
130Use \fB1\fR for yes and \fB0\fR for no (default).
131.RE
132
133.sp
134.ne 2
135.na
136\fBl2arc_write_boost\fR (ulong)
137.ad
138.RS 12n
603a1784 139Cold L2ARC devices will have \fBl2arc_write_max\fR increased by this amount
83426735 140while they remain cold.
29714574
TF
141.sp
142Default value: \fB8,388,608\fR.
143.RE
144
145.sp
146.ne 2
147.na
148\fBl2arc_write_max\fR (ulong)
149.ad
150.RS 12n
151Max write bytes per interval
152.sp
153Default value: \fB8,388,608\fR.
154.RE
155
99b14de4
ED
156.sp
157.ne 2
158.na
159\fBmetaslab_aliquot\fR (ulong)
160.ad
161.RS 12n
162Metaslab granularity, in bytes. This is roughly similar to what would be
163referred to as the "stripe size" in traditional RAID arrays. In normal
164operation, ZFS will try to write this amount of data to a top-level vdev
165before moving on to the next one.
166.sp
167Default value: \fB524,288\fR.
168.RE
169
f3a7f661
GW
170.sp
171.ne 2
172.na
173\fBmetaslab_bias_enabled\fR (int)
174.ad
175.RS 12n
176Enable metaslab group biasing based on its vdev's over- or under-utilization
177relative to the pool.
178.sp
179Use \fB1\fR for yes (default) and \fB0\fR for no.
180.RE
181
4e21fd06
DB
182.sp
183.ne 2
184.na
185\fBzfs_metaslab_segment_weight_enabled\fR (int)
186.ad
187.RS 12n
188Enable/disable segment-based metaslab selection.
189.sp
190Use \fB1\fR for yes (default) and \fB0\fR for no.
191.RE
192
193.sp
194.ne 2
195.na
196\fBzfs_metaslab_switch_threshold\fR (int)
197.ad
198.RS 12n
199When using segment-based metaslab selection, continue allocating
321204be 200from the active metaslab until \fBzfs_metaslab_switch_threshold\fR
4e21fd06
DB
201worth of buckets have been exhausted.
202.sp
203Default value: \fB2\fR.
204.RE
205
29714574
TF
206.sp
207.ne 2
208.na
aa7d06a9 209\fBmetaslab_debug_load\fR (int)
29714574
TF
210.ad
211.RS 12n
aa7d06a9
GW
212Load all metaslabs during pool import.
213.sp
214Use \fB1\fR for yes and \fB0\fR for no (default).
215.RE
216
217.sp
218.ne 2
219.na
220\fBmetaslab_debug_unload\fR (int)
221.ad
222.RS 12n
223Prevent metaslabs from being unloaded.
29714574
TF
224.sp
225Use \fB1\fR for yes and \fB0\fR for no (default).
226.RE
227
f3a7f661
GW
228.sp
229.ne 2
230.na
231\fBmetaslab_fragmentation_factor_enabled\fR (int)
232.ad
233.RS 12n
234Enable use of the fragmentation metric in computing metaslab weights.
235.sp
236Use \fB1\fR for yes (default) and \fB0\fR for no.
237.RE
238
b8bcca18
MA
239.sp
240.ne 2
241.na
242\fBmetaslabs_per_vdev\fR (int)
243.ad
244.RS 12n
245When a vdev is added, it will be divided into approximately (but no more than) this number of metaslabs.
246.sp
247Default value: \fB200\fR.
248.RE
249
f3a7f661
GW
250.sp
251.ne 2
252.na
253\fBmetaslab_preload_enabled\fR (int)
254.ad
255.RS 12n
256Enable metaslab group preloading.
257.sp
258Use \fB1\fR for yes (default) and \fB0\fR for no.
259.RE
260
261.sp
262.ne 2
263.na
264\fBmetaslab_lba_weighting_enabled\fR (int)
265.ad
266.RS 12n
267Give more weight to metaslabs with lower LBAs, assuming they have
268greater bandwidth as is typically the case on a modern constant
269angular velocity disk drive.
270.sp
271Use \fB1\fR for yes (default) and \fB0\fR for no.
272.RE
273
29714574
TF
274.sp
275.ne 2
276.na
277\fBspa_config_path\fR (charp)
278.ad
279.RS 12n
280SPA config file
281.sp
282Default value: \fB/etc/zfs/zpool.cache\fR.
283.RE
284
e8b96c60
MA
285.sp
286.ne 2
287.na
288\fBspa_asize_inflation\fR (int)
289.ad
290.RS 12n
291Multiplication factor used to estimate actual disk consumption from the
292size of data being written. The default value is a worst case estimate,
293but lower values may be valid for a given pool depending on its
294configuration. Pool administrators who understand the factors involved
295may wish to specify a more realistic inflation factor, particularly if
296they operate close to quota or capacity limits.
297.sp
83426735 298Default value: \fB24\fR.
e8b96c60
MA
299.RE
300
dea377c0
MA
301.sp
302.ne 2
303.na
304\fBspa_load_verify_data\fR (int)
305.ad
306.RS 12n
307Whether to traverse data blocks during an "extreme rewind" (\fB-X\fR)
308import. Use 0 to disable and 1 to enable.
309
310An extreme rewind import normally performs a full traversal of all
311blocks in the pool for verification. If this parameter is set to 0,
312the traversal skips non-metadata blocks. It can be toggled once the
313import has started to stop or start the traversal of non-metadata blocks.
314.sp
83426735 315Default value: \fB1\fR.
dea377c0
MA
316.RE
317
318.sp
319.ne 2
320.na
321\fBspa_load_verify_metadata\fR (int)
322.ad
323.RS 12n
324Whether to traverse blocks during an "extreme rewind" (\fB-X\fR)
325pool import. Use 0 to disable and 1 to enable.
326
327An extreme rewind import normally performs a full traversal of all
1c012083 328blocks in the pool for verification. If this parameter is set to 0,
dea377c0
MA
329the traversal is not performed. It can be toggled once the import has
330started to stop or start the traversal.
331.sp
83426735 332Default value: \fB1\fR.
dea377c0
MA
333.RE
334
335.sp
336.ne 2
337.na
338\fBspa_load_verify_maxinflight\fR (int)
339.ad
340.RS 12n
341Maximum concurrent I/Os during the traversal performed during an "extreme
342rewind" (\fB-X\fR) pool import.
343.sp
83426735 344Default value: \fB10000\fR.
dea377c0
MA
345.RE
346
6cde6435
BB
347.sp
348.ne 2
349.na
350\fBspa_slop_shift\fR (int)
351.ad
352.RS 12n
353Normally, we don't allow the last 3.2% (1/(2^spa_slop_shift)) of space
354in the pool to be consumed. This ensures that we don't run the pool
355completely out of space, due to unaccounted changes (e.g. to the MOS).
356It also limits the worst-case time to allocate space. If we have
357less than this amount of free space, most ZPL operations (e.g. write,
358create) will return ENOSPC.
359.sp
83426735 360Default value: \fB5\fR.
6cde6435
BB
361.RE
362
29714574
TF
363.sp
364.ne 2
365.na
366\fBzfetch_array_rd_sz\fR (ulong)
367.ad
368.RS 12n
27b293be 369If prefetching is enabled, disable prefetching for reads larger than this size.
29714574
TF
370.sp
371Default value: \fB1,048,576\fR.
372.RE
373
374.sp
375.ne 2
376.na
7f60329a 377\fBzfetch_max_distance\fR (uint)
29714574
TF
378.ad
379.RS 12n
7f60329a 380Max bytes to prefetch per stream (default 8MB).
29714574 381.sp
7f60329a 382Default value: \fB8,388,608\fR.
29714574
TF
383.RE
384
385.sp
386.ne 2
387.na
388\fBzfetch_max_streams\fR (uint)
389.ad
390.RS 12n
27b293be 391Max number of streams per zfetch (prefetch streams per file).
29714574
TF
392.sp
393Default value: \fB8\fR.
394.RE
395
396.sp
397.ne 2
398.na
399\fBzfetch_min_sec_reap\fR (uint)
400.ad
401.RS 12n
27b293be 402Min time before an active prefetch stream can be reclaimed
29714574
TF
403.sp
404Default value: \fB2\fR.
405.RE
406
25458cbe
TC
407.sp
408.ne 2
409.na
410\fBzfs_arc_dnode_limit\fR (ulong)
411.ad
412.RS 12n
413When the number of bytes consumed by dnodes in the ARC exceeds this number of
9907cc1c
G
414bytes, try to unpin some of it in response to demand for non-metadata. This
415value acts as a floor to the amount of dnode metadata, and defaults to 0 which
416indicates that a percent which is based on \fBzfs_arc_dnode_limit_percent\fR of
417the ARC meta buffers that may be used for dnodes.
25458cbe
TC
418
419See also \fBzfs_arc_meta_prune\fR which serves a similar purpose but is used
420when the amount of metadata in the ARC exceeds \fBzfs_arc_meta_limit\fR rather
421than in response to overall demand for non-metadata.
422
423.sp
9907cc1c
G
424Default value: \fB0\fR.
425.RE
426
427.sp
428.ne 2
429.na
430\fBzfs_arc_dnode_limit_percent\fR (ulong)
431.ad
432.RS 12n
433Percentage that can be consumed by dnodes of ARC meta buffers.
434.sp
435See also \fBzfs_arc_dnode_limit\fR which serves a similar purpose but has a
436higher priority if set to nonzero value.
437.sp
438Default value: \fB10\fR.
25458cbe
TC
439.RE
440
441.sp
442.ne 2
443.na
444\fBzfs_arc_dnode_reduce_percent\fR (ulong)
445.ad
446.RS 12n
447Percentage of ARC dnodes to try to scan in response to demand for non-metadata
6146e17e 448when the number of bytes consumed by dnodes exceeds \fBzfs_arc_dnode_limit\fR.
25458cbe
TC
449
450.sp
451Default value: \fB10% of the number of dnodes in the ARC\fR.
452.RE
453
49ddb315
MA
454.sp
455.ne 2
456.na
457\fBzfs_arc_average_blocksize\fR (int)
458.ad
459.RS 12n
460The ARC's buffer hash table is sized based on the assumption of an average
461block size of \fBzfs_arc_average_blocksize\fR (default 8K). This works out
462to roughly 1MB of hash table per 1GB of physical memory with 8-byte pointers.
463For configurations with a known larger average block size this value can be
464increased to reduce the memory footprint.
465
466.sp
467Default value: \fB8192\fR.
468.RE
469
ca0bf58d
PS
470.sp
471.ne 2
472.na
473\fBzfs_arc_evict_batch_limit\fR (int)
474.ad
475.RS 12n
8f343973 476Number ARC headers to evict per sub-list before proceeding to another sub-list.
ca0bf58d
PS
477This batch-style operation prevents entire sub-lists from being evicted at once
478but comes at a cost of additional unlocking and locking.
479.sp
480Default value: \fB10\fR.
481.RE
482
29714574
TF
483.sp
484.ne 2
485.na
486\fBzfs_arc_grow_retry\fR (int)
487.ad
488.RS 12n
83426735
D
489After a memory pressure event the ARC will wait this many seconds before trying
490to resume growth
29714574
TF
491.sp
492Default value: \fB5\fR.
493.RE
494
495.sp
496.ne 2
497.na
7e8bddd0 498\fBzfs_arc_lotsfree_percent\fR (int)
29714574
TF
499.ad
500.RS 12n
7e8bddd0
BB
501Throttle I/O when free system memory drops below this percentage of total
502system memory. Setting this value to 0 will disable the throttle.
29714574 503.sp
7e8bddd0 504Default value: \fB10\fR.
29714574
TF
505.RE
506
507.sp
508.ne 2
509.na
7e8bddd0 510\fBzfs_arc_max\fR (ulong)
29714574
TF
511.ad
512.RS 12n
83426735
D
513Max arc size of ARC in bytes. If set to 0 then it will consume 1/2 of system
514RAM. This value must be at least 67108864 (64 megabytes).
515.sp
516This value can be changed dynamically with some caveats. It cannot be set back
517to 0 while running and reducing it below the current ARC size will not cause
518the ARC to shrink without memory pressure to induce shrinking.
29714574 519.sp
7e8bddd0 520Default value: \fB0\fR.
29714574
TF
521.RE
522
523.sp
524.ne 2
525.na
526\fBzfs_arc_meta_limit\fR (ulong)
527.ad
528.RS 12n
2cbb06b5
BB
529The maximum allowed size in bytes that meta data buffers are allowed to
530consume in the ARC. When this limit is reached meta data buffers will
531be reclaimed even if the overall arc_c_max has not been reached. This
9907cc1c
G
532value defaults to 0 which indicates that a percent which is based on
533\fBzfs_arc_meta_limit_percent\fR of the ARC may be used for meta data.
29714574 534.sp
83426735 535This value my be changed dynamically except that it cannot be set back to 0
9907cc1c 536for a specific percent of the ARC; it must be set to an explicit value.
83426735 537.sp
29714574
TF
538Default value: \fB0\fR.
539.RE
540
9907cc1c
G
541.sp
542.ne 2
543.na
544\fBzfs_arc_meta_limit_percent\fR (ulong)
545.ad
546.RS 12n
547Percentage of ARC buffers that can be used for meta data.
548
549See also \fBzfs_arc_meta_limit\fR which serves a similar purpose but has a
550higher priority if set to nonzero value.
551
552.sp
553Default value: \fB75\fR.
554.RE
555
ca0bf58d
PS
556.sp
557.ne 2
558.na
559\fBzfs_arc_meta_min\fR (ulong)
560.ad
561.RS 12n
562The minimum allowed size in bytes that meta data buffers may consume in
563the ARC. This value defaults to 0 which disables a floor on the amount
564of the ARC devoted meta data.
565.sp
566Default value: \fB0\fR.
567.RE
568
29714574
TF
569.sp
570.ne 2
571.na
572\fBzfs_arc_meta_prune\fR (int)
573.ad
574.RS 12n
2cbb06b5
BB
575The number of dentries and inodes to be scanned looking for entries
576which can be dropped. This may be required when the ARC reaches the
577\fBzfs_arc_meta_limit\fR because dentries and inodes can pin buffers
578in the ARC. Increasing this value will cause to dentry and inode caches
579to be pruned more aggressively. Setting this value to 0 will disable
580pruning the inode and dentry caches.
29714574 581.sp
2cbb06b5 582Default value: \fB10,000\fR.
29714574
TF
583.RE
584
bc888666
BB
585.sp
586.ne 2
587.na
588\fBzfs_arc_meta_adjust_restarts\fR (ulong)
589.ad
590.RS 12n
591The number of restart passes to make while scanning the ARC attempting
592the free buffers in order to stay below the \fBzfs_arc_meta_limit\fR.
593This value should not need to be tuned but is available to facilitate
594performance analysis.
595.sp
596Default value: \fB4096\fR.
597.RE
598
29714574
TF
599.sp
600.ne 2
601.na
602\fBzfs_arc_min\fR (ulong)
603.ad
604.RS 12n
605Min arc size
606.sp
607Default value: \fB100\fR.
608.RE
609
610.sp
611.ne 2
612.na
613\fBzfs_arc_min_prefetch_lifespan\fR (int)
614.ad
615.RS 12n
83426735
D
616Minimum time prefetched blocks are locked in the ARC, specified in jiffies.
617A value of 0 will default to 1 second.
29714574 618.sp
83426735 619Default value: \fB0\fR.
29714574
TF
620.RE
621
ca0bf58d
PS
622.sp
623.ne 2
624.na
c30e58c4 625\fBzfs_multilist_num_sublists\fR (int)
ca0bf58d
PS
626.ad
627.RS 12n
628To allow more fine-grained locking, each ARC state contains a series
629of lists for both data and meta data objects. Locking is performed at
630the level of these "sub-lists". This parameters controls the number of
c30e58c4
MA
631sub-lists per ARC state, and also applies to other uses of the
632multilist data structure.
ca0bf58d 633.sp
c30e58c4 634Default value: \fB4\fR or the number of online CPUs, whichever is greater
ca0bf58d
PS
635.RE
636
637.sp
638.ne 2
639.na
640\fBzfs_arc_overflow_shift\fR (int)
641.ad
642.RS 12n
643The ARC size is considered to be overflowing if it exceeds the current
644ARC target size (arc_c) by a threshold determined by this parameter.
645The threshold is calculated as a fraction of arc_c using the formula
646"arc_c >> \fBzfs_arc_overflow_shift\fR".
647
648The default value of 8 causes the ARC to be considered to be overflowing
649if it exceeds the target size by 1/256th (0.3%) of the target size.
650
651When the ARC is overflowing, new buffer allocations are stalled until
652the reclaim thread catches up and the overflow condition no longer exists.
653.sp
654Default value: \fB8\fR.
655.RE
656
728d6ae9
BB
657.sp
658.ne 2
659.na
660
661\fBzfs_arc_p_min_shift\fR (int)
662.ad
663.RS 12n
664arc_c shift to calc min/max arc_p
665.sp
666Default value: \fB4\fR.
667.RE
668
89c8cac4
PS
669.sp
670.ne 2
671.na
672\fBzfs_arc_p_aggressive_disable\fR (int)
673.ad
674.RS 12n
675Disable aggressive arc_p growth
676.sp
677Use \fB1\fR for yes (default) and \fB0\fR to disable.
678.RE
679
62422785
PS
680.sp
681.ne 2
682.na
683\fBzfs_arc_p_dampener_disable\fR (int)
684.ad
685.RS 12n
686Disable arc_p adapt dampener
687.sp
688Use \fB1\fR for yes (default) and \fB0\fR to disable.
689.RE
690
29714574
TF
691.sp
692.ne 2
693.na
694\fBzfs_arc_shrink_shift\fR (int)
695.ad
696.RS 12n
697log2(fraction of arc to reclaim)
698.sp
699Default value: \fB5\fR.
700.RE
701
03b60eee
DB
702.sp
703.ne 2
704.na
705\fBzfs_arc_pc_percent\fR (uint)
706.ad
707.RS 12n
708Percent of pagecache to reclaim arc to
709
710This tunable allows ZFS arc to play more nicely with the kernel's LRU
711pagecache. It can guarantee that the arc size won't collapse under scanning
712pressure on the pagecache, yet still allows arc to be reclaimed down to
713zfs_arc_min if necessary. This value is specified as percent of pagecache
714size (as measured by NR_FILE_PAGES) where that percent may exceed 100. This
715only operates during memory pressure/reclaim.
716.sp
717Default value: \fB0\fR (disabled).
718.RE
719
11f552fa
BB
720.sp
721.ne 2
722.na
723\fBzfs_arc_sys_free\fR (ulong)
724.ad
725.RS 12n
726The target number of bytes the ARC should leave as free memory on the system.
727Defaults to the larger of 1/64 of physical memory or 512K. Setting this
728option to a non-zero value will override the default.
729.sp
730Default value: \fB0\fR.
731.RE
732
29714574
TF
733.sp
734.ne 2
735.na
736\fBzfs_autoimport_disable\fR (int)
737.ad
738.RS 12n
27b293be 739Disable pool import at module load by ignoring the cache file (typically \fB/etc/zfs/zpool.cache\fR).
29714574 740.sp
70081096 741Use \fB1\fR for yes (default) and \fB0\fR for no.
29714574
TF
742.RE
743
3b36f831
BB
744.sp
745.ne 2
746.na
747\fBzfs_dbgmsg_enable\fR (int)
748.ad
749.RS 12n
750Internally ZFS keeps a small log to facilitate debugging. By default the log
751is disabled, to enable it set this option to 1. The contents of the log can
752be accessed by reading the /proc/spl/kstat/zfs/dbgmsg file. Writing 0 to
753this proc file clears the log.
754.sp
755Default value: \fB0\fR.
756.RE
757
758.sp
759.ne 2
760.na
761\fBzfs_dbgmsg_maxsize\fR (int)
762.ad
763.RS 12n
764The maximum size in bytes of the internal ZFS debug log.
765.sp
766Default value: \fB4M\fR.
767.RE
768
29714574
TF
769.sp
770.ne 2
771.na
772\fBzfs_dbuf_state_index\fR (int)
773.ad
774.RS 12n
83426735
D
775This feature is currently unused. It is normally used for controlling what
776reporting is available under /proc/spl/kstat/zfs.
29714574
TF
777.sp
778Default value: \fB0\fR.
779.RE
780
781.sp
782.ne 2
783.na
784\fBzfs_deadman_enabled\fR (int)
785.ad
786.RS 12n
b81a3ddc
TC
787When a pool sync operation takes longer than \fBzfs_deadman_synctime_ms\fR
788milliseconds, a "slow spa_sync" message is logged to the debug log
789(see \fBzfs_dbgmsg_enable\fR). If \fBzfs_deadman_enabled\fR is set,
790all pending IO operations are also checked and if any haven't completed
791within \fBzfs_deadman_synctime_ms\fR milliseconds, a "SLOW IO" message
792is logged to the debug log and a "delay" system event with the details of
793the hung IO is posted.
29714574 794.sp
b81a3ddc
TC
795Use \fB1\fR (default) to enable the slow IO check and \fB0\fR to disable.
796.RE
797
798.sp
799.ne 2
800.na
801\fBzfs_deadman_checktime_ms\fR (int)
802.ad
803.RS 12n
804Once a pool sync operation has taken longer than
805\fBzfs_deadman_synctime_ms\fR milliseconds, continue to check for slow
806operations every \fBzfs_deadman_checktime_ms\fR milliseconds.
807.sp
808Default value: \fB5,000\fR.
29714574
TF
809.RE
810
811.sp
812.ne 2
813.na
e8b96c60 814\fBzfs_deadman_synctime_ms\fR (ulong)
29714574
TF
815.ad
816.RS 12n
b81a3ddc
TC
817Interval in milliseconds after which the deadman is triggered and also
818the interval after which an IO operation is considered to be "hung"
819if \fBzfs_deadman_enabled\fR is set.
820
821See \fBzfs_deadman_enabled\fR.
29714574 822.sp
e8b96c60 823Default value: \fB1,000,000\fR.
29714574
TF
824.RE
825
826.sp
827.ne 2
828.na
829\fBzfs_dedup_prefetch\fR (int)
830.ad
831.RS 12n
832Enable prefetching dedup-ed blks
833.sp
0dfc7324 834Use \fB1\fR for yes and \fB0\fR to disable (default).
29714574
TF
835.RE
836
e8b96c60
MA
837.sp
838.ne 2
839.na
840\fBzfs_delay_min_dirty_percent\fR (int)
841.ad
842.RS 12n
843Start to delay each transaction once there is this amount of dirty data,
844expressed as a percentage of \fBzfs_dirty_data_max\fR.
845This value should be >= zfs_vdev_async_write_active_max_dirty_percent.
846See the section "ZFS TRANSACTION DELAY".
847.sp
848Default value: \fB60\fR.
849.RE
850
851.sp
852.ne 2
853.na
854\fBzfs_delay_scale\fR (int)
855.ad
856.RS 12n
857This controls how quickly the transaction delay approaches infinity.
858Larger values cause longer delays for a given amount of dirty data.
859.sp
860For the smoothest delay, this value should be about 1 billion divided
861by the maximum number of operations per second. This will smoothly
862handle between 10x and 1/10th this number.
863.sp
864See the section "ZFS TRANSACTION DELAY".
865.sp
866Note: \fBzfs_delay_scale\fR * \fBzfs_dirty_data_max\fR must be < 2^64.
867.sp
868Default value: \fB500,000\fR.
869.RE
870
a966c564
K
871.sp
872.ne 2
873.na
874\fBzfs_delete_blocks\fR (ulong)
875.ad
876.RS 12n
877This is the used to define a large file for the purposes of delete. Files
878containing more than \fBzfs_delete_blocks\fR will be deleted asynchronously
879while smaller files are deleted synchronously. Decreasing this value will
880reduce the time spent in an unlink(2) system call at the expense of a longer
881delay before the freed space is available.
882.sp
883Default value: \fB20,480\fR.
884.RE
885
e8b96c60
MA
886.sp
887.ne 2
888.na
889\fBzfs_dirty_data_max\fR (int)
890.ad
891.RS 12n
892Determines the dirty space limit in bytes. Once this limit is exceeded, new
893writes are halted until space frees up. This parameter takes precedence
894over \fBzfs_dirty_data_max_percent\fR.
895See the section "ZFS TRANSACTION DELAY".
896.sp
897Default value: 10 percent of all memory, capped at \fBzfs_dirty_data_max_max\fR.
898.RE
899
900.sp
901.ne 2
902.na
903\fBzfs_dirty_data_max_max\fR (int)
904.ad
905.RS 12n
906Maximum allowable value of \fBzfs_dirty_data_max\fR, expressed in bytes.
907This limit is only enforced at module load time, and will be ignored if
908\fBzfs_dirty_data_max\fR is later changed. This parameter takes
909precedence over \fBzfs_dirty_data_max_max_percent\fR. See the section
910"ZFS TRANSACTION DELAY".
911.sp
912Default value: 25% of physical RAM.
913.RE
914
915.sp
916.ne 2
917.na
918\fBzfs_dirty_data_max_max_percent\fR (int)
919.ad
920.RS 12n
921Maximum allowable value of \fBzfs_dirty_data_max\fR, expressed as a
922percentage of physical RAM. This limit is only enforced at module load
923time, and will be ignored if \fBzfs_dirty_data_max\fR is later changed.
924The parameter \fBzfs_dirty_data_max_max\fR takes precedence over this
925one. See the section "ZFS TRANSACTION DELAY".
926.sp
9ef3906a 927Default value: \fB25\fR.
e8b96c60
MA
928.RE
929
930.sp
931.ne 2
932.na
933\fBzfs_dirty_data_max_percent\fR (int)
934.ad
935.RS 12n
936Determines the dirty space limit, expressed as a percentage of all
937memory. Once this limit is exceeded, new writes are halted until space frees
938up. The parameter \fBzfs_dirty_data_max\fR takes precedence over this
939one. See the section "ZFS TRANSACTION DELAY".
940.sp
941Default value: 10%, subject to \fBzfs_dirty_data_max_max\fR.
942.RE
943
944.sp
945.ne 2
946.na
947\fBzfs_dirty_data_sync\fR (int)
948.ad
949.RS 12n
950Start syncing out a transaction group if there is at least this much dirty data.
951.sp
952Default value: \fB67,108,864\fR.
953.RE
954
1eeb4562
JX
955.sp
956.ne 2
957.na
958\fBzfs_fletcher_4_impl\fR (string)
959.ad
960.RS 12n
961Select a fletcher 4 implementation.
962.sp
35a76a03 963Supported selectors are: \fBfastest\fR, \fBscalar\fR, \fBsse2\fR, \fBssse3\fR,
24cdeaf1 964\fBavx2\fR, \fBavx512f\fR, and \fBaarch64_neon\fR.
70b258fc
GN
965All of the selectors except \fBfastest\fR and \fBscalar\fR require instruction
966set extensions to be available and will only appear if ZFS detects that they are
967present at runtime. If multiple implementations of fletcher 4 are available,
968the \fBfastest\fR will be chosen using a micro benchmark. Selecting \fBscalar\fR
969results in the original, CPU based calculation, being used. Selecting any option
970other than \fBfastest\fR and \fBscalar\fR results in vector instructions from
971the respective CPU instruction set being used.
1eeb4562
JX
972.sp
973Default value: \fBfastest\fR.
974.RE
975
ba5ad9a4
GW
976.sp
977.ne 2
978.na
979\fBzfs_free_bpobj_enabled\fR (int)
980.ad
981.RS 12n
982Enable/disable the processing of the free_bpobj object.
983.sp
984Default value: \fB1\fR.
985.RE
986
36283ca2
MG
987.sp
988.ne 2
989.na
990\fBzfs_free_max_blocks\fR (ulong)
991.ad
992.RS 12n
993Maximum number of blocks freed in a single txg.
994.sp
995Default value: \fB100,000\fR.
996.RE
997
e8b96c60
MA
998.sp
999.ne 2
1000.na
1001\fBzfs_vdev_async_read_max_active\fR (int)
1002.ad
1003.RS 12n
83426735 1004Maximum asynchronous read I/Os active to each device.
e8b96c60
MA
1005See the section "ZFS I/O SCHEDULER".
1006.sp
1007Default value: \fB3\fR.
1008.RE
1009
1010.sp
1011.ne 2
1012.na
1013\fBzfs_vdev_async_read_min_active\fR (int)
1014.ad
1015.RS 12n
1016Minimum asynchronous read I/Os active to each device.
1017See the section "ZFS I/O SCHEDULER".
1018.sp
1019Default value: \fB1\fR.
1020.RE
1021
1022.sp
1023.ne 2
1024.na
1025\fBzfs_vdev_async_write_active_max_dirty_percent\fR (int)
1026.ad
1027.RS 12n
1028When the pool has more than
1029\fBzfs_vdev_async_write_active_max_dirty_percent\fR dirty data, use
1030\fBzfs_vdev_async_write_max_active\fR to limit active async writes. If
1031the dirty data is between min and max, the active I/O limit is linearly
1032interpolated. See the section "ZFS I/O SCHEDULER".
1033.sp
1034Default value: \fB60\fR.
1035.RE
1036
1037.sp
1038.ne 2
1039.na
1040\fBzfs_vdev_async_write_active_min_dirty_percent\fR (int)
1041.ad
1042.RS 12n
1043When the pool has less than
1044\fBzfs_vdev_async_write_active_min_dirty_percent\fR dirty data, use
1045\fBzfs_vdev_async_write_min_active\fR to limit active async writes. If
1046the dirty data is between min and max, the active I/O limit is linearly
1047interpolated. See the section "ZFS I/O SCHEDULER".
1048.sp
1049Default value: \fB30\fR.
1050.RE
1051
1052.sp
1053.ne 2
1054.na
1055\fBzfs_vdev_async_write_max_active\fR (int)
1056.ad
1057.RS 12n
83426735 1058Maximum asynchronous write I/Os active to each device.
e8b96c60
MA
1059See the section "ZFS I/O SCHEDULER".
1060.sp
1061Default value: \fB10\fR.
1062.RE
1063
1064.sp
1065.ne 2
1066.na
1067\fBzfs_vdev_async_write_min_active\fR (int)
1068.ad
1069.RS 12n
1070Minimum asynchronous write I/Os active to each device.
1071See the section "ZFS I/O SCHEDULER".
1072.sp
06226b59
D
1073Lower values are associated with better latency on rotational media but poorer
1074resilver performance. The default value of 2 was chosen as a compromise. A
1075value of 3 has been shown to improve resilver performance further at a cost of
1076further increasing latency.
1077.sp
1078Default value: \fB2\fR.
e8b96c60
MA
1079.RE
1080
1081.sp
1082.ne 2
1083.na
1084\fBzfs_vdev_max_active\fR (int)
1085.ad
1086.RS 12n
1087The maximum number of I/Os active to each device. Ideally, this will be >=
1088the sum of each queue's max_active. It must be at least the sum of each
1089queue's min_active. See the section "ZFS I/O SCHEDULER".
1090.sp
1091Default value: \fB1,000\fR.
1092.RE
1093
1094.sp
1095.ne 2
1096.na
1097\fBzfs_vdev_scrub_max_active\fR (int)
1098.ad
1099.RS 12n
83426735 1100Maximum scrub I/Os active to each device.
e8b96c60
MA
1101See the section "ZFS I/O SCHEDULER".
1102.sp
1103Default value: \fB2\fR.
1104.RE
1105
1106.sp
1107.ne 2
1108.na
1109\fBzfs_vdev_scrub_min_active\fR (int)
1110.ad
1111.RS 12n
1112Minimum scrub I/Os active to each device.
1113See the section "ZFS I/O SCHEDULER".
1114.sp
1115Default value: \fB1\fR.
1116.RE
1117
1118.sp
1119.ne 2
1120.na
1121\fBzfs_vdev_sync_read_max_active\fR (int)
1122.ad
1123.RS 12n
83426735 1124Maximum synchronous read I/Os active to each device.
e8b96c60
MA
1125See the section "ZFS I/O SCHEDULER".
1126.sp
1127Default value: \fB10\fR.
1128.RE
1129
1130.sp
1131.ne 2
1132.na
1133\fBzfs_vdev_sync_read_min_active\fR (int)
1134.ad
1135.RS 12n
1136Minimum synchronous read I/Os active to each device.
1137See the section "ZFS I/O SCHEDULER".
1138.sp
1139Default value: \fB10\fR.
1140.RE
1141
1142.sp
1143.ne 2
1144.na
1145\fBzfs_vdev_sync_write_max_active\fR (int)
1146.ad
1147.RS 12n
83426735 1148Maximum synchronous write I/Os active to each device.
e8b96c60
MA
1149See the section "ZFS I/O SCHEDULER".
1150.sp
1151Default value: \fB10\fR.
1152.RE
1153
1154.sp
1155.ne 2
1156.na
1157\fBzfs_vdev_sync_write_min_active\fR (int)
1158.ad
1159.RS 12n
1160Minimum synchronous write I/Os active to each device.
1161See the section "ZFS I/O SCHEDULER".
1162.sp
1163Default value: \fB10\fR.
1164.RE
1165
3dfb57a3
DB
1166.sp
1167.ne 2
1168.na
1169\fBzfs_vdev_queue_depth_pct\fR (int)
1170.ad
1171.RS 12n
e815485f
TC
1172Maximum number of queued allocations per top-level vdev expressed as
1173a percentage of \fBzfs_vdev_async_write_max_active\fR which allows the
1174system to detect devices that are more capable of handling allocations
1175and to allocate more blocks to those devices. It allows for dynamic
1176allocation distribution when devices are imbalanced as fuller devices
1177will tend to be slower than empty devices.
1178
1179See also \fBzio_dva_throttle_enabled\fR.
3dfb57a3
DB
1180.sp
1181Default value: \fB1000\fR.
1182.RE
1183
29714574
TF
1184.sp
1185.ne 2
1186.na
1187\fBzfs_disable_dup_eviction\fR (int)
1188.ad
1189.RS 12n
1190Disable duplicate buffer eviction
1191.sp
1192Use \fB1\fR for yes and \fB0\fR for no (default).
1193.RE
1194
1195.sp
1196.ne 2
1197.na
1198\fBzfs_expire_snapshot\fR (int)
1199.ad
1200.RS 12n
1201Seconds to expire .zfs/snapshot
1202.sp
1203Default value: \fB300\fR.
1204.RE
1205
0500e835
BB
1206.sp
1207.ne 2
1208.na
1209\fBzfs_admin_snapshot\fR (int)
1210.ad
1211.RS 12n
1212Allow the creation, removal, or renaming of entries in the .zfs/snapshot
1213directory to cause the creation, destruction, or renaming of snapshots.
1214When enabled this functionality works both locally and over NFS exports
1215which have the 'no_root_squash' option set. This functionality is disabled
1216by default.
1217.sp
1218Use \fB1\fR for yes and \fB0\fR for no (default).
1219.RE
1220
29714574
TF
1221.sp
1222.ne 2
1223.na
1224\fBzfs_flags\fR (int)
1225.ad
1226.RS 12n
33b6dbbc
NB
1227Set additional debugging flags. The following flags may be bitwise-or'd
1228together.
1229.sp
1230.TS
1231box;
1232rB lB
1233lB lB
1234r l.
1235Value Symbolic Name
1236 Description
1237_
12381 ZFS_DEBUG_DPRINTF
1239 Enable dprintf entries in the debug log.
1240_
12412 ZFS_DEBUG_DBUF_VERIFY *
1242 Enable extra dbuf verifications.
1243_
12444 ZFS_DEBUG_DNODE_VERIFY *
1245 Enable extra dnode verifications.
1246_
12478 ZFS_DEBUG_SNAPNAMES
1248 Enable snapshot name verification.
1249_
125016 ZFS_DEBUG_MODIFY
1251 Check for illegally modified ARC buffers.
1252_
125332 ZFS_DEBUG_SPA
1254 Enable spa_dbgmsg entries in the debug log.
1255_
125664 ZFS_DEBUG_ZIO_FREE
1257 Enable verification of block frees.
1258_
1259128 ZFS_DEBUG_HISTOGRAM_VERIFY
1260 Enable extra spacemap histogram verifications.
1261.TE
1262.sp
1263* Requires debug build.
29714574 1264.sp
33b6dbbc 1265Default value: \fB0\fR.
29714574
TF
1266.RE
1267
fbeddd60
MA
1268.sp
1269.ne 2
1270.na
1271\fBzfs_free_leak_on_eio\fR (int)
1272.ad
1273.RS 12n
1274If destroy encounters an EIO while reading metadata (e.g. indirect
1275blocks), space referenced by the missing metadata can not be freed.
1276Normally this causes the background destroy to become "stalled", as
1277it is unable to make forward progress. While in this stalled state,
1278all remaining space to free from the error-encountering filesystem is
1279"temporarily leaked". Set this flag to cause it to ignore the EIO,
1280permanently leak the space from indirect blocks that can not be read,
1281and continue to free everything else that it can.
1282
1283The default, "stalling" behavior is useful if the storage partially
1284fails (i.e. some but not all i/os fail), and then later recovers. In
1285this case, we will be able to continue pool operations while it is
1286partially failed, and when it recovers, we can continue to free the
1287space, with no leaks. However, note that this case is actually
1288fairly rare.
1289
1290Typically pools either (a) fail completely (but perhaps temporarily,
1291e.g. a top-level vdev going offline), or (b) have localized,
1292permanent errors (e.g. disk returns the wrong data due to bit flip or
1293firmware bug). In case (a), this setting does not matter because the
1294pool will be suspended and the sync thread will not be able to make
1295forward progress regardless. In case (b), because the error is
1296permanent, the best we can do is leak the minimum amount of space,
1297which is what setting this flag will do. Therefore, it is reasonable
1298for this flag to normally be set, but we chose the more conservative
1299approach of not setting it, so that there is no possibility of
1300leaking space in the "partial temporary" failure case.
1301.sp
1302Default value: \fB0\fR.
1303.RE
1304
29714574
TF
1305.sp
1306.ne 2
1307.na
1308\fBzfs_free_min_time_ms\fR (int)
1309.ad
1310.RS 12n
6146e17e 1311During a \fBzfs destroy\fR operation using \fBfeature@async_destroy\fR a minimum
83426735 1312of this much time will be spent working on freeing blocks per txg.
29714574
TF
1313.sp
1314Default value: \fB1,000\fR.
1315.RE
1316
1317.sp
1318.ne 2
1319.na
1320\fBzfs_immediate_write_sz\fR (long)
1321.ad
1322.RS 12n
83426735 1323Largest data block to write to zil. Larger blocks will be treated as if the
6146e17e 1324dataset being written to had the property setting \fBlogbias=throughput\fR.
29714574
TF
1325.sp
1326Default value: \fB32,768\fR.
1327.RE
1328
f1512ee6
MA
1329.sp
1330.ne 2
1331.na
1332\fBzfs_max_recordsize\fR (int)
1333.ad
1334.RS 12n
1335We currently support block sizes from 512 bytes to 16MB. The benefits of
1336larger blocks, and thus larger IO, need to be weighed against the cost of
1337COWing a giant block to modify one byte. Additionally, very large blocks
1338can have an impact on i/o latency, and also potentially on the memory
1339allocator. Therefore, we do not allow the recordsize to be set larger than
1340zfs_max_recordsize (default 1MB). Larger blocks can be created by changing
1341this tunable, and pools with larger blocks can always be imported and used,
1342regardless of this setting.
1343.sp
1344Default value: \fB1,048,576\fR.
1345.RE
1346
29714574
TF
1347.sp
1348.ne 2
1349.na
1350\fBzfs_mdcomp_disable\fR (int)
1351.ad
1352.RS 12n
1353Disable meta data compression
1354.sp
1355Use \fB1\fR for yes and \fB0\fR for no (default).
1356.RE
1357
f3a7f661
GW
1358.sp
1359.ne 2
1360.na
1361\fBzfs_metaslab_fragmentation_threshold\fR (int)
1362.ad
1363.RS 12n
1364Allow metaslabs to keep their active state as long as their fragmentation
1365percentage is less than or equal to this value. An active metaslab that
1366exceeds this threshold will no longer keep its active status allowing
1367better metaslabs to be selected.
1368.sp
1369Default value: \fB70\fR.
1370.RE
1371
1372.sp
1373.ne 2
1374.na
1375\fBzfs_mg_fragmentation_threshold\fR (int)
1376.ad
1377.RS 12n
1378Metaslab groups are considered eligible for allocations if their
83426735 1379fragmentation metric (measured as a percentage) is less than or equal to
f3a7f661
GW
1380this value. If a metaslab group exceeds this threshold then it will be
1381skipped unless all metaslab groups within the metaslab class have also
1382crossed this threshold.
1383.sp
1384Default value: \fB85\fR.
1385.RE
1386
f4a4046b
TC
1387.sp
1388.ne 2
1389.na
1390\fBzfs_mg_noalloc_threshold\fR (int)
1391.ad
1392.RS 12n
1393Defines a threshold at which metaslab groups should be eligible for
1394allocations. The value is expressed as a percentage of free space
1395beyond which a metaslab group is always eligible for allocations.
1396If a metaslab group's free space is less than or equal to the
6b4e21c6 1397threshold, the allocator will avoid allocating to that group
f4a4046b
TC
1398unless all groups in the pool have reached the threshold. Once all
1399groups have reached the threshold, all groups are allowed to accept
1400allocations. The default value of 0 disables the feature and causes
1401all metaslab groups to be eligible for allocations.
1402
1403This parameter allows to deal with pools having heavily imbalanced
1404vdevs such as would be the case when a new vdev has been added.
1405Setting the threshold to a non-zero percentage will stop allocations
1406from being made to vdevs that aren't filled to the specified percentage
1407and allow lesser filled vdevs to acquire more allocations than they
1408otherwise would under the old \fBzfs_mg_alloc_failures\fR facility.
1409.sp
1410Default value: \fB0\fR.
1411.RE
1412
29714574
TF
1413.sp
1414.ne 2
1415.na
1416\fBzfs_no_scrub_io\fR (int)
1417.ad
1418.RS 12n
83426735
D
1419Set for no scrub I/O. This results in scrubs not actually scrubbing data and
1420simply doing a metadata crawl of the pool instead.
29714574
TF
1421.sp
1422Use \fB1\fR for yes and \fB0\fR for no (default).
1423.RE
1424
1425.sp
1426.ne 2
1427.na
1428\fBzfs_no_scrub_prefetch\fR (int)
1429.ad
1430.RS 12n
83426735 1431Set to disable block prefetching for scrubs.
29714574
TF
1432.sp
1433Use \fB1\fR for yes and \fB0\fR for no (default).
1434.RE
1435
29714574
TF
1436.sp
1437.ne 2
1438.na
1439\fBzfs_nocacheflush\fR (int)
1440.ad
1441.RS 12n
83426735
D
1442Disable cache flush operations on disks when writing. Beware, this may cause
1443corruption if disks re-order writes.
29714574
TF
1444.sp
1445Use \fB1\fR for yes and \fB0\fR for no (default).
1446.RE
1447
1448.sp
1449.ne 2
1450.na
1451\fBzfs_nopwrite_enabled\fR (int)
1452.ad
1453.RS 12n
1454Enable NOP writes
1455.sp
1456Use \fB1\fR for yes (default) and \fB0\fR to disable.
1457.RE
1458
66aca247
DB
1459.sp
1460.ne 2
1461.na
1462\fBzfs_dmu_offset_next_sync\fR (int)
1463.ad
1464.RS 12n
1465Enable forcing txg sync to find holes. When enabled forces ZFS to act
1466like prior versions when SEEK_HOLE or SEEK_DATA flags are used, which
1467when a dnode is dirty causes txg's to be synced so that this data can be
1468found.
1469.sp
1470Use \fB1\fR for yes and \fB0\fR to disable (default).
1471.RE
1472
29714574
TF
1473.sp
1474.ne 2
1475.na
b738bc5a 1476\fBzfs_pd_bytes_max\fR (int)
29714574
TF
1477.ad
1478.RS 12n
83426735 1479The number of bytes which should be prefetched during a pool traversal
6146e17e 1480(eg: \fBzfs send\fR or other data crawling operations)
29714574 1481.sp
74aa2ba2 1482Default value: \fB52,428,800\fR.
29714574
TF
1483.RE
1484
bef78122
DQ
1485.sp
1486.ne 2
1487.na
1488\fBzfs_per_txg_dirty_frees_percent \fR (ulong)
1489.ad
1490.RS 12n
1491Tunable to control percentage of dirtied blocks from frees in one TXG.
1492After this threshold is crossed, additional dirty blocks from frees
1493wait until the next TXG.
1494A value of zero will disable this throttle.
1495.sp
1496Default value: \fB30\fR and \fB0\fR to disable.
1497.RE
1498
1499
1500
29714574
TF
1501.sp
1502.ne 2
1503.na
1504\fBzfs_prefetch_disable\fR (int)
1505.ad
1506.RS 12n
7f60329a
MA
1507This tunable disables predictive prefetch. Note that it leaves "prescient"
1508prefetch (e.g. prefetch for zfs send) intact. Unlike predictive prefetch,
1509prescient prefetch never issues i/os that end up not being needed, so it
1510can't hurt performance.
29714574
TF
1511.sp
1512Use \fB1\fR for yes and \fB0\fR for no (default).
1513.RE
1514
1515.sp
1516.ne 2
1517.na
1518\fBzfs_read_chunk_size\fR (long)
1519.ad
1520.RS 12n
1521Bytes to read per chunk
1522.sp
1523Default value: \fB1,048,576\fR.
1524.RE
1525
1526.sp
1527.ne 2
1528.na
1529\fBzfs_read_history\fR (int)
1530.ad
1531.RS 12n
83426735
D
1532Historic statistics for the last N reads will be available in
1533\fR/proc/spl/kstat/zfs/POOLNAME/reads\fB
29714574 1534.sp
83426735 1535Default value: \fB0\fR (no data is kept).
29714574
TF
1536.RE
1537
1538.sp
1539.ne 2
1540.na
1541\fBzfs_read_history_hits\fR (int)
1542.ad
1543.RS 12n
1544Include cache hits in read history
1545.sp
1546Use \fB1\fR for yes and \fB0\fR for no (default).
1547.RE
1548
1549.sp
1550.ne 2
1551.na
1552\fBzfs_recover\fR (int)
1553.ad
1554.RS 12n
1555Set to attempt to recover from fatal errors. This should only be used as a
1556last resort, as it typically results in leaked space, or worse.
1557.sp
1558Use \fB1\fR for yes and \fB0\fR for no (default).
1559.RE
1560
1561.sp
1562.ne 2
1563.na
1564\fBzfs_resilver_delay\fR (int)
1565.ad
1566.RS 12n
27b293be
TC
1567Number of ticks to delay prior to issuing a resilver I/O operation when
1568a non-resilver or non-scrub I/O operation has occurred within the past
1569\fBzfs_scan_idle\fR ticks.
29714574
TF
1570.sp
1571Default value: \fB2\fR.
1572.RE
1573
1574.sp
1575.ne 2
1576.na
1577\fBzfs_resilver_min_time_ms\fR (int)
1578.ad
1579.RS 12n
83426735
D
1580Resilvers are processed by the sync thread. While resilvering it will spend
1581at least this much time working on a resilver between txg flushes.
29714574
TF
1582.sp
1583Default value: \fB3,000\fR.
1584.RE
1585
1586.sp
1587.ne 2
1588.na
1589\fBzfs_scan_idle\fR (int)
1590.ad
1591.RS 12n
27b293be
TC
1592Idle window in clock ticks. During a scrub or a resilver, if
1593a non-scrub or non-resilver I/O operation has occurred during this
1594window, the next scrub or resilver operation is delayed by, respectively
1595\fBzfs_scrub_delay\fR or \fBzfs_resilver_delay\fR ticks.
29714574
TF
1596.sp
1597Default value: \fB50\fR.
1598.RE
1599
1600.sp
1601.ne 2
1602.na
1603\fBzfs_scan_min_time_ms\fR (int)
1604.ad
1605.RS 12n
83426735
D
1606Scrubs are processed by the sync thread. While scrubbing it will spend
1607at least this much time working on a scrub between txg flushes.
29714574
TF
1608.sp
1609Default value: \fB1,000\fR.
1610.RE
1611
1612.sp
1613.ne 2
1614.na
1615\fBzfs_scrub_delay\fR (int)
1616.ad
1617.RS 12n
27b293be
TC
1618Number of ticks to delay prior to issuing a scrub I/O operation when
1619a non-scrub or non-resilver I/O operation has occurred within the past
1620\fBzfs_scan_idle\fR ticks.
29714574
TF
1621.sp
1622Default value: \fB4\fR.
1623.RE
1624
fd8febbd
TF
1625.sp
1626.ne 2
1627.na
1628\fBzfs_send_corrupt_data\fR (int)
1629.ad
1630.RS 12n
83426735 1631Allow sending of corrupt data (ignore read/checksum errors when sending data)
fd8febbd
TF
1632.sp
1633Use \fB1\fR for yes and \fB0\fR for no (default).
1634.RE
1635
29714574
TF
1636.sp
1637.ne 2
1638.na
1639\fBzfs_sync_pass_deferred_free\fR (int)
1640.ad
1641.RS 12n
83426735 1642Flushing of data to disk is done in passes. Defer frees starting in this pass
29714574
TF
1643.sp
1644Default value: \fB2\fR.
1645.RE
1646
1647.sp
1648.ne 2
1649.na
1650\fBzfs_sync_pass_dont_compress\fR (int)
1651.ad
1652.RS 12n
1653Don't compress starting in this pass
1654.sp
1655Default value: \fB5\fR.
1656.RE
1657
1658.sp
1659.ne 2
1660.na
1661\fBzfs_sync_pass_rewrite\fR (int)
1662.ad
1663.RS 12n
83426735 1664Rewrite new block pointers starting in this pass
29714574
TF
1665.sp
1666Default value: \fB2\fR.
1667.RE
1668
1669.sp
1670.ne 2
1671.na
1672\fBzfs_top_maxinflight\fR (int)
1673.ad
1674.RS 12n
83426735
D
1675Max concurrent I/Os per top-level vdev (mirrors or raidz arrays) allowed during
1676scrub or resilver operations.
29714574
TF
1677.sp
1678Default value: \fB32\fR.
1679.RE
1680
1681.sp
1682.ne 2
1683.na
1684\fBzfs_txg_history\fR (int)
1685.ad
1686.RS 12n
83426735
D
1687Historic statistics for the last N txgs will be available in
1688\fR/proc/spl/kstat/zfs/POOLNAME/txgs\fB
29714574
TF
1689.sp
1690Default value: \fB0\fR.
1691.RE
1692
29714574
TF
1693.sp
1694.ne 2
1695.na
1696\fBzfs_txg_timeout\fR (int)
1697.ad
1698.RS 12n
83426735 1699Flush dirty data to disk at least every N seconds (maximum txg duration)
29714574
TF
1700.sp
1701Default value: \fB5\fR.
1702.RE
1703
1704.sp
1705.ne 2
1706.na
1707\fBzfs_vdev_aggregation_limit\fR (int)
1708.ad
1709.RS 12n
1710Max vdev I/O aggregation size
1711.sp
1712Default value: \fB131,072\fR.
1713.RE
1714
1715.sp
1716.ne 2
1717.na
1718\fBzfs_vdev_cache_bshift\fR (int)
1719.ad
1720.RS 12n
1721Shift size to inflate reads too
1722.sp
83426735 1723Default value: \fB16\fR (effectively 65536).
29714574
TF
1724.RE
1725
1726.sp
1727.ne 2
1728.na
1729\fBzfs_vdev_cache_max\fR (int)
1730.ad
1731.RS 12n
83426735
D
1732Inflate reads small than this value to meet the \fBzfs_vdev_cache_bshift\fR
1733size.
1734.sp
1735Default value: \fB16384\fR.
29714574
TF
1736.RE
1737
1738.sp
1739.ne 2
1740.na
1741\fBzfs_vdev_cache_size\fR (int)
1742.ad
1743.RS 12n
83426735
D
1744Total size of the per-disk cache in bytes.
1745.sp
1746Currently this feature is disabled as it has been found to not be helpful
1747for performance and in some cases harmful.
29714574
TF
1748.sp
1749Default value: \fB0\fR.
1750.RE
1751
29714574
TF
1752.sp
1753.ne 2
1754.na
9f500936 1755\fBzfs_vdev_mirror_rotating_inc\fR (int)
29714574
TF
1756.ad
1757.RS 12n
9f500936 1758A number by which the balancing algorithm increments the load calculation for
1759the purpose of selecting the least busy mirror member when an I/O immediately
1760follows its predecessor on rotational vdevs for the purpose of making decisions
1761based on load.
29714574 1762.sp
9f500936 1763Default value: \fB0\fR.
1764.RE
1765
1766.sp
1767.ne 2
1768.na
1769\fBzfs_vdev_mirror_rotating_seek_inc\fR (int)
1770.ad
1771.RS 12n
1772A number by which the balancing algorithm increments the load calculation for
1773the purpose of selecting the least busy mirror member when an I/O lacks
1774locality as defined by the zfs_vdev_mirror_rotating_seek_offset. I/Os within
1775this that are not immediately following the previous I/O are incremented by
1776half.
1777.sp
1778Default value: \fB5\fR.
1779.RE
1780
1781.sp
1782.ne 2
1783.na
1784\fBzfs_vdev_mirror_rotating_seek_offset\fR (int)
1785.ad
1786.RS 12n
1787The maximum distance for the last queued I/O in which the balancing algorithm
1788considers an I/O to have locality.
1789See the section "ZFS I/O SCHEDULER".
1790.sp
1791Default value: \fB1048576\fR.
1792.RE
1793
1794.sp
1795.ne 2
1796.na
1797\fBzfs_vdev_mirror_non_rotating_inc\fR (int)
1798.ad
1799.RS 12n
1800A number by which the balancing algorithm increments the load calculation for
1801the purpose of selecting the least busy mirror member on non-rotational vdevs
1802when I/Os do not immediately follow one another.
1803.sp
1804Default value: \fB0\fR.
1805.RE
1806
1807.sp
1808.ne 2
1809.na
1810\fBzfs_vdev_mirror_non_rotating_seek_inc\fR (int)
1811.ad
1812.RS 12n
1813A number by which the balancing algorithm increments the load calculation for
1814the purpose of selecting the least busy mirror member when an I/O lacks
1815locality as defined by the zfs_vdev_mirror_rotating_seek_offset. I/Os within
1816this that are not immediately following the previous I/O are incremented by
1817half.
1818.sp
1819Default value: \fB1\fR.
29714574
TF
1820.RE
1821
29714574
TF
1822.sp
1823.ne 2
1824.na
1825\fBzfs_vdev_read_gap_limit\fR (int)
1826.ad
1827.RS 12n
83426735
D
1828Aggregate read I/O operations if the gap on-disk between them is within this
1829threshold.
29714574
TF
1830.sp
1831Default value: \fB32,768\fR.
1832.RE
1833
1834.sp
1835.ne 2
1836.na
1837\fBzfs_vdev_scheduler\fR (charp)
1838.ad
1839.RS 12n
83426735 1840Set the Linux I/O scheduler on whole disk vdevs to this scheduler
29714574
TF
1841.sp
1842Default value: \fBnoop\fR.
1843.RE
1844
29714574
TF
1845.sp
1846.ne 2
1847.na
1848\fBzfs_vdev_write_gap_limit\fR (int)
1849.ad
1850.RS 12n
1851Aggregate write I/O over gap
1852.sp
1853Default value: \fB4,096\fR.
1854.RE
1855
ab9f4b0b
GN
1856.sp
1857.ne 2
1858.na
1859\fBzfs_vdev_raidz_impl\fR (string)
1860.ad
1861.RS 12n
c9187d86 1862Parameter for selecting raidz parity implementation to use.
ab9f4b0b
GN
1863
1864Options marked (always) below may be selected on module load as they are
1865supported on all systems.
1866The remaining options may only be set after the module is loaded, as they
1867are available only if the implementations are compiled in and supported
1868on the running system.
1869
1870Once the module is loaded, the content of
1871/sys/module/zfs/parameters/zfs_vdev_raidz_impl will show available options
1872with the currently selected one enclosed in [].
1873Possible options are:
1874 fastest - (always) implementation selected using built-in benchmark
1875 original - (always) original raidz implementation
1876 scalar - (always) scalar raidz implementation
ae25d222
GN
1877 sse2 - implementation using SSE2 instruction set (64bit x86 only)
1878 ssse3 - implementation using SSSE3 instruction set (64bit x86 only)
ab9f4b0b 1879 avx2 - implementation using AVX2 instruction set (64bit x86 only)
7f547f85
RD
1880 avx512f - implementation using AVX512F instruction set (64bit x86 only)
1881 avx512bw - implementation using AVX512F & AVX512BW instruction sets (64bit x86 only)
62a65a65
RD
1882 aarch64_neon - implementation using NEON (Aarch64/64 bit ARMv8 only)
1883 aarch64_neonx2 - implementation using NEON with more unrolling (Aarch64/64 bit ARMv8 only)
ab9f4b0b
GN
1884.sp
1885Default value: \fBfastest\fR.
1886.RE
1887
29714574
TF
1888.sp
1889.ne 2
1890.na
1891\fBzfs_zevent_cols\fR (int)
1892.ad
1893.RS 12n
83426735 1894When zevents are logged to the console use this as the word wrap width.
29714574
TF
1895.sp
1896Default value: \fB80\fR.
1897.RE
1898
1899.sp
1900.ne 2
1901.na
1902\fBzfs_zevent_console\fR (int)
1903.ad
1904.RS 12n
1905Log events to the console
1906.sp
1907Use \fB1\fR for yes and \fB0\fR for no (default).
1908.RE
1909
1910.sp
1911.ne 2
1912.na
1913\fBzfs_zevent_len_max\fR (int)
1914.ad
1915.RS 12n
83426735
D
1916Max event queue length. A value of 0 will result in a calculated value which
1917increases with the number of CPUs in the system (minimum 64 events). Events
1918in the queue can be viewed with the \fBzpool events\fR command.
29714574
TF
1919.sp
1920Default value: \fB0\fR.
1921.RE
1922
1923.sp
1924.ne 2
1925.na
1926\fBzil_replay_disable\fR (int)
1927.ad
1928.RS 12n
83426735
D
1929Disable intent logging replay. Can be disabled for recovery from corrupted
1930ZIL
29714574
TF
1931.sp
1932Use \fB1\fR for yes and \fB0\fR for no (default).
1933.RE
1934
1935.sp
1936.ne 2
1937.na
1b7c1e5c 1938\fBzil_slog_bulk\fR (ulong)
29714574
TF
1939.ad
1940.RS 12n
1b7c1e5c
GDN
1941Limit SLOG write size per commit executed with synchronous priority.
1942Any writes above that will be executed with lower (asynchronous) priority
1943to limit potential SLOG device abuse by single active ZIL writer.
29714574 1944.sp
1b7c1e5c 1945Default value: \fB786,432\fR.
29714574
TF
1946.RE
1947
29714574
TF
1948.sp
1949.ne 2
1950.na
1951\fBzio_delay_max\fR (int)
1952.ad
1953.RS 12n
83426735 1954A zevent will be logged if a ZIO operation takes more than N milliseconds to
ab9f4b0b 1955complete. Note that this is only a logging facility, not a timeout on
83426735 1956operations.
29714574
TF
1957.sp
1958Default value: \fB30,000\fR.
1959.RE
1960
3dfb57a3
DB
1961.sp
1962.ne 2
1963.na
1964\fBzio_dva_throttle_enabled\fR (int)
1965.ad
1966.RS 12n
1967Throttle block allocations in the ZIO pipeline. This allows for
1968dynamic allocation distribution when devices are imbalanced.
e815485f
TC
1969When enabled, the maximum number of pending allocations per top-level vdev
1970is limited by \fBzfs_vdev_queue_depth_pct\fR.
3dfb57a3 1971.sp
27f2b90d 1972Default value: \fB1\fR.
3dfb57a3
DB
1973.RE
1974
29714574
TF
1975.sp
1976.ne 2
1977.na
1978\fBzio_requeue_io_start_cut_in_line\fR (int)
1979.ad
1980.RS 12n
1981Prioritize requeued I/O
1982.sp
1983Default value: \fB0\fR.
1984.RE
1985
dcb6bed1
D
1986.sp
1987.ne 2
1988.na
1989\fBzio_taskq_batch_pct\fR (uint)
1990.ad
1991.RS 12n
1992Percentage of online CPUs (or CPU cores, etc) which will run a worker thread
1993for IO. These workers are responsible for IO work such as compression and
1994checksum calculations. Fractional number of CPUs will be rounded down.
1995.sp
1996The default value of 75 was chosen to avoid using all CPUs which can result in
1997latency issues and inconsistent application performance, especially when high
1998compression is enabled.
1999.sp
2000Default value: \fB75\fR.
2001.RE
2002
29714574
TF
2003.sp
2004.ne 2
2005.na
2006\fBzvol_inhibit_dev\fR (uint)
2007.ad
2008.RS 12n
83426735
D
2009Do not create zvol device nodes. This may slightly improve startup time on
2010systems with a very large number of zvols.
29714574
TF
2011.sp
2012Use \fB1\fR for yes and \fB0\fR for no (default).
2013.RE
2014
2015.sp
2016.ne 2
2017.na
2018\fBzvol_major\fR (uint)
2019.ad
2020.RS 12n
83426735 2021Major number for zvol block devices
29714574
TF
2022.sp
2023Default value: \fB230\fR.
2024.RE
2025
2026.sp
2027.ne 2
2028.na
2029\fBzvol_max_discard_blocks\fR (ulong)
2030.ad
2031.RS 12n
83426735
D
2032Discard (aka TRIM) operations done on zvols will be done in batches of this
2033many blocks, where block size is determined by the \fBvolblocksize\fR property
2034of a zvol.
29714574
TF
2035.sp
2036Default value: \fB16,384\fR.
2037.RE
2038
9965059a
BB
2039.sp
2040.ne 2
2041.na
2042\fBzvol_prefetch_bytes\fR (uint)
2043.ad
2044.RS 12n
2045When adding a zvol to the system prefetch \fBzvol_prefetch_bytes\fR
2046from the start and end of the volume. Prefetching these regions
2047of the volume is desirable because they are likely to be accessed
2048immediately by \fBblkid(8)\fR or by the kernel scanning for a partition
2049table.
2050.sp
2051Default value: \fB131,072\fR.
2052.RE
2053
692e55b8
CC
2054.sp
2055.ne 2
2056.na
2057\fBzvol_request_sync\fR (uint)
2058.ad
2059.RS 12n
2060When processing I/O requests for a zvol submit them synchronously. This
2061effectively limits the queue depth to 1 for each I/O submitter. When set
2062to 0 requests are handled asynchronously by a thread pool. The number of
2063requests which can be handled concurrently is controller by \fBzvol_threads\fR.
2064.sp
8fa5250f 2065Default value: \fB0\fR.
692e55b8
CC
2066.RE
2067
2068.sp
2069.ne 2
2070.na
2071\fBzvol_threads\fR (uint)
2072.ad
2073.RS 12n
2074Max number of threads which can handle zvol I/O requests concurrently.
2075.sp
2076Default value: \fB32\fR.
2077.RE
2078
39ccc909 2079.sp
2080.ne 2
2081.na
2082\fBzfs_qat_disable\fR (int)
2083.ad
2084.RS 12n
2085This tunable disables qat hardware acceleration for gzip compression.
2086It is available only if qat acceleration is compiled in and qat driver
2087is present.
2088.sp
2089Use \fB1\fR for yes and \fB0\fR for no (default).
2090.RE
2091
e8b96c60
MA
2092.SH ZFS I/O SCHEDULER
2093ZFS issues I/O operations to leaf vdevs to satisfy and complete I/Os.
2094The I/O scheduler determines when and in what order those operations are
2095issued. The I/O scheduler divides operations into five I/O classes
2096prioritized in the following order: sync read, sync write, async read,
2097async write, and scrub/resilver. Each queue defines the minimum and
2098maximum number of concurrent operations that may be issued to the
2099device. In addition, the device has an aggregate maximum,
2100\fBzfs_vdev_max_active\fR. Note that the sum of the per-queue minimums
2101must not exceed the aggregate maximum. If the sum of the per-queue
2102maximums exceeds the aggregate maximum, then the number of active I/Os
2103may reach \fBzfs_vdev_max_active\fR, in which case no further I/Os will
2104be issued regardless of whether all per-queue minimums have been met.
2105.sp
2106For many physical devices, throughput increases with the number of
2107concurrent operations, but latency typically suffers. Further, physical
2108devices typically have a limit at which more concurrent operations have no
2109effect on throughput or can actually cause it to decrease.
2110.sp
2111The scheduler selects the next operation to issue by first looking for an
2112I/O class whose minimum has not been satisfied. Once all are satisfied and
2113the aggregate maximum has not been hit, the scheduler looks for classes
2114whose maximum has not been satisfied. Iteration through the I/O classes is
2115done in the order specified above. No further operations are issued if the
2116aggregate maximum number of concurrent operations has been hit or if there
2117are no operations queued for an I/O class that has not hit its maximum.
2118Every time an I/O is queued or an operation completes, the I/O scheduler
2119looks for new operations to issue.
2120.sp
2121In general, smaller max_active's will lead to lower latency of synchronous
2122operations. Larger max_active's may lead to higher overall throughput,
2123depending on underlying storage.
2124.sp
2125The ratio of the queues' max_actives determines the balance of performance
2126between reads, writes, and scrubs. E.g., increasing
2127\fBzfs_vdev_scrub_max_active\fR will cause the scrub or resilver to complete
2128more quickly, but reads and writes to have higher latency and lower throughput.
2129.sp
2130All I/O classes have a fixed maximum number of outstanding operations
2131except for the async write class. Asynchronous writes represent the data
2132that is committed to stable storage during the syncing stage for
2133transaction groups. Transaction groups enter the syncing state
2134periodically so the number of queued async writes will quickly burst up
2135and then bleed down to zero. Rather than servicing them as quickly as
2136possible, the I/O scheduler changes the maximum number of active async
2137write I/Os according to the amount of dirty data in the pool. Since
2138both throughput and latency typically increase with the number of
2139concurrent operations issued to physical devices, reducing the
2140burstiness in the number of concurrent operations also stabilizes the
2141response time of operations from other -- and in particular synchronous
2142-- queues. In broad strokes, the I/O scheduler will issue more
2143concurrent operations from the async write queue as there's more dirty
2144data in the pool.
2145.sp
2146Async Writes
2147.sp
2148The number of concurrent operations issued for the async write I/O class
2149follows a piece-wise linear function defined by a few adjustable points.
2150.nf
2151
2152 | o---------| <-- zfs_vdev_async_write_max_active
2153 ^ | /^ |
2154 | | / | |
2155active | / | |
2156 I/O | / | |
2157count | / | |
2158 | / | |
2159 |-------o | | <-- zfs_vdev_async_write_min_active
2160 0|_______^______|_________|
2161 0% | | 100% of zfs_dirty_data_max
2162 | |
2163 | `-- zfs_vdev_async_write_active_max_dirty_percent
2164 `--------- zfs_vdev_async_write_active_min_dirty_percent
2165
2166.fi
2167Until the amount of dirty data exceeds a minimum percentage of the dirty
2168data allowed in the pool, the I/O scheduler will limit the number of
2169concurrent operations to the minimum. As that threshold is crossed, the
2170number of concurrent operations issued increases linearly to the maximum at
2171the specified maximum percentage of the dirty data allowed in the pool.
2172.sp
2173Ideally, the amount of dirty data on a busy pool will stay in the sloped
2174part of the function between \fBzfs_vdev_async_write_active_min_dirty_percent\fR
2175and \fBzfs_vdev_async_write_active_max_dirty_percent\fR. If it exceeds the
2176maximum percentage, this indicates that the rate of incoming data is
2177greater than the rate that the backend storage can handle. In this case, we
2178must further throttle incoming writes, as described in the next section.
2179
2180.SH ZFS TRANSACTION DELAY
2181We delay transactions when we've determined that the backend storage
2182isn't able to accommodate the rate of incoming writes.
2183.sp
2184If there is already a transaction waiting, we delay relative to when
2185that transaction will finish waiting. This way the calculated delay time
2186is independent of the number of threads concurrently executing
2187transactions.
2188.sp
2189If we are the only waiter, wait relative to when the transaction
2190started, rather than the current time. This credits the transaction for
2191"time already served", e.g. reading indirect blocks.
2192.sp
2193The minimum time for a transaction to take is calculated as:
2194.nf
2195 min_time = zfs_delay_scale * (dirty - min) / (max - dirty)
2196 min_time is then capped at 100 milliseconds.
2197.fi
2198.sp
2199The delay has two degrees of freedom that can be adjusted via tunables. The
2200percentage of dirty data at which we start to delay is defined by
2201\fBzfs_delay_min_dirty_percent\fR. This should typically be at or above
2202\fBzfs_vdev_async_write_active_max_dirty_percent\fR so that we only start to
2203delay after writing at full speed has failed to keep up with the incoming write
2204rate. The scale of the curve is defined by \fBzfs_delay_scale\fR. Roughly speaking,
2205this variable determines the amount of delay at the midpoint of the curve.
2206.sp
2207.nf
2208delay
2209 10ms +-------------------------------------------------------------*+
2210 | *|
2211 9ms + *+
2212 | *|
2213 8ms + *+
2214 | * |
2215 7ms + * +
2216 | * |
2217 6ms + * +
2218 | * |
2219 5ms + * +
2220 | * |
2221 4ms + * +
2222 | * |
2223 3ms + * +
2224 | * |
2225 2ms + (midpoint) * +
2226 | | ** |
2227 1ms + v *** +
2228 | zfs_delay_scale ----------> ******** |
2229 0 +-------------------------------------*********----------------+
2230 0% <- zfs_dirty_data_max -> 100%
2231.fi
2232.sp
2233Note that since the delay is added to the outstanding time remaining on the
2234most recent transaction, the delay is effectively the inverse of IOPS.
2235Here the midpoint of 500us translates to 2000 IOPS. The shape of the curve
2236was chosen such that small changes in the amount of accumulated dirty data
2237in the first 3/4 of the curve yield relatively small differences in the
2238amount of delay.
2239.sp
2240The effects can be easier to understand when the amount of delay is
2241represented on a log scale:
2242.sp
2243.nf
2244delay
2245100ms +-------------------------------------------------------------++
2246 + +
2247 | |
2248 + *+
2249 10ms + *+
2250 + ** +
2251 | (midpoint) ** |
2252 + | ** +
2253 1ms + v **** +
2254 + zfs_delay_scale ----------> ***** +
2255 | **** |
2256 + **** +
2257100us + ** +
2258 + * +
2259 | * |
2260 + * +
2261 10us + * +
2262 + +
2263 | |
2264 + +
2265 +--------------------------------------------------------------+
2266 0% <- zfs_dirty_data_max -> 100%
2267.fi
2268.sp
2269Note here that only as the amount of dirty data approaches its limit does
2270the delay start to increase rapidly. The goal of a properly tuned system
2271should be to keep the amount of dirty data out of that range by first
2272ensuring that the appropriate limits are set for the I/O scheduler to reach
2273optimal throughput on the backend storage, and then by changing the value
2274of \fBzfs_delay_scale\fR to increase the steepness of the curve.