]> git.proxmox.com Git - mirror_zfs.git/blame - man/man5/zfs-module-parameters.5
Illumos 4390 - I/O errors can corrupt space map when deleting fs/vol
[mirror_zfs.git] / man / man5 / zfs-module-parameters.5
CommitLineData
29714574
TF
1'\" te
2.\" Copyright (c) 2013 by Turbo Fredriksson <turbo@bayour.com>. All rights reserved.
3.\" The contents of this file are subject to the terms of the Common Development
4.\" and Distribution License (the "License"). You may not use this file except
5.\" in compliance with the License. You can obtain a copy of the license at
6.\" usr/src/OPENSOLARIS.LICENSE or http://www.opensolaris.org/os/licensing.
7.\"
8.\" See the License for the specific language governing permissions and
9.\" limitations under the License. When distributing Covered Code, include this
10.\" CDDL HEADER in each file and include the License file at
11.\" usr/src/OPENSOLARIS.LICENSE. If applicable, add the following below this
12.\" CDDL HEADER, with the fields enclosed by brackets "[]" replaced with your
13.\" own identifying information:
14.\" Portions Copyright [yyyy] [name of copyright owner]
15.TH ZFS-MODULE-PARAMETERS 5 "Nov 16, 2013"
16.SH NAME
17zfs\-module\-parameters \- ZFS module parameters
18.SH DESCRIPTION
19.sp
20.LP
21Description of the different parameters to the ZFS module.
22
23.SS "Module parameters"
24.sp
25.LP
26
27.sp
28.ne 2
29.na
30\fBl2arc_feed_again\fR (int)
31.ad
32.RS 12n
33Turbo L2ARC warmup
34.sp
35Use \fB1\fR for yes (default) and \fB0\fR to disable.
36.RE
37
38.sp
39.ne 2
40.na
41\fBl2arc_feed_min_ms\fR (ulong)
42.ad
43.RS 12n
44Min feed interval in milliseconds
45.sp
46Default value: \fB200\fR.
47.RE
48
49.sp
50.ne 2
51.na
52\fBl2arc_feed_secs\fR (ulong)
53.ad
54.RS 12n
55Seconds between L2ARC writing
56.sp
57Default value: \fB1\fR.
58.RE
59
60.sp
61.ne 2
62.na
63\fBl2arc_headroom\fR (ulong)
64.ad
65.RS 12n
66Number of max device writes to precache
67.sp
68Default value: \fB2\fR.
69.RE
70
71.sp
72.ne 2
73.na
74\fBl2arc_headroom_boost\fR (ulong)
75.ad
76.RS 12n
77Compressed l2arc_headroom multiplier
78.sp
79Default value: \fB200\fR.
80.RE
81
82.sp
83.ne 2
84.na
85\fBl2arc_nocompress\fR (int)
86.ad
87.RS 12n
88Skip compressing L2ARC buffers
89.sp
90Use \fB1\fR for yes and \fB0\fR for no (default).
91.RE
92
93.sp
94.ne 2
95.na
96\fBl2arc_noprefetch\fR (int)
97.ad
98.RS 12n
99Skip caching prefetched buffers
100.sp
101Use \fB1\fR for yes (default) and \fB0\fR to disable.
102.RE
103
104.sp
105.ne 2
106.na
107\fBl2arc_norw\fR (int)
108.ad
109.RS 12n
110No reads during writes
111.sp
112Use \fB1\fR for yes and \fB0\fR for no (default).
113.RE
114
115.sp
116.ne 2
117.na
118\fBl2arc_write_boost\fR (ulong)
119.ad
120.RS 12n
121Extra write bytes during device warmup
122.sp
123Default value: \fB8,388,608\fR.
124.RE
125
126.sp
127.ne 2
128.na
129\fBl2arc_write_max\fR (ulong)
130.ad
131.RS 12n
132Max write bytes per interval
133.sp
134Default value: \fB8,388,608\fR.
135.RE
136
137.sp
138.ne 2
139.na
aa7d06a9 140\fBmetaslab_debug_load\fR (int)
29714574
TF
141.ad
142.RS 12n
aa7d06a9
GW
143Load all metaslabs during pool import.
144.sp
145Use \fB1\fR for yes and \fB0\fR for no (default).
146.RE
147
148.sp
149.ne 2
150.na
151\fBmetaslab_debug_unload\fR (int)
152.ad
153.RS 12n
154Prevent metaslabs from being unloaded.
29714574
TF
155.sp
156Use \fB1\fR for yes and \fB0\fR for no (default).
157.RE
158
159.sp
160.ne 2
161.na
162\fBspa_config_path\fR (charp)
163.ad
164.RS 12n
165SPA config file
166.sp
167Default value: \fB/etc/zfs/zpool.cache\fR.
168.RE
169
e8b96c60
MA
170.sp
171.ne 2
172.na
173\fBspa_asize_inflation\fR (int)
174.ad
175.RS 12n
176Multiplication factor used to estimate actual disk consumption from the
177size of data being written. The default value is a worst case estimate,
178but lower values may be valid for a given pool depending on its
179configuration. Pool administrators who understand the factors involved
180may wish to specify a more realistic inflation factor, particularly if
181they operate close to quota or capacity limits.
182.sp
183Default value: 24
184.RE
185
29714574
TF
186.sp
187.ne 2
188.na
189\fBzfetch_array_rd_sz\fR (ulong)
190.ad
191.RS 12n
27b293be 192If prefetching is enabled, disable prefetching for reads larger than this size.
29714574
TF
193.sp
194Default value: \fB1,048,576\fR.
195.RE
196
197.sp
198.ne 2
199.na
200\fBzfetch_block_cap\fR (uint)
201.ad
202.RS 12n
27b293be 203Max number of blocks to prefetch at a time
29714574
TF
204.sp
205Default value: \fB256\fR.
206.RE
207
208.sp
209.ne 2
210.na
211\fBzfetch_max_streams\fR (uint)
212.ad
213.RS 12n
27b293be 214Max number of streams per zfetch (prefetch streams per file).
29714574
TF
215.sp
216Default value: \fB8\fR.
217.RE
218
219.sp
220.ne 2
221.na
222\fBzfetch_min_sec_reap\fR (uint)
223.ad
224.RS 12n
27b293be 225Min time before an active prefetch stream can be reclaimed
29714574
TF
226.sp
227Default value: \fB2\fR.
228.RE
229
230.sp
231.ne 2
232.na
233\fBzfs_arc_grow_retry\fR (int)
234.ad
235.RS 12n
236Seconds before growing arc size
237.sp
238Default value: \fB5\fR.
239.RE
240
241.sp
242.ne 2
243.na
244\fBzfs_arc_max\fR (ulong)
245.ad
246.RS 12n
247Max arc size
248.sp
249Default value: \fB0\fR.
250.RE
251
252.sp
253.ne 2
254.na
255\fBzfs_arc_memory_throttle_disable\fR (int)
256.ad
257.RS 12n
258Disable memory throttle
259.sp
260Use \fB1\fR for yes (default) and \fB0\fR to disable.
261.RE
262
263.sp
264.ne 2
265.na
266\fBzfs_arc_meta_limit\fR (ulong)
267.ad
268.RS 12n
269Meta limit for arc size
270.sp
271Default value: \fB0\fR.
272.RE
273
274.sp
275.ne 2
276.na
277\fBzfs_arc_meta_prune\fR (int)
278.ad
279.RS 12n
280Bytes of meta data to prune
281.sp
282Default value: \fB1,048,576\fR.
283.RE
284
285.sp
286.ne 2
287.na
288\fBzfs_arc_min\fR (ulong)
289.ad
290.RS 12n
291Min arc size
292.sp
293Default value: \fB100\fR.
294.RE
295
296.sp
297.ne 2
298.na
299\fBzfs_arc_min_prefetch_lifespan\fR (int)
300.ad
301.RS 12n
302Min life of prefetch block
303.sp
304Default value: \fB100\fR.
305.RE
306
89c8cac4
PS
307.sp
308.ne 2
309.na
310\fBzfs_arc_p_aggressive_disable\fR (int)
311.ad
312.RS 12n
313Disable aggressive arc_p growth
314.sp
315Use \fB1\fR for yes (default) and \fB0\fR to disable.
316.RE
317
62422785
PS
318.sp
319.ne 2
320.na
321\fBzfs_arc_p_dampener_disable\fR (int)
322.ad
323.RS 12n
324Disable arc_p adapt dampener
325.sp
326Use \fB1\fR for yes (default) and \fB0\fR to disable.
327.RE
328
29714574
TF
329.sp
330.ne 2
331.na
332\fBzfs_arc_shrink_shift\fR (int)
333.ad
334.RS 12n
335log2(fraction of arc to reclaim)
336.sp
337Default value: \fB5\fR.
338.RE
339
340.sp
341.ne 2
342.na
343\fBzfs_autoimport_disable\fR (int)
344.ad
345.RS 12n
27b293be 346Disable pool import at module load by ignoring the cache file (typically \fB/etc/zfs/zpool.cache\fR).
29714574
TF
347.sp
348Use \fB1\fR for yes and \fB0\fR for no (default).
349.RE
350
351.sp
352.ne 2
353.na
354\fBzfs_dbuf_state_index\fR (int)
355.ad
356.RS 12n
357Calculate arc header index
358.sp
359Default value: \fB0\fR.
360.RE
361
362.sp
363.ne 2
364.na
365\fBzfs_deadman_enabled\fR (int)
366.ad
367.RS 12n
368Enable deadman timer
369.sp
370Use \fB1\fR for yes (default) and \fB0\fR to disable.
371.RE
372
373.sp
374.ne 2
375.na
e8b96c60 376\fBzfs_deadman_synctime_ms\fR (ulong)
29714574
TF
377.ad
378.RS 12n
e8b96c60
MA
379Expiration time in milliseconds. This value has two meanings. First it is
380used to determine when the spa_deadman() logic should fire. By default the
381spa_deadman() will fire if spa_sync() has not completed in 1000 seconds.
382Secondly, the value determines if an I/O is considered "hung". Any I/O that
383has not completed in zfs_deadman_synctime_ms is considered "hung" resulting
384in a zevent being logged.
29714574 385.sp
e8b96c60 386Default value: \fB1,000,000\fR.
29714574
TF
387.RE
388
389.sp
390.ne 2
391.na
392\fBzfs_dedup_prefetch\fR (int)
393.ad
394.RS 12n
395Enable prefetching dedup-ed blks
396.sp
397Use \fB1\fR for yes (default) and \fB0\fR to disable.
398.RE
399
e8b96c60
MA
400.sp
401.ne 2
402.na
403\fBzfs_delay_min_dirty_percent\fR (int)
404.ad
405.RS 12n
406Start to delay each transaction once there is this amount of dirty data,
407expressed as a percentage of \fBzfs_dirty_data_max\fR.
408This value should be >= zfs_vdev_async_write_active_max_dirty_percent.
409See the section "ZFS TRANSACTION DELAY".
410.sp
411Default value: \fB60\fR.
412.RE
413
414.sp
415.ne 2
416.na
417\fBzfs_delay_scale\fR (int)
418.ad
419.RS 12n
420This controls how quickly the transaction delay approaches infinity.
421Larger values cause longer delays for a given amount of dirty data.
422.sp
423For the smoothest delay, this value should be about 1 billion divided
424by the maximum number of operations per second. This will smoothly
425handle between 10x and 1/10th this number.
426.sp
427See the section "ZFS TRANSACTION DELAY".
428.sp
429Note: \fBzfs_delay_scale\fR * \fBzfs_dirty_data_max\fR must be < 2^64.
430.sp
431Default value: \fB500,000\fR.
432.RE
433
434.sp
435.ne 2
436.na
437\fBzfs_dirty_data_max\fR (int)
438.ad
439.RS 12n
440Determines the dirty space limit in bytes. Once this limit is exceeded, new
441writes are halted until space frees up. This parameter takes precedence
442over \fBzfs_dirty_data_max_percent\fR.
443See the section "ZFS TRANSACTION DELAY".
444.sp
445Default value: 10 percent of all memory, capped at \fBzfs_dirty_data_max_max\fR.
446.RE
447
448.sp
449.ne 2
450.na
451\fBzfs_dirty_data_max_max\fR (int)
452.ad
453.RS 12n
454Maximum allowable value of \fBzfs_dirty_data_max\fR, expressed in bytes.
455This limit is only enforced at module load time, and will be ignored if
456\fBzfs_dirty_data_max\fR is later changed. This parameter takes
457precedence over \fBzfs_dirty_data_max_max_percent\fR. See the section
458"ZFS TRANSACTION DELAY".
459.sp
460Default value: 25% of physical RAM.
461.RE
462
463.sp
464.ne 2
465.na
466\fBzfs_dirty_data_max_max_percent\fR (int)
467.ad
468.RS 12n
469Maximum allowable value of \fBzfs_dirty_data_max\fR, expressed as a
470percentage of physical RAM. This limit is only enforced at module load
471time, and will be ignored if \fBzfs_dirty_data_max\fR is later changed.
472The parameter \fBzfs_dirty_data_max_max\fR takes precedence over this
473one. See the section "ZFS TRANSACTION DELAY".
474.sp
475Default value: 25
476.RE
477
478.sp
479.ne 2
480.na
481\fBzfs_dirty_data_max_percent\fR (int)
482.ad
483.RS 12n
484Determines the dirty space limit, expressed as a percentage of all
485memory. Once this limit is exceeded, new writes are halted until space frees
486up. The parameter \fBzfs_dirty_data_max\fR takes precedence over this
487one. See the section "ZFS TRANSACTION DELAY".
488.sp
489Default value: 10%, subject to \fBzfs_dirty_data_max_max\fR.
490.RE
491
492.sp
493.ne 2
494.na
495\fBzfs_dirty_data_sync\fR (int)
496.ad
497.RS 12n
498Start syncing out a transaction group if there is at least this much dirty data.
499.sp
500Default value: \fB67,108,864\fR.
501.RE
502
503.sp
504.ne 2
505.na
506\fBzfs_vdev_async_read_max_active\fR (int)
507.ad
508.RS 12n
509Maxium asynchronous read I/Os active to each device.
510See the section "ZFS I/O SCHEDULER".
511.sp
512Default value: \fB3\fR.
513.RE
514
515.sp
516.ne 2
517.na
518\fBzfs_vdev_async_read_min_active\fR (int)
519.ad
520.RS 12n
521Minimum asynchronous read I/Os active to each device.
522See the section "ZFS I/O SCHEDULER".
523.sp
524Default value: \fB1\fR.
525.RE
526
527.sp
528.ne 2
529.na
530\fBzfs_vdev_async_write_active_max_dirty_percent\fR (int)
531.ad
532.RS 12n
533When the pool has more than
534\fBzfs_vdev_async_write_active_max_dirty_percent\fR dirty data, use
535\fBzfs_vdev_async_write_max_active\fR to limit active async writes. If
536the dirty data is between min and max, the active I/O limit is linearly
537interpolated. See the section "ZFS I/O SCHEDULER".
538.sp
539Default value: \fB60\fR.
540.RE
541
542.sp
543.ne 2
544.na
545\fBzfs_vdev_async_write_active_min_dirty_percent\fR (int)
546.ad
547.RS 12n
548When the pool has less than
549\fBzfs_vdev_async_write_active_min_dirty_percent\fR dirty data, use
550\fBzfs_vdev_async_write_min_active\fR to limit active async writes. If
551the dirty data is between min and max, the active I/O limit is linearly
552interpolated. See the section "ZFS I/O SCHEDULER".
553.sp
554Default value: \fB30\fR.
555.RE
556
557.sp
558.ne 2
559.na
560\fBzfs_vdev_async_write_max_active\fR (int)
561.ad
562.RS 12n
563Maxium asynchronous write I/Os active to each device.
564See the section "ZFS I/O SCHEDULER".
565.sp
566Default value: \fB10\fR.
567.RE
568
569.sp
570.ne 2
571.na
572\fBzfs_vdev_async_write_min_active\fR (int)
573.ad
574.RS 12n
575Minimum asynchronous write I/Os active to each device.
576See the section "ZFS I/O SCHEDULER".
577.sp
578Default value: \fB1\fR.
579.RE
580
581.sp
582.ne 2
583.na
584\fBzfs_vdev_max_active\fR (int)
585.ad
586.RS 12n
587The maximum number of I/Os active to each device. Ideally, this will be >=
588the sum of each queue's max_active. It must be at least the sum of each
589queue's min_active. See the section "ZFS I/O SCHEDULER".
590.sp
591Default value: \fB1,000\fR.
592.RE
593
594.sp
595.ne 2
596.na
597\fBzfs_vdev_scrub_max_active\fR (int)
598.ad
599.RS 12n
600Maxium scrub I/Os active to each device.
601See the section "ZFS I/O SCHEDULER".
602.sp
603Default value: \fB2\fR.
604.RE
605
606.sp
607.ne 2
608.na
609\fBzfs_vdev_scrub_min_active\fR (int)
610.ad
611.RS 12n
612Minimum scrub I/Os active to each device.
613See the section "ZFS I/O SCHEDULER".
614.sp
615Default value: \fB1\fR.
616.RE
617
618.sp
619.ne 2
620.na
621\fBzfs_vdev_sync_read_max_active\fR (int)
622.ad
623.RS 12n
624Maxium synchronous read I/Os active to each device.
625See the section "ZFS I/O SCHEDULER".
626.sp
627Default value: \fB10\fR.
628.RE
629
630.sp
631.ne 2
632.na
633\fBzfs_vdev_sync_read_min_active\fR (int)
634.ad
635.RS 12n
636Minimum synchronous read I/Os active to each device.
637See the section "ZFS I/O SCHEDULER".
638.sp
639Default value: \fB10\fR.
640.RE
641
642.sp
643.ne 2
644.na
645\fBzfs_vdev_sync_write_max_active\fR (int)
646.ad
647.RS 12n
648Maxium synchronous write I/Os active to each device.
649See the section "ZFS I/O SCHEDULER".
650.sp
651Default value: \fB10\fR.
652.RE
653
654.sp
655.ne 2
656.na
657\fBzfs_vdev_sync_write_min_active\fR (int)
658.ad
659.RS 12n
660Minimum synchronous write I/Os active to each device.
661See the section "ZFS I/O SCHEDULER".
662.sp
663Default value: \fB10\fR.
664.RE
665
29714574
TF
666.sp
667.ne 2
668.na
669\fBzfs_disable_dup_eviction\fR (int)
670.ad
671.RS 12n
672Disable duplicate buffer eviction
673.sp
674Use \fB1\fR for yes and \fB0\fR for no (default).
675.RE
676
677.sp
678.ne 2
679.na
680\fBzfs_expire_snapshot\fR (int)
681.ad
682.RS 12n
683Seconds to expire .zfs/snapshot
684.sp
685Default value: \fB300\fR.
686.RE
687
688.sp
689.ne 2
690.na
691\fBzfs_flags\fR (int)
692.ad
693.RS 12n
694Set additional debugging flags
695.sp
696Default value: \fB1\fR.
697.RE
698
fbeddd60
MA
699.sp
700.ne 2
701.na
702\fBzfs_free_leak_on_eio\fR (int)
703.ad
704.RS 12n
705If destroy encounters an EIO while reading metadata (e.g. indirect
706blocks), space referenced by the missing metadata can not be freed.
707Normally this causes the background destroy to become "stalled", as
708it is unable to make forward progress. While in this stalled state,
709all remaining space to free from the error-encountering filesystem is
710"temporarily leaked". Set this flag to cause it to ignore the EIO,
711permanently leak the space from indirect blocks that can not be read,
712and continue to free everything else that it can.
713
714The default, "stalling" behavior is useful if the storage partially
715fails (i.e. some but not all i/os fail), and then later recovers. In
716this case, we will be able to continue pool operations while it is
717partially failed, and when it recovers, we can continue to free the
718space, with no leaks. However, note that this case is actually
719fairly rare.
720
721Typically pools either (a) fail completely (but perhaps temporarily,
722e.g. a top-level vdev going offline), or (b) have localized,
723permanent errors (e.g. disk returns the wrong data due to bit flip or
724firmware bug). In case (a), this setting does not matter because the
725pool will be suspended and the sync thread will not be able to make
726forward progress regardless. In case (b), because the error is
727permanent, the best we can do is leak the minimum amount of space,
728which is what setting this flag will do. Therefore, it is reasonable
729for this flag to normally be set, but we chose the more conservative
730approach of not setting it, so that there is no possibility of
731leaking space in the "partial temporary" failure case.
732.sp
733Default value: \fB0\fR.
734.RE
735
29714574
TF
736.sp
737.ne 2
738.na
739\fBzfs_free_min_time_ms\fR (int)
740.ad
741.RS 12n
742Min millisecs to free per txg
743.sp
744Default value: \fB1,000\fR.
745.RE
746
747.sp
748.ne 2
749.na
750\fBzfs_immediate_write_sz\fR (long)
751.ad
752.RS 12n
753Largest data block to write to zil
754.sp
755Default value: \fB32,768\fR.
756.RE
757
758.sp
759.ne 2
760.na
761\fBzfs_mdcomp_disable\fR (int)
762.ad
763.RS 12n
764Disable meta data compression
765.sp
766Use \fB1\fR for yes and \fB0\fR for no (default).
767.RE
768
f4a4046b
TC
769.sp
770.ne 2
771.na
772\fBzfs_mg_noalloc_threshold\fR (int)
773.ad
774.RS 12n
775Defines a threshold at which metaslab groups should be eligible for
776allocations. The value is expressed as a percentage of free space
777beyond which a metaslab group is always eligible for allocations.
778If a metaslab group's free space is less than or equal to the
779the threshold, the allocator will avoid allocating to that group
780unless all groups in the pool have reached the threshold. Once all
781groups have reached the threshold, all groups are allowed to accept
782allocations. The default value of 0 disables the feature and causes
783all metaslab groups to be eligible for allocations.
784
785This parameter allows to deal with pools having heavily imbalanced
786vdevs such as would be the case when a new vdev has been added.
787Setting the threshold to a non-zero percentage will stop allocations
788from being made to vdevs that aren't filled to the specified percentage
789and allow lesser filled vdevs to acquire more allocations than they
790otherwise would under the old \fBzfs_mg_alloc_failures\fR facility.
791.sp
792Default value: \fB0\fR.
793.RE
794
29714574
TF
795.sp
796.ne 2
797.na
798\fBzfs_no_scrub_io\fR (int)
799.ad
800.RS 12n
801Set for no scrub I/O
802.sp
803Use \fB1\fR for yes and \fB0\fR for no (default).
804.RE
805
806.sp
807.ne 2
808.na
809\fBzfs_no_scrub_prefetch\fR (int)
810.ad
811.RS 12n
812Set for no scrub prefetching
813.sp
814Use \fB1\fR for yes and \fB0\fR for no (default).
815.RE
816
29714574
TF
817.sp
818.ne 2
819.na
820\fBzfs_nocacheflush\fR (int)
821.ad
822.RS 12n
823Disable cache flushes
824.sp
825Use \fB1\fR for yes and \fB0\fR for no (default).
826.RE
827
828.sp
829.ne 2
830.na
831\fBzfs_nopwrite_enabled\fR (int)
832.ad
833.RS 12n
834Enable NOP writes
835.sp
836Use \fB1\fR for yes (default) and \fB0\fR to disable.
837.RE
838
839.sp
840.ne 2
841.na
842\fBzfs_pd_blks_max\fR (int)
843.ad
844.RS 12n
845Max number of blocks to prefetch
846.sp
847Default value: \fB100\fR.
848.RE
849
850.sp
851.ne 2
852.na
853\fBzfs_prefetch_disable\fR (int)
854.ad
855.RS 12n
856Disable all ZFS prefetching
857.sp
858Use \fB1\fR for yes and \fB0\fR for no (default).
859.RE
860
861.sp
862.ne 2
863.na
864\fBzfs_read_chunk_size\fR (long)
865.ad
866.RS 12n
867Bytes to read per chunk
868.sp
869Default value: \fB1,048,576\fR.
870.RE
871
872.sp
873.ne 2
874.na
875\fBzfs_read_history\fR (int)
876.ad
877.RS 12n
878Historic statistics for the last N reads
879.sp
880Default value: \fB0\fR.
881.RE
882
883.sp
884.ne 2
885.na
886\fBzfs_read_history_hits\fR (int)
887.ad
888.RS 12n
889Include cache hits in read history
890.sp
891Use \fB1\fR for yes and \fB0\fR for no (default).
892.RE
893
894.sp
895.ne 2
896.na
897\fBzfs_recover\fR (int)
898.ad
899.RS 12n
900Set to attempt to recover from fatal errors. This should only be used as a
901last resort, as it typically results in leaked space, or worse.
902.sp
903Use \fB1\fR for yes and \fB0\fR for no (default).
904.RE
905
906.sp
907.ne 2
908.na
909\fBzfs_resilver_delay\fR (int)
910.ad
911.RS 12n
27b293be
TC
912Number of ticks to delay prior to issuing a resilver I/O operation when
913a non-resilver or non-scrub I/O operation has occurred within the past
914\fBzfs_scan_idle\fR ticks.
29714574
TF
915.sp
916Default value: \fB2\fR.
917.RE
918
919.sp
920.ne 2
921.na
922\fBzfs_resilver_min_time_ms\fR (int)
923.ad
924.RS 12n
925Min millisecs to resilver per txg
926.sp
927Default value: \fB3,000\fR.
928.RE
929
930.sp
931.ne 2
932.na
933\fBzfs_scan_idle\fR (int)
934.ad
935.RS 12n
27b293be
TC
936Idle window in clock ticks. During a scrub or a resilver, if
937a non-scrub or non-resilver I/O operation has occurred during this
938window, the next scrub or resilver operation is delayed by, respectively
939\fBzfs_scrub_delay\fR or \fBzfs_resilver_delay\fR ticks.
29714574
TF
940.sp
941Default value: \fB50\fR.
942.RE
943
944.sp
945.ne 2
946.na
947\fBzfs_scan_min_time_ms\fR (int)
948.ad
949.RS 12n
950Min millisecs to scrub per txg
951.sp
952Default value: \fB1,000\fR.
953.RE
954
955.sp
956.ne 2
957.na
958\fBzfs_scrub_delay\fR (int)
959.ad
960.RS 12n
27b293be
TC
961Number of ticks to delay prior to issuing a scrub I/O operation when
962a non-scrub or non-resilver I/O operation has occurred within the past
963\fBzfs_scan_idle\fR ticks.
29714574
TF
964.sp
965Default value: \fB4\fR.
966.RE
967
fd8febbd
TF
968.sp
969.ne 2
970.na
971\fBzfs_send_corrupt_data\fR (int)
972.ad
973.RS 12n
974Allow to send corrupt data (ignore read/checksum errors when sending data)
975.sp
976Use \fB1\fR for yes and \fB0\fR for no (default).
977.RE
978
29714574
TF
979.sp
980.ne 2
981.na
982\fBzfs_sync_pass_deferred_free\fR (int)
983.ad
984.RS 12n
985Defer frees starting in this pass
986.sp
987Default value: \fB2\fR.
988.RE
989
990.sp
991.ne 2
992.na
993\fBzfs_sync_pass_dont_compress\fR (int)
994.ad
995.RS 12n
996Don't compress starting in this pass
997.sp
998Default value: \fB5\fR.
999.RE
1000
1001.sp
1002.ne 2
1003.na
1004\fBzfs_sync_pass_rewrite\fR (int)
1005.ad
1006.RS 12n
1007Rewrite new bps starting in this pass
1008.sp
1009Default value: \fB2\fR.
1010.RE
1011
1012.sp
1013.ne 2
1014.na
1015\fBzfs_top_maxinflight\fR (int)
1016.ad
1017.RS 12n
27b293be 1018Max I/Os per top-level vdev during scrub or resilver operations.
29714574
TF
1019.sp
1020Default value: \fB32\fR.
1021.RE
1022
1023.sp
1024.ne 2
1025.na
1026\fBzfs_txg_history\fR (int)
1027.ad
1028.RS 12n
1029Historic statistics for the last N txgs
1030.sp
1031Default value: \fB0\fR.
1032.RE
1033
29714574
TF
1034.sp
1035.ne 2
1036.na
1037\fBzfs_txg_timeout\fR (int)
1038.ad
1039.RS 12n
1040Max seconds worth of delta per txg
1041.sp
1042Default value: \fB5\fR.
1043.RE
1044
1045.sp
1046.ne 2
1047.na
1048\fBzfs_vdev_aggregation_limit\fR (int)
1049.ad
1050.RS 12n
1051Max vdev I/O aggregation size
1052.sp
1053Default value: \fB131,072\fR.
1054.RE
1055
1056.sp
1057.ne 2
1058.na
1059\fBzfs_vdev_cache_bshift\fR (int)
1060.ad
1061.RS 12n
1062Shift size to inflate reads too
1063.sp
1064Default value: \fB16\fR.
1065.RE
1066
1067.sp
1068.ne 2
1069.na
1070\fBzfs_vdev_cache_max\fR (int)
1071.ad
1072.RS 12n
1073Inflate reads small than max
1074.RE
1075
1076.sp
1077.ne 2
1078.na
1079\fBzfs_vdev_cache_size\fR (int)
1080.ad
1081.RS 12n
1082Total size of the per-disk cache
1083.sp
1084Default value: \fB0\fR.
1085.RE
1086
29714574
TF
1087.sp
1088.ne 2
1089.na
1090\fBzfs_vdev_mirror_switch_us\fR (int)
1091.ad
1092.RS 12n
1093Switch mirrors every N usecs
1094.sp
1095Default value: \fB10,000\fR.
1096.RE
1097
29714574
TF
1098.sp
1099.ne 2
1100.na
1101\fBzfs_vdev_read_gap_limit\fR (int)
1102.ad
1103.RS 12n
1104Aggregate read I/O over gap
1105.sp
1106Default value: \fB32,768\fR.
1107.RE
1108
1109.sp
1110.ne 2
1111.na
1112\fBzfs_vdev_scheduler\fR (charp)
1113.ad
1114.RS 12n
1115I/O scheduler
1116.sp
1117Default value: \fBnoop\fR.
1118.RE
1119
29714574
TF
1120.sp
1121.ne 2
1122.na
1123\fBzfs_vdev_write_gap_limit\fR (int)
1124.ad
1125.RS 12n
1126Aggregate write I/O over gap
1127.sp
1128Default value: \fB4,096\fR.
1129.RE
1130
29714574
TF
1131.sp
1132.ne 2
1133.na
1134\fBzfs_zevent_cols\fR (int)
1135.ad
1136.RS 12n
1137Max event column width
1138.sp
1139Default value: \fB80\fR.
1140.RE
1141
1142.sp
1143.ne 2
1144.na
1145\fBzfs_zevent_console\fR (int)
1146.ad
1147.RS 12n
1148Log events to the console
1149.sp
1150Use \fB1\fR for yes and \fB0\fR for no (default).
1151.RE
1152
1153.sp
1154.ne 2
1155.na
1156\fBzfs_zevent_len_max\fR (int)
1157.ad
1158.RS 12n
1159Max event queue length
1160.sp
1161Default value: \fB0\fR.
1162.RE
1163
1164.sp
1165.ne 2
1166.na
1167\fBzil_replay_disable\fR (int)
1168.ad
1169.RS 12n
1170Disable intent logging replay
1171.sp
1172Use \fB1\fR for yes and \fB0\fR for no (default).
1173.RE
1174
1175.sp
1176.ne 2
1177.na
1178\fBzil_slog_limit\fR (ulong)
1179.ad
1180.RS 12n
1181Max commit bytes to separate log device
1182.sp
1183Default value: \fB1,048,576\fR.
1184.RE
1185
1186.sp
1187.ne 2
1188.na
1189\fBzio_bulk_flags\fR (int)
1190.ad
1191.RS 12n
1192Additional flags to pass to bulk buffers
1193.sp
1194Default value: \fB0\fR.
1195.RE
1196
1197.sp
1198.ne 2
1199.na
1200\fBzio_delay_max\fR (int)
1201.ad
1202.RS 12n
1203Max zio millisec delay before posting event
1204.sp
1205Default value: \fB30,000\fR.
1206.RE
1207
1208.sp
1209.ne 2
1210.na
1211\fBzio_injection_enabled\fR (int)
1212.ad
1213.RS 12n
1214Enable fault injection
1215.sp
1216Use \fB1\fR for yes and \fB0\fR for no (default).
1217.RE
1218
1219.sp
1220.ne 2
1221.na
1222\fBzio_requeue_io_start_cut_in_line\fR (int)
1223.ad
1224.RS 12n
1225Prioritize requeued I/O
1226.sp
1227Default value: \fB0\fR.
1228.RE
1229
1230.sp
1231.ne 2
1232.na
1233\fBzvol_inhibit_dev\fR (uint)
1234.ad
1235.RS 12n
1236Do not create zvol device nodes
1237.sp
1238Use \fB1\fR for yes and \fB0\fR for no (default).
1239.RE
1240
1241.sp
1242.ne 2
1243.na
1244\fBzvol_major\fR (uint)
1245.ad
1246.RS 12n
1247Major number for zvol device
1248.sp
1249Default value: \fB230\fR.
1250.RE
1251
1252.sp
1253.ne 2
1254.na
1255\fBzvol_max_discard_blocks\fR (ulong)
1256.ad
1257.RS 12n
1258Max number of blocks to discard at once
1259.sp
1260Default value: \fB16,384\fR.
1261.RE
1262
1263.sp
1264.ne 2
1265.na
1266\fBzvol_threads\fR (uint)
1267.ad
1268.RS 12n
1269Number of threads for zvol device
1270.sp
1271Default value: \fB32\fR.
1272.RE
1273
e8b96c60
MA
1274.SH ZFS I/O SCHEDULER
1275ZFS issues I/O operations to leaf vdevs to satisfy and complete I/Os.
1276The I/O scheduler determines when and in what order those operations are
1277issued. The I/O scheduler divides operations into five I/O classes
1278prioritized in the following order: sync read, sync write, async read,
1279async write, and scrub/resilver. Each queue defines the minimum and
1280maximum number of concurrent operations that may be issued to the
1281device. In addition, the device has an aggregate maximum,
1282\fBzfs_vdev_max_active\fR. Note that the sum of the per-queue minimums
1283must not exceed the aggregate maximum. If the sum of the per-queue
1284maximums exceeds the aggregate maximum, then the number of active I/Os
1285may reach \fBzfs_vdev_max_active\fR, in which case no further I/Os will
1286be issued regardless of whether all per-queue minimums have been met.
1287.sp
1288For many physical devices, throughput increases with the number of
1289concurrent operations, but latency typically suffers. Further, physical
1290devices typically have a limit at which more concurrent operations have no
1291effect on throughput or can actually cause it to decrease.
1292.sp
1293The scheduler selects the next operation to issue by first looking for an
1294I/O class whose minimum has not been satisfied. Once all are satisfied and
1295the aggregate maximum has not been hit, the scheduler looks for classes
1296whose maximum has not been satisfied. Iteration through the I/O classes is
1297done in the order specified above. No further operations are issued if the
1298aggregate maximum number of concurrent operations has been hit or if there
1299are no operations queued for an I/O class that has not hit its maximum.
1300Every time an I/O is queued or an operation completes, the I/O scheduler
1301looks for new operations to issue.
1302.sp
1303In general, smaller max_active's will lead to lower latency of synchronous
1304operations. Larger max_active's may lead to higher overall throughput,
1305depending on underlying storage.
1306.sp
1307The ratio of the queues' max_actives determines the balance of performance
1308between reads, writes, and scrubs. E.g., increasing
1309\fBzfs_vdev_scrub_max_active\fR will cause the scrub or resilver to complete
1310more quickly, but reads and writes to have higher latency and lower throughput.
1311.sp
1312All I/O classes have a fixed maximum number of outstanding operations
1313except for the async write class. Asynchronous writes represent the data
1314that is committed to stable storage during the syncing stage for
1315transaction groups. Transaction groups enter the syncing state
1316periodically so the number of queued async writes will quickly burst up
1317and then bleed down to zero. Rather than servicing them as quickly as
1318possible, the I/O scheduler changes the maximum number of active async
1319write I/Os according to the amount of dirty data in the pool. Since
1320both throughput and latency typically increase with the number of
1321concurrent operations issued to physical devices, reducing the
1322burstiness in the number of concurrent operations also stabilizes the
1323response time of operations from other -- and in particular synchronous
1324-- queues. In broad strokes, the I/O scheduler will issue more
1325concurrent operations from the async write queue as there's more dirty
1326data in the pool.
1327.sp
1328Async Writes
1329.sp
1330The number of concurrent operations issued for the async write I/O class
1331follows a piece-wise linear function defined by a few adjustable points.
1332.nf
1333
1334 | o---------| <-- zfs_vdev_async_write_max_active
1335 ^ | /^ |
1336 | | / | |
1337active | / | |
1338 I/O | / | |
1339count | / | |
1340 | / | |
1341 |-------o | | <-- zfs_vdev_async_write_min_active
1342 0|_______^______|_________|
1343 0% | | 100% of zfs_dirty_data_max
1344 | |
1345 | `-- zfs_vdev_async_write_active_max_dirty_percent
1346 `--------- zfs_vdev_async_write_active_min_dirty_percent
1347
1348.fi
1349Until the amount of dirty data exceeds a minimum percentage of the dirty
1350data allowed in the pool, the I/O scheduler will limit the number of
1351concurrent operations to the minimum. As that threshold is crossed, the
1352number of concurrent operations issued increases linearly to the maximum at
1353the specified maximum percentage of the dirty data allowed in the pool.
1354.sp
1355Ideally, the amount of dirty data on a busy pool will stay in the sloped
1356part of the function between \fBzfs_vdev_async_write_active_min_dirty_percent\fR
1357and \fBzfs_vdev_async_write_active_max_dirty_percent\fR. If it exceeds the
1358maximum percentage, this indicates that the rate of incoming data is
1359greater than the rate that the backend storage can handle. In this case, we
1360must further throttle incoming writes, as described in the next section.
1361
1362.SH ZFS TRANSACTION DELAY
1363We delay transactions when we've determined that the backend storage
1364isn't able to accommodate the rate of incoming writes.
1365.sp
1366If there is already a transaction waiting, we delay relative to when
1367that transaction will finish waiting. This way the calculated delay time
1368is independent of the number of threads concurrently executing
1369transactions.
1370.sp
1371If we are the only waiter, wait relative to when the transaction
1372started, rather than the current time. This credits the transaction for
1373"time already served", e.g. reading indirect blocks.
1374.sp
1375The minimum time for a transaction to take is calculated as:
1376.nf
1377 min_time = zfs_delay_scale * (dirty - min) / (max - dirty)
1378 min_time is then capped at 100 milliseconds.
1379.fi
1380.sp
1381The delay has two degrees of freedom that can be adjusted via tunables. The
1382percentage of dirty data at which we start to delay is defined by
1383\fBzfs_delay_min_dirty_percent\fR. This should typically be at or above
1384\fBzfs_vdev_async_write_active_max_dirty_percent\fR so that we only start to
1385delay after writing at full speed has failed to keep up with the incoming write
1386rate. The scale of the curve is defined by \fBzfs_delay_scale\fR. Roughly speaking,
1387this variable determines the amount of delay at the midpoint of the curve.
1388.sp
1389.nf
1390delay
1391 10ms +-------------------------------------------------------------*+
1392 | *|
1393 9ms + *+
1394 | *|
1395 8ms + *+
1396 | * |
1397 7ms + * +
1398 | * |
1399 6ms + * +
1400 | * |
1401 5ms + * +
1402 | * |
1403 4ms + * +
1404 | * |
1405 3ms + * +
1406 | * |
1407 2ms + (midpoint) * +
1408 | | ** |
1409 1ms + v *** +
1410 | zfs_delay_scale ----------> ******** |
1411 0 +-------------------------------------*********----------------+
1412 0% <- zfs_dirty_data_max -> 100%
1413.fi
1414.sp
1415Note that since the delay is added to the outstanding time remaining on the
1416most recent transaction, the delay is effectively the inverse of IOPS.
1417Here the midpoint of 500us translates to 2000 IOPS. The shape of the curve
1418was chosen such that small changes in the amount of accumulated dirty data
1419in the first 3/4 of the curve yield relatively small differences in the
1420amount of delay.
1421.sp
1422The effects can be easier to understand when the amount of delay is
1423represented on a log scale:
1424.sp
1425.nf
1426delay
1427100ms +-------------------------------------------------------------++
1428 + +
1429 | |
1430 + *+
1431 10ms + *+
1432 + ** +
1433 | (midpoint) ** |
1434 + | ** +
1435 1ms + v **** +
1436 + zfs_delay_scale ----------> ***** +
1437 | **** |
1438 + **** +
1439100us + ** +
1440 + * +
1441 | * |
1442 + * +
1443 10us + * +
1444 + +
1445 | |
1446 + +
1447 +--------------------------------------------------------------+
1448 0% <- zfs_dirty_data_max -> 100%
1449.fi
1450.sp
1451Note here that only as the amount of dirty data approaches its limit does
1452the delay start to increase rapidly. The goal of a properly tuned system
1453should be to keep the amount of dirty data out of that range by first
1454ensuring that the appropriate limits are set for the I/O scheduler to reach
1455optimal throughput on the backend storage, and then by changing the value
1456of \fBzfs_delay_scale\fR to increase the steepness of the curve.