]> git.proxmox.com Git - mirror_zfs.git/blame - man/man7/zpoolconcepts.7
Ubuntu 22.04 integration: mancheck
[mirror_zfs.git] / man / man7 / zpoolconcepts.7
CommitLineData
c5ebfbbe
RW
1.\"
2.\" CDDL HEADER START
3.\"
4.\" The contents of this file are subject to the terms of the
5.\" Common Development and Distribution License (the "License").
6.\" You may not use this file except in compliance with the License.
7.\"
8.\" You can obtain a copy of the license at usr/src/OPENSOLARIS.LICENSE
1d3ba0bf 9.\" or https://opensource.org/licenses/CDDL-1.0.
c5ebfbbe
RW
10.\" See the License for the specific language governing permissions
11.\" and limitations under the License.
12.\"
13.\" When distributing Covered Code, include this CDDL HEADER in each
14.\" file and include the License file at usr/src/OPENSOLARIS.LICENSE.
15.\" If applicable, add the following below this CDDL HEADER, with the
16.\" fields enclosed by brackets "[]" replaced with your own identifying
17.\" information: Portions Copyright [yyyy] [name of copyright owner]
18.\"
19.\" CDDL HEADER END
20.\"
c5ebfbbe
RW
21.\" Copyright (c) 2007, Sun Microsystems, Inc. All Rights Reserved.
22.\" Copyright (c) 2012, 2018 by Delphix. All rights reserved.
23.\" Copyright (c) 2012 Cyril Plisko. All Rights Reserved.
24.\" Copyright (c) 2017 Datto Inc.
25.\" Copyright (c) 2018 George Melikov. All Rights Reserved.
26.\" Copyright 2017 Nexenta Systems, Inc.
27.\" Copyright (c) 2017 Open-E, Inc. All Rights Reserved.
28.\"
f84fe3fc 29.Dd June 2, 2021
2badb345 30.Dt ZPOOLCONCEPTS 7
6706552e 31.Os
f84fe3fc 32.
c5ebfbbe
RW
33.Sh NAME
34.Nm zpoolconcepts
35.Nd overview of ZFS storage pools
f84fe3fc 36.
c5ebfbbe
RW
37.Sh DESCRIPTION
38.Ss Virtual Devices (vdevs)
39A "virtual device" describes a single device or a collection of devices
40organized according to certain performance and fault characteristics.
41The following virtual devices are supported:
f84fe3fc 42.Bl -tag -width "special"
c5ebfbbe
RW
43.It Sy disk
44A block device, typically located under
45.Pa /dev .
46ZFS can use individual slices or partitions, though the recommended mode of
47operation is to use whole disks.
48A disk can be specified by a full path, or it can be a shorthand name
49.Po the relative portion of the path under
50.Pa /dev
51.Pc .
52A whole disk can be specified by omitting the slice or partition designation.
53For example,
54.Pa sda
55is equivalent to
56.Pa /dev/sda .
57When given a whole disk, ZFS automatically labels the disk, if necessary.
58.It Sy file
59A regular file.
60The use of files as a backing store is strongly discouraged.
61It is designed primarily for experimental purposes, as the fault tolerance of a
f84fe3fc 62file is only as good as the file system on which it resides.
c5ebfbbe
RW
63A file must be specified by a full path.
64.It Sy mirror
65A mirror of two or more devices.
66Data is replicated in an identical fashion across all components of a mirror.
f84fe3fc
AZ
67A mirror with
68.Em N No disks of size Em X No can hold Em X No bytes and can withstand Em N-1
69devices failing without losing data.
c5ebfbbe
RW
70.It Sy raidz , raidz1 , raidz2 , raidz3
71A variation on RAID-5 that allows for better distribution of parity and
72eliminates the RAID-5
73.Qq write hole
74.Pq in which data and parity become inconsistent after a power loss .
75Data and parity is striped across all disks within a raidz group.
76.Pp
f84fe3fc 77A raidz group can have single, double, or triple parity, meaning that the
c5ebfbbe
RW
78raidz group can sustain one, two, or three failures, respectively, without
79losing any data.
80The
81.Sy raidz1
82vdev type specifies a single-parity raidz group; the
83.Sy raidz2
84vdev type specifies a double-parity raidz group; and the
85.Sy raidz3
86vdev type specifies a triple-parity raidz group.
87The
88.Sy raidz
89vdev type is an alias for
90.Sy raidz1 .
91.Pp
f84fe3fc
AZ
92A raidz group with
93.Em N No disks of size Em X No with Em P No parity disks can hold approximately
b46be903 94.Em (N-P)*X No bytes and can withstand Em P No devices failing without losing data .
c5ebfbbe
RW
95The minimum number of devices in a raidz group is one more than the number of
96parity disks.
97The recommended number is between 3 and 9 to help increase performance.
b2255edc
BB
98.It Sy draid , draid1 , draid2 , draid3
99A variant of raidz that provides integrated distributed hot spares which
100allows for faster resilvering while retaining the benefits of raidz.
f84fe3fc 101A dRAID vdev is constructed from multiple internal raidz groups, each with
b46be903 102.Em D No data devices and Em P No parity devices .
b2255edc
BB
103These groups are distributed over all of the children in order to fully
104utilize the available disk performance.
105.Pp
106Unlike raidz, dRAID uses a fixed stripe width (padding as necessary with
107zeros) to allow fully sequential resilvering.
108This fixed stripe width significantly effects both usable capacity and IOPS.
f84fe3fc 109For example, with the default
a894ae75 110.Em D=8 No and Em 4 KiB No disk sectors the minimum allocation size is Em 32 KiB .
b2255edc
BB
111If using compression, this relatively large allocation size can reduce the
112effective compression ratio.
f84fe3fc
AZ
113When using ZFS volumes and dRAID, the default of the
114.Sy volblocksize
115property is increased to account for the allocation size.
b2255edc
BB
116If a dRAID pool will hold a significant amount of small blocks, it is
117recommended to also add a mirrored
118.Sy special
119vdev to store those blocks.
120.Pp
f84fe3fc 121In regards to I/O, performance is similar to raidz since for any read all
b46be903 122.Em D No data disks must be accessed .
b2255edc 123Delivered random IOPS can be reasonably approximated as
f84fe3fc 124.Sy floor((N-S)/(D+P))*single_drive_IOPS .
b2255edc 125.Pp
01893788 126Like raidz, a dRAID can have single-, double-, or triple-parity.
f84fe3fc 127The
b2255edc
BB
128.Sy draid1 ,
129.Sy draid2 ,
130and
131.Sy draid3
132types can be used to specify the parity level.
133The
134.Sy draid
135vdev type is an alias for
136.Sy draid1 .
137.Pp
f84fe3fc 138A dRAID with
b46be903 139.Em N No disks of size Em X , D No data disks per redundancy group , Em P
f84fe3fc
AZ
140.No parity level, and Em S No distributed hot spares can hold approximately
141.Em (N-S)*(D/(D+P))*X No bytes and can withstand Em P
142devices failing without losing data.
143.It Sy draid Ns Oo Ar parity Oc Ns Oo Sy \&: Ns Ar data Ns Sy d Oc Ns Oo Sy \&: Ns Ar children Ns Sy c Oc Ns Oo Sy \&: Ns Ar spares Ns Sy s Oc
b2255edc
BB
144A non-default dRAID configuration can be specified by appending one or more
145of the following optional arguments to the
146.Sy draid
f84fe3fc
AZ
147keyword:
148.Bl -tag -compact -width "children"
149.It Ar parity
150The parity level (1-3).
151.It Ar data
152The number of data devices per redundancy group.
153In general, a smaller value of
b46be903 154.Em D No will increase IOPS, improve the compression ratio ,
f84fe3fc
AZ
155and speed up resilvering at the expense of total usable capacity.
156Defaults to
157.Em 8 , No unless Em N-P-S No is less than Em 8 .
158.It Ar children
159The expected number of children.
b2255edc
BB
160Useful as a cross-check when listing a large number of devices.
161An error is returned when the provided number of children differs.
f84fe3fc
AZ
162.It Ar spares
163The number of distributed hot spares.
b2255edc 164Defaults to zero.
f84fe3fc 165.El
c5ebfbbe
RW
166.It Sy spare
167A pseudo-vdev which keeps track of available hot spares for a pool.
168For more information, see the
169.Sx Hot Spares
170section.
171.It Sy log
172A separate intent log device.
173If more than one log device is specified, then writes are load-balanced between
174devices.
175Log devices can be mirrored.
176However, raidz vdev types are not supported for the intent log.
177For more information, see the
178.Sx Intent Log
179section.
180.It Sy dedup
181A device dedicated solely for deduplication tables.
182The redundancy of this device should match the redundancy of the other normal
f84fe3fc
AZ
183devices in the pool.
184If more than one dedup device is specified, then
c5ebfbbe
RW
185allocations are load-balanced between those devices.
186.It Sy special
187A device dedicated solely for allocating various kinds of internal metadata,
188and optionally small file blocks.
189The redundancy of this device should match the redundancy of the other normal
f84fe3fc
AZ
190devices in the pool.
191If more than one special device is specified, then
c5ebfbbe
RW
192allocations are load-balanced between those devices.
193.Pp
194For more information on special allocations, see the
195.Sx Special Allocation Class
196section.
197.It Sy cache
198A device used to cache storage pool data.
199A cache device cannot be configured as a mirror or raidz group.
200For more information, see the
201.Sx Cache Devices
202section.
203.El
204.Pp
205Virtual devices cannot be nested, so a mirror or raidz virtual device can only
206contain files or disks.
207Mirrors of mirrors
208.Pq or other combinations
209are not allowed.
210.Pp
211A pool can have any number of virtual devices at the top of the configuration
212.Po known as
213.Qq root vdevs
214.Pc .
215Data is dynamically distributed across all top-level devices to balance data
216among devices.
217As new virtual devices are added, ZFS automatically places data on the newly
218available devices.
219.Pp
f84fe3fc
AZ
220Virtual devices are specified one at a time on the command line,
221separated by whitespace.
222Keywords like
223.Sy mirror No and Sy raidz
c5ebfbbe 224are used to distinguish where a group ends and another begins.
f84fe3fc
AZ
225For example, the following creates a pool with two root vdevs,
226each a mirror of two disks:
227.Dl # Nm zpool Cm create Ar mypool Sy mirror Ar sda sdb Sy mirror Ar sdc sdd
228.
c5ebfbbe
RW
229.Ss Device Failure and Recovery
230ZFS supports a rich set of mechanisms for handling device failure and data
231corruption.
232All metadata and data is checksummed, and ZFS automatically repairs bad data
233from a good copy when corruption is detected.
234.Pp
235In order to take advantage of these features, a pool must make use of some form
236of redundancy, using either mirrored or raidz groups.
237While ZFS supports running in a non-redundant configuration, where each root
238vdev is simply a disk or file, this is strongly discouraged.
239A single case of bit corruption can render some or all of your data unavailable.
240.Pp
f84fe3fc
AZ
241A pool's health status is described by one of three states:
242.Sy online , degraded , No or Sy faulted .
c5ebfbbe
RW
243An online pool has all devices operating normally.
244A degraded pool is one in which one or more devices have failed, but the data is
245still available due to a redundant configuration.
246A faulted pool has corrupted metadata, or one or more faulted devices, and
247insufficient replicas to continue functioning.
248.Pp
f84fe3fc
AZ
249The health of the top-level vdev, such as a mirror or raidz device,
250is potentially impacted by the state of its associated vdevs,
251or component devices.
c5ebfbbe
RW
252A top-level vdev or component device is in one of the following states:
253.Bl -tag -width "DEGRADED"
254.It Sy DEGRADED
255One or more top-level vdevs is in the degraded state because one or more
256component devices are offline.
257Sufficient replicas exist to continue functioning.
258.Pp
259One or more component devices is in the degraded or faulted state, but
260sufficient replicas exist to continue functioning.
261The underlying conditions are as follows:
f84fe3fc 262.Bl -bullet -compact
c5ebfbbe
RW
263.It
264The number of checksum errors exceeds acceptable levels and the device is
265degraded as an indication that something may be wrong.
266ZFS continues to use the device as necessary.
267.It
268The number of I/O errors exceeds acceptable levels.
269The device could not be marked as faulted because there are insufficient
270replicas to continue functioning.
271.El
272.It Sy FAULTED
273One or more top-level vdevs is in the faulted state because one or more
274component devices are offline.
275Insufficient replicas exist to continue functioning.
276.Pp
277One or more component devices is in the faulted state, and insufficient
278replicas exist to continue functioning.
279The underlying conditions are as follows:
f84fe3fc 280.Bl -bullet -compact
c5ebfbbe
RW
281.It
282The device could be opened, but the contents did not match expected values.
283.It
284The number of I/O errors exceeds acceptable levels and the device is faulted to
285prevent further use of the device.
286.El
287.It Sy OFFLINE
288The device was explicitly taken offline by the
289.Nm zpool Cm offline
290command.
291.It Sy ONLINE
292The device is online and functioning.
293.It Sy REMOVED
294The device was physically removed while the system was running.
295Device removal detection is hardware-dependent and may not be supported on all
296platforms.
297.It Sy UNAVAIL
298The device could not be opened.
299If a pool is imported when a device was unavailable, then the device will be
300identified by a unique identifier instead of its path since the path was never
301correct in the first place.
302.El
303.Pp
330c6c05
MA
304Checksum errors represent events where a disk returned data that was expected
305to be correct, but was not.
306In other words, these are instances of silent data corruption.
307The checksum errors are reported in
308.Nm zpool Cm status
309and
310.Nm zpool Cm events .
311When a block is stored redundantly, a damaged block may be reconstructed
f84fe3fc 312(e.g. from raidz parity or a mirrored copy).
330c6c05
MA
313In this case, ZFS reports the checksum error against the disks that contained
314damaged data.
315If a block is unable to be reconstructed (e.g. due to 3 disks being damaged
f84fe3fc 316in a raidz2 group), it is not possible to determine which disks were silently
330c6c05
MA
317corrupted.
318In this case, checksum errors are reported for all disks on which the block
319is stored.
320.Pp
f84fe3fc
AZ
321If a device is removed and later re-attached to the system,
322ZFS attempts online the device automatically.
323Device attachment detection is hardware-dependent
324and might not be supported on all platforms.
325.
c5ebfbbe
RW
326.Ss Hot Spares
327ZFS allows devices to be associated with pools as
328.Qq hot spares .
329These devices are not actively used in the pool, but when an active device
330fails, it is automatically replaced by a hot spare.
331To create a pool with hot spares, specify a
332.Sy spare
333vdev with any number of devices.
334For example,
f84fe3fc 335.Dl # Nm zpool Cm create Ar pool Sy mirror Ar sda sdb Sy spare Ar sdc sdd
c5ebfbbe
RW
336.Pp
337Spares can be shared across multiple pools, and can be added with the
338.Nm zpool Cm add
339command and removed with the
340.Nm zpool Cm remove
341command.
342Once a spare replacement is initiated, a new
343.Sy spare
344vdev is created within the configuration that will remain there until the
345original device is replaced.
346At this point, the hot spare becomes available again if another device fails.
347.Pp
348If a pool has a shared spare that is currently being used, the pool can not be
349exported since other pools may use this shared spare, which may lead to
350potential data corruption.
351.Pp
f84fe3fc
AZ
352Shared spares add some risk.
353If the pools are imported on different hosts,
354and both pools suffer a device failure at the same time,
355both could attempt to use the spare at the same time.
356This may not be detected, resulting in data corruption.
c5ebfbbe
RW
357.Pp
358An in-progress spare replacement can be cancelled by detaching the hot spare.
359If the original faulted device is detached, then the hot spare assumes its
360place in the configuration, and is removed from the spare list of all active
361pools.
362.Pp
b2255edc
BB
363The
364.Sy draid
365vdev type provides distributed hot spares.
f84fe3fc
AZ
366These hot spares are named after the dRAID vdev they're a part of
367.Po Sy draid1 Ns - Ns Ar 2 Ns - Ns Ar 3 No specifies spare Ar 3 No of vdev Ar 2 ,
368.No which is a single parity dRAID Pc
369and may only be used by that dRAID vdev.
b2255edc
BB
370Otherwise, they behave the same as normal hot spares.
371.Pp
c5ebfbbe 372Spares cannot replace log devices.
f84fe3fc 373.
c5ebfbbe
RW
374.Ss Intent Log
375The ZFS Intent Log (ZIL) satisfies POSIX requirements for synchronous
376transactions.
377For instance, databases often require their transactions to be on stable storage
378devices when returning from a system call.
379NFS and other applications can also use
380.Xr fsync 2
381to ensure data stability.
382By default, the intent log is allocated from blocks within the main pool.
383However, it might be possible to get better performance using separate intent
384log devices such as NVRAM or a dedicated disk.
385For example:
f84fe3fc 386.Dl # Nm zpool Cm create Ar pool sda sdb Sy log Ar sdc
c5ebfbbe
RW
387.Pp
388Multiple log devices can also be specified, and they can be mirrored.
389See the
390.Sx EXAMPLES
391section for an example of mirroring multiple log devices.
392.Pp
f84fe3fc
AZ
393Log devices can be added, replaced, attached, detached and removed.
394In addition, log devices are imported and exported as part of the pool
c5ebfbbe
RW
395that contains them.
396Mirrored devices can be removed by specifying the top-level mirror vdev.
f84fe3fc 397.
c5ebfbbe
RW
398.Ss Cache Devices
399Devices can be added to a storage pool as
400.Qq cache devices .
401These devices provide an additional layer of caching between main memory and
402disk.
403For read-heavy workloads, where the working set size is much larger than what
f84fe3fc 404can be cached in main memory, using cache devices allows much more of this
c5ebfbbe
RW
405working set to be served from low latency media.
406Using cache devices provides the greatest performance improvement for random
407read-workloads of mostly static content.
408.Pp
409To create a pool with cache devices, specify a
410.Sy cache
411vdev with any number of devices.
412For example:
f84fe3fc 413.Dl # Nm zpool Cm create Ar pool sda sdb Sy cache Ar sdc sdd
c5ebfbbe
RW
414.Pp
415Cache devices cannot be mirrored or part of a raidz configuration.
416If a read error is encountered on a cache device, that read I/O is reissued to
417the original storage pool device, which might be part of a mirrored or raidz
418configuration.
419.Pp
77f6826b
GA
420The content of the cache devices is persistent across reboots and restored
421asynchronously when importing the pool in L2ARC (persistent L2ARC).
422This can be disabled by setting
f84fe3fc
AZ
423.Sy l2arc_rebuild_enabled Ns = Ns Sy 0 .
424For cache devices smaller than
a894ae75 425.Em 1 GiB ,
f84fe3fc
AZ
426we do not write the metadata structures
427required for rebuilding the L2ARC in order not to waste space.
428This can be changed with
77f6826b 429.Sy l2arc_rebuild_blocks_min_l2size .
f84fe3fc 430The cache device header
a894ae75 431.Pq Em 512 B
f84fe3fc
AZ
432is updated even if no metadata structures are written.
433Setting
434.Sy l2arc_headroom Ns = Ns Sy 0
77f6826b 435will result in scanning the full-length ARC lists for cacheable content to be
f84fe3fc
AZ
436written in L2ARC (persistent ARC).
437If a cache device is added with
77f6826b
GA
438.Nm zpool Cm add
439its label and header will be overwritten and its contents are not going to be
f84fe3fc
AZ
440restored in L2ARC, even if the device was previously part of the pool.
441If a cache device is onlined with
77f6826b 442.Nm zpool Cm online
f84fe3fc
AZ
443its contents will be restored in L2ARC.
444This is useful in case of memory pressure
77f6826b 445where the contents of the cache device are not fully restored in L2ARC.
f84fe3fc 446The user can off- and online the cache device when there is less memory pressure
77f6826b 447in order to fully restore its contents to L2ARC.
f84fe3fc 448.
c5ebfbbe 449.Ss Pool checkpoint
f84fe3fc
AZ
450Before starting critical procedures that include destructive actions
451.Pq like Nm zfs Cm destroy ,
452an administrator can checkpoint the pool's state and in the case of a
c5ebfbbe
RW
453mistake or failure, rewind the entire pool back to the checkpoint.
454Otherwise, the checkpoint can be discarded when the procedure has completed
455successfully.
456.Pp
457A pool checkpoint can be thought of as a pool-wide snapshot and should be used
458with care as it contains every part of the pool's state, from properties to vdev
459configuration.
f84fe3fc 460Thus, certain operations are not allowed while a pool has a checkpoint.
c5ebfbbe 461Specifically, vdev removal/attach/detach, mirror splitting, and
f84fe3fc
AZ
462changing the pool's GUID.
463Adding a new vdev is supported, but in the case of a rewind it will have to be
c5ebfbbe
RW
464added again.
465Finally, users of this feature should keep in mind that scrubs in a pool that
466has a checkpoint do not repair checkpointed data.
467.Pp
468To create a checkpoint for a pool:
f84fe3fc 469.Dl # Nm zpool Cm checkpoint Ar pool
c5ebfbbe
RW
470.Pp
471To later rewind to its checkpointed state, you need to first export it and
472then rewind it during import:
f84fe3fc
AZ
473.Dl # Nm zpool Cm export Ar pool
474.Dl # Nm zpool Cm import Fl -rewind-to-checkpoint Ar pool
c5ebfbbe
RW
475.Pp
476To discard the checkpoint from a pool:
f84fe3fc 477.Dl # Nm zpool Cm checkpoint Fl d Ar pool
c5ebfbbe
RW
478.Pp
479Dataset reservations (controlled by the
f84fe3fc
AZ
480.Sy reservation No and Sy refreservation
481properties) may be unenforceable while a checkpoint exists, because the
c5ebfbbe
RW
482checkpoint is allowed to consume the dataset's reservation.
483Finally, data that is part of the checkpoint but has been freed in the
484current state of the pool won't be scanned during a scrub.
f84fe3fc 485.
c5ebfbbe 486.Ss Special Allocation Class
f84fe3fc 487Allocations in the special class are dedicated to specific block types.
c5ebfbbe 488By default this includes all metadata, the indirect blocks of user data, and
f84fe3fc
AZ
489any deduplication tables.
490The class can also be provisioned to accept small file blocks.
491.Pp
492A pool must always have at least one normal
493.Pq non- Ns Sy dedup Ns /- Ns Sy special
494vdev before
495other devices can be assigned to the special class.
496If the
497.Sy special
498class becomes full, then allocations intended for it
499will spill back into the normal class.
c5ebfbbe 500.Pp
f84fe3fc 501Deduplication tables can be excluded from the special class by unsetting the
c5ebfbbe 502.Sy zfs_ddt_data_is_special
f84fe3fc 503ZFS module parameter.
c5ebfbbe 504.Pp
f84fe3fc
AZ
505Inclusion of small file blocks in the special class is opt-in.
506Each dataset can control the size of small file blocks allowed
507in the special class by setting the
c5ebfbbe 508.Sy special_small_blocks
f84fe3fc
AZ
509property to nonzero.
510See
2badb345 511.Xr zfsprops 7
f84fe3fc 512for more info on this property.