]> git.proxmox.com Git - mirror_zfs.git/blob - man/man7/zpoolconcepts.7
Move properties, parameters, events, and concepts around manual sections
[mirror_zfs.git] / man / man7 / zpoolconcepts.7
1 .\"
2 .\" CDDL HEADER START
3 .\"
4 .\" The contents of this file are subject to the terms of the
5 .\" Common Development and Distribution License (the "License").
6 .\" You may not use this file except in compliance with the License.
7 .\"
8 .\" You can obtain a copy of the license at usr/src/OPENSOLARIS.LICENSE
9 .\" or http://www.opensolaris.org/os/licensing.
10 .\" See the License for the specific language governing permissions
11 .\" and limitations under the License.
12 .\"
13 .\" When distributing Covered Code, include this CDDL HEADER in each
14 .\" file and include the License file at usr/src/OPENSOLARIS.LICENSE.
15 .\" If applicable, add the following below this CDDL HEADER, with the
16 .\" fields enclosed by brackets "[]" replaced with your own identifying
17 .\" information: Portions Copyright [yyyy] [name of copyright owner]
18 .\"
19 .\" CDDL HEADER END
20 .\"
21 .\" Copyright (c) 2007, Sun Microsystems, Inc. All Rights Reserved.
22 .\" Copyright (c) 2012, 2018 by Delphix. All rights reserved.
23 .\" Copyright (c) 2012 Cyril Plisko. All Rights Reserved.
24 .\" Copyright (c) 2017 Datto Inc.
25 .\" Copyright (c) 2018 George Melikov. All Rights Reserved.
26 .\" Copyright 2017 Nexenta Systems, Inc.
27 .\" Copyright (c) 2017 Open-E, Inc. All Rights Reserved.
28 .\"
29 .Dd June 2, 2021
30 .Dt ZPOOLCONCEPTS 7
31 .Os
32 .
33 .Sh NAME
34 .Nm zpoolconcepts
35 .Nd overview of ZFS storage pools
36 .
37 .Sh DESCRIPTION
38 .Ss Virtual Devices (vdevs)
39 A "virtual device" describes a single device or a collection of devices
40 organized according to certain performance and fault characteristics.
41 The following virtual devices are supported:
42 .Bl -tag -width "special"
43 .It Sy disk
44 A block device, typically located under
45 .Pa /dev .
46 ZFS can use individual slices or partitions, though the recommended mode of
47 operation is to use whole disks.
48 A disk can be specified by a full path, or it can be a shorthand name
49 .Po the relative portion of the path under
50 .Pa /dev
51 .Pc .
52 A whole disk can be specified by omitting the slice or partition designation.
53 For example,
54 .Pa sda
55 is equivalent to
56 .Pa /dev/sda .
57 When given a whole disk, ZFS automatically labels the disk, if necessary.
58 .It Sy file
59 A regular file.
60 The use of files as a backing store is strongly discouraged.
61 It is designed primarily for experimental purposes, as the fault tolerance of a
62 file is only as good as the file system on which it resides.
63 A file must be specified by a full path.
64 .It Sy mirror
65 A mirror of two or more devices.
66 Data is replicated in an identical fashion across all components of a mirror.
67 A mirror with
68 .Em N No disks of size Em X No can hold Em X No bytes and can withstand Em N-1
69 devices failing without losing data.
70 .It Sy raidz , raidz1 , raidz2 , raidz3
71 A variation on RAID-5 that allows for better distribution of parity and
72 eliminates the RAID-5
73 .Qq write hole
74 .Pq in which data and parity become inconsistent after a power loss .
75 Data and parity is striped across all disks within a raidz group.
76 .Pp
77 A raidz group can have single, double, or triple parity, meaning that the
78 raidz group can sustain one, two, or three failures, respectively, without
79 losing any data.
80 The
81 .Sy raidz1
82 vdev type specifies a single-parity raidz group; the
83 .Sy raidz2
84 vdev type specifies a double-parity raidz group; and the
85 .Sy raidz3
86 vdev type specifies a triple-parity raidz group.
87 The
88 .Sy raidz
89 vdev type is an alias for
90 .Sy raidz1 .
91 .Pp
92 A raidz group with
93 .Em N No disks of size Em X No with Em P No parity disks can hold approximately
94 .Em (N-P)*X No bytes and can withstand Em P No devices failing without losing data.
95 The minimum number of devices in a raidz group is one more than the number of
96 parity disks.
97 The recommended number is between 3 and 9 to help increase performance.
98 .It Sy draid , draid1 , draid2 , draid3
99 A variant of raidz that provides integrated distributed hot spares which
100 allows for faster resilvering while retaining the benefits of raidz.
101 A dRAID vdev is constructed from multiple internal raidz groups, each with
102 .Em D No data devices and Em P No parity devices.
103 These groups are distributed over all of the children in order to fully
104 utilize the available disk performance.
105 .Pp
106 Unlike raidz, dRAID uses a fixed stripe width (padding as necessary with
107 zeros) to allow fully sequential resilvering.
108 This fixed stripe width significantly effects both usable capacity and IOPS.
109 For example, with the default
110 .Em D=8 No and Em 4kB No disk sectors the minimum allocation size is Em 32kB .
111 If using compression, this relatively large allocation size can reduce the
112 effective compression ratio.
113 When using ZFS volumes and dRAID, the default of the
114 .Sy volblocksize
115 property is increased to account for the allocation size.
116 If a dRAID pool will hold a significant amount of small blocks, it is
117 recommended to also add a mirrored
118 .Sy special
119 vdev to store those blocks.
120 .Pp
121 In regards to I/O, performance is similar to raidz since for any read all
122 .Em D No data disks must be accessed.
123 Delivered random IOPS can be reasonably approximated as
124 .Sy floor((N-S)/(D+P))*single_drive_IOPS .
125 .Pp
126 Like raidzm a dRAID can have single-, double-, or triple-parity.
127 The
128 .Sy draid1 ,
129 .Sy draid2 ,
130 and
131 .Sy draid3
132 types can be used to specify the parity level.
133 The
134 .Sy draid
135 vdev type is an alias for
136 .Sy draid1 .
137 .Pp
138 A dRAID with
139 .Em N No disks of size Em X , D No data disks per redundancy group, Em P
140 .No parity level, and Em S No distributed hot spares can hold approximately
141 .Em (N-S)*(D/(D+P))*X No bytes and can withstand Em P
142 devices failing without losing data.
143 .It Sy draid Ns Oo Ar parity Oc Ns Oo Sy \&: Ns Ar data Ns Sy d Oc Ns Oo Sy \&: Ns Ar children Ns Sy c Oc Ns Oo Sy \&: Ns Ar spares Ns Sy s Oc
144 A non-default dRAID configuration can be specified by appending one or more
145 of the following optional arguments to the
146 .Sy draid
147 keyword:
148 .Bl -tag -compact -width "children"
149 .It Ar parity
150 The parity level (1-3).
151 .It Ar data
152 The number of data devices per redundancy group.
153 In general, a smaller value of
154 .Em D No will increase IOPS, improve the compression ratio,
155 and speed up resilvering at the expense of total usable capacity.
156 Defaults to
157 .Em 8 , No unless Em N-P-S No is less than Em 8 .
158 .It Ar children
159 The expected number of children.
160 Useful as a cross-check when listing a large number of devices.
161 An error is returned when the provided number of children differs.
162 .It Ar spares
163 The number of distributed hot spares.
164 Defaults to zero.
165 .El
166 .It Sy spare
167 A pseudo-vdev which keeps track of available hot spares for a pool.
168 For more information, see the
169 .Sx Hot Spares
170 section.
171 .It Sy log
172 A separate intent log device.
173 If more than one log device is specified, then writes are load-balanced between
174 devices.
175 Log devices can be mirrored.
176 However, raidz vdev types are not supported for the intent log.
177 For more information, see the
178 .Sx Intent Log
179 section.
180 .It Sy dedup
181 A device dedicated solely for deduplication tables.
182 The redundancy of this device should match the redundancy of the other normal
183 devices in the pool.
184 If more than one dedup device is specified, then
185 allocations are load-balanced between those devices.
186 .It Sy special
187 A device dedicated solely for allocating various kinds of internal metadata,
188 and optionally small file blocks.
189 The redundancy of this device should match the redundancy of the other normal
190 devices in the pool.
191 If more than one special device is specified, then
192 allocations are load-balanced between those devices.
193 .Pp
194 For more information on special allocations, see the
195 .Sx Special Allocation Class
196 section.
197 .It Sy cache
198 A device used to cache storage pool data.
199 A cache device cannot be configured as a mirror or raidz group.
200 For more information, see the
201 .Sx Cache Devices
202 section.
203 .El
204 .Pp
205 Virtual devices cannot be nested, so a mirror or raidz virtual device can only
206 contain files or disks.
207 Mirrors of mirrors
208 .Pq or other combinations
209 are not allowed.
210 .Pp
211 A pool can have any number of virtual devices at the top of the configuration
212 .Po known as
213 .Qq root vdevs
214 .Pc .
215 Data is dynamically distributed across all top-level devices to balance data
216 among devices.
217 As new virtual devices are added, ZFS automatically places data on the newly
218 available devices.
219 .Pp
220 Virtual devices are specified one at a time on the command line,
221 separated by whitespace.
222 Keywords like
223 .Sy mirror No and Sy raidz
224 are used to distinguish where a group ends and another begins.
225 For example, the following creates a pool with two root vdevs,
226 each a mirror of two disks:
227 .Dl # Nm zpool Cm create Ar mypool Sy mirror Ar sda sdb Sy mirror Ar sdc sdd
228 .
229 .Ss Device Failure and Recovery
230 ZFS supports a rich set of mechanisms for handling device failure and data
231 corruption.
232 All metadata and data is checksummed, and ZFS automatically repairs bad data
233 from a good copy when corruption is detected.
234 .Pp
235 In order to take advantage of these features, a pool must make use of some form
236 of redundancy, using either mirrored or raidz groups.
237 While ZFS supports running in a non-redundant configuration, where each root
238 vdev is simply a disk or file, this is strongly discouraged.
239 A single case of bit corruption can render some or all of your data unavailable.
240 .Pp
241 A pool's health status is described by one of three states:
242 .Sy online , degraded , No or Sy faulted .
243 An online pool has all devices operating normally.
244 A degraded pool is one in which one or more devices have failed, but the data is
245 still available due to a redundant configuration.
246 A faulted pool has corrupted metadata, or one or more faulted devices, and
247 insufficient replicas to continue functioning.
248 .Pp
249 The health of the top-level vdev, such as a mirror or raidz device,
250 is potentially impacted by the state of its associated vdevs,
251 or component devices.
252 A top-level vdev or component device is in one of the following states:
253 .Bl -tag -width "DEGRADED"
254 .It Sy DEGRADED
255 One or more top-level vdevs is in the degraded state because one or more
256 component devices are offline.
257 Sufficient replicas exist to continue functioning.
258 .Pp
259 One or more component devices is in the degraded or faulted state, but
260 sufficient replicas exist to continue functioning.
261 The underlying conditions are as follows:
262 .Bl -bullet -compact
263 .It
264 The number of checksum errors exceeds acceptable levels and the device is
265 degraded as an indication that something may be wrong.
266 ZFS continues to use the device as necessary.
267 .It
268 The number of I/O errors exceeds acceptable levels.
269 The device could not be marked as faulted because there are insufficient
270 replicas to continue functioning.
271 .El
272 .It Sy FAULTED
273 One or more top-level vdevs is in the faulted state because one or more
274 component devices are offline.
275 Insufficient replicas exist to continue functioning.
276 .Pp
277 One or more component devices is in the faulted state, and insufficient
278 replicas exist to continue functioning.
279 The underlying conditions are as follows:
280 .Bl -bullet -compact
281 .It
282 The device could be opened, but the contents did not match expected values.
283 .It
284 The number of I/O errors exceeds acceptable levels and the device is faulted to
285 prevent further use of the device.
286 .El
287 .It Sy OFFLINE
288 The device was explicitly taken offline by the
289 .Nm zpool Cm offline
290 command.
291 .It Sy ONLINE
292 The device is online and functioning.
293 .It Sy REMOVED
294 The device was physically removed while the system was running.
295 Device removal detection is hardware-dependent and may not be supported on all
296 platforms.
297 .It Sy UNAVAIL
298 The device could not be opened.
299 If a pool is imported when a device was unavailable, then the device will be
300 identified by a unique identifier instead of its path since the path was never
301 correct in the first place.
302 .El
303 .Pp
304 Checksum errors represent events where a disk returned data that was expected
305 to be correct, but was not.
306 In other words, these are instances of silent data corruption.
307 The checksum errors are reported in
308 .Nm zpool Cm status
309 and
310 .Nm zpool Cm events .
311 When a block is stored redundantly, a damaged block may be reconstructed
312 (e.g. from raidz parity or a mirrored copy).
313 In this case, ZFS reports the checksum error against the disks that contained
314 damaged data.
315 If a block is unable to be reconstructed (e.g. due to 3 disks being damaged
316 in a raidz2 group), it is not possible to determine which disks were silently
317 corrupted.
318 In this case, checksum errors are reported for all disks on which the block
319 is stored.
320 .Pp
321 If a device is removed and later re-attached to the system,
322 ZFS attempts online the device automatically.
323 Device attachment detection is hardware-dependent
324 and might not be supported on all platforms.
325 .
326 .Ss Hot Spares
327 ZFS allows devices to be associated with pools as
328 .Qq hot spares .
329 These devices are not actively used in the pool, but when an active device
330 fails, it is automatically replaced by a hot spare.
331 To create a pool with hot spares, specify a
332 .Sy spare
333 vdev with any number of devices.
334 For example,
335 .Dl # Nm zpool Cm create Ar pool Sy mirror Ar sda sdb Sy spare Ar sdc sdd
336 .Pp
337 Spares can be shared across multiple pools, and can be added with the
338 .Nm zpool Cm add
339 command and removed with the
340 .Nm zpool Cm remove
341 command.
342 Once a spare replacement is initiated, a new
343 .Sy spare
344 vdev is created within the configuration that will remain there until the
345 original device is replaced.
346 At this point, the hot spare becomes available again if another device fails.
347 .Pp
348 If a pool has a shared spare that is currently being used, the pool can not be
349 exported since other pools may use this shared spare, which may lead to
350 potential data corruption.
351 .Pp
352 Shared spares add some risk.
353 If the pools are imported on different hosts,
354 and both pools suffer a device failure at the same time,
355 both could attempt to use the spare at the same time.
356 This may not be detected, resulting in data corruption.
357 .Pp
358 An in-progress spare replacement can be cancelled by detaching the hot spare.
359 If the original faulted device is detached, then the hot spare assumes its
360 place in the configuration, and is removed from the spare list of all active
361 pools.
362 .Pp
363 The
364 .Sy draid
365 vdev type provides distributed hot spares.
366 These hot spares are named after the dRAID vdev they're a part of
367 .Po Sy draid1 Ns - Ns Ar 2 Ns - Ns Ar 3 No specifies spare Ar 3 No of vdev Ar 2 ,
368 .No which is a single parity dRAID Pc
369 and may only be used by that dRAID vdev.
370 Otherwise, they behave the same as normal hot spares.
371 .Pp
372 Spares cannot replace log devices.
373 .
374 .Ss Intent Log
375 The ZFS Intent Log (ZIL) satisfies POSIX requirements for synchronous
376 transactions.
377 For instance, databases often require their transactions to be on stable storage
378 devices when returning from a system call.
379 NFS and other applications can also use
380 .Xr fsync 2
381 to ensure data stability.
382 By default, the intent log is allocated from blocks within the main pool.
383 However, it might be possible to get better performance using separate intent
384 log devices such as NVRAM or a dedicated disk.
385 For example:
386 .Dl # Nm zpool Cm create Ar pool sda sdb Sy log Ar sdc
387 .Pp
388 Multiple log devices can also be specified, and they can be mirrored.
389 See the
390 .Sx EXAMPLES
391 section for an example of mirroring multiple log devices.
392 .Pp
393 Log devices can be added, replaced, attached, detached and removed.
394 In addition, log devices are imported and exported as part of the pool
395 that contains them.
396 Mirrored devices can be removed by specifying the top-level mirror vdev.
397 .
398 .Ss Cache Devices
399 Devices can be added to a storage pool as
400 .Qq cache devices .
401 These devices provide an additional layer of caching between main memory and
402 disk.
403 For read-heavy workloads, where the working set size is much larger than what
404 can be cached in main memory, using cache devices allows much more of this
405 working set to be served from low latency media.
406 Using cache devices provides the greatest performance improvement for random
407 read-workloads of mostly static content.
408 .Pp
409 To create a pool with cache devices, specify a
410 .Sy cache
411 vdev with any number of devices.
412 For example:
413 .Dl # Nm zpool Cm create Ar pool sda sdb Sy cache Ar sdc sdd
414 .Pp
415 Cache devices cannot be mirrored or part of a raidz configuration.
416 If a read error is encountered on a cache device, that read I/O is reissued to
417 the original storage pool device, which might be part of a mirrored or raidz
418 configuration.
419 .Pp
420 The content of the cache devices is persistent across reboots and restored
421 asynchronously when importing the pool in L2ARC (persistent L2ARC).
422 This can be disabled by setting
423 .Sy l2arc_rebuild_enabled Ns = Ns Sy 0 .
424 For cache devices smaller than
425 .Em 1GB ,
426 we do not write the metadata structures
427 required for rebuilding the L2ARC in order not to waste space.
428 This can be changed with
429 .Sy l2arc_rebuild_blocks_min_l2size .
430 The cache device header
431 .Pq Em 512B
432 is updated even if no metadata structures are written.
433 Setting
434 .Sy l2arc_headroom Ns = Ns Sy 0
435 will result in scanning the full-length ARC lists for cacheable content to be
436 written in L2ARC (persistent ARC).
437 If a cache device is added with
438 .Nm zpool Cm add
439 its label and header will be overwritten and its contents are not going to be
440 restored in L2ARC, even if the device was previously part of the pool.
441 If a cache device is onlined with
442 .Nm zpool Cm online
443 its contents will be restored in L2ARC.
444 This is useful in case of memory pressure
445 where the contents of the cache device are not fully restored in L2ARC.
446 The user can off- and online the cache device when there is less memory pressure
447 in order to fully restore its contents to L2ARC.
448 .
449 .Ss Pool checkpoint
450 Before starting critical procedures that include destructive actions
451 .Pq like Nm zfs Cm destroy ,
452 an administrator can checkpoint the pool's state and in the case of a
453 mistake or failure, rewind the entire pool back to the checkpoint.
454 Otherwise, the checkpoint can be discarded when the procedure has completed
455 successfully.
456 .Pp
457 A pool checkpoint can be thought of as a pool-wide snapshot and should be used
458 with care as it contains every part of the pool's state, from properties to vdev
459 configuration.
460 Thus, certain operations are not allowed while a pool has a checkpoint.
461 Specifically, vdev removal/attach/detach, mirror splitting, and
462 changing the pool's GUID.
463 Adding a new vdev is supported, but in the case of a rewind it will have to be
464 added again.
465 Finally, users of this feature should keep in mind that scrubs in a pool that
466 has a checkpoint do not repair checkpointed data.
467 .Pp
468 To create a checkpoint for a pool:
469 .Dl # Nm zpool Cm checkpoint Ar pool
470 .Pp
471 To later rewind to its checkpointed state, you need to first export it and
472 then rewind it during import:
473 .Dl # Nm zpool Cm export Ar pool
474 .Dl # Nm zpool Cm import Fl -rewind-to-checkpoint Ar pool
475 .Pp
476 To discard the checkpoint from a pool:
477 .Dl # Nm zpool Cm checkpoint Fl d Ar pool
478 .Pp
479 Dataset reservations (controlled by the
480 .Sy reservation No and Sy refreservation
481 properties) may be unenforceable while a checkpoint exists, because the
482 checkpoint is allowed to consume the dataset's reservation.
483 Finally, data that is part of the checkpoint but has been freed in the
484 current state of the pool won't be scanned during a scrub.
485 .
486 .Ss Special Allocation Class
487 Allocations in the special class are dedicated to specific block types.
488 By default this includes all metadata, the indirect blocks of user data, and
489 any deduplication tables.
490 The class can also be provisioned to accept small file blocks.
491 .Pp
492 A pool must always have at least one normal
493 .Pq non- Ns Sy dedup Ns /- Ns Sy special
494 vdev before
495 other devices can be assigned to the special class.
496 If the
497 .Sy special
498 class becomes full, then allocations intended for it
499 will spill back into the normal class.
500 .Pp
501 Deduplication tables can be excluded from the special class by unsetting the
502 .Sy zfs_ddt_data_is_special
503 ZFS module parameter.
504 .Pp
505 Inclusion of small file blocks in the special class is opt-in.
506 Each dataset can control the size of small file blocks allowed
507 in the special class by setting the
508 .Sy special_small_blocks
509 property to nonzero.
510 See
511 .Xr zfsprops 7
512 for more info on this property.