man/man7/zpoolconcepts.7

   1 .\"
   2 .\" CDDL HEADER START
   3 .\"
   4 .\" The contents of this file are subject to the terms of the
   5 .\" Common Development and Distribution License (the "License").
   6 .\" You may not use this file except in compliance with the License.
   7 .\"
   8 .\" You can obtain a copy of the license at usr/src/OPENSOLARIS.LICENSE
   9 .\" or https://opensource.org/licenses/CDDL-1.0.
  10 .\" See the License for the specific language governing permissions
  11 .\" and limitations under the License.
  12 .\"
  13 .\" When distributing Covered Code, include this CDDL HEADER in each
  14 .\" file and include the License file at usr/src/OPENSOLARIS.LICENSE.
  15 .\" If applicable, add the following below this CDDL HEADER, with the
  16 .\" fields enclosed by brackets "[]" replaced with your own identifying
  17 .\" information: Portions Copyright [yyyy] [name of copyright owner]
  18 .\"
  19 .\" CDDL HEADER END
  20 .\"
  21 .\" Copyright (c) 2007, Sun Microsystems, Inc. All Rights Reserved.
  22 .\" Copyright (c) 2012, 2018 by Delphix. All rights reserved.
  23 .\" Copyright (c) 2012 Cyril Plisko. All Rights Reserved.
  24 .\" Copyright (c) 2017 Datto Inc.
  25 .\" Copyright (c) 2018 George Melikov. All Rights Reserved.
  26 .\" Copyright 2017 Nexenta Systems, Inc.
  27 .\" Copyright (c) 2017 Open-E, Inc. All Rights Reserved.
  28 .\"
  29 .Dd April 7, 2023
  30 .Dt ZPOOLCONCEPTS 7
  31 .Os
  32 .
  33 .Sh NAME
  34 .Nm zpoolconcepts
  35 .Nd overview of ZFS storage pools
  36 .
  37 .Sh DESCRIPTION
  38 .Ss Virtual Devices (vdevs)
  39 A "virtual device" describes a single device or a collection of devices,
  40 organized according to certain performance and fault characteristics.
  41 The following virtual devices are supported:
  42 .Bl -tag -width "special"
  43 .It Sy disk
  44 A block device, typically located under
  45 .Pa /dev .
  46 ZFS can use individual slices or partitions, though the recommended mode of
  47 operation is to use whole disks.
  48 A disk can be specified by a full path, or it can be a shorthand name
  49 .Po the relative portion of the path under
  50 .Pa /dev
  51 .Pc .
  52 A whole disk can be specified by omitting the slice or partition designation.
  53 For example,
  54 .Pa sda
  55 is equivalent to
  56 .Pa /dev/sda .
  57 When given a whole disk, ZFS automatically labels the disk, if necessary.
  58 .It Sy file
  59 A regular file.
  60 The use of files as a backing store is strongly discouraged.
  61 It is designed primarily for experimental purposes, as the fault tolerance of a
  62 file is only as good as the file system on which it resides.
  63 A file must be specified by a full path.
  64 .It Sy mirror
  65 A mirror of two or more devices.
  66 Data is replicated in an identical fashion across all components of a mirror.
  67 A mirror with
  68 .Em N No disks of size Em X No can hold Em X No bytes and can withstand Em N-1
  69 devices failing, without losing data.
  70 .It Sy raidz , raidz1 , raidz2 , raidz3
  71 A distributed-parity layout, similar to RAID-5/6, with improved distribution of
  72 parity, and which does not suffer from the RAID-5/6
  73 .Qq write hole ,
  74 .Pq in which data and parity become inconsistent after a power loss .
  75 Data and parity is striped across all disks within a raidz group, though not
  76 necessarily in a consistent stripe width.
  77 .Pp
  78 A raidz group can have single, double, or triple parity, meaning that the
  79 raidz group can sustain one, two, or three failures, respectively, without
  80 losing any data.
  81 The
  82 .Sy raidz1
  83 vdev type specifies a single-parity raidz group; the
  84 .Sy raidz2
  85 vdev type specifies a double-parity raidz group; and the
  86 .Sy raidz3
  87 vdev type specifies a triple-parity raidz group.
  88 The
  89 .Sy raidz
  90 vdev type is an alias for
  91 .Sy raidz1 .
  92 .Pp
  93 A raidz group with
  94 .Em N No disks of size Em X No with Em P No parity disks can hold approximately
  95 .Em (N-P)*X No bytes and can withstand Em P No devices failing without losing data .
  96 The minimum number of devices in a raidz group is one more than the number of
  97 parity disks.
  98 The recommended number is between 3 and 9 to help increase performance.
  99 .It Sy draid , draid1 , draid2 , draid3
 100 A variant of raidz that provides integrated distributed hot spares, allowing
 101 for faster resilvering, while retaining the benefits of raidz.
 102 A dRAID vdev is constructed from multiple internal raidz groups, each with
 103 .Em D No data devices and Em P No parity devices .
 104 These groups are distributed over all of the children in order to fully
 105 utilize the available disk performance.
 106 .Pp
 107 Unlike raidz, dRAID uses a fixed stripe width (padding as necessary with
 108 zeros) to allow fully sequential resilvering.
 109 This fixed stripe width significantly affects both usable capacity and IOPS.
 110 For example, with the default
 111 .Em D=8 No and Em 4 KiB No disk sectors the minimum allocation size is Em 32 KiB .
 112 If using compression, this relatively large allocation size can reduce the
 113 effective compression ratio.
 114 When using ZFS volumes (zvols) and dRAID, the default of the
 115 .Sy volblocksize
 116 property is increased to account for the allocation size.
 117 If a dRAID pool will hold a significant amount of small blocks, it is
 118 recommended to also add a mirrored
 119 .Sy special
 120 vdev to store those blocks.
 121 .Pp
 122 In regards to I/O, performance is similar to raidz since, for any read, all
 123 .Em D No data disks must be accessed .
 124 Delivered random IOPS can be reasonably approximated as
 125 .Sy floor((N-S)/(D+P))*single_drive_IOPS .
 126 .Pp
 127 Like raidz, a dRAID can have single-, double-, or triple-parity.
 128 The
 129 .Sy draid1 ,
 130 .Sy draid2 ,
 131 and
 132 .Sy draid3
 133 types can be used to specify the parity level.
 134 The
 135 .Sy draid
 136 vdev type is an alias for
 137 .Sy draid1 .
 138 .Pp
 139 A dRAID with
 140 .Em N No disks of size Em X , D No data disks per redundancy group , Em P
 141 .No parity level, and Em S No distributed hot spares can hold approximately
 142 .Em (N-S)*(D/(D+P))*X No bytes and can withstand Em P
 143 devices failing without losing data.
 144 .It Sy draid Ns Oo Ar parity Oc Ns Oo Sy \&: Ns Ar data Ns Sy d Oc Ns Oo Sy \&: Ns Ar children Ns Sy c Oc Ns Oo Sy \&: Ns Ar spares Ns Sy s Oc
 145 A non-default dRAID configuration can be specified by appending one or more
 146 of the following optional arguments to the
 147 .Sy draid
 148 keyword:
 149 .Bl -tag -compact -width "children"
 150 .It Ar parity
 151 The parity level (1-3).
 152 .It Ar data
 153 The number of data devices per redundancy group.
 154 In general, a smaller value of
 155 .Em D No will increase IOPS, improve the compression ratio ,
 156 and speed up resilvering at the expense of total usable capacity.
 157 Defaults to
 158 .Em 8 , No unless Em N-P-S No is less than Em 8 .
 159 .It Ar children
 160 The expected number of children.
 161 Useful as a cross-check when listing a large number of devices.
 162 An error is returned when the provided number of children differs.
 163 .It Ar spares
 164 The number of distributed hot spares.
 165 Defaults to zero.
 166 .El
 167 .It Sy spare
 168 A pseudo-vdev which keeps track of available hot spares for a pool.
 169 For more information, see the
 170 .Sx Hot Spares
 171 section.
 172 .It Sy log
 173 A separate intent log device.
 174 If more than one log device is specified, then writes are load-balanced between
 175 devices.
 176 Log devices can be mirrored.
 177 However, raidz vdev types are not supported for the intent log.
 178 For more information, see the
 179 .Sx Intent Log
 180 section.
 181 .It Sy dedup
 182 A device solely dedicated for deduplication tables.
 183 The redundancy of this device should match the redundancy of the other normal
 184 devices in the pool.
 185 If more than one dedup device is specified, then
 186 allocations are load-balanced between those devices.
 187 .It Sy special
 188 A device dedicated solely for allocating various kinds of internal metadata,
 189 and optionally small file blocks.
 190 The redundancy of this device should match the redundancy of the other normal
 191 devices in the pool.
 192 If more than one special device is specified, then
 193 allocations are load-balanced between those devices.
 194 .Pp
 195 For more information on special allocations, see the
 196 .Sx Special Allocation Class
 197 section.
 198 .It Sy cache
 199 A device used to cache storage pool data.
 200 A cache device cannot be configured as a mirror or raidz group.
 201 For more information, see the
 202 .Sx Cache Devices
 203 section.
 204 .El
 205 .Pp
 206 Virtual devices cannot be nested, so a mirror or raidz virtual device can only
 207 contain files or disks.
 208 Mirrors of mirrors
 209 .Pq or other combinations
 210 are not allowed.
 211 .Pp
 212 A pool can have any number of virtual devices at the top of the configuration
 213 .Po known as
 214 .Qq root vdevs
 215 .Pc .
 216 Data is dynamically distributed across all top-level devices to balance data
 217 among devices.
 218 As new virtual devices are added, ZFS automatically places data on the newly
 219 available devices.
 220 .Pp
 221 Virtual devices are specified one at a time on the command line,
 222 separated by whitespace.
 223 Keywords like
 224 .Sy mirror No and Sy raidz
 225 are used to distinguish where a group ends and another begins.
 226 For example, the following creates a pool with two root vdevs,
 227 each a mirror of two disks:
 228 .Dl # Nm zpool Cm create Ar mypool Sy mirror Ar sda sdb Sy mirror Ar sdc sdd
 229 .
 230 .Ss Device Failure and Recovery
 231 ZFS supports a rich set of mechanisms for handling device failure and data
 232 corruption.
 233 All metadata and data is checksummed, and ZFS automatically repairs bad data
 234 from a good copy, when corruption is detected.
 235 .Pp
 236 In order to take advantage of these features, a pool must make use of some form
 237 of redundancy, using either mirrored or raidz groups.
 238 While ZFS supports running in a non-redundant configuration, where each root
 239 vdev is simply a disk or file, this is strongly discouraged.
 240 A single case of bit corruption can render some or all of your data unavailable.
 241 .Pp
 242 A pool's health status is described by one of three states:
 243 .Sy online , degraded , No or Sy faulted .
 244 An online pool has all devices operating normally.
 245 A degraded pool is one in which one or more devices have failed, but the data is
 246 still available due to a redundant configuration.
 247 A faulted pool has corrupted metadata, or one or more faulted devices, and
 248 insufficient replicas to continue functioning.
 249 .Pp
 250 The health of the top-level vdev, such as a mirror or raidz device,
 251 is potentially impacted by the state of its associated vdevs
 252 or component devices.
 253 A top-level vdev or component device is in one of the following states:
 254 .Bl -tag -width "DEGRADED"
 255 .It Sy DEGRADED
 256 One or more top-level vdevs is in the degraded state because one or more
 257 component devices are offline.
 258 Sufficient replicas exist to continue functioning.
 259 .Pp
 260 One or more component devices is in the degraded or faulted state, but
 261 sufficient replicas exist to continue functioning.
 262 The underlying conditions are as follows:
 263 .Bl -bullet -compact
 264 .It
 265 The number of checksum errors exceeds acceptable levels and the device is
 266 degraded as an indication that something may be wrong.
 267 ZFS continues to use the device as necessary.
 268 .It
 269 The number of I/O errors exceeds acceptable levels.
 270 The device could not be marked as faulted because there are insufficient
 271 replicas to continue functioning.
 272 .El
 273 .It Sy FAULTED
 274 One or more top-level vdevs is in the faulted state because one or more
 275 component devices are offline.
 276 Insufficient replicas exist to continue functioning.
 277 .Pp
 278 One or more component devices is in the faulted state, and insufficient
 279 replicas exist to continue functioning.
 280 The underlying conditions are as follows:
 281 .Bl -bullet -compact
 282 .It
 283 The device could be opened, but the contents did not match expected values.
 284 .It
 285 The number of I/O errors exceeds acceptable levels and the device is faulted to
 286 prevent further use of the device.
 287 .El
 288 .It Sy OFFLINE
 289 The device was explicitly taken offline by the
 290 .Nm zpool Cm offline
 291 command.
 292 .It Sy ONLINE
 293 The device is online and functioning.
 294 .It Sy REMOVED
 295 The device was physically removed while the system was running.
 296 Device removal detection is hardware-dependent and may not be supported on all
 297 platforms.
 298 .It Sy UNAVAIL
 299 The device could not be opened.
 300 If a pool is imported when a device was unavailable, then the device will be
 301 identified by a unique identifier instead of its path since the path was never
 302 correct in the first place.
 303 .El
 304 .Pp
 305 Checksum errors represent events where a disk returned data that was expected
 306 to be correct, but was not.
 307 In other words, these are instances of silent data corruption.
 308 The checksum errors are reported in
 309 .Nm zpool Cm status
 310 and
 311 .Nm zpool Cm events .
 312 When a block is stored redundantly, a damaged block may be reconstructed
 313 (e.g. from raidz parity or a mirrored copy).
 314 In this case, ZFS reports the checksum error against the disks that contained
 315 damaged data.
 316 If a block is unable to be reconstructed (e.g. due to 3 disks being damaged
 317 in a raidz2 group), it is not possible to determine which disks were silently
 318 corrupted.
 319 In this case, checksum errors are reported for all disks on which the block
 320 is stored.
 321 .Pp
 322 If a device is removed and later re-attached to the system,
 323 ZFS attempts to bring the device online automatically.
 324 Device attachment detection is hardware-dependent
 325 and might not be supported on all platforms.
 326 .
 327 .Ss Hot Spares
 328 ZFS allows devices to be associated with pools as
 329 .Qq hot spares .
 330 These devices are not actively used in the pool.
 331 But, when an active device
 332 fails, it is automatically replaced by a hot spare.
 333 To create a pool with hot spares, specify a
 334 .Sy spare
 335 vdev with any number of devices.
 336 For example,
 337 .Dl # Nm zpool Cm create Ar pool Sy mirror Ar sda sdb Sy spare Ar sdc sdd
 338 .Pp
 339 Spares can be shared across multiple pools, and can be added with the
 340 .Nm zpool Cm add
 341 command and removed with the
 342 .Nm zpool Cm remove
 343 command.
 344 Once a spare replacement is initiated, a new
 345 .Sy spare
 346 vdev is created within the configuration that will remain there until the
 347 original device is replaced.
 348 At this point, the hot spare becomes available again, if another device fails.
 349 .Pp
 350 If a pool has a shared spare that is currently being used, the pool cannot be
 351 exported, since other pools may use this shared spare, which may lead to
 352 potential data corruption.
 353 .Pp
 354 Shared spares add some risk.
 355 If the pools are imported on different hosts,
 356 and both pools suffer a device failure at the same time,
 357 both could attempt to use the spare at the same time.
 358 This may not be detected, resulting in data corruption.
 359 .Pp
 360 An in-progress spare replacement can be cancelled by detaching the hot spare.
 361 If the original faulted device is detached, then the hot spare assumes its
 362 place in the configuration, and is removed from the spare list of all active
 363 pools.
 364 .Pp
 365 The
 366 .Sy draid
 367 vdev type provides distributed hot spares.
 368 These hot spares are named after the dRAID vdev they're a part of
 369 .Po Sy draid1 Ns - Ns Ar 2 Ns - Ns Ar 3 No specifies spare Ar 3 No of vdev Ar 2 ,
 370 .No which is a single parity dRAID Pc
 371 and may only be used by that dRAID vdev.
 372 Otherwise, they behave the same as normal hot spares.
 373 .Pp
 374 Spares cannot replace log devices.
 375 .
 376 .Ss Intent Log
 377 The ZFS Intent Log (ZIL) satisfies POSIX requirements for synchronous
 378 transactions.
 379 For instance, databases often require their transactions to be on stable storage
 380 devices when returning from a system call.
 381 NFS and other applications can also use
 382 .Xr fsync 2
 383 to ensure data stability.
 384 By default, the intent log is allocated from blocks within the main pool.
 385 However, it might be possible to get better performance using separate intent
 386 log devices such as NVRAM or a dedicated disk.
 387 For example:
 388 .Dl # Nm zpool Cm create Ar pool sda sdb Sy log Ar sdc
 389 .Pp
 390 Multiple log devices can also be specified, and they can be mirrored.
 391 See the
 392 .Sx EXAMPLES
 393 section for an example of mirroring multiple log devices.
 394 .Pp
 395 Log devices can be added, replaced, attached, detached, and removed.
 396 In addition, log devices are imported and exported as part of the pool
 397 that contains them.
 398 Mirrored devices can be removed by specifying the top-level mirror vdev.
 399 .
 400 .Ss Cache Devices
 401 Devices can be added to a storage pool as
 402 .Qq cache devices .
 403 These devices provide an additional layer of caching between main memory and
 404 disk.
 405 For read-heavy workloads, where the working set size is much larger than what
 406 can be cached in main memory, using cache devices allows much more of this
 407 working set to be served from low latency media.
 408 Using cache devices provides the greatest performance improvement for random
 409 read-workloads of mostly static content.
 410 .Pp
 411 To create a pool with cache devices, specify a
 412 .Sy cache
 413 vdev with any number of devices.
 414 For example:
 415 .Dl # Nm zpool Cm create Ar pool sda sdb Sy cache Ar sdc sdd
 416 .Pp
 417 Cache devices cannot be mirrored or part of a raidz configuration.
 418 If a read error is encountered on a cache device, that read I/O is reissued to
 419 the original storage pool device, which might be part of a mirrored or raidz
 420 configuration.
 421 .Pp
 422 The content of the cache devices is persistent across reboots and restored
 423 asynchronously when importing the pool in L2ARC (persistent L2ARC).
 424 This can be disabled by setting
 425 .Sy l2arc_rebuild_enabled Ns = Ns Sy 0 .
 426 For cache devices smaller than
 427 .Em 1 GiB ,
 428 ZFS does not write the metadata structures
 429 required for rebuilding the L2ARC, to conserve space.
 430 This can be changed with
 431 .Sy l2arc_rebuild_blocks_min_l2size .
 432 The cache device header
 433 .Pq Em 512 B
 434 is updated even if no metadata structures are written.
 435 Setting
 436 .Sy l2arc_headroom Ns = Ns Sy 0
 437 will result in scanning the full-length ARC lists for cacheable content to be
 438 written in L2ARC (persistent ARC).
 439 If a cache device is added with
 440 .Nm zpool Cm add ,
 441 its label and header will be overwritten and its contents will not be
 442 restored in L2ARC, even if the device was previously part of the pool.
 443 If a cache device is onlined with
 444 .Nm zpool Cm online ,
 445 its contents will be restored in L2ARC.
 446 This is useful in case of memory pressure,
 447 where the contents of the cache device are not fully restored in L2ARC.
 448 The user can off- and online the cache device when there is less memory
 449 pressure, to fully restore its contents to L2ARC.
 450 .
 451 .Ss Pool checkpoint
 452 Before starting critical procedures that include destructive actions
 453 .Pq like Nm zfs Cm destroy ,
 454 an administrator can checkpoint the pool's state and, in the case of a
 455 mistake or failure, rewind the entire pool back to the checkpoint.
 456 Otherwise, the checkpoint can be discarded when the procedure has completed
 457 successfully.
 458 .Pp
 459 A pool checkpoint can be thought of as a pool-wide snapshot and should be used
 460 with care as it contains every part of the pool's state, from properties to vdev
 461 configuration.
 462 Thus, certain operations are not allowed while a pool has a checkpoint.
 463 Specifically, vdev removal/attach/detach, mirror splitting, and
 464 changing the pool's GUID.
 465 Adding a new vdev is supported, but in the case of a rewind it will have to be
 466 added again.
 467 Finally, users of this feature should keep in mind that scrubs in a pool that
 468 has a checkpoint do not repair checkpointed data.
 469 .Pp
 470 To create a checkpoint for a pool:
 471 .Dl # Nm zpool Cm checkpoint Ar pool
 472 .Pp
 473 To later rewind to its checkpointed state, you need to first export it and
 474 then rewind it during import:
 475 .Dl # Nm zpool Cm export Ar pool
 476 .Dl # Nm zpool Cm import Fl -rewind-to-checkpoint Ar pool
 477 .Pp
 478 To discard the checkpoint from a pool:
 479 .Dl # Nm zpool Cm checkpoint Fl d Ar pool
 480 .Pp
 481 Dataset reservations (controlled by the
 482 .Sy reservation No and Sy refreservation
 483 properties) may be unenforceable while a checkpoint exists, because the
 484 checkpoint is allowed to consume the dataset's reservation.
 485 Finally, data that is part of the checkpoint but has been freed in the
 486 current state of the pool won't be scanned during a scrub.
 487 .
 488 .Ss Special Allocation Class
 489 Allocations in the special class are dedicated to specific block types.
 490 By default, this includes all metadata, the indirect blocks of user data, and
 491 any deduplication tables.
 492 The class can also be provisioned to accept small file blocks.
 493 .Pp
 494 A pool must always have at least one normal
 495 .Pq non- Ns Sy dedup Ns /- Ns Sy special
 496 vdev before
 497 other devices can be assigned to the special class.
 498 If the
 499 .Sy special
 500 class becomes full, then allocations intended for it
 501 will spill back into the normal class.
 502 .Pp
 503 Deduplication tables can be excluded from the special class by unsetting the
 504 .Sy zfs_ddt_data_is_special
 505 ZFS module parameter.
 506 .Pp
 507 Inclusion of small file blocks in the special class is opt-in.
 508 Each dataset can control the size of small file blocks allowed
 509 in the special class by setting the
 510 .Sy special_small_blocks
 511 property to nonzero.
 512 See
 513 .Xr zfsprops 7
 514 for more info on this property.