storage: lvm: expand on description of saferemove option

[pve-docs.git] / local-zfs.adoc
diff --git a/local-zfs.adoc b/local-zfs.adoc

index 2dc25fd5fe3e563c7c91156ba5c5030aecb3c2b0..58726a7270ea4950507453c85572c5171cb70690 100644 (file)
--- a/local-zfs.adoc
+++ b/local-zfs.adoc
@@ -32,7 +32,8 @@ management.
  
  * Copy-on-write clone
  
-* Various raid levels: RAID0, RAID1, RAID10, RAIDZ-1, RAIDZ-2 and RAIDZ-3
+* Various raid levels: RAID0, RAID1, RAID10, RAIDZ-1, RAIDZ-2, RAIDZ-3,
+dRAID, dRAID2, dRAID3
  
  * Can use SSD for cache
  
@@ -59,7 +60,7 @@ practice, use as much as you can get for your hardware/budget. To prevent
  data corruption, we recommend the use of high quality ECC RAM.
  
  If you use a dedicated cache and/or log disk, you should use an
-enterprise class SSD (e.g. Intel SSD DC S3700 Series). This can
+enterprise class SSD. This can
  increase the overall performance significantly.
  
  IMPORTANT: Do not use ZFS on top of a hardware RAID controller which has its
@@ -156,7 +157,7 @@ ZFS RAID Level Considerations
  There are a few factors to take into consideration when choosing the layout of
  a ZFS pool. The basic building block of a ZFS pool is the virtual device, or
  `vdev`. All vdevs in a pool are used equally and the data is striped among them
-(RAID0). Check the `zpool(8)` manpage for more details on vdevs.
+(RAID0). Check the `zpoolconcepts(7)` manpage for more details on vdevs.
  
  [[sysadmin_zfs_raid_performance]]
  Performance
@@ -166,17 +167,17 @@ Each `vdev` type has different performance behaviors. The two
  parameters of interest are the IOPS (Input/Output Operations per Second) and
  the bandwidth with which data can be written or read.
  
-A 'mirror' vdev (RAID1) will approximately behave like a single disk in regards
-to both parameters when writing data. When reading data if will behave like the
-number of disks in the mirror.
+A 'mirror' vdev (RAID1) will approximately behave like a single disk in regard
+to both parameters when writing data. When reading data the performance will
+scale linearly with the number of disks in the mirror.
  
  A common situation is to have 4 disks. When setting it up as 2 mirror vdevs
  (RAID10) the pool will have the write characteristics as two single disks in
-regard of IOPS and bandwidth. For read operations it will resemble 4 single
+regard to IOPS and bandwidth. For read operations it will resemble 4 single
  disks.
  
  A 'RAIDZ' of any redundancy level will approximately behave like a single disk
-in regard of IOPS with a lot of bandwidth. How much bandwidth depends on the
+in regard to IOPS with a lot of bandwidth. How much bandwidth depends on the
  size of the RAIDZ vdev and the redundancy level.
  
  For running VMs, IOPS is the more important metric in most situations.
@@ -234,7 +235,7 @@ There are a few options to counter the increased use of space:
  The `volblocksize` property can only be set when creating a ZVOL. The default
  value can be changed in the storage configuration. When doing this, the guest
  needs to be tuned accordingly and depending on the use case, the problem of
-write amplification if just moved from the ZFS layer up to the guest.
+write amplification is just moved from the ZFS layer up to the guest.
  
  Using `ashift=9` when creating the pool can lead to bad
  performance, depending on the disks underneath, and cannot be changed later on.
@@ -244,6 +245,47 @@ them, unless your environment has specific needs and characteristics where
  RAIDZ performance characteristics are acceptable.
  
  
+ZFS dRAID
+~~~~~~~~~
+
+In a ZFS dRAID (declustered RAID) the hot spare drive(s) participate in the RAID.
+Their spare capacity is reserved and used for rebuilding when one drive fails.
+This provides, depending on the configuration, faster rebuilding compared to a
+RAIDZ in case of drive failure. More information can be found in the official
+OpenZFS documentation. footnote:[OpenZFS dRAID
+https://openzfs.github.io/openzfs-docs/Basic%20Concepts/dRAID%20Howto.html]
+
+NOTE: dRAID is intended for more than 10-15 disks in a dRAID. A RAIDZ
+setup should be better for a lower amount of disks in most use cases.
+
+NOTE: The GUI requires one more disk than the minimum (i.e. dRAID1 needs 3). It
+expects that a spare disk is added as well.
+
+ * `dRAID1` or `dRAID`: requires at least 2 disks, one can fail before data is
+lost
+ * `dRAID2`: requires at least 3 disks, two can fail before data is lost
+ * `dRAID3`: requires at least 4 disks, three can fail before data is lost
+
+
+Additional information can be found on the manual page:
+
+----
+# man zpoolconcepts
+----
+
+Spares and Data
+^^^^^^^^^^^^^^^
+The number of `spares` tells the system how many disks it should keep ready in
+case of a disk failure. The default value is 0 `spares`. Without spares,
+rebuilding won't get any speed benefits.
+
+`data` defines the number of devices in a redundancy group. The default value is
+8. Except when `disks - parity - spares` equal something less than 8, the lower
+number is used. In general, a smaller number of `data` devices leads to higher
+IOPS, better compression ratios and faster resilvering, but defining fewer data
+devices reduces the available storage capacity of the pool.
+
+
  Bootloader
  ~~~~~~~~~~
  
@@ -269,14 +311,23 @@ manual pages, which can be read with:
  Create a new zpool
  ^^^^^^^^^^^^^^^^^^
  
-To create a new pool, at least one disk is needed. The `ashift` should
-have the same sector-size (2 power of `ashift`) or larger as the
-underlying disk.
+To create a new pool, at least one disk is needed. The `ashift` should have the
+same sector-size (2 power of `ashift`) or larger as the underlying disk.
  
  ----
  # zpool create -f -o ashift=12 <pool> <device>
  ----
  
+[TIP]
+====
+Pool names must adhere to the following rules:
+
+* begin with a letter (a-z or A-Z)
+* contain only alphanumeric, `-`, `_`, `.`, `:` or ` ` (space) characters
+* must *not begin* with one of `mirror`, `raidz`, `draid` or `spare`
+* must not be `log`
+====
+
  To activate compression (see section <<zfs_compression,Compression in ZFS>>):
  
  ----
@@ -332,57 +383,116 @@ Minimum 4 disks
  # zpool create -f -o ashift=12 <pool> raidz2 <device1> <device2> <device3> <device4>
  ----
  
+Please read the section for
+xref:sysadmin_zfs_raid_considerations[ZFS RAID Level Considerations]
+to get a rough estimate on how IOPS and bandwidth expectations before setting up
+a pool, especially when wanting to use a RAID-Z mode.
+
  [[sysadmin_zfs_create_new_zpool_with_cache]]
  Create a new pool with cache (L2ARC)
  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  
-It is possible to use a dedicated cache drive partition to increase
-the performance (use SSD).
-
-As `<device>` it is possible to use more devices, like it's shown in
-"Create a new pool with RAID*".
+It is possible to use a dedicated device, or partition, as second-level cache to
+increase the performance. Such a cache device will especially help with
+random-read workloads of data that is mostly static. As it acts as additional
+caching layer between the actual storage, and the in-memory ARC, it can also
+help if the ARC must be reduced due to memory constraints.
  
+.Create ZFS pool with a on-disk cache
  ----
-# zpool create -f -o ashift=12 <pool> <device> cache <cache_device>
+# zpool create -f -o ashift=12 <pool> <device> cache <cache-device>
  ----
  
+Here only a single `<device>` and a single `<cache-device>` was used, but it is
+possible to use more devices, like it's shown in
+xref:sysadmin_zfs_create_new_zpool_raid0[Create a new pool with RAID].
+
+Note that for cache devices no mirror or raid modi exist, they are all simply
+accumulated.
+
+If any cache device produces errors on read, ZFS will transparently divert that
+request to the underlying storage layer.
+
+
  [[sysadmin_zfs_create_new_zpool_with_log]]
  Create a new pool with log (ZIL)
  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  
-It is possible to use a dedicated cache drive partition to increase
-the performance(SSD).
+It is possible to use a dedicated drive, or partition, for the ZFS Intent Log
+(ZIL), it is mainly used to provide safe synchronous transactions, so often in
+performance critical paths like databases, or other programs that issue `fsync`
+operations more frequently.
+
+The pool is used as default ZIL location, diverting the ZIL IO load to a
+separate device can, help to reduce transaction latencies while relieving the
+main pool at the same time, increasing overall performance.
+
+For disks to be used as log devices, directly or through a partition, it's
+recommend to:
  
-As `<device>` it is possible to use more devices, like it's shown in
-"Create a new pool with RAID*".
+- use fast SSDs with power-loss protection, as those have much smaller commit
+  latencies.
  
+- Use at least a few GB for the partition (or whole device), but using more than
+  half of your installed memory won't provide you with any real advantage.
+
+.Create ZFS pool with separate log device
  ----
-# zpool create -f -o ashift=12 <pool> <device> log <log_device>
+# zpool create -f -o ashift=12 <pool> <device> log <log-device>
  ----
  
+In above example a single `<device>` and a single `<log-device>` is used, but you
+can also combine this with other RAID variants, as described in the
+xref:sysadmin_zfs_create_new_zpool_raid0[Create a new pool with RAID] section.
+
+You can also mirror the log device to multiple devices, this is mainly useful to
+ensure that performance doesn't immediately degrades if a single log device
+fails.
+
+If all log devices fail the ZFS main pool itself will be used again, until the
+log device(s) get replaced.
+
  [[sysadmin_zfs_add_cache_and_log_dev]]
  Add cache and log to an existing pool
  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  
-If you have a pool without cache and log. First partition the SSD in
-2 partition with `parted` or `gdisk`
+If you have a pool without cache and log you can still add both, or just one of
+them, at any time.
+
+For example, let's assume you got a good enterprise SSD with power-loss
+protection that you want to use for improving the overall performance of your
+pool.
  
-IMPORTANT: Always use GPT partition tables.
+As the maximum size of a log device should be about half the size of the
+installed physical memory, it means that the ZIL will mostly likely only take up
+a relatively small part of the SSD, the remaining space can be used as cache.
  
-The maximum size of a log device should be about half the size of
-physical memory, so this is usually quite small. The rest of the SSD
-can be used as cache.
+First you have to create two GPT partitions on the SSD with `parted` or `gdisk`.
  
+Then you're ready to add them to an pool:
+
+.Add both, a separate log device and a second-level cache, to an existing pool
  ----
  # zpool add -f <pool> log <device-part1> cache <device-part2>
  ----
  
+Just replay `<pool>`, `<device-part1>` and `<device-part2>` with the pool name
+and the two `/dev/disk/by-id/` paths to the partitions.
+
+You can also add ZIL and cache separately.
+
+.Add a log device to an existing ZFS pool
+----
+# zpool add <pool> log <log-device>
+----
+
+
  [[sysadmin_zfs_change_failed_dev]]
  Changing a failed device
  ^^^^^^^^^^^^^^^^^^^^^^^^
  
  ----
-# zpool replace -f <pool> <old device> <new device>
+# zpool replace -f <pool> <old-device> <new-device>
  ----
  
  .Changing a failed bootable device
@@ -414,13 +524,17 @@ process of the new disk has progressed.
  
  ----
  # proxmox-boot-tool format <new disk's ESP>
-# proxmox-boot-tool init <new disk's ESP>
+# proxmox-boot-tool init <new disk's ESP> [grub]
  ----
  
  NOTE: `ESP` stands for EFI System Partition, which is setup as partition #2 on
  bootable disks setup by the {pve} installer since version 5.4. For details, see
  xref:sysboot_proxmox_boot_setup[Setting up a new partition for use as synced ESP].
  
+NOTE: make sure to pass 'grub' as mode to `proxmox-boot-tool init` if
+`proxmox-boot-tool status` indicates your current disks are using Grub,
+especially if Secure Boot is enabled!
+
  .With plain `grub`:
  
  ----
@@ -547,6 +661,12 @@ improve performance when sufficient memory exists in a system.
  Encrypted ZFS Datasets
  ~~~~~~~~~~~~~~~~~~~~~~
  
+WARNING: Native ZFS encryption in {pve} is experimental. Known limitations and
+issues include Replication with encrypted datasets
+footnote:[https://bugzilla.proxmox.com/show_bug.cgi?id=2350],
+as well as checksum errors when using Snapshots or ZVOLs.
+footnote:[https://github.com/openzfs/zfs/issues/11688]
+
  ZFS on Linux version 0.8.0 introduced support for native encryption of
  datasets. After an upgrade from previous ZFS on Linux versions, the encryption
  feature can be enabled per pool:
@@ -688,7 +808,7 @@ WARNING: Adding a `special` device to a pool cannot be undone!
  
  ZFS datasets expose the `special_small_blocks=<size>` property. `size` can be
  `0` to disable storing small file blocks on the `special` device or a power of
-two in the range between `512B` to `128K`. After setting the property new file
+two in the range between `512B` to `1M`. After setting the property new file
  blocks smaller than `size` will be allocated on the `special` device.
  
  IMPORTANT: If the value for `special_small_blocks` is greater than or equal to