* Copy-on-write clone
-* Various raid levels: RAID0, RAID1, RAID10, RAIDZ-1, RAIDZ-2 and RAIDZ-3
+* Various raid levels: RAID0, RAID1, RAID10, RAIDZ-1, RAIDZ-2, RAIDZ-3,
+dRAID, dRAID2, dRAID3
* Can use SSD for cache
data corruption, we recommend the use of high quality ECC RAM.
If you use a dedicated cache and/or log disk, you should use an
-enterprise class SSD (e.g. Intel SSD DC S3700 Series). This can
+enterprise class SSD. This can
increase the overall performance significantly.
IMPORTANT: Do not use ZFS on top of a hardware RAID controller which has its
parameters of interest are the IOPS (Input/Output Operations per Second) and
the bandwidth with which data can be written or read.
-A 'mirror' vdev (RAID1) will approximately behave like a single disk in regards
-to both parameters when writing data. When reading data if will behave like the
-number of disks in the mirror.
+A 'mirror' vdev (RAID1) will approximately behave like a single disk in regard
+to both parameters when writing data. When reading data the performance will
+scale linearly with the number of disks in the mirror.
A common situation is to have 4 disks. When setting it up as 2 mirror vdevs
(RAID10) the pool will have the write characteristics as two single disks in
-regard of IOPS and bandwidth. For read operations it will resemble 4 single
+regard to IOPS and bandwidth. For read operations it will resemble 4 single
disks.
A 'RAIDZ' of any redundancy level will approximately behave like a single disk
-in regard of IOPS with a lot of bandwidth. How much bandwidth depends on the
+in regard to IOPS with a lot of bandwidth. How much bandwidth depends on the
size of the RAIDZ vdev and the redundancy level.
For running VMs, IOPS is the more important metric in most situations.
The `volblocksize` property can only be set when creating a ZVOL. The default
value can be changed in the storage configuration. When doing this, the guest
needs to be tuned accordingly and depending on the use case, the problem of
-write amplification if just moved from the ZFS layer up to the guest.
+write amplification is just moved from the ZFS layer up to the guest.
Using `ashift=9` when creating the pool can lead to bad
performance, depending on the disks underneath, and cannot be changed later on.
RAIDZ performance characteristics are acceptable.
+ZFS dRAID
+~~~~~~~~~
+
+In a ZFS dRAID (declustered RAID) the hot spare drive(s) participate in the RAID.
+Their spare capacity is reserved and used for rebuilding when one drive fails.
+This provides, depending on the configuration, faster rebuilding compared to a
+RAIDZ in case of drive failure. More information can be found in the official
+OpenZFS documentation. footnote:[OpenZFS dRAID
+https://openzfs.github.io/openzfs-docs/Basic%20Concepts/dRAID%20Howto.html]
+
+NOTE: dRAID is intended for more than 10-15 disks in a dRAID. A RAIDZ
+setup should be better for a lower amount of disks in most use cases.
+
+NOTE: The GUI requires one more disk than the minimum (i.e. dRAID1 needs 3). It
+expects that a spare disk is added as well.
+
+ * `dRAID1` or `dRAID`: requires at least 2 disks, one can fail before data is
+lost
+ * `dRAID2`: requires at least 3 disks, two can fail before data is lost
+ * `dRAID3`: requires at least 4 disks, three can fail before data is lost
+
+
+Additional information can be found on the manual page:
+
+----
+# man zpoolconcepts
+----
+
+Spares and Data
+^^^^^^^^^^^^^^^
+The number of `spares` tells the system how many disks it should keep ready in
+case of a disk failure. The default value is 0 `spares`. Without spares,
+rebuilding won't get any speed benefits.
+
+`data` defines the number of devices in a redundancy group. The default value is
+8. Except when `disks - parity - spares` equal something less than 8, the lower
+number is used. In general, a smaller number of `data` devices leads to higher
+IOPS, better compression ratios and faster resilvering, but defining fewer data
+devices reduces the available storage capacity of the pool.
+
+
Bootloader
~~~~~~~~~~
-Depending on whether the system is booted in EFI or legacy BIOS mode the
-{pve} installer sets up either `grub` or `systemd-boot` as main bootloader.
-See the chapter on xref:sysboot[{pve} host bootladers] for details.
+{pve} uses xref:sysboot_proxmox_boot_tool[`proxmox-boot-tool`] to manage the
+bootloader configuration.
+See the chapter on xref:sysboot[{pve} host bootloaders] for details.
ZFS Administration
Create a new zpool
^^^^^^^^^^^^^^^^^^
-To create a new pool, at least one disk is needed. The `ashift` should
-have the same sector-size (2 power of `ashift`) or larger as the
-underlying disk.
+To create a new pool, at least one disk is needed. The `ashift` should have the
+same sector-size (2 power of `ashift`) or larger as the underlying disk.
----
# zpool create -f -o ashift=12 <pool> <device>
----
+[TIP]
+====
+Pool names must adhere to the following rules:
+
+* begin with a letter (a-z or A-Z)
+* contain only alphanumeric, `-`, `_`, `.`, `:` or ` ` (space) characters
+* must *not begin* with one of `mirror`, `raidz`, `draid` or `spare`
+* must not be `log`
+====
+
To activate compression (see section <<zfs_compression,Compression in ZFS>>):
----
# zpool create -f -o ashift=12 <pool> raidz2 <device1> <device2> <device3> <device4>
----
+Please read the section for
+xref:sysadmin_zfs_raid_considerations[ZFS RAID Level Considerations]
+to get a rough estimate on how IOPS and bandwidth expectations before setting up
+a pool, especially when wanting to use a RAID-Z mode.
+
[[sysadmin_zfs_create_new_zpool_with_cache]]
Create a new pool with cache (L2ARC)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
-It is possible to use a dedicated cache drive partition to increase
-the performance (use SSD).
-
-As `<device>` it is possible to use more devices, like it's shown in
-"Create a new pool with RAID*".
+It is possible to use a dedicated device, or partition, as second-level cache to
+increase the performance. Such a cache device will especially help with
+random-read workloads of data that is mostly static. As it acts as additional
+caching layer between the actual storage, and the in-memory ARC, it can also
+help if the ARC must be reduced due to memory constraints.
+.Create ZFS pool with a on-disk cache
----
-# zpool create -f -o ashift=12 <pool> <device> cache <cache_device>
+# zpool create -f -o ashift=12 <pool> <device> cache <cache-device>
----
+Here only a single `<device>` and a single `<cache-device>` was used, but it is
+possible to use more devices, like it's shown in
+xref:sysadmin_zfs_create_new_zpool_raid0[Create a new pool with RAID].
+
+Note that for cache devices no mirror or raid modi exist, they are all simply
+accumulated.
+
+If any cache device produces errors on read, ZFS will transparently divert that
+request to the underlying storage layer.
+
+
[[sysadmin_zfs_create_new_zpool_with_log]]
Create a new pool with log (ZIL)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
-It is possible to use a dedicated cache drive partition to increase
-the performance(SSD).
+It is possible to use a dedicated drive, or partition, for the ZFS Intent Log
+(ZIL), it is mainly used to provide safe synchronous transactions, so often in
+performance critical paths like databases, or other programs that issue `fsync`
+operations more frequently.
-As `<device>` it is possible to use more devices, like it's shown in
-"Create a new pool with RAID*".
+The pool is used as default ZIL location, diverting the ZIL IO load to a
+separate device can, help to reduce transaction latencies while relieving the
+main pool at the same time, increasing overall performance.
+For disks to be used as log devices, directly or through a partition, it's
+recommend to:
+
+- use fast SSDs with power-loss protection, as those have much smaller commit
+ latencies.
+
+- Use at least a few GB for the partition (or whole device), but using more than
+ half of your installed memory won't provide you with any real advantage.
+
+.Create ZFS pool with separate log device
----
-# zpool create -f -o ashift=12 <pool> <device> log <log_device>
+# zpool create -f -o ashift=12 <pool> <device> log <log-device>
----
+In above example a single `<device>` and a single `<log-device>` is used, but you
+can also combine this with other RAID variants, as described in the
+xref:sysadmin_zfs_create_new_zpool_raid0[Create a new pool with RAID] section.
+
+You can also mirror the log device to multiple devices, this is mainly useful to
+ensure that performance doesn't immediately degrades if a single log device
+fails.
+
+If all log devices fail the ZFS main pool itself will be used again, until the
+log device(s) get replaced.
+
[[sysadmin_zfs_add_cache_and_log_dev]]
Add cache and log to an existing pool
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
-If you have a pool without cache and log. First partition the SSD in
-2 partition with `parted` or `gdisk`
+If you have a pool without cache and log you can still add both, or just one of
+them, at any time.
-IMPORTANT: Always use GPT partition tables.
+For example, let's assume you got a good enterprise SSD with power-loss
+protection that you want to use for improving the overall performance of your
+pool.
-The maximum size of a log device should be about half the size of
-physical memory, so this is usually quite small. The rest of the SSD
-can be used as cache.
+As the maximum size of a log device should be about half the size of the
+installed physical memory, it means that the ZIL will mostly likely only take up
+a relatively small part of the SSD, the remaining space can be used as cache.
+First you have to create two GPT partitions on the SSD with `parted` or `gdisk`.
+
+Then you're ready to add them to an pool:
+
+.Add both, a separate log device and a second-level cache, to an existing pool
----
# zpool add -f <pool> log <device-part1> cache <device-part2>
----
+Just replay `<pool>`, `<device-part1>` and `<device-part2>` with the pool name
+and the two `/dev/disk/by-id/` paths to the partitions.
+
+You can also add ZIL and cache separately.
+
+.Add a log device to an existing ZFS pool
+----
+# zpool add <pool> log <log-device>
+----
+
+
[[sysadmin_zfs_change_failed_dev]]
Changing a failed device
^^^^^^^^^^^^^^^^^^^^^^^^
----
-# zpool replace -f <pool> <old device> <new device>
+# zpool replace -f <pool> <old-device> <new-device>
----
.Changing a failed bootable device
-Depending on how {pve} was installed it is either using `grub` or `systemd-boot`
-as bootloader (see xref:sysboot[Host Bootloader]).
+Depending on how {pve} was installed it is either using `systemd-boot` or `grub`
+through `proxmox-boot-tool`
+footnote:[Systems installed with {pve} 6.4 or later, EFI systems installed with
+{pve} 5.4 or later] or plain `grub` as bootloader (see
+xref:sysboot[Host Bootloader]). You can check by running:
+
+----
+# proxmox-boot-tool status
+----
The first steps of copying the partition table, reissuing GUIDs and replacing
the ZFS partition are the same. To make the system bootable from the new disk,
NOTE: Use the `zpool status -v` command to monitor how far the resilvering
process of the new disk has progressed.
-.With `systemd-boot`:
+.With `proxmox-boot-tool`:
----
-# pve-efiboot-tool format <new disk's ESP>
-# pve-efiboot-tool init <new disk's ESP>
+# proxmox-boot-tool format <new disk's ESP>
+# proxmox-boot-tool init <new disk's ESP> [grub]
----
NOTE: `ESP` stands for EFI System Partition, which is setup as partition #2 on
bootable disks setup by the {pve} installer since version 5.4. For details, see
-xref:sysboot_systemd_boot_setup[Setting up a new partition for use as synced ESP].
+xref:sysboot_proxmox_boot_setup[Setting up a new partition for use as synced ESP].
+
+NOTE: make sure to pass 'grub' as mode to `proxmox-boot-tool init` if
+`proxmox-boot-tool status` indicates your current disks are using Grub,
+especially if Secure Boot is enabled!
-.With `grub`:
+.With plain `grub`:
----
# grub-install <new disk>
----
+NOTE: plain `grub` is only used on systems installed with {pve} 6.3 or earlier,
+which have not been manually migrated to using `proxmox-boot-tool` yet.
-Activate E-Mail Notification
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-ZFS comes with an event daemon, which monitors events generated by the
-ZFS kernel module. The daemon can also send emails on ZFS events like
-pool errors. Newer ZFS packages ship the daemon in a separate package,
-and you can install it using `apt-get`:
+Configure E-Mail Notification
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-----
-# apt-get install zfs-zed
-----
+ZFS comes with an event daemon `ZED`, which monitors events generated by the ZFS
+kernel module. The daemon can also send emails on ZFS events like pool errors.
+Newer ZFS packages ship the daemon in a separate `zfs-zed` package, which should
+already be installed by default in {pve}.
-To activate the daemon it is necessary to edit `/etc/zfs/zed.d/zed.rc` with your
-favourite editor, and uncomment the `ZED_EMAIL_ADDR` setting:
+You can configure the daemon via the file `/etc/zfs/zed.d/zed.rc` with your
+favorite editor. The required setting for email notification is
+`ZED_EMAIL_ADDR`, which is set to `root` by default.
--------
ZED_EMAIL_ADDR="root"
Please note {pve} forwards mails to `root` to the email address
configured for the root user.
-IMPORTANT: The only setting that is required is `ZED_EMAIL_ADDR`. All
-other settings are optional.
-
[[sysadmin_zfs_limit_memory_usage]]
Limit ZFS Memory Usage
time this value changes:
----
-# update-initramfs -u
+# update-initramfs -u -k all
----
You *must reboot* to activate these changes.
We strongly recommend to use enough memory, so that you normally do not
run into low memory situations. Should you need or want to add swap, it is
-preferred to create a partition on a physical disk and use it as swapdevice.
+preferred to create a partition on a physical disk and use it as a swap device.
You can leave some space free for this purpose in the advanced options of the
installer. Additionally, you can lower the
``swappiness'' value. A good value for servers is 10:
Encrypted ZFS Datasets
~~~~~~~~~~~~~~~~~~~~~~
+WARNING: Native ZFS encryption in {pve} is experimental. Known limitations and
+issues include Replication with encrypted datasets
+footnote:[https://bugzilla.proxmox.com/show_bug.cgi?id=2350],
+as well as checksum errors when using Snapshots or ZVOLs.
+footnote:[https://github.com/openzfs/zfs/issues/11688]
+
ZFS on Linux version 0.8.0 introduced support for native encryption of
datasets. After an upgrade from previous ZFS on Linux versions, the encryption
feature can be enabled per pool:
ZFS datasets expose the `special_small_blocks=<size>` property. `size` can be
`0` to disable storing small file blocks on the `special` device or a power of
-two in the range between `512B` to `128K`. After setting the property new file
+two in the range between `512B` to `1M`. After setting the property new file
blocks smaller than `size` will be allocated on the `special` device.
IMPORTANT: If the value for `special_small_blocks` is greater than or equal to