X-Git-Url: https://git.proxmox.com/?p=pve-docs.git;a=blobdiff_plain;f=local-zfs.adoc;h=89ab8bd847d2bc5da9c5ee8c95ab3df6b0f3ca3a;hp=6aae81e4a275c729a86514fd2b2d1b5321dcb59b;hb=HEAD;hpb=b2f242abe4c50227f5610767e6fcaa40654c2b88 diff --git a/local-zfs.adoc b/local-zfs.adoc index 6aae81e..130f9d6 100644 --- a/local-zfs.adoc +++ b/local-zfs.adoc @@ -1,6 +1,6 @@ +[[chapter_zfs]] ZFS on Linux ------------ -include::attributes.txt[] ifdef::wiki[] :pve-toplevel: endif::wiki[] @@ -32,7 +32,8 @@ management. * Copy-on-write clone -* Various raid levels: RAID0, RAID1, RAID10, RAIDZ-1, RAIDZ-2 and RAIDZ-3 +* Various raid levels: RAID0, RAID1, RAID10, RAIDZ-1, RAIDZ-2, RAIDZ-3, +dRAID, dRAID2, dRAID3 * Can use SSD for cache @@ -42,8 +43,6 @@ management. * Designed for high storage capacities -* Protection against data corruption - * Asynchronous replication over network * Open Source @@ -57,22 +56,22 @@ Hardware ~~~~~~~~ ZFS depends heavily on memory, so you need at least 8GB to start. In -practice, use as much you can get for your hardware/budget. To prevent +practice, use as much as you can get for your hardware/budget. To prevent data corruption, we recommend the use of high quality ECC RAM. -If you use a dedicated cache and/or log disk, you should use a -enterprise class SSD (e.g. Intel SSD DC S3700 Series). This can +If you use a dedicated cache and/or log disk, you should use an +enterprise class SSD. This can increase the overall performance significantly. -IMPORTANT: Do not use ZFS on top of hardware controller which has its -own cache management. ZFS needs to directly communicate with disks. An -HBA adapter is the way to go, or something like LSI controller flashed -in ``IT'' mode. +IMPORTANT: Do not use ZFS on top of a hardware RAID controller which has its +own cache management. ZFS needs to communicate directly with the disks. An +HBA adapter or something like an LSI controller flashed in ``IT'' mode is more +appropriate. If you are experimenting with an installation of {pve} inside a VM (Nested Virtualization), don't use `virtio` for disks of that VM, -since they are not supported by ZFS. Use IDE or SCSI instead (works -also with `virtio` SCSI controller type). +as they are not supported by ZFS. Use IDE or SCSI instead (also works +with the `virtio` SCSI controller type). Installation as Root File System @@ -136,9 +135,8 @@ config: errors: No known data errors ---- -The `zfs` command is used configure and manage your ZFS file -systems. The following command lists all file systems after -installation: +The `zfs` command is used to configure and manage your ZFS file systems. The +following command lists all file systems after installation: ---- # zfs list @@ -151,18 +149,150 @@ rpool/swap 4.25G 7.69T 64K - ---- +[[sysadmin_zfs_raid_considerations]] +ZFS RAID Level Considerations +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +There are a few factors to take into consideration when choosing the layout of +a ZFS pool. The basic building block of a ZFS pool is the virtual device, or +`vdev`. All vdevs in a pool are used equally and the data is striped among them +(RAID0). Check the `zpoolconcepts(7)` manpage for more details on vdevs. + +[[sysadmin_zfs_raid_performance]] +Performance +^^^^^^^^^^^ + +Each `vdev` type has different performance behaviors. The two +parameters of interest are the IOPS (Input/Output Operations per Second) and +the bandwidth with which data can be written or read. + +A 'mirror' vdev (RAID1) will approximately behave like a single disk in regard +to both parameters when writing data. When reading data the performance will +scale linearly with the number of disks in the mirror. + +A common situation is to have 4 disks. When setting it up as 2 mirror vdevs +(RAID10) the pool will have the write characteristics as two single disks in +regard to IOPS and bandwidth. For read operations it will resemble 4 single +disks. + +A 'RAIDZ' of any redundancy level will approximately behave like a single disk +in regard to IOPS with a lot of bandwidth. How much bandwidth depends on the +size of the RAIDZ vdev and the redundancy level. + +A 'dRAID' pool should match the performance of an equivalent 'RAIDZ' pool. + +For running VMs, IOPS is the more important metric in most situations. + + +[[sysadmin_zfs_raid_size_space_usage_redundancy]] +Size, Space usage and Redundancy +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +While a pool made of 'mirror' vdevs will have the best performance +characteristics, the usable space will be 50% of the disks available. Less if a +mirror vdev consists of more than 2 disks, for example in a 3-way mirror. At +least one healthy disk per mirror is needed for the pool to stay functional. + +The usable space of a 'RAIDZ' type vdev of N disks is roughly N-P, with P being +the RAIDZ-level. The RAIDZ-level indicates how many arbitrary disks can fail +without losing data. A special case is a 4 disk pool with RAIDZ2. In this +situation it is usually better to use 2 mirror vdevs for the better performance +as the usable space will be the same. + +Another important factor when using any RAIDZ level is how ZVOL datasets, which +are used for VM disks, behave. For each data block the pool needs parity data +which is at least the size of the minimum block size defined by the `ashift` +value of the pool. With an ashift of 12 the block size of the pool is 4k. The +default block size for a ZVOL is 8k. Therefore, in a RAIDZ2 each 8k block +written will cause two additional 4k parity blocks to be written, +8k + 4k + 4k = 16k. This is of course a simplified approach and the real +situation will be slightly different with metadata, compression and such not +being accounted for in this example. + +This behavior can be observed when checking the following properties of the +ZVOL: + + * `volsize` + * `refreservation` (if the pool is not thin provisioned) + * `used` (if the pool is thin provisioned and without snapshots present) + +---- +# zfs get volsize,refreservation,used /vm--disk-X +---- + +`volsize` is the size of the disk as it is presented to the VM, while +`refreservation` shows the reserved space on the pool which includes the +expected space needed for the parity data. If the pool is thin provisioned, the +`refreservation` will be set to 0. Another way to observe the behavior is to +compare the used disk space within the VM and the `used` property. Be aware +that snapshots will skew the value. + +There are a few options to counter the increased use of space: + +* Increase the `volblocksize` to improve the data to parity ratio +* Use 'mirror' vdevs instead of 'RAIDZ' +* Use `ashift=9` (block size of 512 bytes) + +The `volblocksize` property can only be set when creating a ZVOL. The default +value can be changed in the storage configuration. When doing this, the guest +needs to be tuned accordingly and depending on the use case, the problem of +write amplification is just moved from the ZFS layer up to the guest. + +Using `ashift=9` when creating the pool can lead to bad +performance, depending on the disks underneath, and cannot be changed later on. + +Mirror vdevs (RAID1, RAID10) have favorable behavior for VM workloads. Use +them, unless your environment has specific needs and characteristics where +RAIDZ performance characteristics are acceptable. + + +ZFS dRAID +~~~~~~~~~ + +In a ZFS dRAID (declustered RAID) the hot spare drive(s) participate in the RAID. +Their spare capacity is reserved and used for rebuilding when one drive fails. +This provides, depending on the configuration, faster rebuilding compared to a +RAIDZ in case of drive failure. More information can be found in the official +OpenZFS documentation. footnote:[OpenZFS dRAID +https://openzfs.github.io/openzfs-docs/Basic%20Concepts/dRAID%20Howto.html] + +NOTE: dRAID is intended for more than 10-15 disks in a dRAID. A RAIDZ +setup should be better for a lower amount of disks in most use cases. + +NOTE: The GUI requires one more disk than the minimum (i.e. dRAID1 needs 3). It +expects that a spare disk is added as well. + + * `dRAID1` or `dRAID`: requires at least 2 disks, one can fail before data is +lost + * `dRAID2`: requires at least 3 disks, two can fail before data is lost + * `dRAID3`: requires at least 4 disks, three can fail before data is lost + + +Additional information can be found on the manual page: + +---- +# man zpoolconcepts +---- + +Spares and Data +^^^^^^^^^^^^^^^ +The number of `spares` tells the system how many disks it should keep ready in +case of a disk failure. The default value is 0 `spares`. Without spares, +rebuilding won't get any speed benefits. + +`data` defines the number of devices in a redundancy group. The default value is +8. Except when `disks - parity - spares` equal something less than 8, the lower +number is used. In general, a smaller number of `data` devices leads to higher +IOPS, better compression ratios and faster resilvering, but defining fewer data +devices reduces the available storage capacity of the pool. + + Bootloader ~~~~~~~~~~ -The default ZFS disk partitioning scheme does not use the first 2048 -sectors. This gives enough room to install a GRUB boot partition. The -{pve} installer automatically allocates that space, and installs the -GRUB boot loader there. If you use a redundant RAID setup, it installs -the boot loader on all disk required for booting. So you can boot -even if some disks fail. - -NOTE: It is not possible to use ZFS as root file system with UEFI -boot. +{pve} uses xref:sysboot_proxmox_boot_tool[`proxmox-boot-tool`] to manage the +bootloader configuration. +See the chapter on xref:sysboot[{pve} host bootloaders] for details. ZFS Administration @@ -178,95 +308,253 @@ manual pages, which can be read with: # man zfs ----- -.Create a new zpool +[[sysadmin_zfs_create_new_zpool]] +Create a new zpool +^^^^^^^^^^^^^^^^^^ + +To create a new pool, at least one disk is needed. The `ashift` should have the +same sector-size (2 power of `ashift`) or larger as the underlying disk. + +---- +# zpool create -f -o ashift=12 +---- + +[TIP] +==== +Pool names must adhere to the following rules: + +* begin with a letter (a-z or A-Z) +* contain only alphanumeric, `-`, `_`, `.`, `:` or ` ` (space) characters +* must *not begin* with one of `mirror`, `raidz`, `draid` or `spare` +* must not be `log` +==== + +To activate compression (see section <>): + +---- +# zfs set compression=lz4 +---- + +[[sysadmin_zfs_create_new_zpool_raid0]] +Create a new pool with RAID-0 +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +Minimum 1 disk + +---- +# zpool create -f -o ashift=12 +---- + +[[sysadmin_zfs_create_new_zpool_raid1]] +Create a new pool with RAID-1 +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +Minimum 2 disks + +---- +# zpool create -f -o ashift=12 mirror +---- + +[[sysadmin_zfs_create_new_zpool_raid10]] +Create a new pool with RAID-10 +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +Minimum 4 disks + +---- +# zpool create -f -o ashift=12 mirror mirror +---- + +[[sysadmin_zfs_create_new_zpool_raidz1]] +Create a new pool with RAIDZ-1 +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +Minimum 3 disks + +---- +# zpool create -f -o ashift=12 raidz1 +---- + +Create a new pool with RAIDZ-2 +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +Minimum 4 disks + +---- +# zpool create -f -o ashift=12 raidz2 +---- + +Please read the section for +xref:sysadmin_zfs_raid_considerations[ZFS RAID Level Considerations] +to get a rough estimate on how IOPS and bandwidth expectations before setting up +a pool, especially when wanting to use a RAID-Z mode. + +[[sysadmin_zfs_create_new_zpool_with_cache]] +Create a new pool with cache (L2ARC) +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +It is possible to use a dedicated device, or partition, as second-level cache to +increase the performance. Such a cache device will especially help with +random-read workloads of data that is mostly static. As it acts as additional +caching layer between the actual storage, and the in-memory ARC, it can also +help if the ARC must be reduced due to memory constraints. + +.Create ZFS pool with a on-disk cache +---- +# zpool create -f -o ashift=12 cache +---- -To create a new pool, at least one disk is needed. The `ashift` should -have the same sector-size (2 power of `ashift`) or larger as the -underlying disk. +Here only a single `` and a single `` was used, but it is +possible to use more devices, like it's shown in +xref:sysadmin_zfs_create_new_zpool_raid0[Create a new pool with RAID]. - zpool create -f -o ashift=12 +Note that for cache devices no mirror or raid modi exist, they are all simply +accumulated. -To activate compression +If any cache device produces errors on read, ZFS will transparently divert that +request to the underlying storage layer. - zfs set compression=lz4 -.Create a new pool with RAID-0 +[[sysadmin_zfs_create_new_zpool_with_log]] +Create a new pool with log (ZIL) +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ -Minimum 1 Disk +It is possible to use a dedicated drive, or partition, for the ZFS Intent Log +(ZIL), it is mainly used to provide safe synchronous transactions, so often in +performance critical paths like databases, or other programs that issue `fsync` +operations more frequently. + +The pool is used as default ZIL location, diverting the ZIL IO load to a +separate device can, help to reduce transaction latencies while relieving the +main pool at the same time, increasing overall performance. + +For disks to be used as log devices, directly or through a partition, it's +recommend to: + +- use fast SSDs with power-loss protection, as those have much smaller commit + latencies. + +- Use at least a few GB for the partition (or whole device), but using more than + half of your installed memory won't provide you with any real advantage. + +.Create ZFS pool with separate log device +---- +# zpool create -f -o ashift=12 log +---- - zpool create -f -o ashift=12 +In above example a single `` and a single `` is used, but you +can also combine this with other RAID variants, as described in the +xref:sysadmin_zfs_create_new_zpool_raid0[Create a new pool with RAID] section. -.Create a new pool with RAID-1 +You can also mirror the log device to multiple devices, this is mainly useful to +ensure that performance doesn't immediately degrades if a single log device +fails. -Minimum 2 Disks +If all log devices fail the ZFS main pool itself will be used again, until the +log device(s) get replaced. - zpool create -f -o ashift=12 mirror +[[sysadmin_zfs_add_cache_and_log_dev]] +Add cache and log to an existing pool +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ -.Create a new pool with RAID-10 +If you have a pool without cache and log you can still add both, or just one of +them, at any time. -Minimum 4 Disks +For example, let's assume you got a good enterprise SSD with power-loss +protection that you want to use for improving the overall performance of your +pool. - zpool create -f -o ashift=12 mirror mirror +As the maximum size of a log device should be about half the size of the +installed physical memory, it means that the ZIL will mostly likely only take up +a relatively small part of the SSD, the remaining space can be used as cache. -.Create a new pool with RAIDZ-1 +First you have to create two GPT partitions on the SSD with `parted` or `gdisk`. -Minimum 3 Disks +Then you're ready to add them to an pool: - zpool create -f -o ashift=12 raidz1 +.Add both, a separate log device and a second-level cache, to an existing pool +---- +# zpool add -f log cache +---- -.Create a new pool with RAIDZ-2 +Just replay ``, `` and `` with the pool name +and the two `/dev/disk/by-id/` paths to the partitions. -Minimum 4 Disks +You can also add ZIL and cache separately. - zpool create -f -o ashift=12 raidz2 +.Add a log device to an existing ZFS pool +---- +# zpool add log +---- -.Create a new pool with cache (L2ARC) -It is possible to use a dedicated cache drive partition to increase -the performance (use SSD). +[[sysadmin_zfs_change_failed_dev]] +Changing a failed device +^^^^^^^^^^^^^^^^^^^^^^^^ -As `` it is possible to use more devices, like it's shown in -"Create a new pool with RAID*". +---- +# zpool replace -f +---- - zpool create -f -o ashift=12 cache +.Changing a failed bootable device -.Create a new pool with log (ZIL) +Depending on how {pve} was installed it is either using `systemd-boot` or GRUB +through `proxmox-boot-tool` footnote:[Systems installed with {pve} 6.4 or later, +EFI systems installed with {pve} 5.4 or later] or plain GRUB as bootloader (see +xref:sysboot[Host Bootloader]). You can check by running: -It is possible to use a dedicated cache drive partition to increase -the performance(SSD). +---- +# proxmox-boot-tool status +---- -As `` it is possible to use more devices, like it's shown in -"Create a new pool with RAID*". +The first steps of copying the partition table, reissuing GUIDs and replacing +the ZFS partition are the same. To make the system bootable from the new disk, +different steps are needed which depend on the bootloader in use. - zpool create -f -o ashift=12 log +---- +# sgdisk -R +# sgdisk -G +# zpool replace -f +---- -.Add cache and log to an existing pool +NOTE: Use the `zpool status -v` command to monitor how far the resilvering +process of the new disk has progressed. -If you have an pool without cache and log. First partition the SSD in -2 partition with `parted` or `gdisk` +.With `proxmox-boot-tool`: -IMPORTANT: Always use GPT partition tables. +---- +# proxmox-boot-tool format +# proxmox-boot-tool init [grub] +---- -The maximum size of a log device should be about half the size of -physical memory, so this is usually quite small. The rest of the SSD -can be used as cache. +NOTE: `ESP` stands for EFI System Partition, which is setup as partition #2 on +bootable disks setup by the {pve} installer since version 5.4. For details, see +xref:sysboot_proxmox_boot_setup[Setting up a new partition for use as synced ESP]. - zpool add -f log cache +NOTE: Make sure to pass 'grub' as mode to `proxmox-boot-tool init` if +`proxmox-boot-tool status` indicates your current disks are using GRUB, +especially if Secure Boot is enabled! -.Changing a failed device +.With plain GRUB: - zpool replace -f +---- +# grub-install +---- +NOTE: Plain GRUB is only used on systems installed with {pve} 6.3 or earlier, +which have not been manually migrated to using `proxmox-boot-tool` yet. -Activate E-Mail Notification -~~~~~~~~~~~~~~~~~~~~~~~~~~~~ +Configure E-Mail Notification +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ -ZFS comes with an event daemon, which monitors events generated by the -ZFS kernel module. The daemon can also send emails on ZFS events like -pool errors. +ZFS comes with an event daemon `ZED`, which monitors events generated by the ZFS +kernel module. The daemon can also send emails on ZFS events like pool errors. +Newer ZFS packages ship the daemon in a separate `zfs-zed` package, which should +already be installed by default in {pve}. -To activate the daemon it is necessary to edit `/etc/zfs/zed.d/zed.rc` with your -favourite editor, and uncomment the `ZED_EMAIL_ADDR` setting: +You can configure the daemon via the file `/etc/zfs/zed.d/zed.rc` with your +favorite editor. The required setting for email notification is +`ZED_EMAIL_ADDR`, which is set to `root` by default. -------- ZED_EMAIL_ADDR="root" @@ -275,44 +563,83 @@ ZED_EMAIL_ADDR="root" Please note {pve} forwards mails to `root` to the email address configured for the root user. -IMPORTANT: The only setting that is required is `ZED_EMAIL_ADDR`. All -other settings are optional. - +[[sysadmin_zfs_limit_memory_usage]] Limit ZFS Memory Usage ~~~~~~~~~~~~~~~~~~~~~~ -It is good to use at most 50 percent (which is the default) of the -system memory for ZFS ARC to prevent performance shortage of the -host. Use your preferred editor to change the configuration in -`/etc/modprobe.d/zfs.conf` and insert: +ZFS uses '50 %' of the host memory for the **A**daptive **R**eplacement +**C**ache (ARC) by default. For new installations starting with {pve} 8.1, the +ARC usage limit will be set to '10 %' of the installed physical memory, clamped +to a maximum of +16 GiB+. This value is written to `/etc/modprobe.d/zfs.conf`. + +Allocating enough memory for the ARC is crucial for IO performance, so reduce it +with caution. As a general rule of thumb, allocate at least +2 GiB Base + 1 +GiB/TiB-Storage+. For example, if you have a pool with +8 TiB+ of available +storage space then you should use +10 GiB+ of memory for the ARC. + +ZFS also enforces a minimum value of +64 MiB+. + +You can change the ARC usage limit for the current boot (a reboot resets this +change again) by writing to the +zfs_arc_max+ module parameter directly: + +---- + echo "$[10 * 1024*1024*1024]" >/sys/module/zfs/parameters/zfs_arc_max +---- + +To *permanently change* the ARC limits, add (or change if already present) the +following line to `/etc/modprobe.d/zfs.conf`: -------- options zfs zfs_arc_max=8589934592 -------- -This example setting limits the usage to 8GB. +This example setting limits the usage to 8 GiB ('8 * 2^30^'). + +IMPORTANT: In case your desired +zfs_arc_max+ value is lower than or equal to ++zfs_arc_min+ (which defaults to 1/32 of the system memory), +zfs_arc_max+ will +be ignored unless you also set +zfs_arc_min+ to at most +zfs_arc_max - 1+. + +---- +echo "$[8 * 1024*1024*1024 - 1]" >/sys/module/zfs/parameters/zfs_arc_min +echo "$[8 * 1024*1024*1024]" >/sys/module/zfs/parameters/zfs_arc_max +---- + +This example setting (temporarily) limits the usage to 8 GiB ('8 * 2^30^') on +systems with more than 256 GiB of total memory, where simply setting ++zfs_arc_max+ alone would not work. [IMPORTANT] ==== -If your root file system is ZFS you must update your initramfs every +If your root file system is ZFS, you must update your initramfs every time this value changes: - update-initramfs -u +---- +# update-initramfs -u -k all +---- + +You *must reboot* to activate these changes. ==== -.SWAP on ZFS +[[zfs_swap]] +SWAP on ZFS +~~~~~~~~~~~ -SWAP on ZFS on Linux may generate some troubles, like blocking the +Swap-space created on a zvol may generate some troubles, like blocking the server or generating a high IO load, often seen when starting a Backup to an external Storage. We strongly recommend to use enough memory, so that you normally do not -run into low memory situations. Additionally, you can lower the +run into low memory situations. Should you need or want to add swap, it is +preferred to create a partition on a physical disk and use it as a swap device. +You can leave some space free for this purpose in the advanced options of the +installer. Additionally, you can lower the ``swappiness'' value. A good value for servers is 10: - sysctl -w vm.swappiness=10 +---- +# sysctl -w vm.swappiness=10 +---- To make the swappiness persistent, open `/etc/sysctl.conf` with an editor of your choice and add the following line: @@ -334,3 +661,218 @@ improve performance when sufficient memory exists in a system. | vm.swappiness = 60 | The default value. | vm.swappiness = 100 | The kernel will swap aggressively. |=========================================================== + +[[zfs_encryption]] +Encrypted ZFS Datasets +~~~~~~~~~~~~~~~~~~~~~~ + +WARNING: Native ZFS encryption in {pve} is experimental. Known limitations and +issues include Replication with encrypted datasets +footnote:[https://bugzilla.proxmox.com/show_bug.cgi?id=2350], +as well as checksum errors when using Snapshots or ZVOLs. +footnote:[https://github.com/openzfs/zfs/issues/11688] + +ZFS on Linux version 0.8.0 introduced support for native encryption of +datasets. After an upgrade from previous ZFS on Linux versions, the encryption +feature can be enabled per pool: + +---- +# zpool get feature@encryption tank +NAME PROPERTY VALUE SOURCE +tank feature@encryption disabled local + +# zpool set feature@encryption=enabled + +# zpool get feature@encryption tank +NAME PROPERTY VALUE SOURCE +tank feature@encryption enabled local +---- + +WARNING: There is currently no support for booting from pools with encrypted +datasets using GRUB, and only limited support for automatically unlocking +encrypted datasets on boot. Older versions of ZFS without encryption support +will not be able to decrypt stored data. + +NOTE: It is recommended to either unlock storage datasets manually after +booting, or to write a custom unit to pass the key material needed for +unlocking on boot to `zfs load-key`. + +WARNING: Establish and test a backup procedure before enabling encryption of +production data. If the associated key material/passphrase/keyfile has been +lost, accessing the encrypted data is no longer possible. + +Encryption needs to be setup when creating datasets/zvols, and is inherited by +default to child datasets. For example, to create an encrypted dataset +`tank/encrypted_data` and configure it as storage in {pve}, run the following +commands: + +---- +# zfs create -o encryption=on -o keyformat=passphrase tank/encrypted_data +Enter passphrase: +Re-enter passphrase: + +# pvesm add zfspool encrypted_zfs -pool tank/encrypted_data +---- + +All guest volumes/disks create on this storage will be encrypted with the +shared key material of the parent dataset. + +To actually use the storage, the associated key material needs to be loaded +and the dataset needs to be mounted. This can be done in one step with: + +---- +# zfs mount -l tank/encrypted_data +Enter passphrase for 'tank/encrypted_data': +---- + +It is also possible to use a (random) keyfile instead of prompting for a +passphrase by setting the `keylocation` and `keyformat` properties, either at +creation time or with `zfs change-key` on existing datasets: + +---- +# dd if=/dev/urandom of=/path/to/keyfile bs=32 count=1 + +# zfs change-key -o keyformat=raw -o keylocation=file:///path/to/keyfile tank/encrypted_data +---- + +WARNING: When using a keyfile, special care needs to be taken to secure the +keyfile against unauthorized access or accidental loss. Without the keyfile, it +is not possible to access the plaintext data! + +A guest volume created underneath an encrypted dataset will have its +`encryptionroot` property set accordingly. The key material only needs to be +loaded once per encryptionroot to be available to all encrypted datasets +underneath it. + +See the `encryptionroot`, `encryption`, `keylocation`, `keyformat` and +`keystatus` properties, the `zfs load-key`, `zfs unload-key` and `zfs +change-key` commands and the `Encryption` section from `man zfs` for more +details and advanced usage. + + +[[zfs_compression]] +Compression in ZFS +~~~~~~~~~~~~~~~~~~ + +When compression is enabled on a dataset, ZFS tries to compress all *new* +blocks before writing them and decompresses them on reading. Already +existing data will not be compressed retroactively. + +You can enable compression with: + +---- +# zfs set compression= +---- + +We recommend using the `lz4` algorithm, because it adds very little CPU +overhead. Other algorithms like `lzjb` and `gzip-N`, where `N` is an +integer from `1` (fastest) to `9` (best compression ratio), are also +available. Depending on the algorithm and how compressible the data is, +having compression enabled can even increase I/O performance. + +You can disable compression at any time with: + +---- +# zfs set compression=off +---- + +Again, only new blocks will be affected by this change. + + +[[sysadmin_zfs_special_device]] +ZFS Special Device +~~~~~~~~~~~~~~~~~~ + +Since version 0.8.0 ZFS supports `special` devices. A `special` device in a +pool is used to store metadata, deduplication tables, and optionally small +file blocks. + +A `special` device can improve the speed of a pool consisting of slow spinning +hard disks with a lot of metadata changes. For example workloads that involve +creating, updating or deleting a large number of files will benefit from the +presence of a `special` device. ZFS datasets can also be configured to store +whole small files on the `special` device which can further improve the +performance. Use fast SSDs for the `special` device. + +IMPORTANT: The redundancy of the `special` device should match the one of the +pool, since the `special` device is a point of failure for the whole pool. + +WARNING: Adding a `special` device to a pool cannot be undone! + +.Create a pool with `special` device and RAID-1: + +---- +# zpool create -f -o ashift=12 mirror special mirror +---- + +.Add a `special` device to an existing pool with RAID-1: + +---- +# zpool add special mirror +---- + +ZFS datasets expose the `special_small_blocks=` property. `size` can be +`0` to disable storing small file blocks on the `special` device or a power of +two in the range between `512B` to `1M`. After setting the property new file +blocks smaller than `size` will be allocated on the `special` device. + +IMPORTANT: If the value for `special_small_blocks` is greater than or equal to +the `recordsize` (default `128K`) of the dataset, *all* data will be written to +the `special` device, so be careful! + +Setting the `special_small_blocks` property on a pool will change the default +value of that property for all child ZFS datasets (for example all containers +in the pool will opt in for small file blocks). + +.Opt in for all file smaller than 4K-blocks pool-wide: + +---- +# zfs set special_small_blocks=4K +---- + +.Opt in for small file blocks for a single dataset: + +---- +# zfs set special_small_blocks=4K / +---- + +.Opt out from small file blocks for a single dataset: + +---- +# zfs set special_small_blocks=0 / +---- + +[[sysadmin_zfs_features]] +ZFS Pool Features +~~~~~~~~~~~~~~~~~ + +Changes to the on-disk format in ZFS are only made between major version changes +and are specified through *features*. All features, as well as the general +mechanism are well documented in the `zpool-features(5)` manpage. + +Since enabling new features can render a pool not importable by an older version +of ZFS, this needs to be done actively by the administrator, by running +`zpool upgrade` on the pool (see the `zpool-upgrade(8)` manpage). + +Unless you need to use one of the new features, there is no upside to enabling +them. + +In fact, there are some downsides to enabling new features: + +* A system with root on ZFS, that still boots using GRUB will become + unbootable if a new feature is active on the rpool, due to the incompatible + implementation of ZFS in GRUB. +* The system will not be able to import any upgraded pool when booted with an + older kernel, which still ships with the old ZFS modules. +* Booting an older {pve} ISO to repair a non-booting system will likewise not + work. + +IMPORTANT: Do *not* upgrade your rpool if your system is still booted with +GRUB, as this will render your system unbootable. This includes systems +installed before {pve} 5.4, and systems booting with legacy BIOS boot (see +xref:sysboot_determine_bootloader_used[how to determine the bootloader]). + +.Enable new features for a ZFS pool: +---- +# zpool upgrade +----