X-Git-Url: https://git.proxmox.com/?a=blobdiff_plain;f=local-zfs.adoc;h=ab0f6adaa0b8e6071c9c7fc4a7129c5b729972dd;hb=10a2a4aa2e07f720da624f9d0611af8d6be13152;hp=bb03506e760b18e58a0d5d6d767de78246ed4bd5;hpb=eaefe614232992de1a528cd08e8f9dbe3c2aead8;p=pve-docs.git diff --git a/local-zfs.adoc b/local-zfs.adoc index bb03506..ab0f6ad 100644 --- a/local-zfs.adoc +++ b/local-zfs.adoc @@ -42,8 +42,6 @@ management. * Designed for high storage capacities -* Protection against data corruption - * Asynchronous replication over network * Open Source @@ -57,22 +55,22 @@ Hardware ~~~~~~~~ ZFS depends heavily on memory, so you need at least 8GB to start. In -practice, use as much you can get for your hardware/budget. To prevent +practice, use as much as you can get for your hardware/budget. To prevent data corruption, we recommend the use of high quality ECC RAM. If you use a dedicated cache and/or log disk, you should use an enterprise class SSD (e.g. Intel SSD DC S3700 Series). This can increase the overall performance significantly. -IMPORTANT: Do not use ZFS on top of hardware controller which has its -own cache management. ZFS needs to directly communicate with disks. An -HBA adapter is the way to go, or something like LSI controller flashed -in ``IT'' mode. +IMPORTANT: Do not use ZFS on top of a hardware RAID controller which has its +own cache management. ZFS needs to communicate directly with the disks. An +HBA adapter or something like an LSI controller flashed in ``IT'' mode is more +appropriate. If you are experimenting with an installation of {pve} inside a VM (Nested Virtualization), don't use `virtio` for disks of that VM, -since they are not supported by ZFS. Use IDE or SCSI instead (works -also with `virtio` SCSI controller type). +as they are not supported by ZFS. Use IDE or SCSI instead (also works +with the `virtio` SCSI controller type). Installation as Root File System @@ -151,12 +149,107 @@ rpool/swap 4.25G 7.69T 64K - ---- +[[sysadmin_zfs_raid_considerations]] +ZFS RAID Level Considerations +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +There are a few factors to take into consideration when choosing the layout of +a ZFS pool. The basic building block of a ZFS pool is the virtual device, or +`vdev`. All vdevs in a pool are used equally and the data is striped among them +(RAID0). Check the `zpool(8)` manpage for more details on vdevs. + +[[sysadmin_zfs_raid_performance]] +Performance +^^^^^^^^^^^ + +Each `vdev` type has different performance behaviors. The two +parameters of interest are the IOPS (Input/Output Operations per Second) and +the bandwidth with which data can be written or read. + +A 'mirror' vdev (RAID1) will approximately behave like a single disk in regards +to both parameters when writing data. When reading data if will behave like the +number of disks in the mirror. + +A common situation is to have 4 disks. When setting it up as 2 mirror vdevs +(RAID10) the pool will have the write characteristics as two single disks in +regard of IOPS and bandwidth. For read operations it will resemble 4 single +disks. + +A 'RAIDZ' of any redundancy level will approximately behave like a single disk +in regard of IOPS with a lot of bandwidth. How much bandwidth depends on the +size of the RAIDZ vdev and the redundancy level. + +For running VMs, IOPS is the more important metric in most situations. + + +[[sysadmin_zfs_raid_size_space_usage_redundancy]] +Size, Space usage and Redundancy +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +While a pool made of 'mirror' vdevs will have the best performance +characteristics, the usable space will be 50% of the disks available. Less if a +mirror vdev consists of more than 2 disks, for example in a 3-way mirror. At +least one healthy disk per mirror is needed for the pool to stay functional. + +The usable space of a 'RAIDZ' type vdev of N disks is roughly N-P, with P being +the RAIDZ-level. The RAIDZ-level indicates how many arbitrary disks can fail +without losing data. A special case is a 4 disk pool with RAIDZ2. In this +situation it is usually better to use 2 mirror vdevs for the better performance +as the usable space will be the same. + +Another important factor when using any RAIDZ level is how ZVOL datasets, which +are used for VM disks, behave. For each data block the pool needs parity data +which is at least the size of the minimum block size defined by the `ashift` +value of the pool. With an ashift of 12 the block size of the pool is 4k. The +default block size for a ZVOL is 8k. Therefore, in a RAIDZ2 each 8k block +written will cause two additional 4k parity blocks to be written, +8k + 4k + 4k = 16k. This is of course a simplified approach and the real +situation will be slightly different with metadata, compression and such not +being accounted for in this example. + +This behavior can be observed when checking the following properties of the +ZVOL: + + * `volsize` + * `refreservation` (if the pool is not thin provisioned) + * `used` (if the pool is thin provisioned and without snapshots present) + +---- +# zfs get volsize,refreservation,used /vm--disk-X +---- + +`volsize` is the size of the disk as it is presented to the VM, while +`refreservation` shows the reserved space on the pool which includes the +expected space needed for the parity data. If the pool is thin provisioned, the +`refreservation` will be set to 0. Another way to observe the behavior is to +compare the used disk space within the VM and the `used` property. Be aware +that snapshots will skew the value. + +There are a few options to counter the increased use of space: + +* Increase the `volblocksize` to improve the data to parity ratio +* Use 'mirror' vdevs instead of 'RAIDZ' +* Use `ashift=9` (block size of 512 bytes) + +The `volblocksize` property can only be set when creating a ZVOL. The default +value can be changed in the storage configuration. When doing this, the guest +needs to be tuned accordingly and depending on the use case, the problem of +write amplification if just moved from the ZFS layer up to the guest. + +Using `ashift=9` when creating the pool can lead to bad +performance, depending on the disks underneath, and cannot be changed later on. + +Mirror vdevs (RAID1, RAID10) have favorable behavior for VM workloads. Use +them, unless your environment has specific needs and characteristics where +RAIDZ performance characteristics are acceptable. + + Bootloader ~~~~~~~~~~ -Depending on whether the system is booted in EFI or legacy BIOS mode the -{pve} installer sets up either `grub` or `systemd-boot` as main bootloader. -See the chapter on xref:sysboot[{pve} host bootladers] for details. +{pve} uses xref:sysboot_proxmox_boot_tool[`proxmox-boot-tool`] to manage the +bootloader configuration. +See the chapter on xref:sysboot[{pve} host bootloaders] for details. ZFS Administration @@ -172,7 +265,9 @@ manual pages, which can be read with: # man zfs ----- -.Create a new zpool +[[sysadmin_zfs_create_new_zpool]] +Create a new zpool +^^^^^^^^^^^^^^^^^^ To create a new pool, at least one disk is needed. The `ashift` should have the same sector-size (2 power of `ashift`) or larger as the @@ -188,47 +283,58 @@ To activate compression (see section <>): # zfs set compression=lz4 ---- -.Create a new pool with RAID-0 +[[sysadmin_zfs_create_new_zpool_raid0]] +Create a new pool with RAID-0 +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ -Minimum 1 Disk +Minimum 1 disk ---- # zpool create -f -o ashift=12 ---- -.Create a new pool with RAID-1 +[[sysadmin_zfs_create_new_zpool_raid1]] +Create a new pool with RAID-1 +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ -Minimum 2 Disks +Minimum 2 disks ---- # zpool create -f -o ashift=12 mirror ---- -.Create a new pool with RAID-10 +[[sysadmin_zfs_create_new_zpool_raid10]] +Create a new pool with RAID-10 +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ -Minimum 4 Disks +Minimum 4 disks ---- # zpool create -f -o ashift=12 mirror mirror ---- -.Create a new pool with RAIDZ-1 +[[sysadmin_zfs_create_new_zpool_raidz1]] +Create a new pool with RAIDZ-1 +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ -Minimum 3 Disks +Minimum 3 disks ---- # zpool create -f -o ashift=12 raidz1 ---- -.Create a new pool with RAIDZ-2 +Create a new pool with RAIDZ-2 +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ -Minimum 4 Disks +Minimum 4 disks ---- # zpool create -f -o ashift=12 raidz2 ---- -.Create a new pool with cache (L2ARC) +[[sysadmin_zfs_create_new_zpool_with_cache]] +Create a new pool with cache (L2ARC) +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ It is possible to use a dedicated cache drive partition to increase the performance (use SSD). @@ -240,7 +346,9 @@ As `` it is possible to use more devices, like it's shown in # zpool create -f -o ashift=12 cache ---- -.Create a new pool with log (ZIL) +[[sysadmin_zfs_create_new_zpool_with_log]] +Create a new pool with log (ZIL) +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ It is possible to use a dedicated cache drive partition to increase the performance(SSD). @@ -252,7 +360,9 @@ As `` it is possible to use more devices, like it's shown in # zpool create -f -o ashift=12 log ---- -.Add cache and log to an existing pool +[[sysadmin_zfs_add_cache_and_log_dev]] +Add cache and log to an existing pool +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ If you have a pool without cache and log. First partition the SSD in 2 partition with `parted` or `gdisk` @@ -264,44 +374,73 @@ physical memory, so this is usually quite small. The rest of the SSD can be used as cache. ---- -# zpool add -f log cache +# zpool add -f log cache ---- -.Changing a failed device +[[sysadmin_zfs_change_failed_dev]] +Changing a failed device +^^^^^^^^^^^^^^^^^^^^^^^^ ---- # zpool replace -f ---- -.Changing a failed bootable device when using systemd-boot +.Changing a failed bootable device + +Depending on how {pve} was installed it is either using `systemd-boot` or `grub` +through `proxmox-boot-tool` +footnote:[Systems installed with {pve} 6.4 or later, EFI systems installed with +{pve} 5.4 or later] or plain `grub` as bootloader (see +xref:sysboot[Host Bootloader]). You can check by running: + +---- +# proxmox-boot-tool status +---- + +The first steps of copying the partition table, reissuing GUIDs and replacing +the ZFS partition are the same. To make the system bootable from the new disk, +different steps are needed which depend on the bootloader in use. ---- # sgdisk -R # sgdisk -G # zpool replace -f -# pve-efiboot-tool format -# pve-efiboot-tool init ---- -NOTE: `ESP` stands for EFI System Partition, which is setup as partition #2 on -bootable disks setup by the {pve} installer since version 5.4. For details, see -xref:sysboot_systemd_boot_setup[Setting up a new partition for use as synced ESP]. +NOTE: Use the `zpool status -v` command to monitor how far the resilvering +process of the new disk has progressed. +.With `proxmox-boot-tool`: + +---- +# proxmox-boot-tool format +# proxmox-boot-tool init +---- -Activate E-Mail Notification -~~~~~~~~~~~~~~~~~~~~~~~~~~~~ +NOTE: `ESP` stands for EFI System Partition, which is setup as partition #2 on +bootable disks setup by the {pve} installer since version 5.4. For details, see +xref:sysboot_proxmox_boot_setup[Setting up a new partition for use as synced ESP]. -ZFS comes with an event daemon, which monitors events generated by the -ZFS kernel module. The daemon can also send emails on ZFS events like -pool errors. Newer ZFS packages ship the daemon in a separate package, -and you can install it using `apt-get`: +.With plain `grub`: ---- -# apt-get install zfs-zed +# grub-install ---- +NOTE: plain `grub` is only used on systems installed with {pve} 6.3 or earlier, +which have not been manually migrated to using `proxmox-boot-tool` yet. + -To activate the daemon it is necessary to edit `/etc/zfs/zed.d/zed.rc` with your -favourite editor, and uncomment the `ZED_EMAIL_ADDR` setting: +Configure E-Mail Notification +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +ZFS comes with an event daemon `ZED`, which monitors events generated by the ZFS +kernel module. The daemon can also send emails on ZFS events like pool errors. +Newer ZFS packages ship the daemon in a separate `zfs-zed` package, which should +already be installed by default in {pve}. + +You can configure the daemon via the file `/etc/zfs/zed.d/zed.rc` with your +favorite editor. The required setting for email notification is +`ZED_EMAIL_ADDR`, which is set to `root` by default. -------- ZED_EMAIL_ADDR="root" @@ -310,32 +449,57 @@ ZED_EMAIL_ADDR="root" Please note {pve} forwards mails to `root` to the email address configured for the root user. -IMPORTANT: The only setting that is required is `ZED_EMAIL_ADDR`. All -other settings are optional. - +[[sysadmin_zfs_limit_memory_usage]] Limit ZFS Memory Usage ~~~~~~~~~~~~~~~~~~~~~~ -It is good to use at most 50 percent (which is the default) of the -system memory for ZFS ARC to prevent performance shortage of the -host. Use your preferred editor to change the configuration in -`/etc/modprobe.d/zfs.conf` and insert: +ZFS uses '50 %' of the host memory for the **A**daptive **R**eplacement +**C**ache (ARC) by default. Allocating enough memory for the ARC is crucial for +IO performance, so reduce it with caution. As a general rule of thumb, allocate +at least +2 GiB Base + 1 GiB/TiB-Storage+. For example, if you have a pool with ++8 TiB+ of available storage space then you should use +10 GiB+ of memory for +the ARC. + +You can change the ARC usage limit for the current boot (a reboot resets this +change again) by writing to the +zfs_arc_max+ module parameter directly: + +---- + echo "$[10 * 1024*1024*1024]" >/sys/module/zfs/parameters/zfs_arc_max +---- + +To *permanently change* the ARC limits, add the following line to +`/etc/modprobe.d/zfs.conf`: -------- options zfs zfs_arc_max=8589934592 -------- -This example setting limits the usage to 8GB. +This example setting limits the usage to 8 GiB ('8 * 2^30^'). + +IMPORTANT: In case your desired +zfs_arc_max+ value is lower than or equal to ++zfs_arc_min+ (which defaults to 1/32 of the system memory), +zfs_arc_max+ will +be ignored unless you also set +zfs_arc_min+ to at most +zfs_arc_max - 1+. + +---- +echo "$[8 * 1024*1024*1024 - 1]" >/sys/module/zfs/parameters/zfs_arc_min +echo "$[8 * 1024*1024*1024]" >/sys/module/zfs/parameters/zfs_arc_max +---- + +This example setting (temporarily) limits the usage to 8 GiB ('8 * 2^30^') on +systems with more than 256 GiB of total memory, where simply setting ++zfs_arc_max+ alone would not work. [IMPORTANT] ==== -If your root file system is ZFS you must update your initramfs every +If your root file system is ZFS, you must update your initramfs every time this value changes: ---- # update-initramfs -u ---- + +You *must reboot* to activate these changes. ==== @@ -349,7 +513,7 @@ to an external Storage. We strongly recommend to use enough memory, so that you normally do not run into low memory situations. Should you need or want to add swap, it is -preferred to create a partition on a physical disk and use it as swapdevice. +preferred to create a partition on a physical disk and use it as a swap device. You can leave some space free for this purpose in the advanced options of the installer. Additionally, you can lower the ``swappiness'' value. A good value for servers is 10: @@ -429,10 +593,10 @@ All guest volumes/disks create on this storage will be encrypted with the shared key material of the parent dataset. To actually use the storage, the associated key material needs to be loaded -with `zfs load-key`: +and the dataset needs to be mounted. This can be done in one step with: ---- -# zfs load-key tank/encrypted_data +# zfs mount -l tank/encrypted_data Enter passphrase for 'tank/encrypted_data': ---- @@ -490,6 +654,7 @@ You can disable compression at any time with: Again, only new blocks will be affected by this change. +[[sysadmin_zfs_special_device]] ZFS Special Device ~~~~~~~~~~~~~~~~~~ @@ -551,3 +716,38 @@ in the pool will opt in for small file blocks). ---- # zfs set special_small_blocks=0 / ---- + +[[sysadmin_zfs_features]] +ZFS Pool Features +~~~~~~~~~~~~~~~~~ + +Changes to the on-disk format in ZFS are only made between major version changes +and are specified through *features*. All features, as well as the general +mechanism are well documented in the `zpool-features(5)` manpage. + +Since enabling new features can render a pool not importable by an older version +of ZFS, this needs to be done actively by the administrator, by running +`zpool upgrade` on the pool (see the `zpool-upgrade(8)` manpage). + +Unless you need to use one of the new features, there is no upside to enabling +them. + +In fact, there are some downsides to enabling new features: + +* A system with root on ZFS, that still boots using `grub` will become + unbootable if a new feature is active on the rpool, due to the incompatible + implementation of ZFS in grub. +* The system will not be able to import any upgraded pool when booted with an + older kernel, which still ships with the old ZFS modules. +* Booting an older {pve} ISO to repair a non-booting system will likewise not + work. + +IMPORTANT: Do *not* upgrade your rpool if your system is still booted with +`grub`, as this will render your system unbootable. This includes systems +installed before {pve} 5.4, and systems booting with legacy BIOS boot (see +xref:sysboot_determine_bootloader_used[how to determine the bootloader]). + +.Enable new features for a ZFS pool: +---- +# zpool upgrade +----