X-Git-Url: https://git.proxmox.com/?a=blobdiff_plain;f=local-zfs.adoc;h=89ab8bd847d2bc5da9c5ee8c95ab3df6b0f3ca3a;hb=1658c673519fea697132d9a24c42a5a3a4667586;hp=76a1ac209391df5169714956262cc765ced42fed;hpb=11a6e022148636689094664285937c205498a890;p=pve-docs.git diff --git a/local-zfs.adoc b/local-zfs.adoc index 76a1ac2..89ab8bd 100644 --- a/local-zfs.adoc +++ b/local-zfs.adoc @@ -151,6 +151,101 @@ rpool/swap 4.25G 7.69T 64K - ---- +[[sysadmin_zfs_raid_considerations]] +ZFS RAID Level Considerations +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +There are a few factors to take into consideration when choosing the layout of +a ZFS pool. The basic building block of a ZFS pool is the virtual device, or +`vdev`. All vdevs in a pool are used equally and the data is striped among them +(RAID0). Check the `zpool(8)` manpage for more details on vdevs. + +[[sysadmin_zfs_raid_performance]] +Performance +^^^^^^^^^^^ + +Each `vdev` type has different performance behaviors. The two +parameters of interest are the IOPS (Input/Output Operations per Second) and +the bandwidth with which data can be written or read. + +A 'mirror' vdev (RAID1) will approximately behave like a single disk in regards +to both parameters when writing data. When reading data if will behave like the +number of disks in the mirror. + +A common situation is to have 4 disks. When setting it up as 2 mirror vdevs +(RAID10) the pool will have the write characteristics as two single disks in +regard of IOPS and bandwidth. For read operations it will resemble 4 single +disks. + +A 'RAIDZ' of any redundancy level will approximately behave like a single disk +in regard of IOPS with a lot of bandwidth. How much bandwidth depends on the +size of the RAIDZ vdev and the redundancy level. + +For running VMs, IOPS is the more important metric in most situations. + + +[[sysadmin_zfs_raid_size_space_usage_redundancy]] +Size, Space usage and Redundancy +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +While a pool made of 'mirror' vdevs will have the best performance +characteristics, the usable space will be 50% of the disks available. Less if a +mirror vdev consists of more than 2 disks, for example in a 3-way mirror. At +least one healthy disk per mirror is needed for the pool to stay functional. + +The usable space of a 'RAIDZ' type vdev of N disks is roughly N-P, with P being +the RAIDZ-level. The RAIDZ-level indicates how many arbitrary disks can fail +without losing data. A special case is a 4 disk pool with RAIDZ2. In this +situation it is usually better to use 2 mirror vdevs for the better performance +as the usable space will be the same. + +Another important factor when using any RAIDZ level is how ZVOL datasets, which +are used for VM disks, behave. For each data block the pool needs parity data +which is at least the size of the minimum block size defined by the `ashift` +value of the pool. With an ashift of 12 the block size of the pool is 4k. The +default block size for a ZVOL is 8k. Therefore, in a RAIDZ2 each 8k block +written will cause two additional 4k parity blocks to be written, +8k + 4k + 4k = 16k. This is of course a simplified approach and the real +situation will be slightly different with metadata, compression and such not +being accounted for in this example. + +This behavior can be observed when checking the following properties of the +ZVOL: + + * `volsize` + * `refreservation` (if the pool is not thin provisioned) + * `used` (if the pool is thin provisioned and without snapshots present) + +---- +# zfs get volsize,refreservation,used /vm--disk-X +---- + +`volsize` is the size of the disk as it is presented to the VM, while +`refreservation` shows the reserved space on the pool which includes the +expected space needed for the parity data. If the pool is thin provisioned, the +`refreservation` will be set to 0. Another way to observe the behavior is to +compare the used disk space within the VM and the `used` property. Be aware +that snapshots will skew the value. + +There are a few options to counter the increased use of space: + +* Increase the `volblocksize` to improve the data to parity ratio +* Use 'mirror' vdevs instead of 'RAIDZ' +* Use `ashift=9` (block size of 512 bytes) + +The `volblocksize` property can only be set when creating a ZVOL. The default +value can be changed in the storage configuration. When doing this, the guest +needs to be tuned accordingly and depending on the use case, the problem of +write amplification if just moved from the ZFS layer up to the guest. + +Using `ashift=9` when creating the pool can lead to bad +performance, depending on the disks underneath, and cannot be changed later on. + +Mirror vdevs (RAID1, RAID10) have favorable behavior for VM workloads. Use +them, unless your environment has specific needs and characteristics where +RAIDZ performance characteristics are acceptable. + + Bootloader ~~~~~~~~~~ @@ -172,7 +267,9 @@ manual pages, which can be read with: # man zfs ----- -.Create a new zpool +[[sysadmin_zfs_create_new_zpool]] +Create a new zpool +^^^^^^^^^^^^^^^^^^ To create a new pool, at least one disk is needed. The `ashift` should have the same sector-size (2 power of `ashift`) or larger as the @@ -188,7 +285,9 @@ To activate compression (see section <>): # zfs set compression=lz4 ---- -.Create a new pool with RAID-0 +[[sysadmin_zfs_create_new_zpool_raid0]] +Create a new pool with RAID-0 +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ Minimum 1 disk @@ -196,7 +295,9 @@ Minimum 1 disk # zpool create -f -o ashift=12 ---- -.Create a new pool with RAID-1 +[[sysadmin_zfs_create_new_zpool_raid1]] +Create a new pool with RAID-1 +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ Minimum 2 disks @@ -204,7 +305,9 @@ Minimum 2 disks # zpool create -f -o ashift=12 mirror ---- -.Create a new pool with RAID-10 +[[sysadmin_zfs_create_new_zpool_raid10]] +Create a new pool with RAID-10 +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ Minimum 4 disks @@ -212,7 +315,9 @@ Minimum 4 disks # zpool create -f -o ashift=12 mirror mirror ---- -.Create a new pool with RAIDZ-1 +[[sysadmin_zfs_create_new_zpool_raidz1]] +Create a new pool with RAIDZ-1 +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ Minimum 3 disks @@ -220,7 +325,8 @@ Minimum 3 disks # zpool create -f -o ashift=12 raidz1 ---- -.Create a new pool with RAIDZ-2 +Create a new pool with RAIDZ-2 +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ Minimum 4 disks @@ -228,7 +334,9 @@ Minimum 4 disks # zpool create -f -o ashift=12 raidz2 ---- -.Create a new pool with cache (L2ARC) +[[sysadmin_zfs_create_new_zpool_with_cache]] +Create a new pool with cache (L2ARC) +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ It is possible to use a dedicated cache drive partition to increase the performance (use SSD). @@ -240,7 +348,9 @@ As `` it is possible to use more devices, like it's shown in # zpool create -f -o ashift=12 cache ---- -.Create a new pool with log (ZIL) +[[sysadmin_zfs_create_new_zpool_with_log]] +Create a new pool with log (ZIL) +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ It is possible to use a dedicated cache drive partition to increase the performance(SSD). @@ -252,7 +362,9 @@ As `` it is possible to use more devices, like it's shown in # zpool create -f -o ashift=12 log ---- -.Add cache and log to an existing pool +[[sysadmin_zfs_add_cache_and_log_dev]] +Add cache and log to an existing pool +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ If you have a pool without cache and log. First partition the SSD in 2 partition with `parted` or `gdisk` @@ -267,7 +379,9 @@ can be used as cache. # zpool add -f log cache ---- -.Changing a failed device +[[sysadmin_zfs_change_failed_dev]] +Changing a failed device +^^^^^^^^^^^^^^^^^^^^^^^^ ---- # zpool replace -f @@ -288,10 +402,10 @@ different steps are needed which depend on the bootloader in use. # zpool replace -f ---- -NOTE: Use the `zpool status -v` command to monitor how far the resivlering +NOTE: Use the `zpool status -v` command to monitor how far the resilvering process of the new disk has progressed. -With `systemd-boot`: +.With `systemd-boot`: ---- # pve-efiboot-tool format @@ -302,7 +416,7 @@ NOTE: `ESP` stands for EFI System Partition, which is setup as partition #2 on bootable disks setup by the {pve} installer since version 5.4. For details, see xref:sysboot_systemd_boot_setup[Setting up a new partition for use as synced ESP]. -With `grub`: +.With `grub`: ---- # grub-install @@ -334,6 +448,7 @@ IMPORTANT: The only setting that is required is `ZED_EMAIL_ADDR`. All other settings are optional. +[[sysadmin_zfs_limit_memory_usage]] Limit ZFS Memory Usage ~~~~~~~~~~~~~~~~~~~~~~ @@ -449,10 +564,10 @@ All guest volumes/disks create on this storage will be encrypted with the shared key material of the parent dataset. To actually use the storage, the associated key material needs to be loaded -with `zfs load-key`: +and the dataset needs to be mounted. This can be done in one step with: ---- -# zfs load-key tank/encrypted_data +# zfs mount -l tank/encrypted_data Enter passphrase for 'tank/encrypted_data': ---- @@ -510,6 +625,7 @@ You can disable compression at any time with: Again, only new blocks will be affected by this change. +[[sysadmin_zfs_special_device]] ZFS Special Device ~~~~~~~~~~~~~~~~~~