| 1 | ZFS on Linux |
| 2 | ------------ |
| 3 | include::attributes.txt[] |
| 4 | |
| 5 | ZFS is a combined file system and logical volume manager designed by |
| 6 | Sun Microsystems. Starting with {pve} 3.4, the native Linux |
| 7 | kernel port of the ZFS file system is introduced as optional |
| 8 | file-system and also as an additional selection for the root |
| 9 | file-system. There is no need for manually compile ZFS modules - all |
| 10 | packages are included. |
| 11 | |
| 12 | By using ZFS, its possible to achieve maximal enterprise features with |
| 13 | low budget hardware, but also high performance systems by leveraging |
| 14 | SSD caching or even SSD only setups. ZFS can replace cost intense |
| 15 | hardware raid cards by moderate CPU and memory load combined with easy |
| 16 | management. |
| 17 | |
| 18 | .General ZFS advantages |
| 19 | |
| 20 | * Easy configuration and management with {pve} GUI and CLI. |
| 21 | |
| 22 | * Reliable |
| 23 | |
| 24 | * Protection against data corruption |
| 25 | |
| 26 | * Data compression on file-system level |
| 27 | |
| 28 | * Snapshots |
| 29 | |
| 30 | * Copy-on-write clone |
| 31 | |
| 32 | * Various raid levels: RAID0, RAID1, RAID10, RAIDZ-1, RAIDZ-2 and RAIDZ-3 |
| 33 | |
| 34 | * Can use SSD for cache |
| 35 | |
| 36 | * Self healing |
| 37 | |
| 38 | * Continuous integrity checking |
| 39 | |
| 40 | * Designed for high storage capacities |
| 41 | |
| 42 | * Protection against data corruption |
| 43 | |
| 44 | * Asynchronous replication over network |
| 45 | |
| 46 | * Open Source |
| 47 | |
| 48 | * Encryption |
| 49 | |
| 50 | * ... |
| 51 | |
| 52 | |
| 53 | Hardware |
| 54 | ~~~~~~~~ |
| 55 | |
| 56 | ZFS depends heavily on memory, so you need at least 8GB to start. In |
| 57 | practice, use as much you can get for your hardware/budget. To prevent |
| 58 | data corruption, we recommend the use of high quality ECC RAM. |
| 59 | |
| 60 | If you use a dedicated cache and/or log disk, you should use a |
| 61 | enterprise class SSD (e.g. Intel SSD DC S3700 Series). This can |
| 62 | increase the overall performance significantly. |
| 63 | |
| 64 | IMPORTANT: Do not use ZFS on top of hardware controller which has it's |
| 65 | own cache management. ZFS needs to directly communicate with disks. An |
| 66 | HBA adapter is the way to go, or something like LSI controller flashed |
| 67 | in 'IT' mode. |
| 68 | |
| 69 | If you are experimenting with an installation of {pve} inside a VM |
| 70 | (Nested Virtualization), don't use 'virtio' for disks of that VM, |
| 71 | since they are not supported by ZFS. Use IDE or SCSI instead (works |
| 72 | also with 'virtio' SCSI controller type). |
| 73 | |
| 74 | |
| 75 | Installation as root file system |
| 76 | ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ |
| 77 | |
| 78 | When you install using the {pve} installer, you can choose ZFS for the |
| 79 | root file system. You need to select the RAID type at installation |
| 80 | time: |
| 81 | |
| 82 | [horizontal] |
| 83 | RAID0:: Also called 'striping'. The capacity of such volume is the sum |
| 84 | of the capacity of all disks. But RAID0 does not add any redundancy, |
| 85 | so the failure of a single drive makes the volume unusable. |
| 86 | |
| 87 | RAID1:: Also called mirroring. Data is written identically to all |
| 88 | disks. This mode requires at least 2 disks with the same size. The |
| 89 | resulting capacity is that of a single disk. |
| 90 | |
| 91 | RAID10:: A combination of RAID0 and RAID1. Requires at least 4 disks. |
| 92 | |
| 93 | RAIDZ-1:: A variation on RAID-5, single parity. Requires at least 3 disks. |
| 94 | |
| 95 | RAIDZ-2:: A variation on RAID-5, double parity. Requires at least 4 disks. |
| 96 | |
| 97 | RAIDZ-3:: A variation on RAID-5, triple parity. Requires at least 5 disks. |
| 98 | |
| 99 | The installer automatically partitions the disks, creates a ZFS pool |
| 100 | called 'rpool', and installs the root file system on the ZFS subvolume |
| 101 | 'rpool/ROOT/pve-1'. |
| 102 | |
| 103 | Another subvolume called 'rpool/data' is created to store VM |
| 104 | images. In order to use that with the {pve} tools, the installer |
| 105 | creates the following configuration entry in '/etc/pve/storage.cfg': |
| 106 | |
| 107 | ---- |
| 108 | zfspool: local-zfs |
| 109 | pool rpool/data |
| 110 | sparse |
| 111 | content images,rootdir |
| 112 | ---- |
| 113 | |
| 114 | After installation, you can view your ZFS pool status using the |
| 115 | 'zpool' command: |
| 116 | |
| 117 | ---- |
| 118 | # zpool status |
| 119 | pool: rpool |
| 120 | state: ONLINE |
| 121 | scan: none requested |
| 122 | config: |
| 123 | |
| 124 | NAME STATE READ WRITE CKSUM |
| 125 | rpool ONLINE 0 0 0 |
| 126 | mirror-0 ONLINE 0 0 0 |
| 127 | sda2 ONLINE 0 0 0 |
| 128 | sdb2 ONLINE 0 0 0 |
| 129 | mirror-1 ONLINE 0 0 0 |
| 130 | sdc ONLINE 0 0 0 |
| 131 | sdd ONLINE 0 0 0 |
| 132 | |
| 133 | errors: No known data errors |
| 134 | ---- |
| 135 | |
| 136 | The 'zfs' command is used configure and manage your ZFS file |
| 137 | systems. The following command lists all file systems after |
| 138 | installation: |
| 139 | |
| 140 | ---- |
| 141 | # zfs list |
| 142 | NAME USED AVAIL REFER MOUNTPOINT |
| 143 | rpool 4.94G 7.68T 96K /rpool |
| 144 | rpool/ROOT 702M 7.68T 96K /rpool/ROOT |
| 145 | rpool/ROOT/pve-1 702M 7.68T 702M / |
| 146 | rpool/data 96K 7.68T 96K /rpool/data |
| 147 | rpool/swap 4.25G 7.69T 64K - |
| 148 | ---- |
| 149 | |
| 150 | |
| 151 | Bootloader |
| 152 | ~~~~~~~~~~ |
| 153 | |
| 154 | The default ZFS disk partitioning scheme does not use the first 2048 |
| 155 | sectors. This gives enough room to install a GRUB boot partition. The |
| 156 | {pve} installer automatically allocates that space, and installs the |
| 157 | GRUB boot loader there. If you use a redundant RAID setup, it installs |
| 158 | the boot loader on all disk required for booting. So you can boot |
| 159 | even if some disks fail. |
| 160 | |
| 161 | NOTE: It is not possible to use ZFS as root partition with UEFI |
| 162 | boot. |
| 163 | |
| 164 | |
| 165 | ZFS Administration |
| 166 | ~~~~~~~~~~~~~~~~~~ |
| 167 | |
| 168 | This section gives you some usage examples for common tasks. ZFS |
| 169 | itself is really powerful and provides many options. The main commands |
| 170 | to manage ZFS are 'zfs' and 'zpool'. Both commands comes with great |
| 171 | manual pages, worth to read: |
| 172 | |
| 173 | ---- |
| 174 | # man zpool |
| 175 | # man zfs |
| 176 | ----- |
| 177 | |
| 178 | .Create a new ZPool |
| 179 | |
| 180 | To create a new pool, at least one disk is needed. The 'ashift' should |
| 181 | have the same sector-size (2 power of 'ashift') or larger as the |
| 182 | underlying disk. |
| 183 | |
| 184 | zpool create -f -o ashift=12 <pool> <device> |
| 185 | |
| 186 | To activate the compression |
| 187 | |
| 188 | zfs set compression=lz4 <pool> |
| 189 | |
| 190 | .Create a new pool with RAID-0 |
| 191 | |
| 192 | Minimum 1 Disk |
| 193 | |
| 194 | zpool create -f -o ashift=12 <pool> <device1> <device2> |
| 195 | |
| 196 | .Create a new pool with RAID-1 |
| 197 | |
| 198 | Minimum 2 Disks |
| 199 | |
| 200 | zpool create -f -o ashift=12 <pool> mirror <device1> <device2> |
| 201 | |
| 202 | .Create a new pool with RAID-10 |
| 203 | |
| 204 | Minimum 4 Disks |
| 205 | |
| 206 | zpool create -f -o ashift=12 <pool> mirror <device1> <device2> mirror <device3> <device4> |
| 207 | |
| 208 | .Create a new pool with RAIDZ-1 |
| 209 | |
| 210 | Minimum 3 Disks |
| 211 | |
| 212 | zpool create -f -o ashift=12 <pool> raidz1 <device1> <device2> <device3> |
| 213 | |
| 214 | .Create a new pool with RAIDZ-2 |
| 215 | |
| 216 | Minimum 4 Disks |
| 217 | |
| 218 | zpool create -f -o ashift=12 <pool> raidz2 <device1> <device2> <device3> <device4> |
| 219 | |
| 220 | .Create a new pool with Cache (L2ARC) |
| 221 | |
| 222 | It is possible to use a dedicated cache drive partition to increase |
| 223 | the performance (use SSD). |
| 224 | |
| 225 | As '<device>' it is possible to use more devices, like it's shown in |
| 226 | "Create a new pool with RAID*". |
| 227 | |
| 228 | zpool create -f -o ashift=12 <pool> <device> cache <cache_device> |
| 229 | |
| 230 | .Create a new pool with Log (ZIL) |
| 231 | |
| 232 | It is possible to use a dedicated cache drive partition to increase |
| 233 | the performance(SSD). |
| 234 | |
| 235 | As '<device>' it is possible to use more devices, like it's shown in |
| 236 | "Create a new pool with RAID*". |
| 237 | |
| 238 | zpool create -f -o ashift=12 <pool> <device> log <log_device> |
| 239 | |
| 240 | .Add Cache and Log to an existing pool |
| 241 | |
| 242 | If you have an pool without cache and log. First partition the SSD in |
| 243 | 2 partition with parted or gdisk |
| 244 | |
| 245 | IMPORTANT: Always use GPT partition tables (gdisk or parted). |
| 246 | |
| 247 | The maximum size of a log device should be about half the size of |
| 248 | physical memory, so this is usually quite small. The rest of the SSD |
| 249 | can be used to the cache. |
| 250 | |
| 251 | zpool add -f <pool> log <device-part1> cache <device-part2> |
| 252 | |
| 253 | .Changing a failed Device |
| 254 | |
| 255 | zpool replace -f <pool> <old device> <new-device> |
| 256 | |
| 257 | |
| 258 | Activate E-Mail Notification |
| 259 | ~~~~~~~~~~~~~~~~~~~~~~~~~~~~ |
| 260 | |
| 261 | ZFS comes with an event daemon, which monitors events generated by the |
| 262 | ZFS kernel module. The daemon can also send E-Mails on ZFS event like |
| 263 | pool errors. |
| 264 | |
| 265 | To activate the daemon it is necessary to edit /etc/zfs/zed.d/zed.rc with your favored editor, and uncomment the 'ZED_EMAIL_ADDR' setting: |
| 266 | |
| 267 | ZED_EMAIL_ADDR="root" |
| 268 | |
| 269 | Please note {pve} forwards mails to 'root' to the email address |
| 270 | configured for the root user. |
| 271 | |
| 272 | IMPORTANT: the only settings that is required is ZED_EMAIL_ADDR. All |
| 273 | other settings are optional. |
| 274 | |
| 275 | |
| 276 | Limit ZFS memory usage |
| 277 | ~~~~~~~~~~~~~~~~~~~~~~ |
| 278 | |
| 279 | It is good to use max 50 percent of the system memory for ZFS arc to |
| 280 | prevent performance shortage of the host. Use your preferred editor to |
| 281 | change the configuration in /etc/modprobe.d/zfs.conf and insert: |
| 282 | |
| 283 | options zfs zfs_arc_max=8589934592 |
| 284 | |
| 285 | This example setting limits the usage to 8GB. |
| 286 | |
| 287 | [IMPORTANT] |
| 288 | ==== |
| 289 | If your root fs is ZFS you must update your initramfs every |
| 290 | time this value changes. |
| 291 | |
| 292 | update-initramfs -u |
| 293 | ==== |
| 294 | |
| 295 | |
| 296 | .SWAP on ZFS |
| 297 | |
| 298 | SWAP on ZFS on Linux may generate some troubles, like blocking the |
| 299 | server or generating a high IO load, often seen when starting a Backup |
| 300 | to an external Storage. |
| 301 | |
| 302 | We strongly recommend to use enough memory, so that you normally do not |
| 303 | run into low memory situations. Additionally, you can lower the |
| 304 | 'swappiness' value. A good value for servers is 10: |
| 305 | |
| 306 | sysctl -w vm.swappiness=10 |
| 307 | |
| 308 | To make the swappiness persistence, open '/etc/sysctl.conf' with |
| 309 | an editor of your choice and add the following line: |
| 310 | |
| 311 | vm.swappiness = 10 |
| 312 | |
| 313 | .Linux Kernel 'swappiness' parameter values |
| 314 | [width="100%",cols="<m,2d",options="header"] |
| 315 | |=========================================================== |
| 316 | | Value | Strategy |
| 317 | | vm.swappiness = 0 | The kernel will swap only to avoid |
| 318 | an 'out of memory' condition |
| 319 | | vm.swappiness = 1 | Minimum amount of swapping without |
| 320 | disabling it entirely. |
| 321 | | vm.swappiness = 10 | This value is sometimes recommended to |
| 322 | improve performance when sufficient memory exists in a system. |
| 323 | | vm.swappiness = 60 | The default value. |
| 324 | | vm.swappiness = 100 | The kernel will swap aggressively. |
| 325 | |=========================================================== |