docs/local-zfs.rst

   1 ZFS on Linux
   2 ------------
   3
   4 ZFS is a combined file system and logical volume manager designed by
   5 Sun Microsystems. There is no need to manually compile ZFS modules - all
   6 packages are included.
   7
   8 By using ZFS, it's possible to achieve maximum enterprise features with
   9 low budget hardware, but also high performance systems by leveraging
  10 SSD caching or even SSD only setups. ZFS can replace cost intense
  11 hardware raid cards by moderate CPU and memory load combined with easy
  12 management.
  13
  14 General ZFS advantages
  15
  16 * Easy configuration and management with GUI and CLI.
  17 * Reliable
  18 * Protection against data corruption
  19 * Data compression on file system level
  20 * Snapshots
  21 * Copy-on-write clone
  22 * Various raid levels: RAID0, RAID1, RAID10, RAIDZ-1, RAIDZ-2 and RAIDZ-3
  23 * Can use SSD for cache
  24 * Self healing
  25 * Continuous integrity checking
  26 * Designed for high storage capacities
  27 * Asynchronous replication over network
  28 * Open Source
  29 * Encryption
  30
  31 Hardware
  32 ~~~~~~~~~
  33
  34 ZFS depends heavily on memory, so you need at least 8GB to start. In
  35 practice, use as much you can get for your hardware/budget. To prevent
  36 data corruption, we recommend the use of high quality ECC RAM.
  37
  38 If you use a dedicated cache and/or log disk, you should use an
  39 enterprise class SSD (e.g. Intel SSD DC S3700 Series). This can
  40 increase the overall performance significantly.
  41
  42 IMPORTANT: Do not use ZFS on top of hardware controller which has its
  43 own cache management. ZFS needs to directly communicate with disks. An
  44 HBA adapter is the way to go, or something like LSI controller flashed
  45 in ``IT`` mode.
  46
  47
  48 ZFS Administration
  49 ~~~~~~~~~~~~~~~~~~
  50
  51 This section gives you some usage examples for common tasks. ZFS
  52 itself is really powerful and provides many options. The main commands
  53 to manage ZFS are `zfs` and `zpool`. Both commands come with great
  54 manual pages, which can be read with:
  55
  56 .. code-block:: console
  57
  58   # man zpool
  59   # man zfs
  60
  61 Create a new zpool
  62 ^^^^^^^^^^^^^^^^^^
  63
  64 To create a new pool, at least one disk is needed. The `ashift` should
  65 have the same sector-size (2 power of `ashift`) or larger as the
  66 underlying disk.
  67
  68 .. code-block:: console
  69
  70   # zpool create -f -o ashift=12 <pool> <device>
  71
  72 Create a new pool with RAID-0
  73 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  74
  75 Minimum 1 disk
  76
  77 .. code-block:: console
  78
  79   # zpool create -f -o ashift=12 <pool> <device1> <device2>
  80
  81 Create a new pool with RAID-1
  82 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  83
  84 Minimum 2 disks
  85
  86 .. code-block:: console
  87
  88   # zpool create -f -o ashift=12 <pool> mirror <device1> <device2>
  89
  90 Create a new pool with RAID-10
  91 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  92
  93 Minimum 4 disks
  94
  95 .. code-block:: console
  96
  97   # zpool create -f -o ashift=12 <pool> mirror <device1> <device2> mirror <device3> <device4>
  98
  99 Create a new pool with RAIDZ-1
 100 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 101
 102 Minimum 3 disks
 103
 104 .. code-block:: console
 105
 106   # zpool create -f -o ashift=12 <pool> raidz1 <device1> <device2> <device3>
 107
 108 Create a new pool with RAIDZ-2
 109 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 110
 111 Minimum 4 disks
 112
 113 .. code-block:: console
 114
 115   # zpool create -f -o ashift=12 <pool> raidz2 <device1> <device2> <device3> <device4>
 116
 117 Create a new pool with cache (L2ARC)
 118 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 119
 120 It is possible to use a dedicated cache drive partition to increase
 121 the performance (use SSD).
 122
 123 As `<device>` it is possible to use more devices, like it's shown in
 124 "Create a new pool with RAID*".
 125
 126 .. code-block:: console
 127
 128   # zpool create -f -o ashift=12 <pool> <device> cache <cache_device>
 129
 130 Create a new pool with log (ZIL)
 131 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 132
 133 It is possible to use a dedicated cache drive partition to increase
 134 the performance (SSD).
 135
 136 As `<device>` it is possible to use more devices, like it's shown in
 137 "Create a new pool with RAID*".
 138
 139 .. code-block:: console
 140
 141   # zpool create -f -o ashift=12 <pool> <device> log <log_device>
 142
 143 Add cache and log to an existing pool
 144 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 145
 146 If you have a pool without cache and log. First partition the SSD in
 147 2 partition with `parted` or `gdisk`
 148
 149 .. important:: Always use GPT partition tables.
 150
 151 The maximum size of a log device should be about half the size of
 152 physical memory, so this is usually quite small. The rest of the SSD
 153 can be used as cache.
 154
 155 .. code-block:: console
 156
 157   # zpool add -f <pool> log <device-part1> cache <device-part2>
 158
 159
 160 Changing a failed device
 161 ^^^^^^^^^^^^^^^^^^^^^^^^
 162
 163 .. code-block:: console
 164
 165   # zpool replace -f <pool> <old device> <new device>
 166
 167
 168 Changing a failed bootable device
 169 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 170
 171 Depending on how Proxmox Backup was installed it is either using `grub` or `systemd-boot`
 172 as bootloader.
 173
 174 The first steps of copying the partition table, reissuing GUIDs and replacing
 175 the ZFS partition are the same. To make the system bootable from the new disk,
 176 different steps are needed which depend on the bootloader in use.
 177
 178 .. code-block:: console
 179
 180   # sgdisk <healthy bootable device> -R <new device>
 181   # sgdisk -G <new device>
 182   # zpool replace -f <pool> <old zfs partition> <new zfs partition>
 183
 184 .. NOTE:: Use the `zpool status -v` command to monitor how far the resilvering process of the new disk has progressed.
 185
 186 With `systemd-boot`:
 187
 188 .. code-block:: console
 189
 190   # pve-efiboot-tool format <new disk's ESP>
 191   # pve-efiboot-tool init <new disk's ESP>
 192
 193 .. NOTE:: `ESP` stands for EFI System Partition, which is setup as partition #2 on
 194   bootable disks setup by the {pve} installer since version 5.4. For details, see
 195   xref:sysboot_systemd_boot_setup[Setting up a new partition for use as synced ESP].
 196
 197 With `grub`:
 198
 199 Usually `grub.cfg` is located in `/boot/grub/grub.cfg`
 200
 201 .. code-block:: console
 202
 203   # grub-install <new disk>
 204   # grub-mkconfig -o /path/to/grub.cfg
 205
 206
 207 Activate E-Mail Notification
 208 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 209
 210 ZFS comes with an event daemon, which monitors events generated by the
 211 ZFS kernel module. The daemon can also send emails on ZFS events like
 212 pool errors. Newer ZFS packages ship the daemon in a separate package,
 213 and you can install it using `apt-get`:
 214
 215 .. code-block:: console
 216
 217   # apt-get install zfs-zed
 218
 219 To activate the daemon it is necessary to edit `/etc/zfs/zed.d/zed.rc` with your
 220 favourite editor, and uncomment the `ZED_EMAIL_ADDR` setting:
 221
 222 .. code-block:: console
 223
 224   ZED_EMAIL_ADDR="root"
 225
 226 Please note Proxmox Backup forwards mails to `root` to the email address
 227 configured for the root user.
 228
 229 IMPORTANT: The only setting that is required is `ZED_EMAIL_ADDR`. All
 230 other settings are optional.
 231
 232 Limit ZFS Memory Usage
 233 ^^^^^^^^^^^^^^^^^^^^^^
 234
 235 It is good to use at most 50 percent (which is the default) of the
 236 system memory for ZFS ARC to prevent performance shortage of the
 237 host. Use your preferred editor to change the configuration in
 238 `/etc/modprobe.d/zfs.conf` and insert:
 239
 240 .. code-block:: console
 241
 242   options zfs zfs_arc_max=8589934592
 243
 244 This example setting limits the usage to 8GB.
 245
 246 .. IMPORTANT:: If your root file system is ZFS you must update your initramfs every time this value changes:
 247
 248 .. code-block:: console
 249
 250   # update-initramfs -u
 251
 252
 253 SWAP on ZFS
 254 ^^^^^^^^^^^
 255
 256 Swap-space created on a zvol may generate some troubles, like blocking the
 257 server or generating a high IO load, often seen when starting a Backup
 258 to an external Storage.
 259
 260 We strongly recommend to use enough memory, so that you normally do not
 261 run into low memory situations. Should you need or want to add swap, it is
 262 preferred to create a partition on a physical disk and use it as swapdevice.
 263 You can leave some space free for this purpose in the advanced options of the
 264 installer. Additionally, you can lower the `swappiness` value.
 265 A good value for servers is 10:
 266
 267 .. code-block:: console
 268
 269   # sysctl -w vm.swappiness=10
 270
 271 To make the swappiness persistent, open `/etc/sysctl.conf` with
 272 an editor of your choice and add the following line:
 273
 274 .. code-block:: console
 275
 276   vm.swappiness = 10
 277
 278 .. table:: Linux kernel `swappiness` parameter values
 279   :widths:auto
 280
 281   ====================  ===============================================================
 282    Value                Strategy
 283   ====================  ===============================================================
 284    vm.swappiness = 0    The kernel will swap only to avoid an 'out of memory' condition
 285    vm.swappiness = 1    Minimum amount of swapping without disabling it entirely.
 286    vm.swappiness = 10   Sometimes recommended to improve performance when sufficient memory exists in a system.
 287    vm.swappiness = 60   The default value.
 288    vm.swappiness = 100  The kernel will swap aggressively.
 289   ====================  ===============================================================
 290
 291 ZFS Compression
 292 ^^^^^^^^^^^^^^^
 293
 294 To activate compression:
 295 .. code-block:: console
 296
 297   # zpool set compression=lz4 <pool>
 298
 299 We recommend using the `lz4` algorithm, since it adds very little CPU overhead.
 300 Other algorithms such as `lzjb` and `gzip-N` (where `N` is an integer `1-9` representing
 301 the compression ratio, 1 is fastest and 9 is best compression) are also available.
 302 Depending on the algorithm and how compressible the data is, having compression enabled can even increase
 303 I/O performance.
 304
 305 You can disable compression at any time with:
 306 .. code-block:: console
 307
 308   # zfs set compression=off <dataset>
 309
 310 Only new blocks will be affected by this change.
 311
 312 ZFS Special Device
 313 ^^^^^^^^^^^^^^^^^^
 314
 315 Since version 0.8.0 ZFS supports `special` devices. A `special` device in a
 316 pool is used to store metadata, deduplication tables, and optionally small
 317 file blocks.
 318
 319 A `special` device can improve the speed of a pool consisting of slow spinning
 320 hard disks with a lot of metadata changes. For example workloads that involve
 321 creating, updating or deleting a large number of files will benefit from the
 322 presence of a `special` device. ZFS datasets can also be configured to store
 323 whole small files on the `special` device which can further improve the
 324 performance. Use fast SSDs for the `special` device.
 325
 326 .. IMPORTANT:: The redundancy of the `special` device should match the one of the
 327   pool, since the `special` device is a point of failure for the whole pool.
 328
 329 .. WARNING:: Adding a `special` device to a pool cannot be undone!
 330
 331 Create a pool with `special` device and RAID-1:
 332
 333 .. code-block:: console
 334
 335   # zpool create -f -o ashift=12 <pool> mirror <device1> <device2> special mirror <device3> <device4>
 336
 337 Adding a `special` device to an existing pool with RAID-1:
 338
 339 .. code-block:: console
 340
 341   # zpool add <pool> special mirror <device1> <device2>
 342
 343 ZFS datasets expose the `special_small_blocks=<size>` property. `size` can be
 344 `0` to disable storing small file blocks on the `special` device or a power of
 345 two in the range between `512B` to `128K`. After setting the property new file
 346 blocks smaller than `size` will be allocated on the `special` device.
 347
 348 .. IMPORTANT:: If the value for `special_small_blocks` is greater than or equal to
 349   the `recordsize` (default `128K`) of the dataset, *all* data will be written to
 350   the `special` device, so be careful!
 351
 352 Setting the `special_small_blocks` property on a pool will change the default
 353 value of that property for all child ZFS datasets (for example all containers
 354 in the pool will opt in for small file blocks).
 355
 356 Opt in for all file smaller than 4K-blocks pool-wide:
 357
 358 .. code-block:: console
 359
 360   # zfs set special_small_blocks=4K <pool>
 361
 362 Opt in for small file blocks for a single dataset:
 363
 364 .. code-block:: console
 365
 366   # zfs set special_small_blocks=4K <pool>/<filesystem>
 367
 368 Opt out from small file blocks for a single dataset:
 369
 370 .. code-block:: console
 371
 372   # zfs set special_small_blocks=0 <pool>/<filesystem>
 373
 374 Troubleshooting
 375 ^^^^^^^^^^^^^^^
 376
 377 Corrupted cachefile
 378
 379 In case of a corrupted ZFS cachefile, some volumes may not be mounted during
 380 boot until mounted manually later.
 381
 382 For each pool, run:
 383
 384 .. code-block:: console
 385
 386   # zpool set cachefile=/etc/zfs/zpool.cache POOLNAME
 387
 388 and afterwards update the `initramfs` by running:
 389
 390 .. code-block:: console
 391
 392   # update-initramfs -u -k all
 393
 394 and finally reboot your node.
 395
 396 Sometimes the ZFS cachefile can get corrupted, and `zfs-import-cache.service`
 397 doesn't import the pools that aren't present in the cachefile.
 398
 399 Another workaround to this problem is enabling the `zfs-import-scan.service`,
 400 which searches and imports pools via device scanning (usually slower).