docs/local-zfs.rst

   1
   2 .. _chapter-zfs:
   3
   4 ZFS on Linux
   5 ------------
   6
   7 ZFS is a combined file system and logical volume manager, designed by
   8 Sun Microsystems. There is no need to manually compile ZFS modules - all
   9 packages are included.
  10
  11 By using ZFS, it's possible to achieve maximum enterprise features with
  12 low budget hardware, and also high performance systems by leveraging
  13 SSD caching or even SSD only setups. ZFS can replace expensive
  14 hardware raid cards with moderate CPU and memory load, combined with easy
  15 management.
  16
  17 General advantages of ZFS:
  18
  19 * Easy configuration and management with GUI and CLI.
  20 * Reliable
  21 * Protection against data corruption
  22 * Data compression on file system level
  23 * Snapshots
  24 * Copy-on-write clone
  25 * Various raid levels: RAID0, RAID1, RAID10, RAIDZ-1, RAIDZ-2 and RAIDZ-3
  26 * Can use SSD for cache
  27 * Self healing
  28 * Continuous integrity checking
  29 * Designed for high storage capacities
  30 * Asynchronous replication over network
  31 * Open Source
  32 * Encryption
  33
  34 Hardware
  35 ~~~~~~~~~
  36
  37 ZFS depends heavily on memory, so it's recommended to have at least 8GB to
  38 start. In practice, use as much you can get for your hardware/budget. To prevent
  39 data corruption, we recommend the use of high quality ECC RAM.
  40
  41 If you use a dedicated cache and/or log disk, you should use an
  42 enterprise class SSD (for example, Intel SSD DC S3700 Series). This can
  43 increase the overall performance significantly.
  44
  45 IMPORTANT: Do not use ZFS on top of a hardware controller which has its
  46 own cache management. ZFS needs to directly communicate with disks. An
  47 HBA adapter or something like an LSI controller flashed in ``IT`` mode is
  48 recommended.
  49
  50
  51 ZFS Administration
  52 ~~~~~~~~~~~~~~~~~~
  53
  54 This section gives you some usage examples for common tasks. ZFS
  55 itself is really powerful and provides many options. The main commands
  56 to manage ZFS are `zfs` and `zpool`. Both commands come with extensive
  57 manual pages, which can be read with:
  58
  59 .. code-block:: console
  60
  61   # man zpool
  62   # man zfs
  63
  64 Create a new zpool
  65 ^^^^^^^^^^^^^^^^^^
  66
  67 To create a new pool, at least one disk is needed. The `ashift` should
  68 have the same sector-size (2 power of `ashift`) or larger as the
  69 underlying disk.
  70
  71 .. code-block:: console
  72
  73   # zpool create -f -o ashift=12 <pool> <device>
  74
  75 Create a new pool with RAID-0
  76 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  77
  78 Minimum 1 disk
  79
  80 .. code-block:: console
  81
  82   # zpool create -f -o ashift=12 <pool> <device1> <device2>
  83
  84 Create a new pool with RAID-1
  85 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  86
  87 Minimum 2 disks
  88
  89 .. code-block:: console
  90
  91   # zpool create -f -o ashift=12 <pool> mirror <device1> <device2>
  92
  93 Create a new pool with RAID-10
  94 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  95
  96 Minimum 4 disks
  97
  98 .. code-block:: console
  99
 100   # zpool create -f -o ashift=12 <pool> mirror <device1> <device2> mirror <device3> <device4>
 101
 102 Create a new pool with RAIDZ-1
 103 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 104
 105 Minimum 3 disks
 106
 107 .. code-block:: console
 108
 109   # zpool create -f -o ashift=12 <pool> raidz1 <device1> <device2> <device3>
 110
 111 Create a new pool with RAIDZ-2
 112 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 113
 114 Minimum 4 disks
 115
 116 .. code-block:: console
 117
 118   # zpool create -f -o ashift=12 <pool> raidz2 <device1> <device2> <device3> <device4>
 119
 120 Create a new pool with cache (L2ARC)
 121 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 122
 123 It is possible to use a dedicated cache drive partition to increase
 124 the performance (use SSD).
 125
 126 For `<device>`, you can use multiple devices, as is shown in
 127 "Create a new pool with RAID*".
 128
 129 .. code-block:: console
 130
 131   # zpool create -f -o ashift=12 <pool> <device> cache <cache_device>
 132
 133 Create a new pool with log (ZIL)
 134 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 135
 136 It is possible to use a dedicated cache drive partition to increase
 137 the performance (SSD).
 138
 139 For `<device>`, you can use multiple devices, as is shown in
 140 "Create a new pool with RAID*".
 141
 142 .. code-block:: console
 143
 144   # zpool create -f -o ashift=12 <pool> <device> log <log_device>
 145
 146 Add cache and log to an existing pool
 147 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 148
 149 You can add cache and log devices to a pool after its creation. In this example,
 150 we will use a single drive for both cache and log. First, you need to create
 151 2 partitions on the SSD with `parted` or `gdisk`
 152
 153 .. important:: Always use GPT partition tables.
 154
 155 The maximum size of a log device should be about half the size of
 156 physical memory, so this is usually quite small. The rest of the SSD
 157 can be used as cache.
 158
 159 .. code-block:: console
 160
 161   # zpool add -f <pool> log <device-part1> cache <device-part2>
 162
 163
 164 Changing a failed device
 165 ^^^^^^^^^^^^^^^^^^^^^^^^
 166
 167 .. code-block:: console
 168
 169   # zpool replace -f <pool> <old device> <new device>
 170
 171
 172 Changing a failed bootable device
 173 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 174
 175 Depending on how Proxmox Backup was installed, it is either using `grub` or
 176 `systemd-boot` as a bootloader.
 177
 178 In either case, the first steps of copying the partition table, reissuing GUIDs
 179 and replacing the ZFS partition are the same. To make the system bootable from
 180 the new disk, different steps are needed which depend on the bootloader in use.
 181
 182 .. code-block:: console
 183
 184   # sgdisk <healthy bootable device> -R <new device>
 185   # sgdisk -G <new device>
 186   # zpool replace -f <pool> <old zfs partition> <new zfs partition>
 187
 188 .. NOTE:: Use the `zpool status -v` command to monitor how far the resilvering process of the new disk has progressed.
 189
 190 With `systemd-boot`:
 191
 192 .. code-block:: console
 193
 194   # proxmox-boot-tool format <new ESP>
 195   # proxmox-boot-tool init <new ESP>
 196
 197 .. NOTE:: `ESP` stands for EFI System Partition, which is setup as partition #2 on
 198   bootable disks setup by the `Proxmox Backup`_ installer. For details, see
 199   :ref:`Setting up a new partition for use as synced ESP <systembooting-proxmox-boot-setup>`.
 200
 201 With `grub`:
 202
 203 Usually `grub.cfg` is located in `/boot/grub/grub.cfg`
 204
 205 .. code-block:: console
 206
 207   # grub-install <new disk>
 208   # grub-mkconfig -o /path/to/grub.cfg
 209
 210
 211 Activate e-mail notification
 212 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 213
 214 ZFS comes with an event daemon, ``ZED``, which monitors events generated by the
 215 ZFS kernel module. The daemon can also send emails upon ZFS events, such as pool
 216 errors. Newer ZFS packages ship the daemon in a separate package ``zfs-zed``,
 217 which should already be installed by default in `Proxmox Backup`_.
 218
 219 You can configure the daemon via the file ``/etc/zfs/zed.d/zed.rc``, using your
 220 preferred editor. The required setting for email notfication is
 221 ``ZED_EMAIL_ADDR``, which is set to ``root`` by default.
 222
 223 .. code-block:: console
 224
 225   ZED_EMAIL_ADDR="root"
 226
 227 Please note that `Proxmox Backup`_ forwards mails to `root` to the email address
 228 configured for the root user.
 229
 230
 231 Limit ZFS memory usage
 232 ^^^^^^^^^^^^^^^^^^^^^^
 233
 234 It is good to use at most 50 percent (which is the default) of the
 235 system memory for ZFS ARC, to prevent performance degradation of the
 236 host. Use your preferred editor to change the configuration in
 237 `/etc/modprobe.d/zfs.conf` and insert:
 238
 239 .. code-block:: console
 240
 241   options zfs zfs_arc_max=8589934592
 242
 243 The above example limits the usage to 8 GiB ('8 * 2^30^').
 244
 245 .. IMPORTANT:: In case your desired `zfs_arc_max` value is lower than or equal
 246    to `zfs_arc_min` (which defaults to 1/32 of the system memory), `zfs_arc_max`
 247    will be ignored. Thus, for it to work in this case, you must set
 248    `zfs_arc_min` to at most `zfs_arc_max - 1`. This would require updating the
 249    configuration in `/etc/modprobe.d/zfs.conf`, with:
 250
 251 .. code-block:: console
 252
 253   options zfs zfs_arc_min=8589934591
 254   options zfs zfs_arc_max=8589934592
 255
 256 This example setting limits the usage to 8 GiB ('8 * 2^30^') on
 257 systems with more than 256 GiB of total memory, where simply setting
 258 `zfs_arc_max` alone would not work.
 259
 260 .. IMPORTANT:: If your root file system is ZFS, you must update your initramfs
 261    every time this value changes.
 262
 263 .. code-block:: console
 264
 265   # update-initramfs -u
 266
 267
 268 Swap on ZFS
 269 ^^^^^^^^^^^
 270
 271 Swap-space created on a zvol may cause some issues, such as blocking the
 272 server or generating a high IO load.
 273
 274 We strongly recommend using enough memory, so that you normally do not
 275 run into low memory situations. Should you need or want to add swap, it is
 276 preferred to create a partition on a physical disk and use it as a swap device.
 277 You can leave some space free for this purpose in the advanced options of the
 278 installer. Additionally, you can lower the `swappiness` value.
 279 A good value for servers is 10:
 280
 281 .. code-block:: console
 282
 283   # sysctl -w vm.swappiness=10
 284
 285 To make the swappiness persistent, open `/etc/sysctl.conf` with
 286 an editor of your choice and add the following line:
 287
 288 .. code-block:: console
 289
 290   vm.swappiness = 10
 291
 292 .. table:: Linux kernel `swappiness` parameter values
 293   :widths:auto
 294
 295   ====================  ===============================================================
 296    Value                Strategy
 297   ====================  ===============================================================
 298    vm.swappiness = 0    The kernel will swap only to avoid an 'out of memory' condition
 299    vm.swappiness = 1    Minimum amount of swapping without disabling it entirely.
 300    vm.swappiness = 10   Sometimes recommended to improve performance when sufficient memory exists in a system.
 301    vm.swappiness = 60   The default value.
 302    vm.swappiness = 100  The kernel will swap aggressively.
 303   ====================  ===============================================================
 304
 305 ZFS compression
 306 ^^^^^^^^^^^^^^^
 307
 308 To activate compression:
 309
 310 .. code-block:: console
 311
 312   # zpool set compression=lz4 <pool>
 313
 314 We recommend using the `lz4` algorithm, since it adds very little CPU overhead.
 315 Other algorithms such as `lzjb`, `zstd` and `gzip-N` (where `N` is an integer from `1-9`
 316 representing the compression ratio, where 1 is fastest and 9 is best
 317 compression) are also available. Depending on the algorithm and how
 318 compressible the data is, having compression enabled can even increase I/O
 319 performance.
 320
 321 You can disable compression at any time with:
 322
 323 .. code-block:: console
 324
 325   # zfs set compression=off <dataset>
 326
 327 Only new blocks will be affected by this change.
 328
 329 .. _local_zfs_special_device:
 330
 331 ZFS special device
 332 ^^^^^^^^^^^^^^^^^^
 333
 334 Since version 0.8.0, ZFS supports `special` devices. A `special` device in a
 335 pool is used to store metadata, deduplication tables, and optionally small
 336 file blocks.
 337
 338 A `special` device can improve the speed of a pool consisting of slow spinning
 339 hard disks with a lot of metadata changes. For example, workloads that involve
 340 creating, updating or deleting a large number of files will benefit from the
 341 presence of a `special` device. ZFS datasets can also be configured to store
 342 small files on the `special` device, which can further improve the
 343 performance. Use fast SSDs for the `special` device.
 344
 345 .. IMPORTANT:: The redundancy of the `special` device should match the one of the
 346   pool, since the `special` device is a point of failure for the entire pool.
 347
 348 .. WARNING:: Adding a `special` device to a pool cannot be undone!
 349
 350 To create a pool with `special` device and RAID-1:
 351
 352 .. code-block:: console
 353
 354   # zpool create -f -o ashift=12 <pool> mirror <device1> <device2> special mirror <device3> <device4>
 355
 356 Adding a `special` device to an existing pool with RAID-1:
 357
 358 .. code-block:: console
 359
 360   # zpool add <pool> special mirror <device1> <device2>
 361
 362 ZFS datasets expose the `special_small_blocks=<size>` property. `size` can be
 363 `0` to disable storing small file blocks on the `special` device, or a power of
 364 two in the range between `512B` to `128K`. After setting this property, new file
 365 blocks smaller than `size` will be allocated on the `special` device.
 366
 367 .. IMPORTANT:: If the value for `special_small_blocks` is greater than or equal to
 368   the `recordsize` (default `128K`) of the dataset, *all* data will be written to
 369   the `special` device, so be careful!
 370
 371 Setting the `special_small_blocks` property on a pool will change the default
 372 value of that property for all child ZFS datasets (for example, all containers
 373 in the pool will opt in for small file blocks).
 374
 375 Opt in for all files smaller than 4K-blocks pool-wide:
 376
 377 .. code-block:: console
 378
 379   # zfs set special_small_blocks=4K <pool>
 380
 381 Opt in for small file blocks for a single dataset:
 382
 383 .. code-block:: console
 384
 385   # zfs set special_small_blocks=4K <pool>/<filesystem>
 386
 387 Opt out from small file blocks for a single dataset:
 388
 389 .. code-block:: console
 390
 391   # zfs set special_small_blocks=0 <pool>/<filesystem>
 392
 393 Troubleshooting
 394 ^^^^^^^^^^^^^^^
 395
 396 Corrupt cache file
 397 """"""""""""""""""
 398
 399 `zfs-import-cache.service` imports ZFS pools using the ZFS cache file. If this
 400 file becomes corrupted, the service won't be able to import the pools that it's
 401 unable to read from it.
 402
 403 As a result, in case of a corrupted ZFS cache file, some volumes may not be
 404 mounted during boot and must be mounted manually later.
 405
 406 For each pool, run:
 407
 408 .. code-block:: console
 409
 410   # zpool set cachefile=/etc/zfs/zpool.cache POOLNAME
 411
 412 then, update the `initramfs` by running:
 413
 414 .. code-block:: console
 415
 416   # update-initramfs -u -k all
 417
 418 and finally, reboot the node.
 419
 420 Another workaround to this problem is enabling the `zfs-import-scan.service`,
 421 which searches and imports pools via device scanning (usually slower).