docs/local-zfs.rst

   1 ZFS on Linux
   2 ------------
   3
   4 ZFS is a combined file system and logical volume manager designed by
   5 Sun Microsystems. There is no need to manually compile ZFS modules - all
   6 packages are included.
   7
   8 By using ZFS, it's possible to achieve maximum enterprise features with
   9 low budget hardware, but also high performance systems by leveraging
  10 SSD caching or even SSD only setups. ZFS can replace cost intense
  11 hardware raid cards by moderate CPU and memory load combined with easy
  12 management.
  13
  14 General ZFS advantages
  15
  16 * Easy configuration and management with GUI and CLI.
  17 * Reliable
  18 * Protection against data corruption
  19 * Data compression on file system level
  20 * Snapshots
  21 * Copy-on-write clone
  22 * Various raid levels: RAID0, RAID1, RAID10, RAIDZ-1, RAIDZ-2 and RAIDZ-3
  23 * Can use SSD for cache
  24 * Self healing
  25 * Continuous integrity checking
  26 * Designed for high storage capacities
  27 * Protection against data corruption
  28 * Asynchronous replication over network
  29 * Open Source
  30 * Encryption
  31
  32 Hardware
  33 ~~~~~~~~~
  34
  35 ZFS depends heavily on memory, so you need at least 8GB to start. In
  36 practice, use as much you can get for your hardware/budget. To prevent
  37 data corruption, we recommend the use of high quality ECC RAM.
  38
  39 If you use a dedicated cache and/or log disk, you should use an
  40 enterprise class SSD (e.g. Intel SSD DC S3700 Series). This can
  41 increase the overall performance significantly.
  42
  43 IMPORTANT: Do not use ZFS on top of hardware controller which has its
  44 own cache management. ZFS needs to directly communicate with disks. An
  45 HBA adapter is the way to go, or something like LSI controller flashed
  46 in ``IT`` mode.
  47
  48
  49 ZFS Administration
  50 ~~~~~~~~~~~~~~~~~~
  51
  52 This section gives you some usage examples for common tasks. ZFS
  53 itself is really powerful and provides many options. The main commands
  54 to manage ZFS are `zfs` and `zpool`. Both commands come with great
  55 manual pages, which can be read with:
  56
  57 .. code-block:: console
  58
  59   # man zpool
  60   # man zfs
  61
  62 Create a new zpool
  63 ^^^^^^^^^^^^^^^^^^
  64
  65 To create a new pool, at least one disk is needed. The `ashift` should
  66 have the same sector-size (2 power of `ashift`) or larger as the
  67 underlying disk.
  68
  69 .. code-block:: console
  70
  71   # zpool create -f -o ashift=12 <pool> <device>
  72
  73 Create a new pool with RAID-0
  74 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  75
  76 Minimum 1 disk
  77
  78 .. code-block:: console
  79
  80   # zpool create -f -o ashift=12 <pool> <device1> <device2>
  81
  82 Create a new pool with RAID-1
  83 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  84
  85 Minimum 2 disks
  86
  87 .. code-block:: console
  88
  89   # zpool create -f -o ashift=12 <pool> mirror <device1> <device2>
  90
  91 Create a new pool with RAID-10
  92 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  93
  94 Minimum 4 disks
  95
  96 .. code-block:: console
  97
  98   # zpool create -f -o ashift=12 <pool> mirror <device1> <device2> mirror <device3> <device4>
  99
 100 Create a new pool with RAIDZ-1
 101 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 102
 103 Minimum 3 disks
 104
 105 .. code-block:: console
 106
 107   # zpool create -f -o ashift=12 <pool> raidz1 <device1> <device2> <device3>
 108
 109 Create a new pool with RAIDZ-2
 110 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 111
 112 Minimum 4 disks
 113
 114 .. code-block:: console
 115
 116   # zpool create -f -o ashift=12 <pool> raidz2 <device1> <device2> <device3> <device4>
 117
 118 Create a new pool with cache (L2ARC)
 119 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 120
 121 It is possible to use a dedicated cache drive partition to increase
 122 the performance (use SSD).
 123
 124 As `<device>` it is possible to use more devices, like it's shown in
 125 "Create a new pool with RAID*".
 126
 127 .. code-block:: console
 128
 129   # zpool create -f -o ashift=12 <pool> <device> cache <cache_device>
 130
 131 Create a new pool with log (ZIL)
 132 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 133
 134 It is possible to use a dedicated cache drive partition to increase
 135 the performance (SSD).
 136
 137 As `<device>` it is possible to use more devices, like it's shown in
 138 "Create a new pool with RAID*".
 139
 140 .. code-block:: console
 141
 142   # zpool create -f -o ashift=12 <pool> <device> log <log_device>
 143
 144 Add cache and log to an existing pool
 145 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 146
 147 If you have a pool without cache and log. First partition the SSD in
 148 2 partition with `parted` or `gdisk`
 149
 150 .. important:: Always use GPT partition tables.
 151
 152 The maximum size of a log device should be about half the size of
 153 physical memory, so this is usually quite small. The rest of the SSD
 154 can be used as cache.
 155
 156 .. code-block:: console
 157
 158   # zpool add -f <pool> log <device-part1> cache <device-part2>
 159
 160
 161 Changing a failed device
 162 ^^^^^^^^^^^^^^^^^^^^^^^^
 163
 164 .. code-block:: console
 165
 166   # zpool replace -f <pool> <old device> <new device>
 167
 168
 169 Changing a failed bootable device
 170 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 171
 172 Depending on how Proxmox Backup was installed it is either using `grub` or `systemd-boot`
 173 as bootloader.
 174
 175 The first steps of copying the partition table, reissuing GUIDs and replacing
 176 the ZFS partition are the same. To make the system bootable from the new disk,
 177 different steps are needed which depend on the bootloader in use.
 178
 179 .. code-block:: console
 180
 181   # sgdisk <healthy bootable device> -R <new device>
 182   # sgdisk -G <new device>
 183   # zpool replace -f <pool> <old zfs partition> <new zfs partition>
 184
 185 .. NOTE:: Use the `zpool status -v` command to monitor how far the resilvering process of the new disk has progressed.
 186
 187 With `systemd-boot`:
 188
 189 .. code-block:: console
 190
 191   # pve-efiboot-tool format <new disk's ESP>
 192   # pve-efiboot-tool init <new disk's ESP>
 193
 194 .. NOTE:: `ESP` stands for EFI System Partition, which is setup as partition #2 on
 195   bootable disks setup by the {pve} installer since version 5.4. For details, see
 196   xref:sysboot_systemd_boot_setup[Setting up a new partition for use as synced ESP].
 197
 198 With `grub`:
 199
 200 Usually `grub.cfg` is located in `/boot/grub/grub.cfg`
 201
 202 .. code-block:: console
 203
 204   # grub-install <new disk>
 205   # grub-mkconfig -o /path/to/grub.cfg
 206
 207
 208 Activate E-Mail Notification
 209 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 210
 211 ZFS comes with an event daemon, which monitors events generated by the
 212 ZFS kernel module. The daemon can also send emails on ZFS events like
 213 pool errors. Newer ZFS packages ship the daemon in a separate package,
 214 and you can install it using `apt-get`:
 215
 216 .. code-block:: console
 217
 218   # apt-get install zfs-zed
 219
 220 To activate the daemon it is necessary to edit `/etc/zfs/zed.d/zed.rc` with your
 221 favourite editor, and uncomment the `ZED_EMAIL_ADDR` setting:
 222
 223 .. code-block:: console
 224
 225   ZED_EMAIL_ADDR="root"
 226
 227 Please note Proxmox Backup forwards mails to `root` to the email address
 228 configured for the root user.
 229
 230 IMPORTANT: The only setting that is required is `ZED_EMAIL_ADDR`. All
 231 other settings are optional.
 232
 233 Limit ZFS Memory Usage
 234 ^^^^^^^^^^^^^^^^^^^^^^
 235
 236 It is good to use at most 50 percent (which is the default) of the
 237 system memory for ZFS ARC to prevent performance shortage of the
 238 host. Use your preferred editor to change the configuration in
 239 `/etc/modprobe.d/zfs.conf` and insert:
 240
 241 .. code-block:: console
 242
 243   options zfs zfs_arc_max=8589934592
 244
 245 This example setting limits the usage to 8GB.
 246
 247 .. IMPORTANT:: If your root file system is ZFS you must update your initramfs every time this value changes:
 248
 249 .. code-block:: console
 250
 251   # update-initramfs -u
 252
 253
 254 SWAP on ZFS
 255 ^^^^^^^^^^^
 256
 257 Swap-space created on a zvol may generate some troubles, like blocking the
 258 server or generating a high IO load, often seen when starting a Backup
 259 to an external Storage.
 260
 261 We strongly recommend to use enough memory, so that you normally do not
 262 run into low memory situations. Should you need or want to add swap, it is
 263 preferred to create a partition on a physical disk and use it as swapdevice.
 264 You can leave some space free for this purpose in the advanced options of the
 265 installer. Additionally, you can lower the `swappiness` value.
 266 A good value for servers is 10:
 267
 268 .. code-block:: console
 269
 270   # sysctl -w vm.swappiness=10
 271
 272 To make the swappiness persistent, open `/etc/sysctl.conf` with
 273 an editor of your choice and add the following line:
 274
 275 .. code-block:: console
 276
 277   vm.swappiness = 10
 278
 279 .. table:: Linux kernel `swappiness` parameter values
 280   :widths:auto
 281
 282   ====================  ===============================================================
 283    Value                Strategy
 284   ====================  ===============================================================
 285    vm.swappiness = 0    The kernel will swap only to avoid an 'out of memory' condition
 286    vm.swappiness = 1    Minimum amount of swapping without disabling it entirely.
 287    vm.swappiness = 10   Sometimes recommended to improve performance when sufficient memory exists in a system.
 288    vm.swappiness = 60   The default value.
 289    vm.swappiness = 100  The kernel will swap aggressively.
 290   ====================  ===============================================================
 291
 292 ZFS Compression
 293 ^^^^^^^^^^^^^^^
 294
 295 To activate compression:
 296 .. code-block:: console
 297
 298   # zpool set compression=lz4 <pool>
 299
 300 We recommend using the `lz4` algorithm, since it adds very little CPU overhead.
 301 Other algorithms such as `lzjb` and `gzip-N` (where `N` is an integer `1-9` representing
 302 the compression ratio, 1 is fastest and 9 is best compression) are also available.
 303 Depending on the algorithm and how compressible the data is, having compression enabled can even increase
 304 I/O performance.
 305
 306 You can disable compression at any time with:
 307 .. code-block:: console
 308
 309   # zfs set compression=off <dataset>
 310
 311 Only new blocks will be affected by this change.
 312
 313 ZFS Special Device
 314 ^^^^^^^^^^^^^^^^^^
 315
 316 Since version 0.8.0 ZFS supports `special` devices. A `special` device in a
 317 pool is used to store metadata, deduplication tables, and optionally small
 318 file blocks.
 319
 320 A `special` device can improve the speed of a pool consisting of slow spinning
 321 hard disks with a lot of metadata changes. For example workloads that involve
 322 creating, updating or deleting a large number of files will benefit from the
 323 presence of a `special` device. ZFS datasets can also be configured to store
 324 whole small files on the `special` device which can further improve the
 325 performance. Use fast SSDs for the `special` device.
 326
 327 .. IMPORTANT:: The redundancy of the `special` device should match the one of the
 328   pool, since the `special` device is a point of failure for the whole pool.
 329
 330 .. WARNING:: Adding a `special` device to a pool cannot be undone!
 331
 332 Create a pool with `special` device and RAID-1:
 333
 334 .. code-block:: console
 335
 336   # zpool create -f -o ashift=12 <pool> mirror <device1> <device2> special mirror <device3> <device4>
 337
 338 Adding a `special` device to an existing pool with RAID-1:
 339
 340 .. code-block:: console
 341
 342   # zpool add <pool> special mirror <device1> <device2>
 343
 344 ZFS datasets expose the `special_small_blocks=<size>` property. `size` can be
 345 `0` to disable storing small file blocks on the `special` device or a power of
 346 two in the range between `512B` to `128K`. After setting the property new file
 347 blocks smaller than `size` will be allocated on the `special` device.
 348
 349 .. IMPORTANT:: If the value for `special_small_blocks` is greater than or equal to
 350   the `recordsize` (default `128K`) of the dataset, *all* data will be written to
 351   the `special` device, so be careful!
 352
 353 Setting the `special_small_blocks` property on a pool will change the default
 354 value of that property for all child ZFS datasets (for example all containers
 355 in the pool will opt in for small file blocks).
 356
 357 Opt in for all file smaller than 4K-blocks pool-wide:
 358
 359 .. code-block:: console
 360
 361   # zfs set special_small_blocks=4K <pool>
 362
 363 Opt in for small file blocks for a single dataset:
 364
 365 .. code-block:: console
 366
 367   # zfs set special_small_blocks=4K <pool>/<filesystem>
 368
 369 Opt out from small file blocks for a single dataset:
 370
 371 .. code-block:: console
 372
 373   # zfs set special_small_blocks=0 <pool>/<filesystem>
 374
 375 Troubleshooting
 376 ^^^^^^^^^^^^^^^
 377
 378 Corrupted cachefile
 379
 380 In case of a corrupted ZFS cachefile, some volumes may not be mounted during
 381 boot until mounted manually later.
 382
 383 For each pool, run:
 384
 385 .. code-block:: console
 386
 387   # zpool set cachefile=/etc/zfs/zpool.cache POOLNAME
 388
 389 and afterwards update the `initramfs` by running:
 390
 391 .. code-block:: console
 392
 393   # update-initramfs -u -k all
 394
 395 and finally reboot your node.
 396
 397 Sometimes the ZFS cachefile can get corrupted, and `zfs-import-cache.service`
 398 doesn't import the pools that aren't present in the cachefile.
 399
 400 Another workaround to this problem is enabling the `zfs-import-scan.service`,
 401 which searches and imports pools via device scanning (usually slower).