]> git.proxmox.com Git - pve-docs.git/blame - local-zfs.adoc
update static/schema information
[pve-docs.git] / local-zfs.adoc
CommitLineData
0235c741 1[[chapter_zfs]]
9ee94323
DM
2ZFS on Linux
3------------
5f09af76
DM
4ifdef::wiki[]
5:pve-toplevel:
6endif::wiki[]
7
9ee94323
DM
8ZFS is a combined file system and logical volume manager designed by
9Sun Microsystems. Starting with {pve} 3.4, the native Linux
10kernel port of the ZFS file system is introduced as optional
5eba0743
FG
11file system and also as an additional selection for the root
12file system. There is no need for manually compile ZFS modules - all
9ee94323
DM
13packages are included.
14
5eba0743 15By using ZFS, its possible to achieve maximum enterprise features with
9ee94323
DM
16low budget hardware, but also high performance systems by leveraging
17SSD caching or even SSD only setups. ZFS can replace cost intense
18hardware raid cards by moderate CPU and memory load combined with easy
19management.
20
21.General ZFS advantages
22
23* Easy configuration and management with {pve} GUI and CLI.
24
25* Reliable
26
27* Protection against data corruption
28
5eba0743 29* Data compression on file system level
9ee94323
DM
30
31* Snapshots
32
33* Copy-on-write clone
34
447596fd
SH
35* Various raid levels: RAID0, RAID1, RAID10, RAIDZ-1, RAIDZ-2, RAIDZ-3,
36dRAID, dRAID2, dRAID3
9ee94323
DM
37
38* Can use SSD for cache
39
40* Self healing
41
42* Continuous integrity checking
43
44* Designed for high storage capacities
45
9ee94323
DM
46* Asynchronous replication over network
47
48* Open Source
49
50* Encryption
51
52* ...
53
54
55Hardware
56~~~~~~~~
57
58ZFS depends heavily on memory, so you need at least 8GB to start. In
60ed554f 59practice, use as much as you can get for your hardware/budget. To prevent
9ee94323
DM
60data corruption, we recommend the use of high quality ECC RAM.
61
d48bdcf2 62If you use a dedicated cache and/or log disk, you should use an
0d4a93dc 63enterprise class SSD. This can
9ee94323
DM
64increase the overall performance significantly.
65
60ed554f
DW
66IMPORTANT: Do not use ZFS on top of a hardware RAID controller which has its
67own cache management. ZFS needs to communicate directly with the disks. An
68HBA adapter or something like an LSI controller flashed in ``IT'' mode is more
69appropriate.
9ee94323
DM
70
71If you are experimenting with an installation of {pve} inside a VM
8c1189b6 72(Nested Virtualization), don't use `virtio` for disks of that VM,
60ed554f
DW
73as they are not supported by ZFS. Use IDE or SCSI instead (also works
74with the `virtio` SCSI controller type).
9ee94323
DM
75
76
5eba0743 77Installation as Root File System
9ee94323
DM
78~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
79
80When you install using the {pve} installer, you can choose ZFS for the
81root file system. You need to select the RAID type at installation
82time:
83
84[horizontal]
8c1189b6
FG
85RAID0:: Also called ``striping''. The capacity of such volume is the sum
86of the capacities of all disks. But RAID0 does not add any redundancy,
9ee94323
DM
87so the failure of a single drive makes the volume unusable.
88
8c1189b6 89RAID1:: Also called ``mirroring''. Data is written identically to all
9ee94323
DM
90disks. This mode requires at least 2 disks with the same size. The
91resulting capacity is that of a single disk.
92
93RAID10:: A combination of RAID0 and RAID1. Requires at least 4 disks.
94
95RAIDZ-1:: A variation on RAID-5, single parity. Requires at least 3 disks.
96
97RAIDZ-2:: A variation on RAID-5, double parity. Requires at least 4 disks.
98
99RAIDZ-3:: A variation on RAID-5, triple parity. Requires at least 5 disks.
100
101The installer automatically partitions the disks, creates a ZFS pool
8c1189b6
FG
102called `rpool`, and installs the root file system on the ZFS subvolume
103`rpool/ROOT/pve-1`.
9ee94323 104
8c1189b6 105Another subvolume called `rpool/data` is created to store VM
9ee94323 106images. In order to use that with the {pve} tools, the installer
8c1189b6 107creates the following configuration entry in `/etc/pve/storage.cfg`:
9ee94323
DM
108
109----
110zfspool: local-zfs
111 pool rpool/data
112 sparse
113 content images,rootdir
114----
115
116After installation, you can view your ZFS pool status using the
8c1189b6 117`zpool` command:
9ee94323
DM
118
119----
120# zpool status
121 pool: rpool
122 state: ONLINE
123 scan: none requested
124config:
125
126 NAME STATE READ WRITE CKSUM
127 rpool ONLINE 0 0 0
128 mirror-0 ONLINE 0 0 0
129 sda2 ONLINE 0 0 0
130 sdb2 ONLINE 0 0 0
131 mirror-1 ONLINE 0 0 0
132 sdc ONLINE 0 0 0
133 sdd ONLINE 0 0 0
134
135errors: No known data errors
136----
137
8c1189b6 138The `zfs` command is used configure and manage your ZFS file
9ee94323
DM
139systems. The following command lists all file systems after
140installation:
141
142----
143# zfs list
144NAME USED AVAIL REFER MOUNTPOINT
145rpool 4.94G 7.68T 96K /rpool
146rpool/ROOT 702M 7.68T 96K /rpool/ROOT
147rpool/ROOT/pve-1 702M 7.68T 702M /
148rpool/data 96K 7.68T 96K /rpool/data
149rpool/swap 4.25G 7.69T 64K -
150----
151
152
e4262cac
AL
153[[sysadmin_zfs_raid_considerations]]
154ZFS RAID Level Considerations
155~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
156
157There are a few factors to take into consideration when choosing the layout of
158a ZFS pool. The basic building block of a ZFS pool is the virtual device, or
159`vdev`. All vdevs in a pool are used equally and the data is striped among them
8182a38e 160(RAID0). Check the `zpoolconcepts(7)` manpage for more details on vdevs.
e4262cac
AL
161
162[[sysadmin_zfs_raid_performance]]
163Performance
164^^^^^^^^^^^
165
166Each `vdev` type has different performance behaviors. The two
167parameters of interest are the IOPS (Input/Output Operations per Second) and
168the bandwidth with which data can be written or read.
169
f1b7d1a3
MH
170A 'mirror' vdev (RAID1) will approximately behave like a single disk in regard
171to both parameters when writing data. When reading data the performance will
172scale linearly with the number of disks in the mirror.
e4262cac
AL
173
174A common situation is to have 4 disks. When setting it up as 2 mirror vdevs
175(RAID10) the pool will have the write characteristics as two single disks in
f1b7d1a3 176regard to IOPS and bandwidth. For read operations it will resemble 4 single
e4262cac
AL
177disks.
178
179A 'RAIDZ' of any redundancy level will approximately behave like a single disk
f1b7d1a3 180in regard to IOPS with a lot of bandwidth. How much bandwidth depends on the
e4262cac
AL
181size of the RAIDZ vdev and the redundancy level.
182
aefdf31c
FG
183A 'dRAID' pool should match the performance of an equivalent 'RAIDZ' pool.
184
e4262cac
AL
185For running VMs, IOPS is the more important metric in most situations.
186
187
188[[sysadmin_zfs_raid_size_space_usage_redundancy]]
189Size, Space usage and Redundancy
190^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
191
192While a pool made of 'mirror' vdevs will have the best performance
193characteristics, the usable space will be 50% of the disks available. Less if a
194mirror vdev consists of more than 2 disks, for example in a 3-way mirror. At
195least one healthy disk per mirror is needed for the pool to stay functional.
196
197The usable space of a 'RAIDZ' type vdev of N disks is roughly N-P, with P being
198the RAIDZ-level. The RAIDZ-level indicates how many arbitrary disks can fail
199without losing data. A special case is a 4 disk pool with RAIDZ2. In this
200situation it is usually better to use 2 mirror vdevs for the better performance
201as the usable space will be the same.
202
203Another important factor when using any RAIDZ level is how ZVOL datasets, which
204are used for VM disks, behave. For each data block the pool needs parity data
205which is at least the size of the minimum block size defined by the `ashift`
206value of the pool. With an ashift of 12 the block size of the pool is 4k. The
207default block size for a ZVOL is 8k. Therefore, in a RAIDZ2 each 8k block
208written will cause two additional 4k parity blocks to be written,
2098k + 4k + 4k = 16k. This is of course a simplified approach and the real
210situation will be slightly different with metadata, compression and such not
211being accounted for in this example.
212
213This behavior can be observed when checking the following properties of the
214ZVOL:
215
216 * `volsize`
217 * `refreservation` (if the pool is not thin provisioned)
218 * `used` (if the pool is thin provisioned and without snapshots present)
219
220----
221# zfs get volsize,refreservation,used <pool>/vm-<vmid>-disk-X
222----
223
224`volsize` is the size of the disk as it is presented to the VM, while
225`refreservation` shows the reserved space on the pool which includes the
226expected space needed for the parity data. If the pool is thin provisioned, the
227`refreservation` will be set to 0. Another way to observe the behavior is to
228compare the used disk space within the VM and the `used` property. Be aware
229that snapshots will skew the value.
230
231There are a few options to counter the increased use of space:
232
233* Increase the `volblocksize` to improve the data to parity ratio
234* Use 'mirror' vdevs instead of 'RAIDZ'
235* Use `ashift=9` (block size of 512 bytes)
236
237The `volblocksize` property can only be set when creating a ZVOL. The default
238value can be changed in the storage configuration. When doing this, the guest
239needs to be tuned accordingly and depending on the use case, the problem of
b2444770 240write amplification is just moved from the ZFS layer up to the guest.
e4262cac
AL
241
242Using `ashift=9` when creating the pool can lead to bad
243performance, depending on the disks underneath, and cannot be changed later on.
244
245Mirror vdevs (RAID1, RAID10) have favorable behavior for VM workloads. Use
f4abc68a 246them, unless your environment has specific needs and characteristics where
e4262cac
AL
247RAIDZ performance characteristics are acceptable.
248
249
447596fd
SH
250ZFS dRAID
251~~~~~~~~~
252
253In a ZFS dRAID (declustered RAID) the hot spare drive(s) participate in the RAID.
254Their spare capacity is reserved and used for rebuilding when one drive fails.
255This provides, depending on the configuration, faster rebuilding compared to a
256RAIDZ in case of drive failure. More information can be found in the official
257OpenZFS documentation. footnote:[OpenZFS dRAID
258https://openzfs.github.io/openzfs-docs/Basic%20Concepts/dRAID%20Howto.html]
259
260NOTE: dRAID is intended for more than 10-15 disks in a dRAID. A RAIDZ
261setup should be better for a lower amount of disks in most use cases.
262
263NOTE: The GUI requires one more disk than the minimum (i.e. dRAID1 needs 3). It
264expects that a spare disk is added as well.
265
266 * `dRAID1` or `dRAID`: requires at least 2 disks, one can fail before data is
267lost
268 * `dRAID2`: requires at least 3 disks, two can fail before data is lost
269 * `dRAID3`: requires at least 4 disks, three can fail before data is lost
270
271
272Additional information can be found on the manual page:
273
274----
275# man zpoolconcepts
276----
277
278Spares and Data
279^^^^^^^^^^^^^^^
280The number of `spares` tells the system how many disks it should keep ready in
281case of a disk failure. The default value is 0 `spares`. Without spares,
282rebuilding won't get any speed benefits.
283
284`data` defines the number of devices in a redundancy group. The default value is
2858. Except when `disks - parity - spares` equal something less than 8, the lower
286number is used. In general, a smaller number of `data` devices leads to higher
287IOPS, better compression ratios and faster resilvering, but defining fewer data
288devices reduces the available storage capacity of the pool.
289
290
9ee94323
DM
291Bootloader
292~~~~~~~~~~
293
cb04e768
SI
294{pve} uses xref:sysboot_proxmox_boot_tool[`proxmox-boot-tool`] to manage the
295bootloader configuration.
3a433e9b 296See the chapter on xref:sysboot[{pve} host bootloaders] for details.
9ee94323
DM
297
298
299ZFS Administration
300~~~~~~~~~~~~~~~~~~
301
302This section gives you some usage examples for common tasks. ZFS
303itself is really powerful and provides many options. The main commands
8c1189b6
FG
304to manage ZFS are `zfs` and `zpool`. Both commands come with great
305manual pages, which can be read with:
9ee94323
DM
306
307----
308# man zpool
309# man zfs
310-----
311
42449bdf
TL
312[[sysadmin_zfs_create_new_zpool]]
313Create a new zpool
314^^^^^^^^^^^^^^^^^^
9ee94323 315
25b89d16
TL
316To create a new pool, at least one disk is needed. The `ashift` should have the
317same sector-size (2 power of `ashift`) or larger as the underlying disk.
9ee94323 318
eaefe614
FE
319----
320# zpool create -f -o ashift=12 <pool> <device>
321----
9ee94323 322
25b89d16
TL
323[TIP]
324====
325Pool names must adhere to the following rules:
326
327* begin with a letter (a-z or A-Z)
328* contain only alphanumeric, `-`, `_`, `.`, `:` or ` ` (space) characters
329* must *not begin* with one of `mirror`, `raidz`, `draid` or `spare`
330* must not be `log`
331====
332
e06707f2 333To activate compression (see section <<zfs_compression,Compression in ZFS>>):
9ee94323 334
eaefe614
FE
335----
336# zfs set compression=lz4 <pool>
337----
9ee94323 338
42449bdf
TL
339[[sysadmin_zfs_create_new_zpool_raid0]]
340Create a new pool with RAID-0
341^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
9ee94323 342
dc2d00a0 343Minimum 1 disk
9ee94323 344
eaefe614
FE
345----
346# zpool create -f -o ashift=12 <pool> <device1> <device2>
347----
9ee94323 348
42449bdf
TL
349[[sysadmin_zfs_create_new_zpool_raid1]]
350Create a new pool with RAID-1
351^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
9ee94323 352
dc2d00a0 353Minimum 2 disks
9ee94323 354
eaefe614
FE
355----
356# zpool create -f -o ashift=12 <pool> mirror <device1> <device2>
357----
9ee94323 358
42449bdf
TL
359[[sysadmin_zfs_create_new_zpool_raid10]]
360Create a new pool with RAID-10
361^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
9ee94323 362
dc2d00a0 363Minimum 4 disks
9ee94323 364
eaefe614
FE
365----
366# zpool create -f -o ashift=12 <pool> mirror <device1> <device2> mirror <device3> <device4>
367----
9ee94323 368
42449bdf
TL
369[[sysadmin_zfs_create_new_zpool_raidz1]]
370Create a new pool with RAIDZ-1
371^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
9ee94323 372
dc2d00a0 373Minimum 3 disks
9ee94323 374
eaefe614
FE
375----
376# zpool create -f -o ashift=12 <pool> raidz1 <device1> <device2> <device3>
377----
9ee94323 378
42449bdf
TL
379Create a new pool with RAIDZ-2
380^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
9ee94323 381
dc2d00a0 382Minimum 4 disks
9ee94323 383
eaefe614
FE
384----
385# zpool create -f -o ashift=12 <pool> raidz2 <device1> <device2> <device3> <device4>
386----
9ee94323 387
8a1de6bf
TL
388Please read the section for
389xref:sysadmin_zfs_raid_considerations[ZFS RAID Level Considerations]
390to get a rough estimate on how IOPS and bandwidth expectations before setting up
391a pool, especially when wanting to use a RAID-Z mode.
392
42449bdf
TL
393[[sysadmin_zfs_create_new_zpool_with_cache]]
394Create a new pool with cache (L2ARC)
395^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
9ee94323 396
5f440d2c
TL
397It is possible to use a dedicated device, or partition, as second-level cache to
398increase the performance. Such a cache device will especially help with
399random-read workloads of data that is mostly static. As it acts as additional
400caching layer between the actual storage, and the in-memory ARC, it can also
401help if the ARC must be reduced due to memory constraints.
9ee94323 402
5f440d2c 403.Create ZFS pool with a on-disk cache
eaefe614 404----
5f440d2c 405# zpool create -f -o ashift=12 <pool> <device> cache <cache-device>
eaefe614 406----
9ee94323 407
5f440d2c
TL
408Here only a single `<device>` and a single `<cache-device>` was used, but it is
409possible to use more devices, like it's shown in
410xref:sysadmin_zfs_create_new_zpool_raid0[Create a new pool with RAID].
411
412Note that for cache devices no mirror or raid modi exist, they are all simply
413accumulated.
414
415If any cache device produces errors on read, ZFS will transparently divert that
416request to the underlying storage layer.
417
418
42449bdf
TL
419[[sysadmin_zfs_create_new_zpool_with_log]]
420Create a new pool with log (ZIL)
421^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
9ee94323 422
5f440d2c
TL
423It is possible to use a dedicated drive, or partition, for the ZFS Intent Log
424(ZIL), it is mainly used to provide safe synchronous transactions, so often in
425performance critical paths like databases, or other programs that issue `fsync`
426operations more frequently.
9ee94323 427
5f440d2c
TL
428The pool is used as default ZIL location, diverting the ZIL IO load to a
429separate device can, help to reduce transaction latencies while relieving the
430main pool at the same time, increasing overall performance.
9ee94323 431
5f440d2c
TL
432For disks to be used as log devices, directly or through a partition, it's
433recommend to:
434
435- use fast SSDs with power-loss protection, as those have much smaller commit
436 latencies.
437
438- Use at least a few GB for the partition (or whole device), but using more than
439 half of your installed memory won't provide you with any real advantage.
440
441.Create ZFS pool with separate log device
eaefe614 442----
5f440d2c 443# zpool create -f -o ashift=12 <pool> <device> log <log-device>
eaefe614 444----
9ee94323 445
5f440d2c
TL
446In above example a single `<device>` and a single `<log-device>` is used, but you
447can also combine this with other RAID variants, as described in the
448xref:sysadmin_zfs_create_new_zpool_raid0[Create a new pool with RAID] section.
449
450You can also mirror the log device to multiple devices, this is mainly useful to
451ensure that performance doesn't immediately degrades if a single log device
452fails.
453
454If all log devices fail the ZFS main pool itself will be used again, until the
455log device(s) get replaced.
456
42449bdf
TL
457[[sysadmin_zfs_add_cache_and_log_dev]]
458Add cache and log to an existing pool
459^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
9ee94323 460
5f440d2c
TL
461If you have a pool without cache and log you can still add both, or just one of
462them, at any time.
463
464For example, let's assume you got a good enterprise SSD with power-loss
465protection that you want to use for improving the overall performance of your
466pool.
467
468As the maximum size of a log device should be about half the size of the
469installed physical memory, it means that the ZIL will mostly likely only take up
470a relatively small part of the SSD, the remaining space can be used as cache.
9ee94323 471
5f440d2c 472First you have to create two GPT partitions on the SSD with `parted` or `gdisk`.
9ee94323 473
5f440d2c 474Then you're ready to add them to an pool:
9ee94323 475
5f440d2c 476.Add both, a separate log device and a second-level cache, to an existing pool
eaefe614 477----
237007eb 478# zpool add -f <pool> log <device-part1> cache <device-part2>
eaefe614 479----
9ee94323 480
5f440d2c
TL
481Just replay `<pool>`, `<device-part1>` and `<device-part2>` with the pool name
482and the two `/dev/disk/by-id/` paths to the partitions.
483
484You can also add ZIL and cache separately.
485
486.Add a log device to an existing ZFS pool
487----
488# zpool add <pool> log <log-device>
489----
490
491
42449bdf
TL
492[[sysadmin_zfs_change_failed_dev]]
493Changing a failed device
494^^^^^^^^^^^^^^^^^^^^^^^^
9ee94323 495
eaefe614 496----
5f440d2c 497# zpool replace -f <pool> <old-device> <new-device>
eaefe614 498----
1748211a 499
11a6e022
AL
500.Changing a failed bootable device
501
7c73a209
CH
502Depending on how {pve} was installed it is either using `systemd-boot` or GRUB
503through `proxmox-boot-tool` footnote:[Systems installed with {pve} 6.4 or later,
504EFI systems installed with {pve} 5.4 or later] or plain GRUB as bootloader (see
cb04e768
SI
505xref:sysboot[Host Bootloader]). You can check by running:
506
507----
508# proxmox-boot-tool status
509----
11a6e022
AL
510
511The first steps of copying the partition table, reissuing GUIDs and replacing
512the ZFS partition are the same. To make the system bootable from the new disk,
513different steps are needed which depend on the bootloader in use.
1748211a 514
eaefe614
FE
515----
516# sgdisk <healthy bootable device> -R <new device>
517# sgdisk -G <new device>
518# zpool replace -f <pool> <old zfs partition> <new zfs partition>
11a6e022
AL
519----
520
44aee838 521NOTE: Use the `zpool status -v` command to monitor how far the resilvering
11a6e022
AL
522process of the new disk has progressed.
523
cb04e768 524.With `proxmox-boot-tool`:
11a6e022
AL
525
526----
cb04e768 527# proxmox-boot-tool format <new disk's ESP>
952ee606 528# proxmox-boot-tool init <new disk's ESP> [grub]
eaefe614 529----
0daaddbd
FG
530
531NOTE: `ESP` stands for EFI System Partition, which is setup as partition #2 on
532bootable disks setup by the {pve} installer since version 5.4. For details, see
cb04e768 533xref:sysboot_proxmox_boot_setup[Setting up a new partition for use as synced ESP].
9ee94323 534
7c73a209
CH
535NOTE: Make sure to pass 'grub' as mode to `proxmox-boot-tool init` if
536`proxmox-boot-tool status` indicates your current disks are using GRUB,
952ee606
FG
537especially if Secure Boot is enabled!
538
7c73a209 539.With plain GRUB:
11a6e022
AL
540
541----
542# grub-install <new disk>
543----
7c73a209 544NOTE: Plain GRUB is only used on systems installed with {pve} 6.3 or earlier,
69c2b2e5
SI
545which have not been manually migrated to using `proxmox-boot-tool` yet.
546
9ee94323 547
aa425868
FE
548Configure E-Mail Notification
549~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
9ee94323 550
aa425868
FE
551ZFS comes with an event daemon `ZED`, which monitors events generated by the ZFS
552kernel module. The daemon can also send emails on ZFS events like pool errors.
553Newer ZFS packages ship the daemon in a separate `zfs-zed` package, which should
554already be installed by default in {pve}.
e280a948 555
aa425868
FE
556You can configure the daemon via the file `/etc/zfs/zed.d/zed.rc` with your
557favorite editor. The required setting for email notification is
558`ZED_EMAIL_ADDR`, which is set to `root` by default.
9ee94323 559
083adc34 560--------
9ee94323 561ZED_EMAIL_ADDR="root"
083adc34 562--------
9ee94323 563
8c1189b6 564Please note {pve} forwards mails to `root` to the email address
9ee94323
DM
565configured for the root user.
566
9ee94323 567
42449bdf 568[[sysadmin_zfs_limit_memory_usage]]
5eba0743 569Limit ZFS Memory Usage
9ee94323
DM
570~~~~~~~~~~~~~~~~~~~~~~
571
9060abb9
TL
572ZFS uses '50 %' of the host memory for the **A**daptive **R**eplacement
573**C**ache (ARC) by default. Allocating enough memory for the ARC is crucial for
574IO performance, so reduce it with caution. As a general rule of thumb, allocate
575at least +2 GiB Base + 1 GiB/TiB-Storage+. For example, if you have a pool with
576+8 TiB+ of available storage space then you should use +10 GiB+ of memory for
577the ARC.
578
579You can change the ARC usage limit for the current boot (a reboot resets this
580change again) by writing to the +zfs_arc_max+ module parameter directly:
581
582----
583 echo "$[10 * 1024*1024*1024]" >/sys/module/zfs/parameters/zfs_arc_max
584----
585
586To *permanently change* the ARC limits, add the following line to
587`/etc/modprobe.d/zfs.conf`:
9ee94323 588
5eba0743
FG
589--------
590options zfs zfs_arc_max=8589934592
591--------
9ee94323 592
9060abb9 593This example setting limits the usage to 8 GiB ('8 * 2^30^').
9ee94323 594
beed14f8
FG
595IMPORTANT: In case your desired +zfs_arc_max+ value is lower than or equal to
596+zfs_arc_min+ (which defaults to 1/32 of the system memory), +zfs_arc_max+ will
597be ignored unless you also set +zfs_arc_min+ to at most +zfs_arc_max - 1+.
598
599----
600echo "$[8 * 1024*1024*1024 - 1]" >/sys/module/zfs/parameters/zfs_arc_min
601echo "$[8 * 1024*1024*1024]" >/sys/module/zfs/parameters/zfs_arc_max
602----
603
604This example setting (temporarily) limits the usage to 8 GiB ('8 * 2^30^') on
605systems with more than 256 GiB of total memory, where simply setting
606+zfs_arc_max+ alone would not work.
607
9ee94323
DM
608[IMPORTANT]
609====
9060abb9 610If your root file system is ZFS, you must update your initramfs every
5eba0743 611time this value changes:
9ee94323 612
eaefe614 613----
abdfbbbb 614# update-initramfs -u -k all
eaefe614 615----
9060abb9
TL
616
617You *must reboot* to activate these changes.
9ee94323
DM
618====
619
620
dc74fc63 621[[zfs_swap]]
4128e7ff
TL
622SWAP on ZFS
623~~~~~~~~~~~
9ee94323 624
dc74fc63 625Swap-space created on a zvol may generate some troubles, like blocking the
9ee94323
DM
626server or generating a high IO load, often seen when starting a Backup
627to an external Storage.
628
629We strongly recommend to use enough memory, so that you normally do not
dc74fc63 630run into low memory situations. Should you need or want to add swap, it is
3a433e9b 631preferred to create a partition on a physical disk and use it as a swap device.
dc74fc63
SI
632You can leave some space free for this purpose in the advanced options of the
633installer. Additionally, you can lower the
8c1189b6 634``swappiness'' value. A good value for servers is 10:
9ee94323 635
eaefe614
FE
636----
637# sysctl -w vm.swappiness=10
638----
9ee94323 639
8c1189b6 640To make the swappiness persistent, open `/etc/sysctl.conf` with
9ee94323
DM
641an editor of your choice and add the following line:
642
083adc34
FG
643--------
644vm.swappiness = 10
645--------
9ee94323 646
8c1189b6 647.Linux kernel `swappiness` parameter values
9ee94323
DM
648[width="100%",cols="<m,2d",options="header"]
649|===========================================================
650| Value | Strategy
651| vm.swappiness = 0 | The kernel will swap only to avoid
652an 'out of memory' condition
653| vm.swappiness = 1 | Minimum amount of swapping without
654disabling it entirely.
655| vm.swappiness = 10 | This value is sometimes recommended to
656improve performance when sufficient memory exists in a system.
657| vm.swappiness = 60 | The default value.
658| vm.swappiness = 100 | The kernel will swap aggressively.
659|===========================================================
cca0540e
FG
660
661[[zfs_encryption]]
4128e7ff
TL
662Encrypted ZFS Datasets
663~~~~~~~~~~~~~~~~~~~~~~
cca0540e 664
500e5ab3
ML
665WARNING: Native ZFS encryption in {pve} is experimental. Known limitations and
666issues include Replication with encrypted datasets
667footnote:[https://bugzilla.proxmox.com/show_bug.cgi?id=2350],
668as well as checksum errors when using Snapshots or ZVOLs.
669footnote:[https://github.com/openzfs/zfs/issues/11688]
670
cca0540e
FG
671ZFS on Linux version 0.8.0 introduced support for native encryption of
672datasets. After an upgrade from previous ZFS on Linux versions, the encryption
229426eb 673feature can be enabled per pool:
cca0540e
FG
674
675----
676# zpool get feature@encryption tank
677NAME PROPERTY VALUE SOURCE
678tank feature@encryption disabled local
679
680# zpool set feature@encryption=enabled
681
682# zpool get feature@encryption tank
683NAME PROPERTY VALUE SOURCE
684tank feature@encryption enabled local
685----
686
687WARNING: There is currently no support for booting from pools with encrypted
7c73a209 688datasets using GRUB, and only limited support for automatically unlocking
cca0540e
FG
689encrypted datasets on boot. Older versions of ZFS without encryption support
690will not be able to decrypt stored data.
691
692NOTE: It is recommended to either unlock storage datasets manually after
693booting, or to write a custom unit to pass the key material needed for
694unlocking on boot to `zfs load-key`.
695
696WARNING: Establish and test a backup procedure before enabling encryption of
5dfeeece 697production data. If the associated key material/passphrase/keyfile has been
cca0540e
FG
698lost, accessing the encrypted data is no longer possible.
699
700Encryption needs to be setup when creating datasets/zvols, and is inherited by
701default to child datasets. For example, to create an encrypted dataset
702`tank/encrypted_data` and configure it as storage in {pve}, run the following
703commands:
704
705----
706# zfs create -o encryption=on -o keyformat=passphrase tank/encrypted_data
707Enter passphrase:
708Re-enter passphrase:
709
710# pvesm add zfspool encrypted_zfs -pool tank/encrypted_data
711----
712
713All guest volumes/disks create on this storage will be encrypted with the
714shared key material of the parent dataset.
715
716To actually use the storage, the associated key material needs to be loaded
7353437b 717and the dataset needs to be mounted. This can be done in one step with:
cca0540e
FG
718
719----
7353437b 720# zfs mount -l tank/encrypted_data
cca0540e
FG
721Enter passphrase for 'tank/encrypted_data':
722----
723
724It is also possible to use a (random) keyfile instead of prompting for a
725passphrase by setting the `keylocation` and `keyformat` properties, either at
229426eb 726creation time or with `zfs change-key` on existing datasets:
cca0540e
FG
727
728----
729# dd if=/dev/urandom of=/path/to/keyfile bs=32 count=1
730
731# zfs change-key -o keyformat=raw -o keylocation=file:///path/to/keyfile tank/encrypted_data
732----
733
734WARNING: When using a keyfile, special care needs to be taken to secure the
735keyfile against unauthorized access or accidental loss. Without the keyfile, it
736is not possible to access the plaintext data!
737
738A guest volume created underneath an encrypted dataset will have its
739`encryptionroot` property set accordingly. The key material only needs to be
740loaded once per encryptionroot to be available to all encrypted datasets
741underneath it.
742
743See the `encryptionroot`, `encryption`, `keylocation`, `keyformat` and
744`keystatus` properties, the `zfs load-key`, `zfs unload-key` and `zfs
745change-key` commands and the `Encryption` section from `man zfs` for more
746details and advanced usage.
68029ec8
FE
747
748
e06707f2
FE
749[[zfs_compression]]
750Compression in ZFS
751~~~~~~~~~~~~~~~~~~
752
753When compression is enabled on a dataset, ZFS tries to compress all *new*
754blocks before writing them and decompresses them on reading. Already
755existing data will not be compressed retroactively.
756
757You can enable compression with:
758
759----
760# zfs set compression=<algorithm> <dataset>
761----
762
763We recommend using the `lz4` algorithm, because it adds very little CPU
764overhead. Other algorithms like `lzjb` and `gzip-N`, where `N` is an
765integer from `1` (fastest) to `9` (best compression ratio), are also
766available. Depending on the algorithm and how compressible the data is,
767having compression enabled can even increase I/O performance.
768
769You can disable compression at any time with:
770
771----
772# zfs set compression=off <dataset>
773----
774
775Again, only new blocks will be affected by this change.
776
777
42449bdf 778[[sysadmin_zfs_special_device]]
68029ec8
FE
779ZFS Special Device
780~~~~~~~~~~~~~~~~~~
781
782Since version 0.8.0 ZFS supports `special` devices. A `special` device in a
783pool is used to store metadata, deduplication tables, and optionally small
784file blocks.
785
786A `special` device can improve the speed of a pool consisting of slow spinning
51e544b6
TL
787hard disks with a lot of metadata changes. For example workloads that involve
788creating, updating or deleting a large number of files will benefit from the
789presence of a `special` device. ZFS datasets can also be configured to store
790whole small files on the `special` device which can further improve the
791performance. Use fast SSDs for the `special` device.
68029ec8
FE
792
793IMPORTANT: The redundancy of the `special` device should match the one of the
794pool, since the `special` device is a point of failure for the whole pool.
795
796WARNING: Adding a `special` device to a pool cannot be undone!
797
798.Create a pool with `special` device and RAID-1:
799
eaefe614
FE
800----
801# zpool create -f -o ashift=12 <pool> mirror <device1> <device2> special mirror <device3> <device4>
802----
68029ec8
FE
803
804.Add a `special` device to an existing pool with RAID-1:
805
eaefe614
FE
806----
807# zpool add <pool> special mirror <device1> <device2>
808----
68029ec8
FE
809
810ZFS datasets expose the `special_small_blocks=<size>` property. `size` can be
811`0` to disable storing small file blocks on the `special` device or a power of
9deec2e2 812two in the range between `512B` to `1M`. After setting the property new file
68029ec8
FE
813blocks smaller than `size` will be allocated on the `special` device.
814
815IMPORTANT: If the value for `special_small_blocks` is greater than or equal to
51e544b6
TL
816the `recordsize` (default `128K`) of the dataset, *all* data will be written to
817the `special` device, so be careful!
68029ec8
FE
818
819Setting the `special_small_blocks` property on a pool will change the default
820value of that property for all child ZFS datasets (for example all containers
821in the pool will opt in for small file blocks).
822
51e544b6 823.Opt in for all file smaller than 4K-blocks pool-wide:
68029ec8 824
eaefe614
FE
825----
826# zfs set special_small_blocks=4K <pool>
827----
68029ec8
FE
828
829.Opt in for small file blocks for a single dataset:
830
eaefe614
FE
831----
832# zfs set special_small_blocks=4K <pool>/<filesystem>
833----
68029ec8
FE
834
835.Opt out from small file blocks for a single dataset:
836
eaefe614
FE
837----
838# zfs set special_small_blocks=0 <pool>/<filesystem>
839----
18d0d68e
SI
840
841[[sysadmin_zfs_features]]
842ZFS Pool Features
843~~~~~~~~~~~~~~~~~
844
845Changes to the on-disk format in ZFS are only made between major version changes
846and are specified through *features*. All features, as well as the general
847mechanism are well documented in the `zpool-features(5)` manpage.
848
849Since enabling new features can render a pool not importable by an older version
850of ZFS, this needs to be done actively by the administrator, by running
851`zpool upgrade` on the pool (see the `zpool-upgrade(8)` manpage).
852
853Unless you need to use one of the new features, there is no upside to enabling
854them.
855
856In fact, there are some downsides to enabling new features:
857
7c73a209 858* A system with root on ZFS, that still boots using GRUB will become
18d0d68e 859 unbootable if a new feature is active on the rpool, due to the incompatible
7c73a209 860 implementation of ZFS in GRUB.
18d0d68e
SI
861* The system will not be able to import any upgraded pool when booted with an
862 older kernel, which still ships with the old ZFS modules.
863* Booting an older {pve} ISO to repair a non-booting system will likewise not
864 work.
865
27adc096 866IMPORTANT: Do *not* upgrade your rpool if your system is still booted with
7c73a209 867GRUB, as this will render your system unbootable. This includes systems
27adc096 868installed before {pve} 5.4, and systems booting with legacy BIOS boot (see
18d0d68e
SI
869xref:sysboot_determine_bootloader_used[how to determine the bootloader]).
870
27adc096 871.Enable new features for a ZFS pool:
18d0d68e
SI
872----
873# zpool upgrade <pool>
874----