]>
Commit | Line | Data |
---|---|---|
1 | [[chapter_zfs]] | |
2 | ZFS on Linux | |
3 | ------------ | |
4 | ifdef::wiki[] | |
5 | :pve-toplevel: | |
6 | endif::wiki[] | |
7 | ||
8 | ZFS is a combined file system and logical volume manager designed by | |
9 | Sun Microsystems. Starting with {pve} 3.4, the native Linux | |
10 | kernel port of the ZFS file system is introduced as optional | |
11 | file system and also as an additional selection for the root | |
12 | file system. There is no need for manually compile ZFS modules - all | |
13 | packages are included. | |
14 | ||
15 | By using ZFS, its possible to achieve maximum enterprise features with | |
16 | low budget hardware, but also high performance systems by leveraging | |
17 | SSD caching or even SSD only setups. ZFS can replace cost intense | |
18 | hardware raid cards by moderate CPU and memory load combined with easy | |
19 | management. | |
20 | ||
21 | .General ZFS advantages | |
22 | ||
23 | * Easy configuration and management with {pve} GUI and CLI. | |
24 | ||
25 | * Reliable | |
26 | ||
27 | * Protection against data corruption | |
28 | ||
29 | * Data compression on file system level | |
30 | ||
31 | * Snapshots | |
32 | ||
33 | * Copy-on-write clone | |
34 | ||
35 | * Various raid levels: RAID0, RAID1, RAID10, RAIDZ-1, RAIDZ-2 and RAIDZ-3 | |
36 | ||
37 | * Can use SSD for cache | |
38 | ||
39 | * Self healing | |
40 | ||
41 | * Continuous integrity checking | |
42 | ||
43 | * Designed for high storage capacities | |
44 | ||
45 | * Asynchronous replication over network | |
46 | ||
47 | * Open Source | |
48 | ||
49 | * Encryption | |
50 | ||
51 | * ... | |
52 | ||
53 | ||
54 | Hardware | |
55 | ~~~~~~~~ | |
56 | ||
57 | ZFS depends heavily on memory, so you need at least 8GB to start. In | |
58 | practice, use as much as you can get for your hardware/budget. To prevent | |
59 | data corruption, we recommend the use of high quality ECC RAM. | |
60 | ||
61 | If you use a dedicated cache and/or log disk, you should use an | |
62 | enterprise class SSD (e.g. Intel SSD DC S3700 Series). This can | |
63 | increase the overall performance significantly. | |
64 | ||
65 | IMPORTANT: Do not use ZFS on top of a hardware RAID controller which has its | |
66 | own cache management. ZFS needs to communicate directly with the disks. An | |
67 | HBA adapter or something like an LSI controller flashed in ``IT'' mode is more | |
68 | appropriate. | |
69 | ||
70 | If you are experimenting with an installation of {pve} inside a VM | |
71 | (Nested Virtualization), don't use `virtio` for disks of that VM, | |
72 | as they are not supported by ZFS. Use IDE or SCSI instead (also works | |
73 | with the `virtio` SCSI controller type). | |
74 | ||
75 | ||
76 | Installation as Root File System | |
77 | ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ | |
78 | ||
79 | When you install using the {pve} installer, you can choose ZFS for the | |
80 | root file system. You need to select the RAID type at installation | |
81 | time: | |
82 | ||
83 | [horizontal] | |
84 | RAID0:: Also called ``striping''. The capacity of such volume is the sum | |
85 | of the capacities of all disks. But RAID0 does not add any redundancy, | |
86 | so the failure of a single drive makes the volume unusable. | |
87 | ||
88 | RAID1:: Also called ``mirroring''. Data is written identically to all | |
89 | disks. This mode requires at least 2 disks with the same size. The | |
90 | resulting capacity is that of a single disk. | |
91 | ||
92 | RAID10:: A combination of RAID0 and RAID1. Requires at least 4 disks. | |
93 | ||
94 | RAIDZ-1:: A variation on RAID-5, single parity. Requires at least 3 disks. | |
95 | ||
96 | RAIDZ-2:: A variation on RAID-5, double parity. Requires at least 4 disks. | |
97 | ||
98 | RAIDZ-3:: A variation on RAID-5, triple parity. Requires at least 5 disks. | |
99 | ||
100 | The installer automatically partitions the disks, creates a ZFS pool | |
101 | called `rpool`, and installs the root file system on the ZFS subvolume | |
102 | `rpool/ROOT/pve-1`. | |
103 | ||
104 | Another subvolume called `rpool/data` is created to store VM | |
105 | images. In order to use that with the {pve} tools, the installer | |
106 | creates the following configuration entry in `/etc/pve/storage.cfg`: | |
107 | ||
108 | ---- | |
109 | zfspool: local-zfs | |
110 | pool rpool/data | |
111 | sparse | |
112 | content images,rootdir | |
113 | ---- | |
114 | ||
115 | After installation, you can view your ZFS pool status using the | |
116 | `zpool` command: | |
117 | ||
118 | ---- | |
119 | # zpool status | |
120 | pool: rpool | |
121 | state: ONLINE | |
122 | scan: none requested | |
123 | config: | |
124 | ||
125 | NAME STATE READ WRITE CKSUM | |
126 | rpool ONLINE 0 0 0 | |
127 | mirror-0 ONLINE 0 0 0 | |
128 | sda2 ONLINE 0 0 0 | |
129 | sdb2 ONLINE 0 0 0 | |
130 | mirror-1 ONLINE 0 0 0 | |
131 | sdc ONLINE 0 0 0 | |
132 | sdd ONLINE 0 0 0 | |
133 | ||
134 | errors: No known data errors | |
135 | ---- | |
136 | ||
137 | The `zfs` command is used configure and manage your ZFS file | |
138 | systems. The following command lists all file systems after | |
139 | installation: | |
140 | ||
141 | ---- | |
142 | # zfs list | |
143 | NAME USED AVAIL REFER MOUNTPOINT | |
144 | rpool 4.94G 7.68T 96K /rpool | |
145 | rpool/ROOT 702M 7.68T 96K /rpool/ROOT | |
146 | rpool/ROOT/pve-1 702M 7.68T 702M / | |
147 | rpool/data 96K 7.68T 96K /rpool/data | |
148 | rpool/swap 4.25G 7.69T 64K - | |
149 | ---- | |
150 | ||
151 | ||
152 | [[sysadmin_zfs_raid_considerations]] | |
153 | ZFS RAID Level Considerations | |
154 | ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ | |
155 | ||
156 | There are a few factors to take into consideration when choosing the layout of | |
157 | a ZFS pool. The basic building block of a ZFS pool is the virtual device, or | |
158 | `vdev`. All vdevs in a pool are used equally and the data is striped among them | |
159 | (RAID0). Check the `zpool(8)` manpage for more details on vdevs. | |
160 | ||
161 | [[sysadmin_zfs_raid_performance]] | |
162 | Performance | |
163 | ^^^^^^^^^^^ | |
164 | ||
165 | Each `vdev` type has different performance behaviors. The two | |
166 | parameters of interest are the IOPS (Input/Output Operations per Second) and | |
167 | the bandwidth with which data can be written or read. | |
168 | ||
169 | A 'mirror' vdev (RAID1) will approximately behave like a single disk in regards | |
170 | to both parameters when writing data. When reading data if will behave like the | |
171 | number of disks in the mirror. | |
172 | ||
173 | A common situation is to have 4 disks. When setting it up as 2 mirror vdevs | |
174 | (RAID10) the pool will have the write characteristics as two single disks in | |
175 | regard of IOPS and bandwidth. For read operations it will resemble 4 single | |
176 | disks. | |
177 | ||
178 | A 'RAIDZ' of any redundancy level will approximately behave like a single disk | |
179 | in regard of IOPS with a lot of bandwidth. How much bandwidth depends on the | |
180 | size of the RAIDZ vdev and the redundancy level. | |
181 | ||
182 | For running VMs, IOPS is the more important metric in most situations. | |
183 | ||
184 | ||
185 | [[sysadmin_zfs_raid_size_space_usage_redundancy]] | |
186 | Size, Space usage and Redundancy | |
187 | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ | |
188 | ||
189 | While a pool made of 'mirror' vdevs will have the best performance | |
190 | characteristics, the usable space will be 50% of the disks available. Less if a | |
191 | mirror vdev consists of more than 2 disks, for example in a 3-way mirror. At | |
192 | least one healthy disk per mirror is needed for the pool to stay functional. | |
193 | ||
194 | The usable space of a 'RAIDZ' type vdev of N disks is roughly N-P, with P being | |
195 | the RAIDZ-level. The RAIDZ-level indicates how many arbitrary disks can fail | |
196 | without losing data. A special case is a 4 disk pool with RAIDZ2. In this | |
197 | situation it is usually better to use 2 mirror vdevs for the better performance | |
198 | as the usable space will be the same. | |
199 | ||
200 | Another important factor when using any RAIDZ level is how ZVOL datasets, which | |
201 | are used for VM disks, behave. For each data block the pool needs parity data | |
202 | which is at least the size of the minimum block size defined by the `ashift` | |
203 | value of the pool. With an ashift of 12 the block size of the pool is 4k. The | |
204 | default block size for a ZVOL is 8k. Therefore, in a RAIDZ2 each 8k block | |
205 | written will cause two additional 4k parity blocks to be written, | |
206 | 8k + 4k + 4k = 16k. This is of course a simplified approach and the real | |
207 | situation will be slightly different with metadata, compression and such not | |
208 | being accounted for in this example. | |
209 | ||
210 | This behavior can be observed when checking the following properties of the | |
211 | ZVOL: | |
212 | ||
213 | * `volsize` | |
214 | * `refreservation` (if the pool is not thin provisioned) | |
215 | * `used` (if the pool is thin provisioned and without snapshots present) | |
216 | ||
217 | ---- | |
218 | # zfs get volsize,refreservation,used <pool>/vm-<vmid>-disk-X | |
219 | ---- | |
220 | ||
221 | `volsize` is the size of the disk as it is presented to the VM, while | |
222 | `refreservation` shows the reserved space on the pool which includes the | |
223 | expected space needed for the parity data. If the pool is thin provisioned, the | |
224 | `refreservation` will be set to 0. Another way to observe the behavior is to | |
225 | compare the used disk space within the VM and the `used` property. Be aware | |
226 | that snapshots will skew the value. | |
227 | ||
228 | There are a few options to counter the increased use of space: | |
229 | ||
230 | * Increase the `volblocksize` to improve the data to parity ratio | |
231 | * Use 'mirror' vdevs instead of 'RAIDZ' | |
232 | * Use `ashift=9` (block size of 512 bytes) | |
233 | ||
234 | The `volblocksize` property can only be set when creating a ZVOL. The default | |
235 | value can be changed in the storage configuration. When doing this, the guest | |
236 | needs to be tuned accordingly and depending on the use case, the problem of | |
237 | write amplification if just moved from the ZFS layer up to the guest. | |
238 | ||
239 | Using `ashift=9` when creating the pool can lead to bad | |
240 | performance, depending on the disks underneath, and cannot be changed later on. | |
241 | ||
242 | Mirror vdevs (RAID1, RAID10) have favorable behavior for VM workloads. Use | |
243 | them, unless your environment has specific needs and characteristics where | |
244 | RAIDZ performance characteristics are acceptable. | |
245 | ||
246 | ||
247 | Bootloader | |
248 | ~~~~~~~~~~ | |
249 | ||
250 | {pve} uses xref:sysboot_proxmox_boot_tool[`proxmox-boot-tool`] to manage the | |
251 | bootloader configuration. | |
252 | See the chapter on xref:sysboot[{pve} host bootloaders] for details. | |
253 | ||
254 | ||
255 | ZFS Administration | |
256 | ~~~~~~~~~~~~~~~~~~ | |
257 | ||
258 | This section gives you some usage examples for common tasks. ZFS | |
259 | itself is really powerful and provides many options. The main commands | |
260 | to manage ZFS are `zfs` and `zpool`. Both commands come with great | |
261 | manual pages, which can be read with: | |
262 | ||
263 | ---- | |
264 | # man zpool | |
265 | # man zfs | |
266 | ----- | |
267 | ||
268 | [[sysadmin_zfs_create_new_zpool]] | |
269 | Create a new zpool | |
270 | ^^^^^^^^^^^^^^^^^^ | |
271 | ||
272 | To create a new pool, at least one disk is needed. The `ashift` should | |
273 | have the same sector-size (2 power of `ashift`) or larger as the | |
274 | underlying disk. | |
275 | ||
276 | ---- | |
277 | # zpool create -f -o ashift=12 <pool> <device> | |
278 | ---- | |
279 | ||
280 | To activate compression (see section <<zfs_compression,Compression in ZFS>>): | |
281 | ||
282 | ---- | |
283 | # zfs set compression=lz4 <pool> | |
284 | ---- | |
285 | ||
286 | [[sysadmin_zfs_create_new_zpool_raid0]] | |
287 | Create a new pool with RAID-0 | |
288 | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ | |
289 | ||
290 | Minimum 1 disk | |
291 | ||
292 | ---- | |
293 | # zpool create -f -o ashift=12 <pool> <device1> <device2> | |
294 | ---- | |
295 | ||
296 | [[sysadmin_zfs_create_new_zpool_raid1]] | |
297 | Create a new pool with RAID-1 | |
298 | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ | |
299 | ||
300 | Minimum 2 disks | |
301 | ||
302 | ---- | |
303 | # zpool create -f -o ashift=12 <pool> mirror <device1> <device2> | |
304 | ---- | |
305 | ||
306 | [[sysadmin_zfs_create_new_zpool_raid10]] | |
307 | Create a new pool with RAID-10 | |
308 | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ | |
309 | ||
310 | Minimum 4 disks | |
311 | ||
312 | ---- | |
313 | # zpool create -f -o ashift=12 <pool> mirror <device1> <device2> mirror <device3> <device4> | |
314 | ---- | |
315 | ||
316 | [[sysadmin_zfs_create_new_zpool_raidz1]] | |
317 | Create a new pool with RAIDZ-1 | |
318 | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ | |
319 | ||
320 | Minimum 3 disks | |
321 | ||
322 | ---- | |
323 | # zpool create -f -o ashift=12 <pool> raidz1 <device1> <device2> <device3> | |
324 | ---- | |
325 | ||
326 | Create a new pool with RAIDZ-2 | |
327 | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ | |
328 | ||
329 | Minimum 4 disks | |
330 | ||
331 | ---- | |
332 | # zpool create -f -o ashift=12 <pool> raidz2 <device1> <device2> <device3> <device4> | |
333 | ---- | |
334 | ||
335 | [[sysadmin_zfs_create_new_zpool_with_cache]] | |
336 | Create a new pool with cache (L2ARC) | |
337 | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ | |
338 | ||
339 | It is possible to use a dedicated cache drive partition to increase | |
340 | the performance (use SSD). | |
341 | ||
342 | As `<device>` it is possible to use more devices, like it's shown in | |
343 | "Create a new pool with RAID*". | |
344 | ||
345 | ---- | |
346 | # zpool create -f -o ashift=12 <pool> <device> cache <cache_device> | |
347 | ---- | |
348 | ||
349 | [[sysadmin_zfs_create_new_zpool_with_log]] | |
350 | Create a new pool with log (ZIL) | |
351 | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ | |
352 | ||
353 | It is possible to use a dedicated cache drive partition to increase | |
354 | the performance(SSD). | |
355 | ||
356 | As `<device>` it is possible to use more devices, like it's shown in | |
357 | "Create a new pool with RAID*". | |
358 | ||
359 | ---- | |
360 | # zpool create -f -o ashift=12 <pool> <device> log <log_device> | |
361 | ---- | |
362 | ||
363 | [[sysadmin_zfs_add_cache_and_log_dev]] | |
364 | Add cache and log to an existing pool | |
365 | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ | |
366 | ||
367 | If you have a pool without cache and log. First partition the SSD in | |
368 | 2 partition with `parted` or `gdisk` | |
369 | ||
370 | IMPORTANT: Always use GPT partition tables. | |
371 | ||
372 | The maximum size of a log device should be about half the size of | |
373 | physical memory, so this is usually quite small. The rest of the SSD | |
374 | can be used as cache. | |
375 | ||
376 | ---- | |
377 | # zpool add -f <pool> log <device-part1> cache <device-part2> | |
378 | ---- | |
379 | ||
380 | [[sysadmin_zfs_change_failed_dev]] | |
381 | Changing a failed device | |
382 | ^^^^^^^^^^^^^^^^^^^^^^^^ | |
383 | ||
384 | ---- | |
385 | # zpool replace -f <pool> <old device> <new device> | |
386 | ---- | |
387 | ||
388 | .Changing a failed bootable device | |
389 | ||
390 | Depending on how {pve} was installed it is either using `systemd-boot` or `grub` | |
391 | through `proxmox-boot-tool` | |
392 | footnote:[Systems installed with {pve} 6.4 or later, EFI systems installed with | |
393 | {pve} 5.4 or later] or plain `grub` as bootloader (see | |
394 | xref:sysboot[Host Bootloader]). You can check by running: | |
395 | ||
396 | ---- | |
397 | # proxmox-boot-tool status | |
398 | ---- | |
399 | ||
400 | The first steps of copying the partition table, reissuing GUIDs and replacing | |
401 | the ZFS partition are the same. To make the system bootable from the new disk, | |
402 | different steps are needed which depend on the bootloader in use. | |
403 | ||
404 | ---- | |
405 | # sgdisk <healthy bootable device> -R <new device> | |
406 | # sgdisk -G <new device> | |
407 | # zpool replace -f <pool> <old zfs partition> <new zfs partition> | |
408 | ---- | |
409 | ||
410 | NOTE: Use the `zpool status -v` command to monitor how far the resilvering | |
411 | process of the new disk has progressed. | |
412 | ||
413 | .With `proxmox-boot-tool`: | |
414 | ||
415 | ---- | |
416 | # proxmox-boot-tool format <new disk's ESP> | |
417 | # proxmox-boot-tool init <new disk's ESP> | |
418 | ---- | |
419 | ||
420 | NOTE: `ESP` stands for EFI System Partition, which is setup as partition #2 on | |
421 | bootable disks setup by the {pve} installer since version 5.4. For details, see | |
422 | xref:sysboot_proxmox_boot_setup[Setting up a new partition for use as synced ESP]. | |
423 | ||
424 | .With plain `grub`: | |
425 | ||
426 | ---- | |
427 | # grub-install <new disk> | |
428 | ---- | |
429 | NOTE: plain `grub` is only used on systems installed with {pve} 6.3 or earlier, | |
430 | which have not been manually migrated to using `proxmox-boot-tool` yet. | |
431 | ||
432 | ||
433 | Configure E-Mail Notification | |
434 | ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ | |
435 | ||
436 | ZFS comes with an event daemon `ZED`, which monitors events generated by the ZFS | |
437 | kernel module. The daemon can also send emails on ZFS events like pool errors. | |
438 | Newer ZFS packages ship the daemon in a separate `zfs-zed` package, which should | |
439 | already be installed by default in {pve}. | |
440 | ||
441 | You can configure the daemon via the file `/etc/zfs/zed.d/zed.rc` with your | |
442 | favorite editor. The required setting for email notification is | |
443 | `ZED_EMAIL_ADDR`, which is set to `root` by default. | |
444 | ||
445 | -------- | |
446 | ZED_EMAIL_ADDR="root" | |
447 | -------- | |
448 | ||
449 | Please note {pve} forwards mails to `root` to the email address | |
450 | configured for the root user. | |
451 | ||
452 | ||
453 | [[sysadmin_zfs_limit_memory_usage]] | |
454 | Limit ZFS Memory Usage | |
455 | ~~~~~~~~~~~~~~~~~~~~~~ | |
456 | ||
457 | ZFS uses '50 %' of the host memory for the **A**daptive **R**eplacement | |
458 | **C**ache (ARC) by default. Allocating enough memory for the ARC is crucial for | |
459 | IO performance, so reduce it with caution. As a general rule of thumb, allocate | |
460 | at least +2 GiB Base + 1 GiB/TiB-Storage+. For example, if you have a pool with | |
461 | +8 TiB+ of available storage space then you should use +10 GiB+ of memory for | |
462 | the ARC. | |
463 | ||
464 | You can change the ARC usage limit for the current boot (a reboot resets this | |
465 | change again) by writing to the +zfs_arc_max+ module parameter directly: | |
466 | ||
467 | ---- | |
468 | echo "$[10 * 1024*1024*1024]" >/sys/module/zfs/parameters/zfs_arc_max | |
469 | ---- | |
470 | ||
471 | To *permanently change* the ARC limits, add the following line to | |
472 | `/etc/modprobe.d/zfs.conf`: | |
473 | ||
474 | -------- | |
475 | options zfs zfs_arc_max=8589934592 | |
476 | -------- | |
477 | ||
478 | This example setting limits the usage to 8 GiB ('8 * 2^30^'). | |
479 | ||
480 | IMPORTANT: In case your desired +zfs_arc_max+ value is lower than or equal to | |
481 | +zfs_arc_min+ (which defaults to 1/32 of the system memory), +zfs_arc_max+ will | |
482 | be ignored unless you also set +zfs_arc_min+ to at most +zfs_arc_max - 1+. | |
483 | ||
484 | ---- | |
485 | echo "$[8 * 1024*1024*1024 - 1]" >/sys/module/zfs/parameters/zfs_arc_min | |
486 | echo "$[8 * 1024*1024*1024]" >/sys/module/zfs/parameters/zfs_arc_max | |
487 | ---- | |
488 | ||
489 | This example setting (temporarily) limits the usage to 8 GiB ('8 * 2^30^') on | |
490 | systems with more than 256 GiB of total memory, where simply setting | |
491 | +zfs_arc_max+ alone would not work. | |
492 | ||
493 | [IMPORTANT] | |
494 | ==== | |
495 | If your root file system is ZFS, you must update your initramfs every | |
496 | time this value changes: | |
497 | ||
498 | ---- | |
499 | # update-initramfs -u -k all | |
500 | ---- | |
501 | ||
502 | You *must reboot* to activate these changes. | |
503 | ==== | |
504 | ||
505 | ||
506 | [[zfs_swap]] | |
507 | SWAP on ZFS | |
508 | ~~~~~~~~~~~ | |
509 | ||
510 | Swap-space created on a zvol may generate some troubles, like blocking the | |
511 | server or generating a high IO load, often seen when starting a Backup | |
512 | to an external Storage. | |
513 | ||
514 | We strongly recommend to use enough memory, so that you normally do not | |
515 | run into low memory situations. Should you need or want to add swap, it is | |
516 | preferred to create a partition on a physical disk and use it as a swap device. | |
517 | You can leave some space free for this purpose in the advanced options of the | |
518 | installer. Additionally, you can lower the | |
519 | ``swappiness'' value. A good value for servers is 10: | |
520 | ||
521 | ---- | |
522 | # sysctl -w vm.swappiness=10 | |
523 | ---- | |
524 | ||
525 | To make the swappiness persistent, open `/etc/sysctl.conf` with | |
526 | an editor of your choice and add the following line: | |
527 | ||
528 | -------- | |
529 | vm.swappiness = 10 | |
530 | -------- | |
531 | ||
532 | .Linux kernel `swappiness` parameter values | |
533 | [width="100%",cols="<m,2d",options="header"] | |
534 | |=========================================================== | |
535 | | Value | Strategy | |
536 | | vm.swappiness = 0 | The kernel will swap only to avoid | |
537 | an 'out of memory' condition | |
538 | | vm.swappiness = 1 | Minimum amount of swapping without | |
539 | disabling it entirely. | |
540 | | vm.swappiness = 10 | This value is sometimes recommended to | |
541 | improve performance when sufficient memory exists in a system. | |
542 | | vm.swappiness = 60 | The default value. | |
543 | | vm.swappiness = 100 | The kernel will swap aggressively. | |
544 | |=========================================================== | |
545 | ||
546 | [[zfs_encryption]] | |
547 | Encrypted ZFS Datasets | |
548 | ~~~~~~~~~~~~~~~~~~~~~~ | |
549 | ||
550 | ZFS on Linux version 0.8.0 introduced support for native encryption of | |
551 | datasets. After an upgrade from previous ZFS on Linux versions, the encryption | |
552 | feature can be enabled per pool: | |
553 | ||
554 | ---- | |
555 | # zpool get feature@encryption tank | |
556 | NAME PROPERTY VALUE SOURCE | |
557 | tank feature@encryption disabled local | |
558 | ||
559 | # zpool set feature@encryption=enabled | |
560 | ||
561 | # zpool get feature@encryption tank | |
562 | NAME PROPERTY VALUE SOURCE | |
563 | tank feature@encryption enabled local | |
564 | ---- | |
565 | ||
566 | WARNING: There is currently no support for booting from pools with encrypted | |
567 | datasets using Grub, and only limited support for automatically unlocking | |
568 | encrypted datasets on boot. Older versions of ZFS without encryption support | |
569 | will not be able to decrypt stored data. | |
570 | ||
571 | NOTE: It is recommended to either unlock storage datasets manually after | |
572 | booting, or to write a custom unit to pass the key material needed for | |
573 | unlocking on boot to `zfs load-key`. | |
574 | ||
575 | WARNING: Establish and test a backup procedure before enabling encryption of | |
576 | production data. If the associated key material/passphrase/keyfile has been | |
577 | lost, accessing the encrypted data is no longer possible. | |
578 | ||
579 | Encryption needs to be setup when creating datasets/zvols, and is inherited by | |
580 | default to child datasets. For example, to create an encrypted dataset | |
581 | `tank/encrypted_data` and configure it as storage in {pve}, run the following | |
582 | commands: | |
583 | ||
584 | ---- | |
585 | # zfs create -o encryption=on -o keyformat=passphrase tank/encrypted_data | |
586 | Enter passphrase: | |
587 | Re-enter passphrase: | |
588 | ||
589 | # pvesm add zfspool encrypted_zfs -pool tank/encrypted_data | |
590 | ---- | |
591 | ||
592 | All guest volumes/disks create on this storage will be encrypted with the | |
593 | shared key material of the parent dataset. | |
594 | ||
595 | To actually use the storage, the associated key material needs to be loaded | |
596 | and the dataset needs to be mounted. This can be done in one step with: | |
597 | ||
598 | ---- | |
599 | # zfs mount -l tank/encrypted_data | |
600 | Enter passphrase for 'tank/encrypted_data': | |
601 | ---- | |
602 | ||
603 | It is also possible to use a (random) keyfile instead of prompting for a | |
604 | passphrase by setting the `keylocation` and `keyformat` properties, either at | |
605 | creation time or with `zfs change-key` on existing datasets: | |
606 | ||
607 | ---- | |
608 | # dd if=/dev/urandom of=/path/to/keyfile bs=32 count=1 | |
609 | ||
610 | # zfs change-key -o keyformat=raw -o keylocation=file:///path/to/keyfile tank/encrypted_data | |
611 | ---- | |
612 | ||
613 | WARNING: When using a keyfile, special care needs to be taken to secure the | |
614 | keyfile against unauthorized access or accidental loss. Without the keyfile, it | |
615 | is not possible to access the plaintext data! | |
616 | ||
617 | A guest volume created underneath an encrypted dataset will have its | |
618 | `encryptionroot` property set accordingly. The key material only needs to be | |
619 | loaded once per encryptionroot to be available to all encrypted datasets | |
620 | underneath it. | |
621 | ||
622 | See the `encryptionroot`, `encryption`, `keylocation`, `keyformat` and | |
623 | `keystatus` properties, the `zfs load-key`, `zfs unload-key` and `zfs | |
624 | change-key` commands and the `Encryption` section from `man zfs` for more | |
625 | details and advanced usage. | |
626 | ||
627 | ||
628 | [[zfs_compression]] | |
629 | Compression in ZFS | |
630 | ~~~~~~~~~~~~~~~~~~ | |
631 | ||
632 | When compression is enabled on a dataset, ZFS tries to compress all *new* | |
633 | blocks before writing them and decompresses them on reading. Already | |
634 | existing data will not be compressed retroactively. | |
635 | ||
636 | You can enable compression with: | |
637 | ||
638 | ---- | |
639 | # zfs set compression=<algorithm> <dataset> | |
640 | ---- | |
641 | ||
642 | We recommend using the `lz4` algorithm, because it adds very little CPU | |
643 | overhead. Other algorithms like `lzjb` and `gzip-N`, where `N` is an | |
644 | integer from `1` (fastest) to `9` (best compression ratio), are also | |
645 | available. Depending on the algorithm and how compressible the data is, | |
646 | having compression enabled can even increase I/O performance. | |
647 | ||
648 | You can disable compression at any time with: | |
649 | ||
650 | ---- | |
651 | # zfs set compression=off <dataset> | |
652 | ---- | |
653 | ||
654 | Again, only new blocks will be affected by this change. | |
655 | ||
656 | ||
657 | [[sysadmin_zfs_special_device]] | |
658 | ZFS Special Device | |
659 | ~~~~~~~~~~~~~~~~~~ | |
660 | ||
661 | Since version 0.8.0 ZFS supports `special` devices. A `special` device in a | |
662 | pool is used to store metadata, deduplication tables, and optionally small | |
663 | file blocks. | |
664 | ||
665 | A `special` device can improve the speed of a pool consisting of slow spinning | |
666 | hard disks with a lot of metadata changes. For example workloads that involve | |
667 | creating, updating or deleting a large number of files will benefit from the | |
668 | presence of a `special` device. ZFS datasets can also be configured to store | |
669 | whole small files on the `special` device which can further improve the | |
670 | performance. Use fast SSDs for the `special` device. | |
671 | ||
672 | IMPORTANT: The redundancy of the `special` device should match the one of the | |
673 | pool, since the `special` device is a point of failure for the whole pool. | |
674 | ||
675 | WARNING: Adding a `special` device to a pool cannot be undone! | |
676 | ||
677 | .Create a pool with `special` device and RAID-1: | |
678 | ||
679 | ---- | |
680 | # zpool create -f -o ashift=12 <pool> mirror <device1> <device2> special mirror <device3> <device4> | |
681 | ---- | |
682 | ||
683 | .Add a `special` device to an existing pool with RAID-1: | |
684 | ||
685 | ---- | |
686 | # zpool add <pool> special mirror <device1> <device2> | |
687 | ---- | |
688 | ||
689 | ZFS datasets expose the `special_small_blocks=<size>` property. `size` can be | |
690 | `0` to disable storing small file blocks on the `special` device or a power of | |
691 | two in the range between `512B` to `128K`. After setting the property new file | |
692 | blocks smaller than `size` will be allocated on the `special` device. | |
693 | ||
694 | IMPORTANT: If the value for `special_small_blocks` is greater than or equal to | |
695 | the `recordsize` (default `128K`) of the dataset, *all* data will be written to | |
696 | the `special` device, so be careful! | |
697 | ||
698 | Setting the `special_small_blocks` property on a pool will change the default | |
699 | value of that property for all child ZFS datasets (for example all containers | |
700 | in the pool will opt in for small file blocks). | |
701 | ||
702 | .Opt in for all file smaller than 4K-blocks pool-wide: | |
703 | ||
704 | ---- | |
705 | # zfs set special_small_blocks=4K <pool> | |
706 | ---- | |
707 | ||
708 | .Opt in for small file blocks for a single dataset: | |
709 | ||
710 | ---- | |
711 | # zfs set special_small_blocks=4K <pool>/<filesystem> | |
712 | ---- | |
713 | ||
714 | .Opt out from small file blocks for a single dataset: | |
715 | ||
716 | ---- | |
717 | # zfs set special_small_blocks=0 <pool>/<filesystem> | |
718 | ---- | |
719 | ||
720 | [[sysadmin_zfs_features]] | |
721 | ZFS Pool Features | |
722 | ~~~~~~~~~~~~~~~~~ | |
723 | ||
724 | Changes to the on-disk format in ZFS are only made between major version changes | |
725 | and are specified through *features*. All features, as well as the general | |
726 | mechanism are well documented in the `zpool-features(5)` manpage. | |
727 | ||
728 | Since enabling new features can render a pool not importable by an older version | |
729 | of ZFS, this needs to be done actively by the administrator, by running | |
730 | `zpool upgrade` on the pool (see the `zpool-upgrade(8)` manpage). | |
731 | ||
732 | Unless you need to use one of the new features, there is no upside to enabling | |
733 | them. | |
734 | ||
735 | In fact, there are some downsides to enabling new features: | |
736 | ||
737 | * A system with root on ZFS, that still boots using `grub` will become | |
738 | unbootable if a new feature is active on the rpool, due to the incompatible | |
739 | implementation of ZFS in grub. | |
740 | * The system will not be able to import any upgraded pool when booted with an | |
741 | older kernel, which still ships with the old ZFS modules. | |
742 | * Booting an older {pve} ISO to repair a non-booting system will likewise not | |
743 | work. | |
744 | ||
745 | IMPORTANT: Do *not* upgrade your rpool if your system is still booted with | |
746 | `grub`, as this will render your system unbootable. This includes systems | |
747 | installed before {pve} 5.4, and systems booting with legacy BIOS boot (see | |
748 | xref:sysboot_determine_bootloader_used[how to determine the bootloader]). | |
749 | ||
750 | .Enable new features for a ZFS pool: | |
751 | ---- | |
752 | # zpool upgrade <pool> | |
753 | ---- |