]>
Commit | Line | Data |
---|---|---|
0235c741 | 1 | [[chapter_zfs]] |
9ee94323 DM |
2 | ZFS on Linux |
3 | ------------ | |
5f09af76 DM |
4 | ifdef::wiki[] |
5 | :pve-toplevel: | |
6 | endif::wiki[] | |
7 | ||
9ee94323 DM |
8 | ZFS is a combined file system and logical volume manager designed by |
9 | Sun Microsystems. Starting with {pve} 3.4, the native Linux | |
10 | kernel port of the ZFS file system is introduced as optional | |
5eba0743 FG |
11 | file system and also as an additional selection for the root |
12 | file system. There is no need for manually compile ZFS modules - all | |
9ee94323 DM |
13 | packages are included. |
14 | ||
5eba0743 | 15 | By using ZFS, its possible to achieve maximum enterprise features with |
9ee94323 DM |
16 | low budget hardware, but also high performance systems by leveraging |
17 | SSD caching or even SSD only setups. ZFS can replace cost intense | |
18 | hardware raid cards by moderate CPU and memory load combined with easy | |
19 | management. | |
20 | ||
21 | .General ZFS advantages | |
22 | ||
23 | * Easy configuration and management with {pve} GUI and CLI. | |
24 | ||
25 | * Reliable | |
26 | ||
27 | * Protection against data corruption | |
28 | ||
5eba0743 | 29 | * Data compression on file system level |
9ee94323 DM |
30 | |
31 | * Snapshots | |
32 | ||
33 | * Copy-on-write clone | |
34 | ||
447596fd SH |
35 | * Various raid levels: RAID0, RAID1, RAID10, RAIDZ-1, RAIDZ-2, RAIDZ-3, |
36 | dRAID, dRAID2, dRAID3 | |
9ee94323 DM |
37 | |
38 | * Can use SSD for cache | |
39 | ||
40 | * Self healing | |
41 | ||
42 | * Continuous integrity checking | |
43 | ||
44 | * Designed for high storage capacities | |
45 | ||
9ee94323 DM |
46 | * Asynchronous replication over network |
47 | ||
48 | * Open Source | |
49 | ||
50 | * Encryption | |
51 | ||
52 | * ... | |
53 | ||
54 | ||
55 | Hardware | |
56 | ~~~~~~~~ | |
57 | ||
58 | ZFS depends heavily on memory, so you need at least 8GB to start. In | |
60ed554f | 59 | practice, use as much as you can get for your hardware/budget. To prevent |
9ee94323 DM |
60 | data corruption, we recommend the use of high quality ECC RAM. |
61 | ||
d48bdcf2 | 62 | If you use a dedicated cache and/or log disk, you should use an |
0d4a93dc | 63 | enterprise class SSD. This can |
9ee94323 DM |
64 | increase the overall performance significantly. |
65 | ||
60ed554f DW |
66 | IMPORTANT: Do not use ZFS on top of a hardware RAID controller which has its |
67 | own cache management. ZFS needs to communicate directly with the disks. An | |
68 | HBA adapter or something like an LSI controller flashed in ``IT'' mode is more | |
69 | appropriate. | |
9ee94323 DM |
70 | |
71 | If you are experimenting with an installation of {pve} inside a VM | |
8c1189b6 | 72 | (Nested Virtualization), don't use `virtio` for disks of that VM, |
60ed554f DW |
73 | as they are not supported by ZFS. Use IDE or SCSI instead (also works |
74 | with the `virtio` SCSI controller type). | |
9ee94323 DM |
75 | |
76 | ||
5eba0743 | 77 | Installation as Root File System |
9ee94323 DM |
78 | ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ |
79 | ||
80 | When you install using the {pve} installer, you can choose ZFS for the | |
81 | root file system. You need to select the RAID type at installation | |
82 | time: | |
83 | ||
84 | [horizontal] | |
8c1189b6 FG |
85 | RAID0:: Also called ``striping''. The capacity of such volume is the sum |
86 | of the capacities of all disks. But RAID0 does not add any redundancy, | |
9ee94323 DM |
87 | so the failure of a single drive makes the volume unusable. |
88 | ||
8c1189b6 | 89 | RAID1:: Also called ``mirroring''. Data is written identically to all |
9ee94323 DM |
90 | disks. This mode requires at least 2 disks with the same size. The |
91 | resulting capacity is that of a single disk. | |
92 | ||
93 | RAID10:: A combination of RAID0 and RAID1. Requires at least 4 disks. | |
94 | ||
95 | RAIDZ-1:: A variation on RAID-5, single parity. Requires at least 3 disks. | |
96 | ||
97 | RAIDZ-2:: A variation on RAID-5, double parity. Requires at least 4 disks. | |
98 | ||
99 | RAIDZ-3:: A variation on RAID-5, triple parity. Requires at least 5 disks. | |
100 | ||
101 | The installer automatically partitions the disks, creates a ZFS pool | |
8c1189b6 FG |
102 | called `rpool`, and installs the root file system on the ZFS subvolume |
103 | `rpool/ROOT/pve-1`. | |
9ee94323 | 104 | |
8c1189b6 | 105 | Another subvolume called `rpool/data` is created to store VM |
9ee94323 | 106 | images. In order to use that with the {pve} tools, the installer |
8c1189b6 | 107 | creates the following configuration entry in `/etc/pve/storage.cfg`: |
9ee94323 DM |
108 | |
109 | ---- | |
110 | zfspool: local-zfs | |
111 | pool rpool/data | |
112 | sparse | |
113 | content images,rootdir | |
114 | ---- | |
115 | ||
116 | After installation, you can view your ZFS pool status using the | |
8c1189b6 | 117 | `zpool` command: |
9ee94323 DM |
118 | |
119 | ---- | |
120 | # zpool status | |
121 | pool: rpool | |
122 | state: ONLINE | |
123 | scan: none requested | |
124 | config: | |
125 | ||
126 | NAME STATE READ WRITE CKSUM | |
127 | rpool ONLINE 0 0 0 | |
128 | mirror-0 ONLINE 0 0 0 | |
129 | sda2 ONLINE 0 0 0 | |
130 | sdb2 ONLINE 0 0 0 | |
131 | mirror-1 ONLINE 0 0 0 | |
132 | sdc ONLINE 0 0 0 | |
133 | sdd ONLINE 0 0 0 | |
134 | ||
135 | errors: No known data errors | |
136 | ---- | |
137 | ||
611d5948 FG |
138 | The `zfs` command is used to configure and manage your ZFS file systems. The |
139 | following command lists all file systems after installation: | |
9ee94323 DM |
140 | |
141 | ---- | |
142 | # zfs list | |
143 | NAME USED AVAIL REFER MOUNTPOINT | |
144 | rpool 4.94G 7.68T 96K /rpool | |
145 | rpool/ROOT 702M 7.68T 96K /rpool/ROOT | |
146 | rpool/ROOT/pve-1 702M 7.68T 702M / | |
147 | rpool/data 96K 7.68T 96K /rpool/data | |
148 | rpool/swap 4.25G 7.69T 64K - | |
149 | ---- | |
150 | ||
151 | ||
e4262cac AL |
152 | [[sysadmin_zfs_raid_considerations]] |
153 | ZFS RAID Level Considerations | |
154 | ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ | |
155 | ||
156 | There are a few factors to take into consideration when choosing the layout of | |
157 | a ZFS pool. The basic building block of a ZFS pool is the virtual device, or | |
158 | `vdev`. All vdevs in a pool are used equally and the data is striped among them | |
8182a38e | 159 | (RAID0). Check the `zpoolconcepts(7)` manpage for more details on vdevs. |
e4262cac AL |
160 | |
161 | [[sysadmin_zfs_raid_performance]] | |
162 | Performance | |
163 | ^^^^^^^^^^^ | |
164 | ||
165 | Each `vdev` type has different performance behaviors. The two | |
166 | parameters of interest are the IOPS (Input/Output Operations per Second) and | |
167 | the bandwidth with which data can be written or read. | |
168 | ||
f1b7d1a3 MH |
169 | A 'mirror' vdev (RAID1) will approximately behave like a single disk in regard |
170 | to both parameters when writing data. When reading data the performance will | |
171 | scale linearly with the number of disks in the mirror. | |
e4262cac AL |
172 | |
173 | A common situation is to have 4 disks. When setting it up as 2 mirror vdevs | |
174 | (RAID10) the pool will have the write characteristics as two single disks in | |
f1b7d1a3 | 175 | regard to IOPS and bandwidth. For read operations it will resemble 4 single |
e4262cac AL |
176 | disks. |
177 | ||
178 | A 'RAIDZ' of any redundancy level will approximately behave like a single disk | |
f1b7d1a3 | 179 | in regard to IOPS with a lot of bandwidth. How much bandwidth depends on the |
e4262cac AL |
180 | size of the RAIDZ vdev and the redundancy level. |
181 | ||
aefdf31c FG |
182 | A 'dRAID' pool should match the performance of an equivalent 'RAIDZ' pool. |
183 | ||
e4262cac AL |
184 | For running VMs, IOPS is the more important metric in most situations. |
185 | ||
186 | ||
187 | [[sysadmin_zfs_raid_size_space_usage_redundancy]] | |
188 | Size, Space usage and Redundancy | |
189 | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ | |
190 | ||
191 | While a pool made of 'mirror' vdevs will have the best performance | |
192 | characteristics, the usable space will be 50% of the disks available. Less if a | |
193 | mirror vdev consists of more than 2 disks, for example in a 3-way mirror. At | |
194 | least one healthy disk per mirror is needed for the pool to stay functional. | |
195 | ||
196 | The usable space of a 'RAIDZ' type vdev of N disks is roughly N-P, with P being | |
197 | the RAIDZ-level. The RAIDZ-level indicates how many arbitrary disks can fail | |
198 | without losing data. A special case is a 4 disk pool with RAIDZ2. In this | |
199 | situation it is usually better to use 2 mirror vdevs for the better performance | |
200 | as the usable space will be the same. | |
201 | ||
202 | Another important factor when using any RAIDZ level is how ZVOL datasets, which | |
203 | are used for VM disks, behave. For each data block the pool needs parity data | |
204 | which is at least the size of the minimum block size defined by the `ashift` | |
205 | value of the pool. With an ashift of 12 the block size of the pool is 4k. The | |
206 | default block size for a ZVOL is 8k. Therefore, in a RAIDZ2 each 8k block | |
207 | written will cause two additional 4k parity blocks to be written, | |
208 | 8k + 4k + 4k = 16k. This is of course a simplified approach and the real | |
209 | situation will be slightly different with metadata, compression and such not | |
210 | being accounted for in this example. | |
211 | ||
212 | This behavior can be observed when checking the following properties of the | |
213 | ZVOL: | |
214 | ||
215 | * `volsize` | |
216 | * `refreservation` (if the pool is not thin provisioned) | |
217 | * `used` (if the pool is thin provisioned and without snapshots present) | |
218 | ||
219 | ---- | |
220 | # zfs get volsize,refreservation,used <pool>/vm-<vmid>-disk-X | |
221 | ---- | |
222 | ||
223 | `volsize` is the size of the disk as it is presented to the VM, while | |
224 | `refreservation` shows the reserved space on the pool which includes the | |
225 | expected space needed for the parity data. If the pool is thin provisioned, the | |
226 | `refreservation` will be set to 0. Another way to observe the behavior is to | |
227 | compare the used disk space within the VM and the `used` property. Be aware | |
228 | that snapshots will skew the value. | |
229 | ||
230 | There are a few options to counter the increased use of space: | |
231 | ||
232 | * Increase the `volblocksize` to improve the data to parity ratio | |
233 | * Use 'mirror' vdevs instead of 'RAIDZ' | |
234 | * Use `ashift=9` (block size of 512 bytes) | |
235 | ||
236 | The `volblocksize` property can only be set when creating a ZVOL. The default | |
237 | value can be changed in the storage configuration. When doing this, the guest | |
238 | needs to be tuned accordingly and depending on the use case, the problem of | |
b2444770 | 239 | write amplification is just moved from the ZFS layer up to the guest. |
e4262cac AL |
240 | |
241 | Using `ashift=9` when creating the pool can lead to bad | |
242 | performance, depending on the disks underneath, and cannot be changed later on. | |
243 | ||
244 | Mirror vdevs (RAID1, RAID10) have favorable behavior for VM workloads. Use | |
f4abc68a | 245 | them, unless your environment has specific needs and characteristics where |
e4262cac AL |
246 | RAIDZ performance characteristics are acceptable. |
247 | ||
248 | ||
447596fd SH |
249 | ZFS dRAID |
250 | ~~~~~~~~~ | |
251 | ||
252 | In a ZFS dRAID (declustered RAID) the hot spare drive(s) participate in the RAID. | |
253 | Their spare capacity is reserved and used for rebuilding when one drive fails. | |
254 | This provides, depending on the configuration, faster rebuilding compared to a | |
255 | RAIDZ in case of drive failure. More information can be found in the official | |
256 | OpenZFS documentation. footnote:[OpenZFS dRAID | |
257 | https://openzfs.github.io/openzfs-docs/Basic%20Concepts/dRAID%20Howto.html] | |
258 | ||
259 | NOTE: dRAID is intended for more than 10-15 disks in a dRAID. A RAIDZ | |
260 | setup should be better for a lower amount of disks in most use cases. | |
261 | ||
262 | NOTE: The GUI requires one more disk than the minimum (i.e. dRAID1 needs 3). It | |
263 | expects that a spare disk is added as well. | |
264 | ||
265 | * `dRAID1` or `dRAID`: requires at least 2 disks, one can fail before data is | |
266 | lost | |
267 | * `dRAID2`: requires at least 3 disks, two can fail before data is lost | |
268 | * `dRAID3`: requires at least 4 disks, three can fail before data is lost | |
269 | ||
270 | ||
271 | Additional information can be found on the manual page: | |
272 | ||
273 | ---- | |
274 | # man zpoolconcepts | |
275 | ---- | |
276 | ||
277 | Spares and Data | |
278 | ^^^^^^^^^^^^^^^ | |
279 | The number of `spares` tells the system how many disks it should keep ready in | |
280 | case of a disk failure. The default value is 0 `spares`. Without spares, | |
281 | rebuilding won't get any speed benefits. | |
282 | ||
283 | `data` defines the number of devices in a redundancy group. The default value is | |
284 | 8. Except when `disks - parity - spares` equal something less than 8, the lower | |
285 | number is used. In general, a smaller number of `data` devices leads to higher | |
286 | IOPS, better compression ratios and faster resilvering, but defining fewer data | |
287 | devices reduces the available storage capacity of the pool. | |
288 | ||
289 | ||
9ee94323 DM |
290 | Bootloader |
291 | ~~~~~~~~~~ | |
292 | ||
cb04e768 SI |
293 | {pve} uses xref:sysboot_proxmox_boot_tool[`proxmox-boot-tool`] to manage the |
294 | bootloader configuration. | |
3a433e9b | 295 | See the chapter on xref:sysboot[{pve} host bootloaders] for details. |
9ee94323 DM |
296 | |
297 | ||
298 | ZFS Administration | |
299 | ~~~~~~~~~~~~~~~~~~ | |
300 | ||
301 | This section gives you some usage examples for common tasks. ZFS | |
302 | itself is really powerful and provides many options. The main commands | |
8c1189b6 FG |
303 | to manage ZFS are `zfs` and `zpool`. Both commands come with great |
304 | manual pages, which can be read with: | |
9ee94323 DM |
305 | |
306 | ---- | |
307 | # man zpool | |
308 | # man zfs | |
309 | ----- | |
310 | ||
42449bdf TL |
311 | [[sysadmin_zfs_create_new_zpool]] |
312 | Create a new zpool | |
313 | ^^^^^^^^^^^^^^^^^^ | |
9ee94323 | 314 | |
25b89d16 TL |
315 | To create a new pool, at least one disk is needed. The `ashift` should have the |
316 | same sector-size (2 power of `ashift`) or larger as the underlying disk. | |
9ee94323 | 317 | |
eaefe614 FE |
318 | ---- |
319 | # zpool create -f -o ashift=12 <pool> <device> | |
320 | ---- | |
9ee94323 | 321 | |
25b89d16 TL |
322 | [TIP] |
323 | ==== | |
324 | Pool names must adhere to the following rules: | |
325 | ||
326 | * begin with a letter (a-z or A-Z) | |
327 | * contain only alphanumeric, `-`, `_`, `.`, `:` or ` ` (space) characters | |
328 | * must *not begin* with one of `mirror`, `raidz`, `draid` or `spare` | |
329 | * must not be `log` | |
330 | ==== | |
331 | ||
e06707f2 | 332 | To activate compression (see section <<zfs_compression,Compression in ZFS>>): |
9ee94323 | 333 | |
eaefe614 FE |
334 | ---- |
335 | # zfs set compression=lz4 <pool> | |
336 | ---- | |
9ee94323 | 337 | |
42449bdf TL |
338 | [[sysadmin_zfs_create_new_zpool_raid0]] |
339 | Create a new pool with RAID-0 | |
340 | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ | |
9ee94323 | 341 | |
dc2d00a0 | 342 | Minimum 1 disk |
9ee94323 | 343 | |
eaefe614 FE |
344 | ---- |
345 | # zpool create -f -o ashift=12 <pool> <device1> <device2> | |
346 | ---- | |
9ee94323 | 347 | |
42449bdf TL |
348 | [[sysadmin_zfs_create_new_zpool_raid1]] |
349 | Create a new pool with RAID-1 | |
350 | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ | |
9ee94323 | 351 | |
dc2d00a0 | 352 | Minimum 2 disks |
9ee94323 | 353 | |
eaefe614 FE |
354 | ---- |
355 | # zpool create -f -o ashift=12 <pool> mirror <device1> <device2> | |
356 | ---- | |
9ee94323 | 357 | |
42449bdf TL |
358 | [[sysadmin_zfs_create_new_zpool_raid10]] |
359 | Create a new pool with RAID-10 | |
360 | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ | |
9ee94323 | 361 | |
dc2d00a0 | 362 | Minimum 4 disks |
9ee94323 | 363 | |
eaefe614 FE |
364 | ---- |
365 | # zpool create -f -o ashift=12 <pool> mirror <device1> <device2> mirror <device3> <device4> | |
366 | ---- | |
9ee94323 | 367 | |
42449bdf TL |
368 | [[sysadmin_zfs_create_new_zpool_raidz1]] |
369 | Create a new pool with RAIDZ-1 | |
370 | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ | |
9ee94323 | 371 | |
dc2d00a0 | 372 | Minimum 3 disks |
9ee94323 | 373 | |
eaefe614 FE |
374 | ---- |
375 | # zpool create -f -o ashift=12 <pool> raidz1 <device1> <device2> <device3> | |
376 | ---- | |
9ee94323 | 377 | |
42449bdf TL |
378 | Create a new pool with RAIDZ-2 |
379 | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ | |
9ee94323 | 380 | |
dc2d00a0 | 381 | Minimum 4 disks |
9ee94323 | 382 | |
eaefe614 FE |
383 | ---- |
384 | # zpool create -f -o ashift=12 <pool> raidz2 <device1> <device2> <device3> <device4> | |
385 | ---- | |
9ee94323 | 386 | |
8a1de6bf TL |
387 | Please read the section for |
388 | xref:sysadmin_zfs_raid_considerations[ZFS RAID Level Considerations] | |
389 | to get a rough estimate on how IOPS and bandwidth expectations before setting up | |
390 | a pool, especially when wanting to use a RAID-Z mode. | |
391 | ||
42449bdf TL |
392 | [[sysadmin_zfs_create_new_zpool_with_cache]] |
393 | Create a new pool with cache (L2ARC) | |
394 | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ | |
9ee94323 | 395 | |
5f440d2c TL |
396 | It is possible to use a dedicated device, or partition, as second-level cache to |
397 | increase the performance. Such a cache device will especially help with | |
398 | random-read workloads of data that is mostly static. As it acts as additional | |
399 | caching layer between the actual storage, and the in-memory ARC, it can also | |
400 | help if the ARC must be reduced due to memory constraints. | |
9ee94323 | 401 | |
5f440d2c | 402 | .Create ZFS pool with a on-disk cache |
eaefe614 | 403 | ---- |
5f440d2c | 404 | # zpool create -f -o ashift=12 <pool> <device> cache <cache-device> |
eaefe614 | 405 | ---- |
9ee94323 | 406 | |
5f440d2c TL |
407 | Here only a single `<device>` and a single `<cache-device>` was used, but it is |
408 | possible to use more devices, like it's shown in | |
409 | xref:sysadmin_zfs_create_new_zpool_raid0[Create a new pool with RAID]. | |
410 | ||
411 | Note that for cache devices no mirror or raid modi exist, they are all simply | |
412 | accumulated. | |
413 | ||
414 | If any cache device produces errors on read, ZFS will transparently divert that | |
415 | request to the underlying storage layer. | |
416 | ||
417 | ||
42449bdf TL |
418 | [[sysadmin_zfs_create_new_zpool_with_log]] |
419 | Create a new pool with log (ZIL) | |
420 | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ | |
9ee94323 | 421 | |
5f440d2c TL |
422 | It is possible to use a dedicated drive, or partition, for the ZFS Intent Log |
423 | (ZIL), it is mainly used to provide safe synchronous transactions, so often in | |
424 | performance critical paths like databases, or other programs that issue `fsync` | |
425 | operations more frequently. | |
9ee94323 | 426 | |
5f440d2c TL |
427 | The pool is used as default ZIL location, diverting the ZIL IO load to a |
428 | separate device can, help to reduce transaction latencies while relieving the | |
429 | main pool at the same time, increasing overall performance. | |
9ee94323 | 430 | |
5f440d2c TL |
431 | For disks to be used as log devices, directly or through a partition, it's |
432 | recommend to: | |
433 | ||
434 | - use fast SSDs with power-loss protection, as those have much smaller commit | |
435 | latencies. | |
436 | ||
437 | - Use at least a few GB for the partition (or whole device), but using more than | |
438 | half of your installed memory won't provide you with any real advantage. | |
439 | ||
440 | .Create ZFS pool with separate log device | |
eaefe614 | 441 | ---- |
5f440d2c | 442 | # zpool create -f -o ashift=12 <pool> <device> log <log-device> |
eaefe614 | 443 | ---- |
9ee94323 | 444 | |
5f440d2c TL |
445 | In above example a single `<device>` and a single `<log-device>` is used, but you |
446 | can also combine this with other RAID variants, as described in the | |
447 | xref:sysadmin_zfs_create_new_zpool_raid0[Create a new pool with RAID] section. | |
448 | ||
449 | You can also mirror the log device to multiple devices, this is mainly useful to | |
450 | ensure that performance doesn't immediately degrades if a single log device | |
451 | fails. | |
452 | ||
453 | If all log devices fail the ZFS main pool itself will be used again, until the | |
454 | log device(s) get replaced. | |
455 | ||
42449bdf TL |
456 | [[sysadmin_zfs_add_cache_and_log_dev]] |
457 | Add cache and log to an existing pool | |
458 | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ | |
9ee94323 | 459 | |
5f440d2c TL |
460 | If you have a pool without cache and log you can still add both, or just one of |
461 | them, at any time. | |
462 | ||
463 | For example, let's assume you got a good enterprise SSD with power-loss | |
464 | protection that you want to use for improving the overall performance of your | |
465 | pool. | |
466 | ||
467 | As the maximum size of a log device should be about half the size of the | |
468 | installed physical memory, it means that the ZIL will mostly likely only take up | |
469 | a relatively small part of the SSD, the remaining space can be used as cache. | |
9ee94323 | 470 | |
5f440d2c | 471 | First you have to create two GPT partitions on the SSD with `parted` or `gdisk`. |
9ee94323 | 472 | |
5f440d2c | 473 | Then you're ready to add them to an pool: |
9ee94323 | 474 | |
5f440d2c | 475 | .Add both, a separate log device and a second-level cache, to an existing pool |
eaefe614 | 476 | ---- |
237007eb | 477 | # zpool add -f <pool> log <device-part1> cache <device-part2> |
eaefe614 | 478 | ---- |
9ee94323 | 479 | |
5f440d2c TL |
480 | Just replay `<pool>`, `<device-part1>` and `<device-part2>` with the pool name |
481 | and the two `/dev/disk/by-id/` paths to the partitions. | |
482 | ||
483 | You can also add ZIL and cache separately. | |
484 | ||
485 | .Add a log device to an existing ZFS pool | |
486 | ---- | |
487 | # zpool add <pool> log <log-device> | |
488 | ---- | |
489 | ||
490 | ||
42449bdf TL |
491 | [[sysadmin_zfs_change_failed_dev]] |
492 | Changing a failed device | |
493 | ^^^^^^^^^^^^^^^^^^^^^^^^ | |
9ee94323 | 494 | |
eaefe614 | 495 | ---- |
5f440d2c | 496 | # zpool replace -f <pool> <old-device> <new-device> |
eaefe614 | 497 | ---- |
1748211a | 498 | |
11a6e022 AL |
499 | .Changing a failed bootable device |
500 | ||
7c73a209 CH |
501 | Depending on how {pve} was installed it is either using `systemd-boot` or GRUB |
502 | through `proxmox-boot-tool` footnote:[Systems installed with {pve} 6.4 or later, | |
503 | EFI systems installed with {pve} 5.4 or later] or plain GRUB as bootloader (see | |
cb04e768 SI |
504 | xref:sysboot[Host Bootloader]). You can check by running: |
505 | ||
506 | ---- | |
507 | # proxmox-boot-tool status | |
508 | ---- | |
11a6e022 AL |
509 | |
510 | The first steps of copying the partition table, reissuing GUIDs and replacing | |
511 | the ZFS partition are the same. To make the system bootable from the new disk, | |
512 | different steps are needed which depend on the bootloader in use. | |
1748211a | 513 | |
eaefe614 FE |
514 | ---- |
515 | # sgdisk <healthy bootable device> -R <new device> | |
516 | # sgdisk -G <new device> | |
517 | # zpool replace -f <pool> <old zfs partition> <new zfs partition> | |
11a6e022 AL |
518 | ---- |
519 | ||
44aee838 | 520 | NOTE: Use the `zpool status -v` command to monitor how far the resilvering |
11a6e022 AL |
521 | process of the new disk has progressed. |
522 | ||
cb04e768 | 523 | .With `proxmox-boot-tool`: |
11a6e022 AL |
524 | |
525 | ---- | |
cb04e768 | 526 | # proxmox-boot-tool format <new disk's ESP> |
952ee606 | 527 | # proxmox-boot-tool init <new disk's ESP> [grub] |
eaefe614 | 528 | ---- |
0daaddbd FG |
529 | |
530 | NOTE: `ESP` stands for EFI System Partition, which is setup as partition #2 on | |
531 | bootable disks setup by the {pve} installer since version 5.4. For details, see | |
cb04e768 | 532 | xref:sysboot_proxmox_boot_setup[Setting up a new partition for use as synced ESP]. |
9ee94323 | 533 | |
7c73a209 CH |
534 | NOTE: Make sure to pass 'grub' as mode to `proxmox-boot-tool init` if |
535 | `proxmox-boot-tool status` indicates your current disks are using GRUB, | |
952ee606 FG |
536 | especially if Secure Boot is enabled! |
537 | ||
7c73a209 | 538 | .With plain GRUB: |
11a6e022 AL |
539 | |
540 | ---- | |
541 | # grub-install <new disk> | |
542 | ---- | |
7c73a209 | 543 | NOTE: Plain GRUB is only used on systems installed with {pve} 6.3 or earlier, |
69c2b2e5 SI |
544 | which have not been manually migrated to using `proxmox-boot-tool` yet. |
545 | ||
9ee94323 | 546 | |
aa425868 FE |
547 | Configure E-Mail Notification |
548 | ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ | |
9ee94323 | 549 | |
aa425868 FE |
550 | ZFS comes with an event daemon `ZED`, which monitors events generated by the ZFS |
551 | kernel module. The daemon can also send emails on ZFS events like pool errors. | |
552 | Newer ZFS packages ship the daemon in a separate `zfs-zed` package, which should | |
553 | already be installed by default in {pve}. | |
e280a948 | 554 | |
aa425868 FE |
555 | You can configure the daemon via the file `/etc/zfs/zed.d/zed.rc` with your |
556 | favorite editor. The required setting for email notification is | |
557 | `ZED_EMAIL_ADDR`, which is set to `root` by default. | |
9ee94323 | 558 | |
083adc34 | 559 | -------- |
9ee94323 | 560 | ZED_EMAIL_ADDR="root" |
083adc34 | 561 | -------- |
9ee94323 | 562 | |
8c1189b6 | 563 | Please note {pve} forwards mails to `root` to the email address |
9ee94323 DM |
564 | configured for the root user. |
565 | ||
9ee94323 | 566 | |
42449bdf | 567 | [[sysadmin_zfs_limit_memory_usage]] |
5eba0743 | 568 | Limit ZFS Memory Usage |
9ee94323 DM |
569 | ~~~~~~~~~~~~~~~~~~~~~~ |
570 | ||
9060abb9 | 571 | ZFS uses '50 %' of the host memory for the **A**daptive **R**eplacement |
2172bbb2 CH |
572 | **C**ache (ARC) by default. For new installations starting with {pve} 8.1, the |
573 | ARC usage limit will be set to '10 %' of the installed physical memory, clamped | |
574 | to a maximum of +16 GiB+. This value is written to `/etc/modprobe.d/zfs.conf`. | |
575 | ||
576 | Allocating enough memory for the ARC is crucial for IO performance, so reduce it | |
577 | with caution. As a general rule of thumb, allocate at least +2 GiB Base + 1 | |
578 | GiB/TiB-Storage+. For example, if you have a pool with +8 TiB+ of available | |
579 | storage space then you should use +10 GiB+ of memory for the ARC. | |
580 | ||
581 | ZFS also enforces a minimum value of +64 MiB+. | |
9060abb9 TL |
582 | |
583 | You can change the ARC usage limit for the current boot (a reboot resets this | |
584 | change again) by writing to the +zfs_arc_max+ module parameter directly: | |
585 | ||
586 | ---- | |
587 | echo "$[10 * 1024*1024*1024]" >/sys/module/zfs/parameters/zfs_arc_max | |
588 | ---- | |
589 | ||
2172bbb2 CH |
590 | To *permanently change* the ARC limits, add (or change if already present) the |
591 | following line to `/etc/modprobe.d/zfs.conf`: | |
9ee94323 | 592 | |
5eba0743 FG |
593 | -------- |
594 | options zfs zfs_arc_max=8589934592 | |
595 | -------- | |
9ee94323 | 596 | |
9060abb9 | 597 | This example setting limits the usage to 8 GiB ('8 * 2^30^'). |
9ee94323 | 598 | |
beed14f8 FG |
599 | IMPORTANT: In case your desired +zfs_arc_max+ value is lower than or equal to |
600 | +zfs_arc_min+ (which defaults to 1/32 of the system memory), +zfs_arc_max+ will | |
601 | be ignored unless you also set +zfs_arc_min+ to at most +zfs_arc_max - 1+. | |
602 | ||
603 | ---- | |
604 | echo "$[8 * 1024*1024*1024 - 1]" >/sys/module/zfs/parameters/zfs_arc_min | |
605 | echo "$[8 * 1024*1024*1024]" >/sys/module/zfs/parameters/zfs_arc_max | |
606 | ---- | |
607 | ||
608 | This example setting (temporarily) limits the usage to 8 GiB ('8 * 2^30^') on | |
609 | systems with more than 256 GiB of total memory, where simply setting | |
610 | +zfs_arc_max+ alone would not work. | |
611 | ||
9ee94323 DM |
612 | [IMPORTANT] |
613 | ==== | |
9060abb9 | 614 | If your root file system is ZFS, you must update your initramfs every |
5eba0743 | 615 | time this value changes: |
9ee94323 | 616 | |
eaefe614 | 617 | ---- |
abdfbbbb | 618 | # update-initramfs -u -k all |
eaefe614 | 619 | ---- |
9060abb9 TL |
620 | |
621 | You *must reboot* to activate these changes. | |
9ee94323 DM |
622 | ==== |
623 | ||
624 | ||
dc74fc63 | 625 | [[zfs_swap]] |
4128e7ff TL |
626 | SWAP on ZFS |
627 | ~~~~~~~~~~~ | |
9ee94323 | 628 | |
dc74fc63 | 629 | Swap-space created on a zvol may generate some troubles, like blocking the |
9ee94323 DM |
630 | server or generating a high IO load, often seen when starting a Backup |
631 | to an external Storage. | |
632 | ||
633 | We strongly recommend to use enough memory, so that you normally do not | |
dc74fc63 | 634 | run into low memory situations. Should you need or want to add swap, it is |
3a433e9b | 635 | preferred to create a partition on a physical disk and use it as a swap device. |
dc74fc63 SI |
636 | You can leave some space free for this purpose in the advanced options of the |
637 | installer. Additionally, you can lower the | |
8c1189b6 | 638 | ``swappiness'' value. A good value for servers is 10: |
9ee94323 | 639 | |
eaefe614 FE |
640 | ---- |
641 | # sysctl -w vm.swappiness=10 | |
642 | ---- | |
9ee94323 | 643 | |
8c1189b6 | 644 | To make the swappiness persistent, open `/etc/sysctl.conf` with |
9ee94323 DM |
645 | an editor of your choice and add the following line: |
646 | ||
083adc34 FG |
647 | -------- |
648 | vm.swappiness = 10 | |
649 | -------- | |
9ee94323 | 650 | |
8c1189b6 | 651 | .Linux kernel `swappiness` parameter values |
9ee94323 DM |
652 | [width="100%",cols="<m,2d",options="header"] |
653 | |=========================================================== | |
654 | | Value | Strategy | |
655 | | vm.swappiness = 0 | The kernel will swap only to avoid | |
656 | an 'out of memory' condition | |
657 | | vm.swappiness = 1 | Minimum amount of swapping without | |
658 | disabling it entirely. | |
659 | | vm.swappiness = 10 | This value is sometimes recommended to | |
660 | improve performance when sufficient memory exists in a system. | |
661 | | vm.swappiness = 60 | The default value. | |
662 | | vm.swappiness = 100 | The kernel will swap aggressively. | |
663 | |=========================================================== | |
cca0540e FG |
664 | |
665 | [[zfs_encryption]] | |
4128e7ff TL |
666 | Encrypted ZFS Datasets |
667 | ~~~~~~~~~~~~~~~~~~~~~~ | |
cca0540e | 668 | |
500e5ab3 ML |
669 | WARNING: Native ZFS encryption in {pve} is experimental. Known limitations and |
670 | issues include Replication with encrypted datasets | |
671 | footnote:[https://bugzilla.proxmox.com/show_bug.cgi?id=2350], | |
672 | as well as checksum errors when using Snapshots or ZVOLs. | |
673 | footnote:[https://github.com/openzfs/zfs/issues/11688] | |
674 | ||
cca0540e FG |
675 | ZFS on Linux version 0.8.0 introduced support for native encryption of |
676 | datasets. After an upgrade from previous ZFS on Linux versions, the encryption | |
229426eb | 677 | feature can be enabled per pool: |
cca0540e FG |
678 | |
679 | ---- | |
680 | # zpool get feature@encryption tank | |
681 | NAME PROPERTY VALUE SOURCE | |
682 | tank feature@encryption disabled local | |
683 | ||
684 | # zpool set feature@encryption=enabled | |
685 | ||
686 | # zpool get feature@encryption tank | |
687 | NAME PROPERTY VALUE SOURCE | |
688 | tank feature@encryption enabled local | |
689 | ---- | |
690 | ||
691 | WARNING: There is currently no support for booting from pools with encrypted | |
7c73a209 | 692 | datasets using GRUB, and only limited support for automatically unlocking |
cca0540e FG |
693 | encrypted datasets on boot. Older versions of ZFS without encryption support |
694 | will not be able to decrypt stored data. | |
695 | ||
696 | NOTE: It is recommended to either unlock storage datasets manually after | |
697 | booting, or to write a custom unit to pass the key material needed for | |
698 | unlocking on boot to `zfs load-key`. | |
699 | ||
700 | WARNING: Establish and test a backup procedure before enabling encryption of | |
5dfeeece | 701 | production data. If the associated key material/passphrase/keyfile has been |
cca0540e FG |
702 | lost, accessing the encrypted data is no longer possible. |
703 | ||
704 | Encryption needs to be setup when creating datasets/zvols, and is inherited by | |
705 | default to child datasets. For example, to create an encrypted dataset | |
706 | `tank/encrypted_data` and configure it as storage in {pve}, run the following | |
707 | commands: | |
708 | ||
709 | ---- | |
710 | # zfs create -o encryption=on -o keyformat=passphrase tank/encrypted_data | |
711 | Enter passphrase: | |
712 | Re-enter passphrase: | |
713 | ||
714 | # pvesm add zfspool encrypted_zfs -pool tank/encrypted_data | |
715 | ---- | |
716 | ||
717 | All guest volumes/disks create on this storage will be encrypted with the | |
718 | shared key material of the parent dataset. | |
719 | ||
720 | To actually use the storage, the associated key material needs to be loaded | |
7353437b | 721 | and the dataset needs to be mounted. This can be done in one step with: |
cca0540e FG |
722 | |
723 | ---- | |
7353437b | 724 | # zfs mount -l tank/encrypted_data |
cca0540e FG |
725 | Enter passphrase for 'tank/encrypted_data': |
726 | ---- | |
727 | ||
728 | It is also possible to use a (random) keyfile instead of prompting for a | |
729 | passphrase by setting the `keylocation` and `keyformat` properties, either at | |
229426eb | 730 | creation time or with `zfs change-key` on existing datasets: |
cca0540e FG |
731 | |
732 | ---- | |
733 | # dd if=/dev/urandom of=/path/to/keyfile bs=32 count=1 | |
734 | ||
735 | # zfs change-key -o keyformat=raw -o keylocation=file:///path/to/keyfile tank/encrypted_data | |
736 | ---- | |
737 | ||
738 | WARNING: When using a keyfile, special care needs to be taken to secure the | |
739 | keyfile against unauthorized access or accidental loss. Without the keyfile, it | |
740 | is not possible to access the plaintext data! | |
741 | ||
742 | A guest volume created underneath an encrypted dataset will have its | |
743 | `encryptionroot` property set accordingly. The key material only needs to be | |
744 | loaded once per encryptionroot to be available to all encrypted datasets | |
745 | underneath it. | |
746 | ||
747 | See the `encryptionroot`, `encryption`, `keylocation`, `keyformat` and | |
748 | `keystatus` properties, the `zfs load-key`, `zfs unload-key` and `zfs | |
749 | change-key` commands and the `Encryption` section from `man zfs` for more | |
750 | details and advanced usage. | |
68029ec8 FE |
751 | |
752 | ||
e06707f2 FE |
753 | [[zfs_compression]] |
754 | Compression in ZFS | |
755 | ~~~~~~~~~~~~~~~~~~ | |
756 | ||
757 | When compression is enabled on a dataset, ZFS tries to compress all *new* | |
758 | blocks before writing them and decompresses them on reading. Already | |
759 | existing data will not be compressed retroactively. | |
760 | ||
761 | You can enable compression with: | |
762 | ||
763 | ---- | |
764 | # zfs set compression=<algorithm> <dataset> | |
765 | ---- | |
766 | ||
767 | We recommend using the `lz4` algorithm, because it adds very little CPU | |
768 | overhead. Other algorithms like `lzjb` and `gzip-N`, where `N` is an | |
769 | integer from `1` (fastest) to `9` (best compression ratio), are also | |
770 | available. Depending on the algorithm and how compressible the data is, | |
771 | having compression enabled can even increase I/O performance. | |
772 | ||
773 | You can disable compression at any time with: | |
774 | ||
775 | ---- | |
776 | # zfs set compression=off <dataset> | |
777 | ---- | |
778 | ||
779 | Again, only new blocks will be affected by this change. | |
780 | ||
781 | ||
42449bdf | 782 | [[sysadmin_zfs_special_device]] |
68029ec8 FE |
783 | ZFS Special Device |
784 | ~~~~~~~~~~~~~~~~~~ | |
785 | ||
786 | Since version 0.8.0 ZFS supports `special` devices. A `special` device in a | |
787 | pool is used to store metadata, deduplication tables, and optionally small | |
788 | file blocks. | |
789 | ||
790 | A `special` device can improve the speed of a pool consisting of slow spinning | |
51e544b6 TL |
791 | hard disks with a lot of metadata changes. For example workloads that involve |
792 | creating, updating or deleting a large number of files will benefit from the | |
793 | presence of a `special` device. ZFS datasets can also be configured to store | |
794 | whole small files on the `special` device which can further improve the | |
795 | performance. Use fast SSDs for the `special` device. | |
68029ec8 FE |
796 | |
797 | IMPORTANT: The redundancy of the `special` device should match the one of the | |
798 | pool, since the `special` device is a point of failure for the whole pool. | |
799 | ||
800 | WARNING: Adding a `special` device to a pool cannot be undone! | |
801 | ||
802 | .Create a pool with `special` device and RAID-1: | |
803 | ||
eaefe614 FE |
804 | ---- |
805 | # zpool create -f -o ashift=12 <pool> mirror <device1> <device2> special mirror <device3> <device4> | |
806 | ---- | |
68029ec8 FE |
807 | |
808 | .Add a `special` device to an existing pool with RAID-1: | |
809 | ||
eaefe614 FE |
810 | ---- |
811 | # zpool add <pool> special mirror <device1> <device2> | |
812 | ---- | |
68029ec8 FE |
813 | |
814 | ZFS datasets expose the `special_small_blocks=<size>` property. `size` can be | |
815 | `0` to disable storing small file blocks on the `special` device or a power of | |
9deec2e2 | 816 | two in the range between `512B` to `1M`. After setting the property new file |
68029ec8 FE |
817 | blocks smaller than `size` will be allocated on the `special` device. |
818 | ||
819 | IMPORTANT: If the value for `special_small_blocks` is greater than or equal to | |
51e544b6 TL |
820 | the `recordsize` (default `128K`) of the dataset, *all* data will be written to |
821 | the `special` device, so be careful! | |
68029ec8 FE |
822 | |
823 | Setting the `special_small_blocks` property on a pool will change the default | |
824 | value of that property for all child ZFS datasets (for example all containers | |
825 | in the pool will opt in for small file blocks). | |
826 | ||
51e544b6 | 827 | .Opt in for all file smaller than 4K-blocks pool-wide: |
68029ec8 | 828 | |
eaefe614 FE |
829 | ---- |
830 | # zfs set special_small_blocks=4K <pool> | |
831 | ---- | |
68029ec8 FE |
832 | |
833 | .Opt in for small file blocks for a single dataset: | |
834 | ||
eaefe614 FE |
835 | ---- |
836 | # zfs set special_small_blocks=4K <pool>/<filesystem> | |
837 | ---- | |
68029ec8 FE |
838 | |
839 | .Opt out from small file blocks for a single dataset: | |
840 | ||
eaefe614 FE |
841 | ---- |
842 | # zfs set special_small_blocks=0 <pool>/<filesystem> | |
843 | ---- | |
18d0d68e SI |
844 | |
845 | [[sysadmin_zfs_features]] | |
846 | ZFS Pool Features | |
847 | ~~~~~~~~~~~~~~~~~ | |
848 | ||
849 | Changes to the on-disk format in ZFS are only made between major version changes | |
850 | and are specified through *features*. All features, as well as the general | |
851 | mechanism are well documented in the `zpool-features(5)` manpage. | |
852 | ||
853 | Since enabling new features can render a pool not importable by an older version | |
854 | of ZFS, this needs to be done actively by the administrator, by running | |
855 | `zpool upgrade` on the pool (see the `zpool-upgrade(8)` manpage). | |
856 | ||
857 | Unless you need to use one of the new features, there is no upside to enabling | |
858 | them. | |
859 | ||
860 | In fact, there are some downsides to enabling new features: | |
861 | ||
7c73a209 | 862 | * A system with root on ZFS, that still boots using GRUB will become |
18d0d68e | 863 | unbootable if a new feature is active on the rpool, due to the incompatible |
7c73a209 | 864 | implementation of ZFS in GRUB. |
18d0d68e SI |
865 | * The system will not be able to import any upgraded pool when booted with an |
866 | older kernel, which still ships with the old ZFS modules. | |
867 | * Booting an older {pve} ISO to repair a non-booting system will likewise not | |
868 | work. | |
869 | ||
27adc096 | 870 | IMPORTANT: Do *not* upgrade your rpool if your system is still booted with |
7c73a209 | 871 | GRUB, as this will render your system unbootable. This includes systems |
27adc096 | 872 | installed before {pve} 5.4, and systems booting with legacy BIOS boot (see |
18d0d68e SI |
873 | xref:sysboot_determine_bootloader_used[how to determine the bootloader]). |
874 | ||
27adc096 | 875 | .Enable new features for a ZFS pool: |
18d0d68e SI |
876 | ---- |
877 | # zpool upgrade <pool> | |
878 | ---- |