]>
Commit | Line | Data |
---|---|---|
68be554e TL |
1 | From 0000000000000000000000000000000000000000 Mon Sep 17 00:00:00 2001 |
2 | From: Rob Norris <rob.norris@klarasystems.com> | |
3 | Date: Tue, 18 Jul 2023 11:11:29 +1000 | |
4 | Subject: [PATCH] vdev_disk: rewrite BIO filling machinery to avoid split pages | |
5 | ||
6 | This commit tackles a number of issues in the way BIOs (`struct bio`) | |
7 | are constructed for submission to the Linux block layer. | |
8 | ||
9 | The kernel has a hard upper limit on the number of pages/segments that | |
10 | can be added to a BIO, as well as a separate limit for each device | |
11 | (related to its queue depth and other scheduling characteristics). | |
12 | ||
13 | ZFS counts the number of memory pages in the request ABD | |
14 | (`abd_nr_pages_off()`, and then uses that as the number of segments to | |
15 | put into the BIO, up to the hard upper limit. If it requires more than | |
16 | the limit, it will create multiple BIOs. | |
17 | ||
18 | Leaving aside the fact that page count method is wrong (see below), not | |
19 | limiting to the device segment max means that the device driver will | |
20 | need to split the BIO in half. This is alone is not necessarily a | |
21 | problem, but it interacts with another issue to cause a much larger | |
22 | problem. | |
23 | ||
24 | The kernel function to add a segment to a BIO (`bio_add_page()`) takes a | |
25 | `struct page` pointer, and offset+len within it. `struct page` can | |
26 | represent a run of contiguous memory pages (known as a "compound page"). | |
27 | In can be of arbitrary length. | |
28 | ||
29 | The ZFS functions that count ABD pages and load them into the BIO | |
30 | (`abd_nr_pages_off()`, `bio_map()` and `abd_bio_map_off()`) will never | |
31 | consider a page to be more than `PAGE_SIZE` (4K), even if the `struct | |
32 | page` is for multiple pages. In this case, it will load the same `struct | |
33 | page` into the BIO multiple times, with the offset adjusted each time. | |
34 | ||
35 | With a sufficiently large ABD, this can easily lead to the BIO being | |
36 | entirely filled much earlier than it could have been. This is also | |
37 | further contributes to the problem caused by the incorrect segment limit | |
38 | calculation, as its much easier to go past the device limit, and so | |
39 | require a split. | |
40 | ||
41 | Again, this is not a problem on its own. | |
42 | ||
43 | The logic for "never submit more than `PAGE_SIZE`" is actually a little | |
44 | more subtle. It will actually never submit a buffer that crosses a 4K | |
45 | page boundary. | |
46 | ||
47 | In practice, this is fine, as most ABDs are scattered, that is a list of | |
48 | complete 4K pages, and so are loaded in as such. | |
49 | ||
50 | Linear ABDs are typically allocated from slabs, and for small sizes they | |
51 | are frequently not aligned to page boundaries. For example, a 12K | |
52 | allocation can span four pages, eg: | |
53 | ||
54 | -- 4K -- -- 4K -- -- 4K -- -- 4K -- | |
55 | | | | | | | |
56 | :## ######## ######## ######: [1K, 4K, 4K, 3K] | |
57 | ||
58 | Such an allocation would be loaded into a BIO as you see: | |
59 | ||
60 | [1K, 4K, 4K, 3K] | |
61 | ||
62 | This tends not to be a problem in practice, because even if the BIO were | |
63 | filled and needed to be split, each half would still have either a start | |
64 | or end aligned to the logical block size of the device (assuming 4K at | |
65 | least). | |
66 | ||
67 | --- | |
68 | ||
69 | In ideal circumstances, these shortcomings don't cause any particular | |
70 | problems. Its when they start to interact with other ZFS features that | |
71 | things get interesting. | |
72 | ||
73 | Aggregation will create a "gang" ABD, which is simply a list of other | |
74 | ABDs. Iterating over a gang ABD is just iterating over each ABD within | |
75 | it in turn. | |
76 | ||
77 | Because the segments are simply loaded in order, we can end up with | |
78 | uneven segments either side of the "gap" between the two ABDs. For | |
79 | example, two 12K ABDs might be aggregated and then loaded as: | |
80 | ||
81 | [1K, 4K, 4K, 3K, 2K, 4K, 4K, 2K] | |
82 | ||
83 | Should a split occur, each individual BIO can end up either having an | |
84 | start or end offset that is not aligned to the logical block size, which | |
85 | some drivers (eg SCSI) will reject. However, this tends not to happen | |
86 | because the default aggregation limit usually keeps the BIO small enough | |
87 | to not require more than one split, and most pages are actually full 4K | |
88 | pages, so hitting an uneven gap is very rare anyway. | |
89 | ||
90 | If the pool is under particular memory pressure, then an IO can be | |
91 | broken down into a "gang block", a 512-byte block composed of a header | |
92 | and up to three block pointers. Each points to a fragment of the | |
93 | original write, or in turn, another gang block, breaking the original | |
94 | data up over and over until space can be found in the pool for each of | |
95 | them. | |
96 | ||
97 | Each gang header is a separate 512-byte memory allocation from a slab, | |
98 | that needs to be written down to disk. When the gang header is added to | |
99 | the BIO, its a single 512-byte segment. | |
100 | ||
101 | Pulling all this together, consider a large aggregated write of gang | |
102 | blocks. This results a BIO containing lots of 512-byte segments. Given | |
103 | our tendency to overfill the BIO, a split is likely, and most possible | |
104 | split points will yield a pair of BIOs that are misaligned. Drivers that | |
105 | care, like the SCSI driver, will reject them. | |
106 | ||
107 | --- | |
108 | ||
109 | This commit is a substantial refactor and rewrite of much of `vdev_disk` | |
110 | to sort all this out. | |
111 | ||
112 | `vdev_bio_max_segs()` now returns the ideal maximum size for the device, | |
113 | if available. There's also a tuneable `zfs_vdev_disk_max_segs` to | |
114 | override this, to assist with testing. | |
115 | ||
116 | We scan the ABD up front to count the number of pages within it, and to | |
117 | confirm that if we submitted all those pages to one or more BIOs, it | |
118 | could be split at any point with creating a misaligned BIO. If the | |
119 | pages in the BIO are not usable (as in any of the above situations), the | |
120 | ABD is linearised, and then checked again. This is the same technique | |
121 | used in `vdev_geom` on FreeBSD, adjusted for Linux's variable page size | |
122 | and allocator quirks. | |
123 | ||
124 | `vbio_t` is a cleanup and enhancement of the old `dio_request_t`. The | |
125 | idea is simply that it can hold all the state needed to create, submit | |
126 | and return multiple BIOs, including all the refcounts, the ABD copy if | |
127 | it was needed, and so on. Apart from what I hope is a clearer interface, | |
128 | the major difference is that because we know how many BIOs we'll need up | |
129 | front, we don't need the old overflow logic that would grow the BIO | |
130 | array, throw away all the old work and restart. We can get it right from | |
131 | the start. | |
132 | ||
133 | Reviewed-by: Alexander Motin <mav@FreeBSD.org> | |
134 | Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> | |
135 | Signed-off-by: Rob Norris <rob.norris@klarasystems.com> | |
136 | Sponsored-by: Klara, Inc. | |
137 | Sponsored-by: Wasabi Technology, Inc. | |
138 | Closes #15533 | |
139 | Closes #15588 | |
140 | (cherry picked from commit 06a196020e6f70d2fedbd4d0d05bbe0c1ac6e4d8) | |
141 | --- | |
142 | include/os/linux/kernel/linux/mod_compat.h | 1 + | |
143 | man/man4/zfs.4 | 10 +- | |
144 | module/os/linux/zfs/vdev_disk.c | 439 ++++++++++++++++++++- | |
145 | 3 files changed, 447 insertions(+), 3 deletions(-) | |
146 | ||
147 | diff --git a/include/os/linux/kernel/linux/mod_compat.h b/include/os/linux/kernel/linux/mod_compat.h | |
148 | index 8e20a9613..039865b70 100644 | |
149 | --- a/include/os/linux/kernel/linux/mod_compat.h | |
150 | +++ b/include/os/linux/kernel/linux/mod_compat.h | |
151 | @@ -68,6 +68,7 @@ enum scope_prefix_types { | |
152 | zfs_trim, | |
153 | zfs_txg, | |
154 | zfs_vdev, | |
155 | + zfs_vdev_disk, | |
156 | zfs_vdev_file, | |
157 | zfs_vdev_mirror, | |
158 | zfs_vnops, | |
159 | diff --git a/man/man4/zfs.4 b/man/man4/zfs.4 | |
160 | index 352990e02..b5679f2f0 100644 | |
161 | --- a/man/man4/zfs.4 | |
162 | +++ b/man/man4/zfs.4 | |
163 | @@ -2,6 +2,7 @@ | |
164 | .\" Copyright (c) 2013 by Turbo Fredriksson <turbo@bayour.com>. All rights reserved. | |
165 | .\" Copyright (c) 2019, 2021 by Delphix. All rights reserved. | |
166 | .\" Copyright (c) 2019 Datto Inc. | |
167 | +.\" Copyright (c) 2023, 2024 Klara, Inc. | |
168 | .\" The contents of this file are subject to the terms of the Common Development | |
169 | .\" and Distribution License (the "License"). You may not use this file except | |
170 | .\" in compliance with the License. You can obtain a copy of the license at | |
171 | @@ -15,7 +16,7 @@ | |
172 | .\" own identifying information: | |
173 | .\" Portions Copyright [yyyy] [name of copyright owner] | |
174 | .\" | |
175 | -.Dd July 21, 2023 | |
176 | +.Dd January 9, 2024 | |
177 | .Dt ZFS 4 | |
178 | .Os | |
179 | . | |
180 | @@ -1345,6 +1346,13 @@ _ | |
181 | 4 Driver No driver retries on driver errors. | |
182 | .TE | |
183 | . | |
184 | +.It Sy zfs_vdev_disk_max_segs Ns = Ns Sy 0 Pq uint | |
185 | +Maximum number of segments to add to a BIO (min 4). | |
186 | +If this is higher than the maximum allowed by the device queue or the kernel | |
187 | +itself, it will be clamped. | |
188 | +Setting it to zero will cause the kernel's ideal size to be used. | |
189 | +This parameter only applies on Linux. | |
190 | +. | |
191 | .It Sy zfs_expire_snapshot Ns = Ns Sy 300 Ns s Pq int | |
192 | Time before expiring | |
193 | .Pa .zfs/snapshot . | |
194 | diff --git a/module/os/linux/zfs/vdev_disk.c b/module/os/linux/zfs/vdev_disk.c | |
195 | index de4dba72f..0ccb9ad96 100644 | |
196 | --- a/module/os/linux/zfs/vdev_disk.c | |
197 | +++ b/module/os/linux/zfs/vdev_disk.c | |
198 | @@ -24,6 +24,7 @@ | |
199 | * Rewritten for Linux by Brian Behlendorf <behlendorf1@llnl.gov>. | |
200 | * LLNL-CODE-403049. | |
201 | * Copyright (c) 2012, 2019 by Delphix. All rights reserved. | |
202 | + * Copyright (c) 2023, 2024, Klara Inc. | |
203 | */ | |
204 | ||
205 | #include <sys/zfs_context.h> | |
206 | @@ -66,6 +67,13 @@ typedef struct vdev_disk { | |
207 | krwlock_t vd_lock; | |
208 | } vdev_disk_t; | |
209 | ||
210 | +/* | |
211 | + * Maximum number of segments to add to a bio (min 4). If this is higher than | |
212 | + * the maximum allowed by the device queue or the kernel itself, it will be | |
213 | + * clamped. Setting it to zero will cause the kernel's ideal size to be used. | |
214 | + */ | |
215 | +uint_t zfs_vdev_disk_max_segs = 0; | |
216 | + | |
217 | /* | |
218 | * Unique identifier for the exclusive vdev holder. | |
219 | */ | |
220 | @@ -607,10 +615,433 @@ vdev_bio_alloc(struct block_device *bdev, gfp_t gfp_mask, | |
221 | return (bio); | |
222 | } | |
223 | ||
224 | +static inline uint_t | |
225 | +vdev_bio_max_segs(struct block_device *bdev) | |
226 | +{ | |
227 | + /* | |
228 | + * Smallest of the device max segs and the tuneable max segs. Minimum | |
229 | + * 4, so there's room to finish split pages if they come up. | |
230 | + */ | |
231 | + const uint_t dev_max_segs = queue_max_segments(bdev_get_queue(bdev)); | |
232 | + const uint_t tune_max_segs = (zfs_vdev_disk_max_segs > 0) ? | |
233 | + MAX(4, zfs_vdev_disk_max_segs) : dev_max_segs; | |
234 | + const uint_t max_segs = MIN(tune_max_segs, dev_max_segs); | |
235 | + | |
236 | +#ifdef HAVE_BIO_MAX_SEGS | |
237 | + return (bio_max_segs(max_segs)); | |
238 | +#else | |
239 | + return (MIN(max_segs, BIO_MAX_PAGES)); | |
240 | +#endif | |
241 | +} | |
242 | + | |
243 | +static inline uint_t | |
244 | +vdev_bio_max_bytes(struct block_device *bdev) | |
245 | +{ | |
246 | + return (queue_max_sectors(bdev_get_queue(bdev)) << 9); | |
247 | +} | |
248 | + | |
249 | + | |
250 | +/* | |
251 | + * Virtual block IO object (VBIO) | |
252 | + * | |
253 | + * Linux block IO (BIO) objects have a limit on how many data segments (pages) | |
254 | + * they can hold. Depending on how they're allocated and structured, a large | |
255 | + * ZIO can require more than one BIO to be submitted to the kernel, which then | |
256 | + * all have to complete before we can return the completed ZIO back to ZFS. | |
257 | + * | |
258 | + * A VBIO is a wrapper around multiple BIOs, carrying everything needed to | |
259 | + * translate a ZIO down into the kernel block layer and back again. | |
260 | + * | |
261 | + * Note that these are only used for data ZIOs (read/write). Meta-operations | |
262 | + * (flush/trim) don't need multiple BIOs and so can just make the call | |
263 | + * directly. | |
264 | + */ | |
265 | +typedef struct { | |
266 | + zio_t *vbio_zio; /* parent zio */ | |
267 | + | |
268 | + struct block_device *vbio_bdev; /* blockdev to submit bios to */ | |
269 | + | |
270 | + abd_t *vbio_abd; /* abd carrying borrowed linear buf */ | |
271 | + | |
272 | + atomic_t vbio_ref; /* bio refcount */ | |
273 | + int vbio_error; /* error from failed bio */ | |
274 | + | |
275 | + uint_t vbio_max_segs; /* max segs per bio */ | |
276 | + | |
277 | + uint_t vbio_max_bytes; /* max bytes per bio */ | |
278 | + uint_t vbio_lbs_mask; /* logical block size mask */ | |
279 | + | |
280 | + uint64_t vbio_offset; /* start offset of next bio */ | |
281 | + | |
282 | + struct bio *vbio_bio; /* pointer to the current bio */ | |
283 | + struct bio *vbio_bios; /* list of all bios */ | |
284 | +} vbio_t; | |
285 | + | |
286 | +static vbio_t * | |
287 | +vbio_alloc(zio_t *zio, struct block_device *bdev) | |
288 | +{ | |
289 | + vbio_t *vbio = kmem_zalloc(sizeof (vbio_t), KM_SLEEP); | |
290 | + | |
291 | + vbio->vbio_zio = zio; | |
292 | + vbio->vbio_bdev = bdev; | |
293 | + atomic_set(&vbio->vbio_ref, 0); | |
294 | + vbio->vbio_max_segs = vdev_bio_max_segs(bdev); | |
295 | + vbio->vbio_max_bytes = vdev_bio_max_bytes(bdev); | |
296 | + vbio->vbio_lbs_mask = ~(bdev_logical_block_size(bdev)-1); | |
297 | + vbio->vbio_offset = zio->io_offset; | |
298 | + | |
299 | + return (vbio); | |
300 | +} | |
301 | + | |
302 | +static int | |
303 | +vbio_add_page(vbio_t *vbio, struct page *page, uint_t size, uint_t offset) | |
304 | +{ | |
305 | + struct bio *bio; | |
306 | + uint_t ssize; | |
307 | + | |
308 | + while (size > 0) { | |
309 | + bio = vbio->vbio_bio; | |
310 | + if (bio == NULL) { | |
311 | + /* New BIO, allocate and set up */ | |
312 | + bio = vdev_bio_alloc(vbio->vbio_bdev, GFP_NOIO, | |
313 | + vbio->vbio_max_segs); | |
314 | + if (unlikely(bio == NULL)) | |
315 | + return (SET_ERROR(ENOMEM)); | |
316 | + BIO_BI_SECTOR(bio) = vbio->vbio_offset >> 9; | |
317 | + | |
318 | + bio->bi_next = vbio->vbio_bios; | |
319 | + vbio->vbio_bios = vbio->vbio_bio = bio; | |
320 | + } | |
321 | + | |
322 | + /* | |
323 | + * Only load as much of the current page data as will fit in | |
324 | + * the space left in the BIO, respecting lbs alignment. Older | |
325 | + * kernels will error if we try to overfill the BIO, while | |
326 | + * newer ones will accept it and split the BIO. This ensures | |
327 | + * everything works on older kernels, and avoids an additional | |
328 | + * overhead on the new. | |
329 | + */ | |
330 | + ssize = MIN(size, (vbio->vbio_max_bytes - BIO_BI_SIZE(bio)) & | |
331 | + vbio->vbio_lbs_mask); | |
332 | + if (ssize > 0 && | |
333 | + bio_add_page(bio, page, ssize, offset) == ssize) { | |
334 | + /* Accepted, adjust and load any remaining. */ | |
335 | + size -= ssize; | |
336 | + offset += ssize; | |
337 | + continue; | |
338 | + } | |
339 | + | |
340 | + /* No room, set up for a new BIO and loop */ | |
341 | + vbio->vbio_offset += BIO_BI_SIZE(bio); | |
342 | + | |
343 | + /* Signal new BIO allocation wanted */ | |
344 | + vbio->vbio_bio = NULL; | |
345 | + } | |
346 | + | |
347 | + return (0); | |
348 | +} | |
349 | + | |
350 | +BIO_END_IO_PROTO(vdev_disk_io_rw_completion, bio, error); | |
351 | +static void vbio_put(vbio_t *vbio); | |
352 | + | |
353 | +static void | |
354 | +vbio_submit(vbio_t *vbio, int flags) | |
355 | +{ | |
356 | + ASSERT(vbio->vbio_bios); | |
357 | + struct bio *bio = vbio->vbio_bios; | |
358 | + vbio->vbio_bio = vbio->vbio_bios = NULL; | |
359 | + | |
360 | + /* | |
361 | + * We take a reference for each BIO as we submit it, plus one to | |
362 | + * protect us from BIOs completing before we're done submitting them | |
363 | + * all, causing vbio_put() to free vbio out from under us and/or the | |
364 | + * zio to be returned before all its IO has completed. | |
365 | + */ | |
366 | + atomic_set(&vbio->vbio_ref, 1); | |
367 | + | |
368 | + /* | |
369 | + * If we're submitting more than one BIO, inform the block layer so | |
370 | + * it can batch them if it wants. | |
371 | + */ | |
372 | + struct blk_plug plug; | |
373 | + boolean_t do_plug = (bio->bi_next != NULL); | |
374 | + if (do_plug) | |
375 | + blk_start_plug(&plug); | |
376 | + | |
377 | + /* Submit all the BIOs */ | |
378 | + while (bio != NULL) { | |
379 | + atomic_inc(&vbio->vbio_ref); | |
380 | + | |
381 | + struct bio *next = bio->bi_next; | |
382 | + bio->bi_next = NULL; | |
383 | + | |
384 | + bio->bi_end_io = vdev_disk_io_rw_completion; | |
385 | + bio->bi_private = vbio; | |
386 | + bio_set_op_attrs(bio, | |
387 | + vbio->vbio_zio->io_type == ZIO_TYPE_WRITE ? | |
388 | + WRITE : READ, flags); | |
389 | + | |
390 | + vdev_submit_bio(bio); | |
391 | + | |
392 | + bio = next; | |
393 | + } | |
394 | + | |
395 | + /* Finish the batch */ | |
396 | + if (do_plug) | |
397 | + blk_finish_plug(&plug); | |
398 | + | |
399 | + /* Release the extra reference */ | |
400 | + vbio_put(vbio); | |
401 | +} | |
402 | + | |
403 | +static void | |
404 | +vbio_return_abd(vbio_t *vbio) | |
405 | +{ | |
406 | + zio_t *zio = vbio->vbio_zio; | |
407 | + if (vbio->vbio_abd == NULL) | |
408 | + return; | |
409 | + | |
410 | + /* | |
411 | + * If we copied the ABD before issuing it, clean up and return the copy | |
412 | + * to the ADB, with changes if appropriate. | |
413 | + */ | |
414 | + void *buf = abd_to_buf(vbio->vbio_abd); | |
415 | + abd_free(vbio->vbio_abd); | |
416 | + vbio->vbio_abd = NULL; | |
417 | + | |
418 | + if (zio->io_type == ZIO_TYPE_READ) | |
419 | + abd_return_buf_copy(zio->io_abd, buf, zio->io_size); | |
420 | + else | |
421 | + abd_return_buf(zio->io_abd, buf, zio->io_size); | |
422 | +} | |
423 | + | |
424 | +static void | |
425 | +vbio_free(vbio_t *vbio) | |
426 | +{ | |
427 | + VERIFY0(atomic_read(&vbio->vbio_ref)); | |
428 | + | |
429 | + vbio_return_abd(vbio); | |
430 | + | |
431 | + kmem_free(vbio, sizeof (vbio_t)); | |
432 | +} | |
433 | + | |
434 | +static void | |
435 | +vbio_put(vbio_t *vbio) | |
436 | +{ | |
437 | + if (atomic_dec_return(&vbio->vbio_ref) > 0) | |
438 | + return; | |
439 | + | |
440 | + /* | |
441 | + * This was the last reference, so the entire IO is completed. Clean | |
442 | + * up and submit it for processing. | |
443 | + */ | |
444 | + | |
445 | + /* | |
446 | + * Get any data buf back to the original ABD, if necessary. We do this | |
447 | + * now so we can get the ZIO into the pipeline as quickly as possible, | |
448 | + * and then do the remaining cleanup after. | |
449 | + */ | |
450 | + vbio_return_abd(vbio); | |
451 | + | |
452 | + zio_t *zio = vbio->vbio_zio; | |
453 | + | |
454 | + /* | |
455 | + * Set the overall error. If multiple BIOs returned an error, only the | |
456 | + * first will be taken; the others are dropped (see | |
457 | + * vdev_disk_io_rw_completion()). Its pretty much impossible for | |
458 | + * multiple IOs to the same device to fail with different errors, so | |
459 | + * there's no real risk. | |
460 | + */ | |
461 | + zio->io_error = vbio->vbio_error; | |
462 | + if (zio->io_error) | |
463 | + vdev_disk_error(zio); | |
464 | + | |
465 | + /* All done, submit for processing */ | |
466 | + zio_delay_interrupt(zio); | |
467 | + | |
468 | + /* Finish cleanup */ | |
469 | + vbio_free(vbio); | |
470 | +} | |
471 | + | |
472 | +BIO_END_IO_PROTO(vdev_disk_io_rw_completion, bio, error) | |
473 | +{ | |
474 | + vbio_t *vbio = bio->bi_private; | |
475 | + | |
476 | + if (vbio->vbio_error == 0) { | |
477 | +#ifdef HAVE_1ARG_BIO_END_IO_T | |
478 | + vbio->vbio_error = BIO_END_IO_ERROR(bio); | |
479 | +#else | |
480 | + if (error) | |
481 | + vbio->vbio_error = -(error); | |
482 | + else if (!test_bit(BIO_UPTODATE, &bio->bi_flags)) | |
483 | + vbio->vbio_error = EIO; | |
484 | +#endif | |
485 | + } | |
486 | + | |
487 | + /* | |
488 | + * Destroy the BIO. This is safe to do; the vbio owns its data and the | |
489 | + * kernel won't touch it again after the completion function runs. | |
490 | + */ | |
491 | + bio_put(bio); | |
492 | + | |
493 | + /* Drop this BIOs reference acquired by vbio_submit() */ | |
494 | + vbio_put(vbio); | |
495 | +} | |
496 | + | |
497 | +/* | |
498 | + * Iterator callback to count ABD pages and check their size & alignment. | |
499 | + * | |
500 | + * On Linux, each BIO segment can take a page pointer, and an offset+length of | |
501 | + * the data within that page. A page can be arbitrarily large ("compound" | |
502 | + * pages) but we still have to ensure the data portion is correctly sized and | |
503 | + * aligned to the logical block size, to ensure that if the kernel wants to | |
504 | + * split the BIO, the two halves will still be properly aligned. | |
505 | + */ | |
506 | +typedef struct { | |
507 | + uint_t bmask; | |
508 | + uint_t npages; | |
509 | + uint_t end; | |
510 | +} vdev_disk_check_pages_t; | |
511 | + | |
512 | +static int | |
513 | +vdev_disk_check_pages_cb(struct page *page, size_t off, size_t len, void *priv) | |
514 | +{ | |
515 | + vdev_disk_check_pages_t *s = priv; | |
516 | + | |
517 | + /* | |
518 | + * If we didn't finish on a block size boundary last time, then there | |
519 | + * would be a gap if we tried to use this ABD as-is, so abort. | |
520 | + */ | |
521 | + if (s->end != 0) | |
522 | + return (1); | |
523 | + | |
524 | + /* | |
525 | + * Note if we're taking less than a full block, so we can check it | |
526 | + * above on the next call. | |
527 | + */ | |
528 | + s->end = len & s->bmask; | |
529 | + | |
530 | + /* All blocks after the first must start on a block size boundary. */ | |
531 | + if (s->npages != 0 && (off & s->bmask) != 0) | |
532 | + return (1); | |
533 | + | |
534 | + s->npages++; | |
535 | + return (0); | |
536 | +} | |
537 | + | |
538 | +/* | |
539 | + * Check if we can submit the pages in this ABD to the kernel as-is. Returns | |
540 | + * the number of pages, or 0 if it can't be submitted like this. | |
541 | + */ | |
542 | +static boolean_t | |
543 | +vdev_disk_check_pages(abd_t *abd, uint64_t size, struct block_device *bdev) | |
544 | +{ | |
545 | + vdev_disk_check_pages_t s = { | |
546 | + .bmask = bdev_logical_block_size(bdev)-1, | |
547 | + .npages = 0, | |
548 | + .end = 0, | |
549 | + }; | |
550 | + | |
551 | + if (abd_iterate_page_func(abd, 0, size, vdev_disk_check_pages_cb, &s)) | |
552 | + return (B_FALSE); | |
553 | + | |
554 | + return (B_TRUE); | |
555 | +} | |
556 | + | |
557 | +/* Iterator callback to submit ABD pages to the vbio. */ | |
558 | +static int | |
559 | +vdev_disk_fill_vbio_cb(struct page *page, size_t off, size_t len, void *priv) | |
560 | +{ | |
561 | + vbio_t *vbio = priv; | |
562 | + return (vbio_add_page(vbio, page, len, off)); | |
563 | +} | |
564 | + | |
565 | +static int | |
566 | +vdev_disk_io_rw(zio_t *zio) | |
567 | +{ | |
568 | + vdev_t *v = zio->io_vd; | |
569 | + vdev_disk_t *vd = v->vdev_tsd; | |
570 | + struct block_device *bdev = BDH_BDEV(vd->vd_bdh); | |
571 | + int flags = 0; | |
572 | + | |
573 | + /* | |
574 | + * Accessing outside the block device is never allowed. | |
575 | + */ | |
576 | + if (zio->io_offset + zio->io_size > bdev->bd_inode->i_size) { | |
577 | + vdev_dbgmsg(zio->io_vd, | |
578 | + "Illegal access %llu size %llu, device size %llu", | |
579 | + (u_longlong_t)zio->io_offset, | |
580 | + (u_longlong_t)zio->io_size, | |
581 | + (u_longlong_t)i_size_read(bdev->bd_inode)); | |
582 | + return (SET_ERROR(EIO)); | |
583 | + } | |
584 | + | |
585 | + if (!(zio->io_flags & (ZIO_FLAG_IO_RETRY | ZIO_FLAG_TRYHARD)) && | |
586 | + v->vdev_failfast == B_TRUE) { | |
587 | + bio_set_flags_failfast(bdev, &flags, zfs_vdev_failfast_mask & 1, | |
588 | + zfs_vdev_failfast_mask & 2, zfs_vdev_failfast_mask & 4); | |
589 | + } | |
590 | + | |
591 | + /* | |
592 | + * Check alignment of the incoming ABD. If any part of it would require | |
593 | + * submitting a page that is not aligned to the logical block size, | |
594 | + * then we take a copy into a linear buffer and submit that instead. | |
595 | + * This should be impossible on a 512b LBS, and fairly rare on 4K, | |
596 | + * usually requiring abnormally-small data blocks (eg gang blocks) | |
597 | + * mixed into the same ABD as larger ones (eg aggregated). | |
598 | + */ | |
599 | + abd_t *abd = zio->io_abd; | |
600 | + if (!vdev_disk_check_pages(abd, zio->io_size, bdev)) { | |
601 | + void *buf; | |
602 | + if (zio->io_type == ZIO_TYPE_READ) | |
603 | + buf = abd_borrow_buf(zio->io_abd, zio->io_size); | |
604 | + else | |
605 | + buf = abd_borrow_buf_copy(zio->io_abd, zio->io_size); | |
606 | + | |
607 | + /* | |
608 | + * Wrap the copy in an abd_t, so we can use the same iterators | |
609 | + * to count and fill the vbio later. | |
610 | + */ | |
611 | + abd = abd_get_from_buf(buf, zio->io_size); | |
612 | + | |
613 | + /* | |
614 | + * False here would mean the borrowed copy has an invalid | |
615 | + * alignment too, which would mean we've somehow been passed a | |
616 | + * linear ABD with an interior page that has a non-zero offset | |
617 | + * or a size not a multiple of PAGE_SIZE. This is not possible. | |
618 | + * It would mean either zio_buf_alloc() or its underlying | |
619 | + * allocators have done something extremely strange, or our | |
620 | + * math in vdev_disk_check_pages() is wrong. In either case, | |
621 | + * something in seriously wrong and its not safe to continue. | |
622 | + */ | |
623 | + VERIFY(vdev_disk_check_pages(abd, zio->io_size, bdev)); | |
624 | + } | |
625 | + | |
626 | + /* Allocate vbio, with a pointer to the borrowed ABD if necessary */ | |
627 | + int error = 0; | |
628 | + vbio_t *vbio = vbio_alloc(zio, bdev); | |
629 | + if (abd != zio->io_abd) | |
630 | + vbio->vbio_abd = abd; | |
631 | + | |
632 | + /* Fill it with pages */ | |
633 | + error = abd_iterate_page_func(abd, 0, zio->io_size, | |
634 | + vdev_disk_fill_vbio_cb, vbio); | |
635 | + if (error != 0) { | |
636 | + vbio_free(vbio); | |
637 | + return (error); | |
638 | + } | |
639 | + | |
640 | + vbio_submit(vbio, flags); | |
641 | + return (0); | |
642 | +} | |
643 | + | |
644 | /* ========== */ | |
645 | ||
646 | /* | |
647 | - * This is the classic, battle-tested BIO submission code. | |
648 | + * This is the classic, battle-tested BIO submission code. Until we're totally | |
649 | + * sure that the new code is safe and correct in all cases, this will remain | |
650 | + * available and can be enabled by setting zfs_vdev_disk_classic=1 at module | |
651 | + * load time. | |
652 | * | |
653 | * These functions have been renamed to vdev_classic_* to make it clear what | |
654 | * they belong to, but their implementations are unchanged. | |
655 | @@ -1116,7 +1547,8 @@ vdev_disk_init(spa_t *spa, nvlist_t *nv, void **tsd) | |
656 | (void) tsd; | |
657 | ||
658 | if (vdev_disk_io_rw_fn == NULL) | |
659 | - vdev_disk_io_rw_fn = vdev_classic_physio; | |
660 | + /* XXX make configurable */ | |
661 | + vdev_disk_io_rw_fn = 0 ? vdev_classic_physio : vdev_disk_io_rw; | |
662 | ||
663 | return (0); | |
664 | } | |
665 | @@ -1215,3 +1647,6 @@ ZFS_MODULE_PARAM(zfs_vdev, zfs_vdev_, open_timeout_ms, UINT, ZMOD_RW, | |
666 | ||
667 | ZFS_MODULE_PARAM(zfs_vdev, zfs_vdev_, failfast_mask, UINT, ZMOD_RW, | |
668 | "Defines failfast mask: 1 - device, 2 - transport, 4 - driver"); | |
669 | + | |
670 | +ZFS_MODULE_PARAM(zfs_vdev_disk, zfs_vdev_disk_, max_segs, UINT, ZMOD_RW, | |
671 | + "Maximum number of data segments to add to an IO request (min 4)"); |