John Garry [Wed, 1 Jul 2020 08:06:25 +0000 (16:06 +0800)]
sbitmap: Consider cleared bits in sbitmap_bitmap_show()
sbitmap works by maintaining separate bitmaps of set and cleared bits.
The set bits are cleared in a batch, to save the burden of continuously
locking the "word" map to unset.
sbitmap_bitmap_show() only shows the set bits (in "word"), which is not
too much use, so mask out the cleared bits.
Fixes: ea86ea2cdced ("sbitmap: ammortize cost of clearing bits") Signed-off-by: John Garry <john.garry@huawei.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
block: remove the bd_block_size field from struct block_device
We can trivially calculate the block size from the inodes i_blkbits
variable. Use that instead of keeping two redundant copies of the
information in slightly different formats.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
The loop to increase the initial block size doesn't really make any
sense, as the AND operation won't match for powers of two if it didn't
for the initial block size.
Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
bd_block_size contains a value that matches the logic block size when
opening, so the statement is redundant. Even if it wasn't the dumb
assignment would cause a a mismatch with bd_inode->i_blkbits.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
Hongnan Li [Wed, 1 Jul 2020 08:09:38 +0000 (16:09 +0800)]
blk-iolatency: only call ktime_get() if needed
ktime_to_ns(ktime_get()), which is expensive, does not need to be called
if blk_iolatency_enabled() return false in blkcg_iolatency_done_bio().
Postponing ktime_to_ns(ktime_get()) execution reduces the CPU usage when
blk_iolatency is disabled.
Signed-off-by: Hongnan Li <hongnan.li@linux.alibaba.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
block: shortcut __submit_bio_noacct for blk-mq drivers
For blk-mq drivers bios can only be inserted for the same queue. So
bypass the complicated sorting logic in __submit_bio_noacct with
a blk-mq simpler submission helper.
Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
Split out a __submit_bio_noacct helper for the actual de-recursion
algorithm, and simplify the loop by using a continue when we can't
enter the queue for a bio.
Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
block: rename generic_make_request to submit_bio_noacct
generic_make_request has always been very confusingly misnamed, so rename
it to submit_bio_noacct to make it clear that it is submit_bio minus
accounting and a few checks.
Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
block: move ->make_request_fn to struct block_device_operations
The make_request_fn is a little weird in that it sits directly in
struct request_queue instead of an operation vector. Replace it with
a block_device_operations method called submit_bio (which describes much
better what it does). Also remove the request_queue argument to it, as
the queue can be derived pretty trivially from the bio.
Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
Hou Tao [Mon, 27 Apr 2020 13:12:50 +0000 (21:12 +0800)]
blk-mq: remove pointless call of list_entry_rq() in hctx_show_busy_rq()
Just use rq directly, the usage of list_entry_rq() doesn't make any
sense.
Signed-off-by: Hou Tao <houtao1@huawei.com> Reviewed-by: Ming Lei <ming.lei@redhat.com> Reviewed-by: Bart Van Assche <bvanassche@acm.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
Ming Lei [Tue, 30 Jun 2020 14:03:56 +0000 (22:03 +0800)]
blk-mq: move blk_mq_put_driver_tag() into blk-mq.c
It is used by blk-mq.c only, so move it to the source file.
Suggested-by: Christoph Hellwig <hch@infradead.org> Signed-off-by: Ming Lei <ming.lei@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Hannes Reinecke <hare@suse.de> Cc: Christoph Hellwig <hch@infradead.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
Ming Lei [Tue, 30 Jun 2020 14:03:55 +0000 (22:03 +0800)]
blk-mq: move blk_mq_get_driver_tag into blk-mq.c
blk_mq_get_driver_tag() is only used by blk-mq.c and is supposed to
stay in blk-mq.c, so move it and preparing for cleanup code of
get/put driver tag.
Meantime hctx_may_queue() is moved to header file and it is fine
since it is defined as inline always.
No functional change.
Signed-off-by: Ming Lei <ming.lei@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Hannes Reinecke <hare@suse.de> Cc: Christoph Hellwig <hch@infradead.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
Ming Lei [Tue, 30 Jun 2020 10:25:01 +0000 (18:25 +0800)]
blk-mq: support batching dispatch in case of io
More and more drivers want to get batching requests queued from
block layer, such as mmc, and tcp based storage drivers. Also
current in-tree users have virtio-scsi, virtio-blk and nvme.
For none, we already support batching dispatch.
But for io scheduler, every time we just take one request from scheduler
and pass the single request to blk_mq_dispatch_rq_list(). This way makes
batching dispatch not possible when io scheduler is applied. One reason
is that we don't want to hurt sequential IO performance, becasue IO
merge chance is reduced if more requests are dequeued from scheduler
queue.
Try to support batching dispatch for io scheduler by starting with the
following simple approach:
1) still make sure we can get budget before dequeueing request
2) use hctx->dispatch_busy to evaluate if queue is busy, if it is busy
we fackback to non-batching dispatch, otherwise dequeue as many as
possible requests from scheduler, and pass them to blk_mq_dispatch_rq_list().
Wrt. 2), we use similar policy for none, and turns out that SCSI SSD
performance got improved much.
In future, maybe we can develop more intelligent algorithem for batching
dispatch.
Baolin has tested this patch and found that MMC performance is improved[3].
Signed-off-by: Ming Lei <ming.lei@redhat.com> Tested-by: Baolin Wang <baolin.wang7@gmail.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Cc: Sagi Grimberg <sagi@grimberg.me> Cc: Baolin Wang <baolin.wang7@gmail.com> Cc: Christoph Hellwig <hch@infradead.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
Ming Lei [Tue, 30 Jun 2020 10:25:00 +0000 (18:25 +0800)]
blk-mq: pass obtained budget count to blk_mq_dispatch_rq_list
Pass obtained budget count to blk_mq_dispatch_rq_list(), and prepare
for supporting fully batching submission.
With the obtained budget count, it is easier to put extra budgets
in case of .queue_rq failure.
Meantime remove the old 'got_budget' parameter.
Signed-off-by: Ming Lei <ming.lei@redhat.com> Tested-by: Baolin Wang <baolin.wang7@gmail.com> Reviewed-by: Christoph Hellwig <hch@infradead.org> Cc: Sagi Grimberg <sagi@grimberg.me> Cc: Baolin Wang <baolin.wang7@gmail.com> Cc: Christoph Hellwig <hch@infradead.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
Ming Lei [Tue, 30 Jun 2020 10:24:59 +0000 (18:24 +0800)]
blk-mq: remove dead check from blk_mq_dispatch_rq_list
When BLK_STS_RESOURCE or BLK_STS_DEV_RESOURCE is returned from
.queue_rq, the 'list' variable always holds this rq which isn't
queued to LLD successfully.
So blk_mq_dispatch_rq_list() always returns false from the branch
of '!list_empty(list)'.
No functional change.
Signed-off-by: Ming Lei <ming.lei@redhat.com> Tested-by: Baolin Wang <baolin.wang7@gmail.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Christoph Hellwig <hch@infradead.org> Cc: Sagi Grimberg <sagi@grimberg.me> Cc: Baolin Wang <baolin.wang7@gmail.com> Cc: Christoph Hellwig <hch@infradead.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
Ming Lei [Tue, 30 Jun 2020 10:24:58 +0000 (18:24 +0800)]
blk-mq: move getting driver tag and budget into one helper
Move code for getting driver tag and budget into one helper, so
blk_mq_dispatch_rq_list gets a bit simplified, and easier to read.
Meantime move updating of 'no_tag' and 'no_budget_available' into
the branch for handling partial dispatch because that is exactly
consumer of the two local variables.
Also rename the parameter of 'got_budget' as 'ask_budget'.
No functional change.
Signed-off-by: Ming Lei <ming.lei@redhat.com> Tested-by: Baolin Wang <baolin.wang7@gmail.com> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Christoph Hellwig <hch@infradead.org> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Cc: Sagi Grimberg <sagi@grimberg.me> Cc: Baolin Wang <baolin.wang7@gmail.com> Cc: Christoph Hellwig <hch@infradead.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
Ming Lei [Tue, 30 Jun 2020 10:24:57 +0000 (18:24 +0800)]
blk-mq: pass hctx to blk_mq_dispatch_rq_list
All requests in the 'list' of blk_mq_dispatch_rq_list belong to same
hctx, so it is better to pass hctx instead of request queue, because
blk-mq's dispatch target is hctx instead of request queue.
Signed-off-by: Ming Lei <ming.lei@redhat.com> Tested-by: Baolin Wang <baolin.wang7@gmail.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Cc: Sagi Grimberg <sagi@grimberg.me> Cc: Baolin Wang <baolin.wang7@gmail.com> Cc: Christoph Hellwig <hch@infradead.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
Ming Lei [Tue, 30 Jun 2020 10:24:56 +0000 (18:24 +0800)]
blk-mq: pass request queue into get/put budget callback
blk-mq budget is abstract from scsi's device queue depth, and it is
always per-request-queue instead of hctx.
It can be quite absurd to get a budget from one hctx, then dequeue a
request from scheduler queue, and this request may not belong to this
hctx, at least for bfq and deadline.
So fix the mess and always pass request queue to get/put budget
callback.
Signed-off-by: Ming Lei <ming.lei@redhat.com> Tested-by: Baolin Wang <baolin.wang7@gmail.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Douglas Anderson <dianders@chromium.org> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Cc: Sagi Grimberg <sagi@grimberg.me> Cc: Baolin Wang <baolin.wang7@gmail.com> Cc: Christoph Hellwig <hch@infradead.org> Cc: Douglas Anderson <dianders@chromium.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
blkcg_bio_issue_check is a giant inline function that does three entirely
different things. Factor out the blk-cgroup related bio initalization
into a new helper, and the open code the sequence in the only caller,
relying on the fact that all the actual functionality is stubbed out for
non-cgroup builds.
block: bypass blkg_tryget_closest for the root_blkg
The root_blkg is only torn down at the very end of removing a queue.
So in the I/O submission path is always has a life reference and we
can just grab another one using blkg_get instead of doing a tryget
and parent walk that won't lead anywhere.
block: move bio_associate_blkg_from_page to mm/page_io.c
bio_associate_blkg_from_page is a special purpose helper for swap bios
that doesn't need access to bio internals. Move it to the swap code
instead of having it in bio.c.
Acked-by: Tejun Heo <tj@kernel.org> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
block: really clone the block cgroup in bio_clone_blkg_association
bio_clone_blkg_association is supposed to clone the associatation, but
actually ends up doing a search with a tryget. As we know we have a
reference on the source cgroup just get an unconditional additional
reference to it and call it a day. That also removes the need for
a RCU critical section.
Acked-by: Tejun Heo <tj@kernel.org> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
dm: use bio_uninit instead of bio_disassociate_blkg
bio_uninit is the proper API to clean up a BIO that has been allocated
on stack or inside a structure that doesn't come from the BIO allocator.
Switch dm to use that instead of bio_disassociate_blkg, which really is
an implementation detail. Note that the bio_uninit calls are also moved
to the two callers of __send_empty_flush, so that they better pair with
the bio_init calls used to initialize them.
Acked-by: Tejun Heo <tj@kernel.org> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
Jan Kara [Wed, 17 Jun 2020 13:58:23 +0000 (15:58 +0200)]
blktrace: Provide event for request merging
Currently blk-mq does not report any event when two requests get merged
in the elevator. This then results in difficult to understand sequence
of events like:
and you can only guess (after quite some headscratching since the above
excerpt is intermixed with a lot of other IO) that request 215023544+56
got merged to request 215023504+40. Provide proper event for request
merging like we used to do in the legacy block layer.
Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Jens Axboe <axboe@kernel.dk>
block: reduce ifdef CONFIG_BLOCK madness in headers
Large part of bio.h, blkdev.h and genhd.h are under ifdef CONFIG_BLOCK
for no good reason. Only stub out function that are called from
code that is not dependent on CONFIG_BLOCK and leave the harmless
other declarations around.
Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
Luis Chamberlain [Fri, 19 Jun 2020 20:47:30 +0000 (20:47 +0000)]
block: create the request_queue debugfs_dir on registration
We were only creating the request_queue debugfs_dir only
for make_request block drivers (multiqueue), but never for
request-based block drivers. We did this as we were only
creating non-blktrace additional debugfs files on that directory
for make_request drivers. However, since blktrace *always* creates
that directory anyway, we special-case the use of that directory
on blktrace. Other than this being an eye-sore, this exposes
request-based block drivers to the same debugfs fragile
race that used to exist with make_request block drivers
where if we start adding files onto that directory we can later
run a race with a double removal of dentries on the directory
if we don't deal with this carefully on blktrace.
Instead, just simplify things by always creating the request_queue
debugfs_dir on request_queue registration. Rename the mutex also to
reflect the fact that this is used outside of the blktrace context.
Signed-off-by: Luis Chamberlain <mcgrof@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
Luis Chamberlain [Fri, 19 Jun 2020 20:47:29 +0000 (20:47 +0000)]
blktrace: ensure our debugfs dir exists
We make an assumption that a debugfs directory exists, but since
this can fail ensure it exists before allowing blktrace setup to
complete. Otherwise we end up stuffing blktrace files on the debugfs
root directory. In the worst case scenario this *in theory* can create
an eventual panic *iff* in the future a similarly named file is created
prior on the debugfs root directory. This theoretical crash can happen
due to a recursive removal followed by a specific dentry removal.
This doesn't fix any known crash, however I have seen the files
go into the main debugfs root directory in cases where the debugfs
directory was not created due to other internal bugs with blktrace
now fixed.
blktrace is also completely useless without this directory, so
this ensures to userspace we only setup blktrace if the kernel
can stuff files where they are supposed to go into.
debugfs directory creations typically aren't checked for, and we have
maintainers doing sweep removals of these checks, but since we need this
check to ensure proper userspace blktrace functionality we make sure
to annotate the justification for the check.
Signed-off-by: Luis Chamberlain <mcgrof@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Bart Van Assche <bvanassche@acm.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
Luis Chamberlain [Fri, 19 Jun 2020 20:47:28 +0000 (20:47 +0000)]
blktrace: fix debugfs use after free
On commit 6ac93117ab00 ("blktrace: use existing disk debugfs directory")
merged on v4.12 Omar fixed the original blktrace code for request-based
drivers (multiqueue). This however left in place a possible crash, if you
happen to abuse blktrace while racing to remove / add a device.
We used to use asynchronous removal of the request_queue, and with that
the issue was easier to reproduce. Now that we have reverted to
synchronous removal of the request_queue, the issue is still possible to
reproduce, its however just a bit more difficult.
We essentially run two instances of break-blktrace which add/remove
a loop device, and setup a blktrace and just never tear the blktrace
down. We do this twice in parallel. This is easily reproduced with the
script run_0004.sh from break-blktrace [0].
We can end up with two types of panics each reflecting where we
race, one a failed blktrace setup:
debugfs: Directory 'loop0' with parent 'block' already present
This crash happens because of how blktrace uses the debugfs directory
where it places its files. Upon init we always create the same directory
which would be needed by blktrace but we only do this for make_request
drivers (multiqueue) block drivers. When you race a removal of these
devices with a blktrace setup you end up in a situation where the
make_request recursive debugfs removal will sweep away the blktrace
files and then later blktrace will also try to remove individual
dentries which are already NULL. The inverse is also possible and hence
the two types of use after frees.
We don't create the block debugfs directory on init for these types of
block devices:
* request-based block driver block devices
* every possible partition
* scsi-generic
And so, this race should in theory only be possible with make_request
drivers.
We can fix the UAF by simply re-using the debugfs directory for
make_request drivers (multiqueue) and only creating the ephemeral
directory for the other type of block devices. The new clarifications
on relying on the q->blk_trace_mutex *and* also checking for q->blk_trace
*prior* to processing a blktrace ensures the debugfs directories are
only created if no possible directory name clashes are possible.
This goes tested with:
o nvme partitions
o ISCSI with tgt, and blktracing against scsi-generic with:
o block
o tape
o cdrom
o media changer
o blktests
This patch is part of the work which disputes the severity of
CVE-2019-19770 which shows this issue is not a core debugfs issue, but
a misuse of debugfs within blktace.
Fixes: 6ac93117ab00 ("blktrace: use existing disk debugfs directory") Reported-by: syzbot+603294af2d01acfdd6da@syzkaller.appspotmail.com Signed-off-by: Luis Chamberlain <mcgrof@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Cc: Bart Van Assche <bvanassche@acm.org> Cc: Omar Sandoval <osandov@fb.com> Cc: Hannes Reinecke <hare@suse.com> Cc: Nicolai Stange <nstange@suse.de> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Michal Hocko <mhocko@kernel.org> Cc: "Martin K. Petersen" <martin.petersen@oracle.com> Cc: "James E.J. Bottomley" <jejb@linux.ibm.com> Cc: yu kuai <yukuai3@huawei.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
Luis Chamberlain [Fri, 19 Jun 2020 20:47:27 +0000 (20:47 +0000)]
loop: be paranoid on exit and prevent new additions / removals
Be pedantic on removal as well and hold the mutex.
This should prevent uses of addition while we exit.
Signed-off-by: Luis Chamberlain <mcgrof@kernel.org> Reviewed-by: Ming Lei <ming.lei@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
Luis Chamberlain [Fri, 19 Jun 2020 20:47:25 +0000 (20:47 +0000)]
block: revert back to synchronous request_queue removal
Commit dc9edc44de6c ("block: Fix a blk_exit_rl() regression") merged on
v4.12 moved the work behind blk_release_queue() into a workqueue after a
splat floated around which indicated some work on blk_release_queue()
could sleep in blk_exit_rl(). This splat would be possible when a driver
called blk_put_queue() or blk_cleanup_queue() (which calls blk_put_queue()
as its final call) from an atomic context.
blk_put_queue() decrements the refcount for the request_queue kobject, and
upon reaching 0 blk_release_queue() is called. Although blk_exit_rl() is
now removed through commit db6d99523560 ("block: remove request_list code")
on v5.0, we reserve the right to be able to sleep within
blk_release_queue() context.
The last reference for the request_queue must not be called from atomic
context. *When* the last reference to the request_queue reaches 0 varies,
and so let's take the opportunity to document when that is expected to
happen and also document the context of the related calls as best as
possible so we can avoid future issues, and with the hopes that the
synchronous request_queue removal sticks.
We revert back to synchronous request_queue removal because asynchronous
removal creates a regression with expected userspace interaction with
several drivers. An example is when removing the loopback driver, one
uses ioctls from userspace to do so, but upon return and if successful,
one expects the device to be removed. Likewise if one races to add another
device the new one may not be added as it is still being removed. This was
expected behavior before and it now fails as the device is still present
and busy still. Moving to asynchronous request_queue removal could have
broken many scripts which relied on the removal to have been completed if
there was no error. Document this expectation as well so that this
doesn't regress userspace again.
Using asynchronous request_queue removal however has helped us find
other bugs. In the future we can test what could break with this
arrangement by enabling CONFIG_DEBUG_KOBJECT_RELEASE.
While at it, update the docs with the context expectations for the
request_queue / gendisk refcount decrement, and make these
expectations explicit by using might_sleep().
Fixes: dc9edc44de6c ("block: Fix a blk_exit_rl() regression") Suggested-by: Nicolai Stange <nstange@suse.de> Signed-off-by: Luis Chamberlain <mcgrof@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Bart Van Assche <bvanassche@acm.org> Cc: Bart Van Assche <bvanassche@acm.org> Cc: Omar Sandoval <osandov@fb.com> Cc: Hannes Reinecke <hare@suse.com> Cc: Nicolai Stange <nstange@suse.de> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Michal Hocko <mhocko@kernel.org> Cc: yu kuai <yukuai3@huawei.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
Luis Chamberlain [Fri, 19 Jun 2020 20:47:24 +0000 (20:47 +0000)]
block: clarify context for refcount increment helpers
Let us clarify the context under which the helpers to increment the
refcount for the gendisk and request_queue can be called under. We
make this explicit on the places where we may sleep with might_sleep().
We don't address the decrement context yet, as that needs some extra
work and fixes, but will be addressed in the next patch.
Signed-off-by: Luis Chamberlain <mcgrof@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Bart Van Assche <bvanassche@acm.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
blk-mq: add a new blk_mq_complete_request_remote API
This is a variant of blk_mq_complete_request_remote that only completes
the request if it needs to be bounced to another CPU or a softirq. If
the request can be completed locally the function returns false and lets
the driver complete it without requring and indirect function call.
Reviewed-by: Daniel Wagner <dwagner@suse.de> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
blk-mq: remove the get_cpu/put_cpu pair in blk_mq_complete_request
We don't really care if we get migrated during the I/O completion.
In the worth case we either perform an IPI that wasn't required, or
complete the request on a CPU which we just migrated off.
Reviewed-by: Daniel Wagner <dwagner@suse.de> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
blk-mq: move failure injection out of blk_mq_complete_request
Move the call to blk_should_fake_timeout out of blk_mq_complete_request
and into the drivers, skipping call sites that are obvious error
handlers, and remove the now superflous blk_mq_force_complete_rq helper.
This ensures we don't keep injecting errors into completions that just
terminate the Linux request after the hardware has been reset or the
command has been aborted.
Reviewed-by: Daniel Wagner <dwagner@suse.de> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
blk-mq: merge the softirq vs non-softirq IPI logic
Both the softirq path for single queue devices and the multi-queue
completion handler share the same logic to figure out if we need an
IPI for the completion and eventually issue it. Merge the two
versions into a single unified code path.
Reviewed-by: Daniel Wagner <dwagner@suse.de> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
Even for single queue devices there is no point in offloading a polled
completion to the softirq, given that blk_mq_force_complete_rq is called
from the polling thread in that case and thus there are no starvation
issues.
Reviewed-by: Daniel Wagner <dwagner@suse.de> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
By open coding raise_blk_irq in the only caller, and replacing the
ifdef CONFIG_SMP with an IS_ENABLED check the flow in the caller
can be significantly simplified.
Reviewed-by: Daniel Wagner <dwagner@suse.de> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
__blk_complete_request is only called from the blk-mq code, and
duplicates a lot of code from blk-mq.c. Move it there to prepare
for better code sharing and simplifications.
Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Daniel Wagner <dwagner@suse.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
Linus Torvalds [Sun, 21 Jun 2020 22:41:24 +0000 (15:41 -0700)]
Merge tag 'selinux-pr-20200621' of git://git.kernel.org/pub/scm/linux/kernel/git/pcmoore/selinux
Pull SELinux fixes from Paul Moore:
"Three small patches to fix problems in the SELinux code, all found via
clang.
Two patches fix potential double-free conditions and one fixes an
undefined return value"
* tag 'selinux-pr-20200621' of git://git.kernel.org/pub/scm/linux/kernel/git/pcmoore/selinux:
selinux: fix undefined return of cond_evaluate_expr
selinux: fix a double free in cond_read_node()/cond_read_list()
selinux: fix double free
Linus Torvalds [Sun, 21 Jun 2020 20:04:57 +0000 (13:04 -0700)]
Merge tag 'pinctrl-v5.8-2' of git://git.kernel.org/pub/scm/linux/kernel/git/linusw/linux-pinctrl
Pull pin control fixes from Linus Walleij:
"Some early fixes collected during the first week after the merge
window, all pretty self-evident, with the details below. The revert is
the crucial thing.
- Fix a warning on the Qualcomm SPMI GPIO chip being instatiated
twice without a unique irqchip struct
- Use the noirq variants of the suspend and resume callbacks in the
Tegra driver
- Clean up the errorpath on the MCP23s08 driver
- Revert the use of devm_of_iomap() in the Freescale driver as it was
regressing the platform
- Add some missing pins in the Qualcomm IPQ6018 driver
- Fix a simple documentation bug in the pinctrl-single driver"
* tag 'pinctrl-v5.8-2' of git://git.kernel.org/pub/scm/linux/kernel/git/linusw/linux-pinctrl:
pinctrl: single: fix function name in documentation
pinctrl: qcom: ipq6018 Add missing pins in qpic pin group
Revert "pinctrl: freescale: imx: Use 'devm_of_iomap()' to avoid a resource leak in case of error in 'imx_pinctrl_probe()'"
pinctrl: mcp23s08: Split to three parts: fix ptr_ret.cocci warnings
pinctrl: tegra: Use noirq suspend/resume callbacks
pinctrl: qcom: spmi-gpio: fix warning about irq chip reusage
Linus Torvalds [Sun, 21 Jun 2020 19:44:52 +0000 (12:44 -0700)]
Merge tag 'kbuild-fixes-v5.8' of git://git.kernel.org/pub/scm/linux/kernel/git/masahiroy/linux-kbuild
Pull Kbuild fixes from Masahiro Yamada:
- fix -gz=zlib compiler option test for CONFIG_DEBUG_INFO_COMPRESSED
- improve cc-option in scripts/Kbuild.include to clean up temp files
- improve cc-option in scripts/Kconfig.include for more reliable
compile option test
- do not copy modules.builtin by 'make install' because it would break
existing systems
- use 'userprogs' syntax for watch_queue sample
* tag 'kbuild-fixes-v5.8' of git://git.kernel.org/pub/scm/linux/kernel/git/masahiroy/linux-kbuild:
samples: watch_queue: build sample program for target architecture
Revert "Makefile: install modules.builtin even if CONFIG_MODULES=n"
scripts: Fix typo in headers_install.sh
kconfig: unify cc-option and as-option
kbuild: improve cc-option to clean up all temporary files
Makefile: Improve compressed debug info support detection
Linus Torvalds [Sun, 21 Jun 2020 17:02:53 +0000 (10:02 -0700)]
Merge tag 'powerpc-5.8-3' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux
Pull powerpc fixes from Michael Ellerman:
- One fix for the interrupt rework we did last release which broke
KVM-PR
- Three commits fixing some fallout from the READ_ONCE() changes
interacting badly with our 8xx 16K pages support, which uses a pte_t
that is a structure of 4 actual PTEs
- A cleanup of the 8xx pte_update() to use the newly added pmd_off()
- A fix for a crash when handling an oops if CONFIG_DEBUG_VIRTUAL is
enabled
- A minor fix for the SPU syscall generation
Thanks to Aneesh Kumar K.V, Christian Zigotzky, Christophe Leroy, Mike
Rapoport, Nicholas Piggin.
* tag 'powerpc-5.8-3' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux:
powerpc/8xx: Provide ptep_get() with 16k pages
mm: Allow arches to provide ptep_get()
mm/gup: Use huge_ptep_get() in gup_hugepte()
powerpc/syscalls: Use the number when building SPU syscall table
powerpc/8xx: use pmd_off() to access a PMD entry in pte_update()
powerpc/64s: Fix KVM interrupt using wrong save area
powerpc: Fix kernel crash in show_instructions() w/DEBUG_VIRTUAL
Masahiro Yamada [Wed, 17 Jun 2020 02:08:38 +0000 (11:08 +0900)]
samples: watch_queue: build sample program for target architecture
This userspace program includes UAPI headers exported to usr/include/.
'make headers' always works for the target architecture (i.e. the same
architecture as the kernel), so the sample program should be built for
the target as well. Kbuild now supports 'userprogs' for that.
I also guarded the CONFIG option by 'depends on CC_CAN_LINK' because
$(CC) may not provide libc.
Masahiro Yamada [Fri, 19 Jun 2020 15:09:55 +0000 (00:09 +0900)]
Revert "Makefile: install modules.builtin even if CONFIG_MODULES=n"
This reverts commit e0b250b57dcf403529081e5898a9de717f96b76b,
which broke build systems that need to install files to a certain
path, but do not set INSTALL_MOD_PATH when invoking 'make install'.
While modules.builtin is useful also for CONFIG_MODULES=n, this change
in the behavior is quite unexpected. Maybe "make modules_install"
can install modules.builtin irrespective of CONFIG_MODULES as Jonas
originally suggested.
Anyway, that commit should be reverted ASAP.
Reported-by: Douglas Anderson <dianders@chromium.org> Reported-by: Guenter Roeck <linux@roeck-us.net> Cc: Jonas Karlman <jonas@kwiboo.se> Signed-off-by: Masahiro Yamada <masahiroy@kernel.org> Reviewed-by: Guenter Roeck <linux@roeck-us.net> Tested-by: Guenter Roeck <linux@roeck-us.net>