Milan Broz [Wed, 9 Aug 2017 15:47:26 +0000 (17:47 +0200)]
bio-integrity: Fix regression if profile verify_fn is NULL
In dm-integrity target we register integrity profile that have
both generate_fn and verify_fn callbacks set to NULL.
This is used if dm-integrity is stacked under a dm-crypt device
for authenticated encryption (integrity payload contains authentication
tag and IV seed).
In this case the verification is done through own crypto API
processing inside dm-crypt; integrity profile is only holder
of these data. (And memory is owned by dm-crypt as well.)
After the commit (and previous changes)
Commit 7c20f11680a441df09de7235206f70115fbf6290
Author: Christoph Hellwig <hch@lst.de>
Date: Mon Jul 3 16:58:43 2017 -0600
block: replace bi_bdev with a gendisk pointer and partitions index
This way we don't need a block_device structure to submit I/O. The
block_device has different life time rules from the gendisk and
request_queue and is usually only available when the block device node
is open. Other callers need to explicitly create one (e.g. the lightnvm
passthrough code, or the new nvme multipathing code).
For the actual I/O path all that we need is the gendisk, which exists
once per block device. But given that the block layer also does
partition remapping we additionally need a partition index, which is
used for said remapping in generic_make_request.
Note that all the block drivers generally want request_queue or
sometimes the gendisk, so this removes a layer of indirection all
over the stack.
Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
The block layer always remaps partitions before calling into the
->make_request methods of drivers. Thus the call to get_start_sect in
in_chunk_boundary will always return 0 and can be removed.
Reviewed-by: Shaohua Li <shli@fb.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
btrfs: index check-integrity state hash by a dev_t
We won't have the struct block_device available in the bio soon, so switch
to the numerical dev_t instead of the block_device pointer for looking up
the check-integrity state.
Reviewed-by: Liu Bo <bo.li.liu@oracle.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
Bart Van Assche [Wed, 23 Aug 2017 17:56:33 +0000 (10:56 -0700)]
skd: Change default interrupt mode to MSI-X
Since MSI support on some motherboards is unreliable, change the
default interrupt mode from MSI to MSI-X. This patch avoids that
the following message appears sporadially in the kernel logs of
my test setup:
do_IRQ: 3.193 No irq handler for vector
Signed-off-by: Bart Van Assche <bart.vanassche@wdc.com> Cc: Christoph Hellwig <hch@lst.de> Cc: Hannes Reinecke <hare@suse.de> Cc: Johannes Thumshirn <jthumshirn@suse.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
Bart Van Assche [Wed, 23 Aug 2017 17:56:32 +0000 (10:56 -0700)]
skd: Avoid double completions in case of a timeout
Avoid that normal request completion and the timeout handler can
run concurrently by calling blk_mq_complete_request() instead of
blk_mq_end_request() from skd_end_request(). Avoid that the block
layer can reuse a request while the firmware is still processing
it. Convert skd_softirq_done() to blk-mq. Pass the pointer to
skd_softirq_done() to the block layer core through
blk_mq_ops.complete instead of by calling blk_queue_softirq_done().
Pass the pointer to skd_timed_out() to the block layer core
through blk_mq_ops.timeout instead of by calling
blk_queue_timed_out(). The timeout handler has been tested as
follows:
Bart Van Assche [Wed, 23 Aug 2017 17:56:29 +0000 (10:56 -0700)]
block: Warn if blk_queue_rq_timed_out() is called for a blk-mq queue
The timeout handler set by blk_queue_rq_timed_out() is only used
in single queue mode. Calling this function for blk-mq drivers is
wrong. Hence issue a warning if this function is called by a blk-mq
driver.
Signed-off-by: Bart Van Assche <bart.vanassche@wdc.com> Cc: Christoph Hellwig <hch@lst.de> Cc: Hannes Reinecke <hare@suse.de> Cc: Johannes Thumshirn <jthumshirn@suse.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
Shaohua Li [Mon, 14 Aug 2017 22:05:00 +0000 (15:05 -0700)]
nullb: badbblocks support
Sometime disk could have tracks broken and data there is inaccessable,
but data in other parts can be accessed in normal way. MD RAID supports
such disks. But we don't have a good way to test it, because we can't
control which part of a physical disk is bad. For a virtual disk, this
can be easily controlled.
This patch adds a new 'badblock' attribute. Configure it in this way:
echo "+1-100" > xxx/badblock, this will make sector [1-100] as bad
blocks.
echo "-20-30" > xxx/badblock, this will make sector [20-30] good
If badblocks are accessed, the nullb disk will return IO error. Other
parts of the disk can accessed in normal way.
Signed-off-by: Shaohua Li <shli@fb.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
Shaohua Li [Mon, 14 Aug 2017 22:04:59 +0000 (15:04 -0700)]
nullb: emulate cache
Software must flush disk cache to guarantee data safety. To check if
software correctly does disk cache flush, we must know the behavior of
disk. But physical disk behavior is uncontrollable. Even software
doesn't do the flush, the disk probably does the flush. This patch tries
to emulate a cache in the test disk.
All write will go to a cache first, when the cache is full, we then
flush some data to disk storage. A flush request will flush all data of
the cache to disk storage. A FUA write will write to memory store
directly and revalidate data in cache. If there is a power failure (by
writing to power attribute, 'echo 0 > disk_name/power'), we discard all
data in the cache, but preserve the data in disk storage. Later we can
power on the disk again as usual (write 1 to 'power' attribute), then we
can check data integrity and very if software does everything correctly.
A new attribute 'cache_size' (in MB) is added to configure cache size.
Shaohua Li [Mon, 14 Aug 2017 22:04:58 +0000 (15:04 -0700)]
nullb: bandwidth control
In test, we usually expect controllable disk speed. For example, in a
raid array, we'd like some disks are fast and some are slow. MD RAID
actually has a feature for this. To test the feature, we'd like to make
the disk run in specific speed.
block throttling probably can be used for this purpose, but it requires
cgroup setup. Here we just implement a simple throttling mechanism in
the driver. There is slight fluctuation in the mechanism, but it's good
enough for test.
To configure the bandwidth cap, user sets the 'mbps' attribute. mbps is
MB/s.
Shaohua Li [Mon, 14 Aug 2017 22:04:54 +0000 (15:04 -0700)]
nullb: add interface to power on disk
The device created in nullb configfs interface isn't power on by
default. After user configures the device, user can do 'echo 1 >
xxx/nullb/device_name/power' to power on the device, which will create a
disk. the xxx/nullb/device_name/index is the disk index, so if the index
is 2, the new created disk should be named as /dev/nullb2. Note, the
'index' is only valid after disk is power on.
'echo 0 > xxx/nullb/device_name/power' will remove the disk. Note, this
doesn't remove the device. To remove the device, user should do 'rmdir
xxx/nullb/device_name'. Removing the device will remove the disk too.
Signed-off-by: Shaohua Li <shli@fb.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
Shaohua Li [Mon, 14 Aug 2017 22:04:53 +0000 (15:04 -0700)]
nullb: add configfs interface
Add configfs interface for nullb. configfs interface is more flexible
and easy to configure in a per-disk basis.
Configuration is something like this:
mount -t configfs none /mnt
Checking which features the driver supports:
cat /mnt/nullb/features
The 'features' attribute is for future extension. We probably will add
new features into the driver, userspace can check this attribute to find
the supported features.
Create/remove a device:
mkdir/rmdir /mnt/nullb/a
Then configure the device by setting attributes under /mnt/nullb/a, most
of nullb supported module parameters are converted to attributes:
size; /* device size in MB */
completion_nsec; /* time in ns to complete a request */
submit_queues; /* number of submission queues */
home_node; /* home node for the device */
queue_mode; /* block interface */
blocksize; /* block size */
irqmode; /* IRQ completion handler */
hw_queue_depth; /* queue depth */
use_lightnvm; /* register as a LightNVM device */
blocking; /* blocking blk-mq device */
use_per_node_hctx; /* use per-node allocation for hardware context */
Note, creating a device doesn't create a disk immediately. Creating a
disk is done in two phases: create a device and then power on the
device. Next patch will introduce device power on.
Shaohua Li [Mon, 14 Aug 2017 22:04:52 +0000 (15:04 -0700)]
nullb: factor disk parameters
When we switch to configfs interface, each disk could have different
configuration. To prepare for the change, we move most disk setting to a
separate data structure. The existing module parameter interface is
kept. The 'nr_devices' and 'shared_tags' don't make sense for per-disk
setting, so they are remained as global settings.
Signed-off-by: Shaohua Li <shli@fb.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
Dan Carpenter [Wed, 23 Aug 2017 11:20:57 +0000 (14:20 +0300)]
skd: error pointer dereference in skd_cons_disk()
My initial impulse was to check for IS_ERR_OR_NULL() but when I looked
at this code a bit more closely, we should only need to check for
IS_ERR().
The blk_mq_alloc_tag_set() returns negative error codes and zero on
success so we can just do an "if (rc) goto err_out;". It's better to
preserve the error code anyhow. The blk_mq_init_queue() returns error
pointers on failure, it never returns NULL. We can also remove the
"q = NULL;" at the start because that's no longer needed.
Fixes: ca33dd92968b ("skd: Convert to blk-mq") Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
Bart Van Assche [Thu, 17 Aug 2017 20:13:38 +0000 (13:13 -0700)]
skd: Bump driver version
Bump the driver version. Remove the build ID because build IDs do
not make sense for an upstream kernel driver. Keep the driver
version in the module information but do not report it during every
load, unload or probe.
Signed-off-by: Bart Van Assche <bart.vanassche@wdc.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
Bart Van Assche [Thu, 17 Aug 2017 20:13:35 +0000 (13:13 -0700)]
skd: Reduce memory usage
Every single coherent DMA memory buffer occupies at least one page.
Reduce memory usage by switching from coherent buffers to streaming
DMA for I/O requests (struct skd_fitmsg_context) and S/G-lists
(struct fit_sg_descriptor[]).
Signed-off-by: Bart Van Assche <bart.vanassche@wdc.com> Cc: Christoph Hellwig <hch@lst.de> Cc: Hannes Reinecke <hare@suse.de> Cc: Johannes Thumshirn <jthumshirn@suse.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
Bart Van Assche [Thu, 17 Aug 2017 20:13:34 +0000 (13:13 -0700)]
skd: Remove skd_device.in_flight
Since skd_device.in_flight is only used to display the number of
in-flight requests in debug messages, remove that member and
introduce skd_in_flight(). That last function relies on the block
layer to determine the number of in flight requests.
Signed-off-by: Bart Van Assche <bart.vanassche@wdc.com> Cc: Christoph Hellwig <hch@lst.de> Cc: Hannes Reinecke <hare@suse.de> Cc: Johannes Thumshirn <jthumshirn@suse.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
Bart Van Assche [Thu, 17 Aug 2017 20:13:32 +0000 (13:13 -0700)]
skd: Convert to blk-mq
Introduce a tag set and a blk_mq_ops structure. Set .cmd_size such
that struct request and struct skd_request_context are allocated
through a single allocation. Remove the skd_request_context.req
pointer. Make queue starting asynchronous such that this can occur
safely from interrupt context. Use locking to protect skdev->skmsg
and *skdev->skmsg against concurrent access from concurrent
.queue_rq() calls. Introduce the functions skd_init_request() and
skd_exit_request() to set up / clean up the per-request S/G-list.
Signed-off-by: Bart Van Assche <bart.vanassche@wdc.com> Cc: Christoph Hellwig <hch@lst.de> Cc: Hannes Reinecke <hare@suse.de> Cc: Johannes Thumshirn <jthumshirn@suse.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
Bart Van Assche [Thu, 17 Aug 2017 20:13:28 +0000 (13:13 -0700)]
skd: Introduce skd_process_request()
The only functional change in this patch is that the skd_fitmsg_context
in which requests are accumulated is changed from a local variable into
a member of struct skd_device. This patch will make the blk-mq conversion
easier.
Signed-off-by: Bart Van Assche <bart.vanassche@wdc.com> Cc: Christoph Hellwig <hch@lst.de> Cc: Hannes Reinecke <hare@suse.de> Cc: Johannes Thumshirn <jthumshirn@suse.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
Bart Van Assche [Thu, 17 Aug 2017 20:13:27 +0000 (13:13 -0700)]
skd: Convert several per-device scalar variables into atomics
Convert the per-device scalar variables that are protected by the
queue lock into atomics such that it becomes safe to access these
variables without holding the queue lock.
Signed-off-by: Bart Van Assche <bart.vanassche@wdc.com> Cc: Christoph Hellwig <hch@lst.de> Cc: Hannes Reinecke <hare@suse.de> Cc: Johannes Thumshirn <jthumshirn@suse.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
Bart Van Assche [Thu, 17 Aug 2017 20:13:26 +0000 (13:13 -0700)]
skd: Enable request tags for the block layer queue
Use the request tag when allocating a skd_fitmsg_context or
skd_request_context such that the lists used to track free elements
can be eliminated. Swap the skd_end_request() and skd_release_req()
calls to avoid triggering a use-after-free. Remove
skd_fitmsg_context.state and .outstanding because FIT messages are
shared among requests and because updating a FIT message after a
request has finished whould trigger a use-after-free.
Signed-off-by: Bart Van Assche <bart.vanassche@wdc.com> Cc: Christoph Hellwig <hch@lst.de> Cc: Hannes Reinecke <hare@suse.de> Cc: Johannes Thumshirn <jthumshirn@suse.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
Bart Van Assche [Thu, 17 Aug 2017 20:13:25 +0000 (13:13 -0700)]
skd: Initialize skd_special_context.req.n_sg to one
The debug code in skd_send_special_fitmsg() assumes that req.n_sg
represents the number of S/G descriptors. However, skd_construct()
initializes that member variable to zero. Set req.n_sg to one such
that the debugging code in skd_send_special_fitmsg() works as
expected.
Signed-off-by: Bart Van Assche <bart.vanassche@wdc.com> Cc: Christoph Hellwig <hch@lst.de> Cc: Hannes Reinecke <hare@suse.de> Cc: Johannes Thumshirn <jthumshirn@suse.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
Bart Van Assche [Thu, 17 Aug 2017 20:13:22 +0000 (13:13 -0700)]
skd: Convert explicit skd_request_fn() calls
This will make it easier to convert this driver to the blk-mq
approach. This patch also reduces interrupt latency by moving
skd_request_fn() calls out of the skd_isr() interrupt.
Signed-off-by: Bart Van Assche <bart.vanassche@wdc.com> Cc: Christoph Hellwig <hch@lst.de> Cc: Hannes Reinecke <hare@suse.de> Cc: Johannes Thumshirn <jthumshirn@suse.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
Bart Van Assche [Thu, 17 Aug 2017 20:13:17 +0000 (13:13 -0700)]
skd: Remove superfluous occurrences of the 'volatile' keyword
mem_map[i] is accessed through readl() / writel() hence declaring
mem_map as volatile is not necessary.
Remove the volatile declarations from struct fit_completion_entry_v1
pointers and struct fit_comp_error_info since reading these structures
multiple times is safe.
Signed-off-by: Bart Van Assche <bart.vanassche@wdc.com> Cc: Christoph Hellwig <hch@lst.de> Cc: Hannes Reinecke <hare@suse.de> Cc: Johannes Thumshirn <jthumshirn@suse.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
Bart Van Assche [Thu, 17 Aug 2017 20:13:13 +0000 (13:13 -0700)]
skd: Remove superfluous initializations from skd_isr_completion_posted()
The value of skcmp, cmp_cntxt etc. is overwritten during every
loop iteration and is not used after the loop has finished. Hence
initializing these variables outside the loop is not necessary.
This patch does not change any functionality.
Signed-off-by: Bart Van Assche <bart.vanassche@wdc.com> Cc: Christoph Hellwig <hch@lst.de> Cc: Hannes Reinecke <hare@suse.de> Cc: Johannes Thumshirn <jthumshirn@suse.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
Bart Van Assche [Thu, 17 Aug 2017 20:13:05 +0000 (13:13 -0700)]
skd: Simplify the code for deciding whether or not to send a FIT msg
Due to the previous patch it is guaranteed that the FIT msg contains
at least one request after the for-loop has finished. Use this to
simplify the code.
Signed-off-by: Bart Van Assche <bart.vanassche@wdc.com> Cc: Christoph Hellwig <hch@lst.de> Cc: Hannes Reinecke <hare@suse.de> Cc: Johannes Thumshirn <jthumshirn@suse.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
Bart Van Assche [Thu, 17 Aug 2017 20:12:58 +0000 (13:12 -0700)]
skd: Switch from the pr_*() to the dev_*() logging functions
Use dev_err() and dev_info() instead of pr_err() and pr_info().
Since dev_dbg() is able to report file name and line number
information, remove __FILE__ and __LINE__ from the dev_dbg() calls.
Remove the struct skd_device members and the function (skd_name())
that became superfluous due to these changes.
This patch removes the device name and serial number from log
statements. An example of the old log line format:
(skd0:STM000196603:[0000:00:09.0]): Driver state STARTING(3)=>ONLINE(4)
An example of the new log line format:
skd:0000:00:09.0: Driver state STARTING(3)=>ONLINE(4)
Signed-off-by: Bart Van Assche <bart.vanassche@wdc.com> Cc: Christoph Hellwig <hch@lst.de> Cc: Hannes Reinecke <hare@suse.de> Cc: Johannes Thumshirn <jthumshirn@suse.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
Bart Van Assche [Thu, 17 Aug 2017 20:12:57 +0000 (13:12 -0700)]
skd: Remove useless barrier() calls
The purpose of barrier() is to prevent reordering by the compiler.
Since the compiler does not reorder calls to non-pure functions,
remove the barrier() calls from skd_reg_{read,write}{32,64}().
Since pr_debug() is able to report file name and line number
information, remove __FILE__ and __LINE__ from the pr_debug() calls.
Signed-off-by: Bart Van Assche <bart.vanassche@wdc.com> Cc: Christoph Hellwig <hch@lst.de> Cc: Hannes Reinecke <hare@suse.de> Cc: Johannes Thumshirn <jthumshirn@suse.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
Bart Van Assche [Thu, 17 Aug 2017 20:12:55 +0000 (13:12 -0700)]
skd: Remove set-but-not-used local variables
These variables have been detected by building with W=1. Declare
'acc' as __maybe_unused because most access_ok() implementations
ignore their first argument. This patch does not change any
functionality.
Signed-off-by: Bart Van Assche <bart.vanassche@wdc.com> Cc: Christoph Hellwig <hch@lst.de> Cc: Hannes Reinecke <hare@suse.de> Cc: Johannes Thumshirn <jthumshirn@suse.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
Bart Van Assche [Thu, 17 Aug 2017 20:12:54 +0000 (13:12 -0700)]
skd: Fix a function name in a comment
There is no function skd_completion_posted_isr() in the skd driver
but there is a function called skd_isr_completion_posted(). Fix
the function name in the comment.
Signed-off-by: Bart Van Assche <bart.vanassche@wdc.com> Cc: Christoph Hellwig <hch@lst.de> Cc: Hannes Reinecke <hare@suse.de> Cc: Johannes Thumshirn <jthumshirn@suse.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
Bart Van Assche [Thu, 17 Aug 2017 20:12:47 +0000 (13:12 -0700)]
skd: Switch to GPLv2
This change does not affect any skd driver version derived from a
dual licensed code base but makes all code derived from future
upstream skd driver versions GPLv2 only.
Signed-off-by: Bart Van Assche <bart.vanassche@wdc.com> Cc: Christoph Hellwig <hch@lst.de> Cc: Hannes Reinecke <hare@suse.de> Cc: Johannes Thumshirn <jthumshirn@suse.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
Bart Van Assche [Thu, 17 Aug 2017 20:12:46 +0000 (13:12 -0700)]
skd: Submit requests to firmware before triggering the doorbell
Ensure that the members of struct skd_msg_buf have been transferred
to the PCIe adapter before the doorbell is triggered. This patch
avoids that I/O fails sporadically and that the following error
message is reported:
Bart Van Assche [Thu, 17 Aug 2017 20:12:45 +0000 (13:12 -0700)]
skd: Avoid that module unloading triggers a use-after-free
Since put_disk() triggers a disk_release() call and since that
last function calls blk_put_queue() if disk->queue != NULL, clear
the disk->queue pointer before calling put_disk(). This avoids
that unloading the skd kernel module triggers the following
use-after-free:
Bart Van Assche [Thu, 17 Aug 2017 20:12:44 +0000 (13:12 -0700)]
block: Relax a check in blk_start_queue()
Calling blk_start_queue() from interrupt context with the queue
lock held and without disabling IRQs, as the skd driver does, is
safe. This patch avoids that loading the skd driver triggers the
following warning:
Fixes: commit a038e2536472 ("[PATCH] blk_start_queue() must be called with irq disabled - add warning") Signed-off-by: Bart Van Assche <bart.vanassche@wdc.com> Cc: Paolo 'Blaisorblade' Giarrusso <blaisorblade@yahoo.it> Cc: Andrew Morton <akpm@osdl.org> Cc: Christoph Hellwig <hch@lst.de> Cc: Hannes Reinecke <hare@suse.de> Cc: Johannes Thumshirn <jthumshirn@suse.de> Cc: <stable@vger.kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
Bart Van Assche [Thu, 17 Aug 2017 23:23:08 +0000 (16:23 -0700)]
virtio_blk: Use blk_rq_is_scsi()
This patch does not change any functionality.
Signed-off-by: Bart Van Assche <bart.vanassche@wdc.com> Reviewed-by: Hannes Reinecke <hare@suse.com> Cc: Michael S. Tsirkin <mst@redhat.com> Cc: Jason Wang <jasowang@redhat.com> Cc: virtualization@lists.linux-foundation.org Signed-off-by: Jens Axboe <axboe@kernel.dk>
Bart Van Assche [Thu, 17 Aug 2017 23:23:07 +0000 (16:23 -0700)]
ide-floppy: Use blk_rq_is_scsi()
This patch does not change any functionality.
Signed-off-by: Bart Van Assche <bart.vanassche@wdc.com> Acked-by: David S. Miller <davem@davemloft.net> Reviewed-by: Hannes Reinecke <hare@suse.com> Cc: linux-ide@vger.kernel.org Signed-off-by: Jens Axboe <axboe@kernel.dk>
Bart Van Assche [Thu, 17 Aug 2017 23:23:06 +0000 (16:23 -0700)]
genhd: Annotate all part and part_tbl pointer dereferences
Annotate gendisk.part_tbl and disk_part_tbl.part dereferences with
rcu_dereference_protected(). This patch does not change the behavior
of the modified code but ensures that sparse does not complain about
disk->part_tbl manipulations nor about part_tbl->part accesses.
Additionally, improve documentation of the locking requirements of
the modified functions.
Signed-off-by: Bart Van Assche <bart.vanassche@sandisk.com> Reviewed-by: Hannes Reinecke <hare@suse.com> Cc: Tejun Heo <tj@kernel.org> Cc: Jan Kara <jack@suse.cz> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
Bart Van Assche [Thu, 17 Aug 2017 23:23:03 +0000 (16:23 -0700)]
blk-mq: Make blk_mq_reinit_tagset() calls easier to read
Since blk_mq_ops.reinit_request is only called from inside
blk_mq_reinit_tagset(), make this function pointer an argument of
blk_mq_reinit_tagset() instead of a member of struct blk_mq_ops.
This patch does not change any functionality but makes
blk_mq_reinit_tagset() calls easier to read and to analyze.
Signed-off-by: Bart Van Assche <bart.vanassche@sandisk.com> Reviewed-by: Hannes Reinecke <hare@suse.com> Cc: Christoph Hellwig <hch@lst.de> Cc: Sagi Grimberg <sagi@grimberg.me> Cc: James Smart <james.smart@broadcom.com> Cc: Johannes Thumshirn <jthumshirn@suse.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
Bart Van Assche [Thu, 17 Aug 2017 23:23:00 +0000 (16:23 -0700)]
block: Fix two comments that refer to .queue_rq() return values
Since patch "blk-mq: switch .queue_rq return value to blk_status_t"
.queue_rq() returns a BLK_STS_* value instead of a BLK_MQ_RQ_*
value. Hence refer to the former in comments about .queue_rq()
return values.
Fixes: commit 39a70c76b89b ("blk-mq: clarify dispatch may not be drained/blocked by stopping queue") Signed-off-by: Bart Van Assche <bart.vanassche@wdc.com> Reviewed-by: Hannes Reinecke <hare@suse.com> Cc: Ming Lei <ming.lei@redhat.com> Cc: Christoph Hellwig <hch@lst.de> Cc: Johannes Thumshirn <jthumshirn@suse.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
Josef Bacik [Mon, 14 Aug 2017 18:56:16 +0000 (18:56 +0000)]
nbd: change the default nbd partitions
There's no reason to have partitions disabled for nbd by default, it costs us
nothing to have it enabled and is just confusing/obnoxious to users who try to
use partitions with nbd.
Signed-off-by: Josef Bacik <jbacik@fb.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
Josef Bacik [Mon, 14 Aug 2017 18:25:33 +0000 (18:25 +0000)]
nbd: allow device creation at a specific index
If users really want to use a particular index for their nbd device and it
doesn't already exist there's no reason we can't just create it for them. Do
this instead of erroring out.
Signed-off-by: Josef Bacik <jbacik@fb.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
Anton Volkov [Mon, 7 Aug 2017 12:37:50 +0000 (15:37 +0300)]
loop: fix to a race condition due to the early registration of device
The early device registration made possible a race leading to allocations
of disks with wrong minors.
This patch moves the device registration further down the loop_init
function to make the race infeasible.
Found by Linux Driver Verification project (linuxtesting.org).
Signed-off-by: Anton Volkov <avolkov@ispras.ru> Reviewed-by: Ming Lei <ming.lei@redhat.com> Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
Ritesh Harjani [Wed, 9 Aug 2017 12:58:32 +0000 (18:28 +0530)]
cfq: Give a chance for arming slice idle timer in case of group_idle
In below scenario blkio cgroup does not work as per their assigned
weights :-
1. When the underlying device is nonrotational with a single HW queue
with depth of >= CFQ_HW_QUEUE_MIN
2. When the use case is forming two blkio cgroups cg1(weight 1000) &
cg2(wight 100) and two processes(file1 and file2) doing sync IO in
their respective blkio cgroups.
For above usecase result of fio (without this patch):-
file1: (groupid=0, jobs=1): err= 0: pid=685: Thu Jan 1 19:41:49 1970
write: IOPS=1315, BW=41.1MiB/s (43.1MB/s)(1024MiB/24906msec)
<...>
file2: (groupid=0, jobs=1): err= 0: pid=686: Thu Jan 1 19:41:49 1970
write: IOPS=1295, BW=40.5MiB/s (42.5MB/s)(1024MiB/25293msec)
<...>
// both the process BW is equal even though they belong to diff.
cgroups with weight of 1000(cg1) and 100(cg2)
In above case (for non rotational NCQ devices),
as soon as the request from cg1 is completed and even
though it is provided with higher set_slice=10, because of CFQ
algorithm when the driver tries to fetch the request, CFQ expires
this group without providing any idle time nor weight priority
and schedules another cfq group (in this case cg2).
And thus both cfq groups(cg1 & cg2) keep alternating to get the
disk time and hence loses the cgroup weight based scheduling.
Below patch gives a chance to cfq algorithm (cfq_arm_slice_timer)
to arm the slice timer in case group_idle is enabled.
In case if group_idle is also not required (including for nonrotational
NCQ drives), we need to explicitly set group_idle = 0 from sysfs for
such cases.
With this patch result of fio(for above usecase) :-
file1: (groupid=0, jobs=1): err= 0: pid=690: Thu Jan 1 00:06:08 1970
write: IOPS=1706, BW=53.3MiB/s (55.9MB/s)(1024MiB/19197msec)
<..>
file2: (groupid=0, jobs=1): err= 0: pid=691: Thu Jan 1 00:06:08 1970
write: IOPS=1043, BW=32.6MiB/s (34.2MB/s)(1024MiB/31401msec)
<..>
// In this processes BW is as per their respective cgroups weight.
Paolo Valente [Fri, 4 Aug 2017 05:35:11 +0000 (07:35 +0200)]
block, bfq: boost throughput with flash-based non-queueing devices
When a queue associated with a process remains empty, there are cases
where throughput gets boosted if the device is idled to await the
arrival of a new I/O request for that queue. Currently, BFQ assumes
that one of these cases is when the device has no internal queueing
(regardless of the properties of the I/O being served). Unfortunately,
this condition has proved to be too general. So, this commit refines it
as "the device has no internal queueing and is rotational".
This refinement provides a significant throughput boost with random
I/O, on flash-based storage without internal queueing. For example, on
a HiKey board, throughput increases by up to 125%, growing, e.g., from
6.9MB/s to 15.6MB/s with two or three random readers in parallel.
Paolo Valente [Fri, 4 Aug 2017 05:35:10 +0000 (07:35 +0200)]
block,bfq: refactor device-idling logic
The logic that decides whether to idle the device is scattered across
three functions. Almost all of the logic is in the function
bfq_bfqq_may_idle, but (1) part of the decision is made in
bfq_update_idle_window, and (2) the function bfq_bfqq_must_idle may
switch off idling regardless of the output of bfq_bfqq_may_idle. In
addition, both bfq_update_idle_window and bfq_bfqq_must_idle make
their decisions as a function of parameters that are used, for similar
purposes, also in bfq_bfqq_may_idle. This commit addresses these
issues by moving all the logic into bfq_bfqq_may_idle.
Signed-off-by: Paolo Valente <paolo.valente@linaro.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
We haven't used these in years, but somehow the definitions still
remained. Kill them, and renumber the QUEUE_FLAG_ space. We had
a hole in the beginning of the space, too.
Jens Axboe [Tue, 8 Aug 2017 23:53:33 +0000 (17:53 -0600)]
blk-mq: enable checking two part inflight counts at the same time
Modify blk_mq_in_flight() to count both a partition and root at
the same time. Then we only have to call it once, instead of
potentially looping the tags twice.
Jens Axboe [Tue, 8 Aug 2017 23:51:45 +0000 (17:51 -0600)]
blk-mq: provide internal in-flight variant
We don't have to inc/dec some counter, since we can just
iterate the tags. That makes inc/dec a noop, but means we
have to iterate busy tags to get an in-flight count.