scsi: virtio_scsi: fix pi_bytes{out,in} on 4 KiB block size devices
When the underlying device is a 4 KiB logical block size device with a
protection interval exponent of 0, i.e. 4096 bytes data + 8 bytes PI, the
driver miscalculates the pi_bytes{out,in} by a factor of 8x (64 bytes).
This leads to errors on all reads and writes on 4 KiB logical block size
devices when CONFIG_BLK_DEV_INTEGRITY is enabled and the
VIRTIO_SCSI_F_T10_PI feature bit has been negotiated.
Fixes: e6dc783a38ec0 ("virtio-scsi: Enable DIF/DIX modes in SCSI host LLD") Acked-by: Martin K. Petersen <martin.petersen@oracle.com> Signed-off-by: Greg Edwards <gedwards@ddn.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
Merge branch 'nvme-4.19' of git://git.infradead.org/nvme into for-4.19/block
Pull NVMe updates from Christoph:
"Highlights:
- massively improved tracepoints (Keith Busch)
- support for larger inline data in the RDMA host and target
(Steve Wise)
- RDMA setup/teardown path fixes and refactor (Sagi Grimberg)
- Command Supported and Effects log support for the NVMe target
(Chaitanya Kulkarni)
- buffered I/O support for the NVMe target (Chaitanya Kulkarni)
plus the usual set of cleanups and small enhancements."
* 'nvme-4.19' of git://git.infradead.org/nvme:
nvmet: don't use uuid_le type
nvmet: check fileio lba range access boundaries
nvmet: fix file discard return status
nvme-rdma: centralize admin/io queue teardown sequence
nvme-rdma: centralize controller setup sequence
nvme-rdma: unquiesce queues when deleting the controller
nvme-rdma: mark expected switch fall-through
nvme: add disk name to trace events
nvme: add controller name to trace events
nvme: use hw qid in trace events
nvme: cache struct nvme_ctrl reference to struct nvme_request
nvmet-rdma: add an error flow for post_recv failures
nvmet-rdma: add unlikely check in the fast path
nvmet-rdma: support max(16KB, PAGE_SIZE) inline data
nvme-rdma: support up to 4 segments of inline data
nvmet: add buffered I/O support for file backed ns
nvmet: add commands supported and effects log page
nvme: move init of keep_alive work item to controller initialization
nvme.h: resync with nvme-cli
Fixes: 1e739730c5b9e ("block: optionally merge discontiguous discard bios into a single request") Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
We immediately overwrite the biovec array, so instead just allocate
a new bio and copy over the disk, setor and size.
Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Ming Lei <ming.lei@redhat.com> Acked-by: Coly Li <colyli@suse.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
bio_check_pages_dirty currently inviolates the invariant that bv_page of
a bio_vec inside bi_vcnt shouldn't be zero, and that is going to become
really annoying with multpath biovecs. Fortunately there isn't any
all that good reason for it - once we decide to defer freeing the bio
to a workqueue holding onto a few additional pages isn't really an
issue anymore. So just check if there is a clean page that needs
dirtying in the first path, and do a second pass to free them if there
was none, while the cache is still hot.
Also use the chance to micro-optimize bio_dirty_fn a bit by not saving
irq state - we know we are called from a workqueue.
Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
block: Rename the null_blk_mod kernel module back into null_blk
Commit ca4b2a011948 ("null_blk: add zone support") breaks several
blktests scripts because it renamed the null_blk kernel module into
null_blk_mod. Hence rename null_blk_mod back into null_blk.
Fixes: ca4b2a011948 ("null_blk: add zone support") Signed-off-by: Bart Van Assche <bart.vanassche@wdc.com> Cc: Matias Bjorling <matias.bjorling@wdc.com> Cc: Christoph Hellwig <hch@lst.de> Cc: Ming Lei <ming.lei@redhat.com> Cc: Damien Le Moal <damien.lemoal@wdc.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
Andy Shevchenko [Tue, 17 Jul 2018 14:17:36 +0000 (17:17 +0300)]
nvmet: don't use uuid_le type
Don't use sizeof(uuid_le) where none of the parameters is type of uuid_le.
Since both arguments are u8 [16], use size of destination there.
Moreover, uuid_le is a deprecated type, and nvmet is using uuid_t
already.
Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com> Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Signed-off-by: Christoph Hellwig <hch@lst.de>
Centralize controller sequence to a single routine that correctly cleans
up after failures instead of having multiple apperances in several flows
(create, reset, reconnect).
One thing that we also gain here are the sanity/boundary checks also
when connecting back to a dynamic controller.
Signed-off-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Christoph Hellwig <hch@lst.de>
nvme-rdma: unquiesce queues when deleting the controller
If the controller is going away, we need to unquiesce the IO queues so
that all pending request can fail gracefully before moving forward with
controller deletion. Do that before we destroy the IO queues so
blk_cleanup_queue won't block in freeze.
Signed-off-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Christoph Hellwig <hch@lst.de>
Keith Busch [Fri, 29 Jun 2018 22:50:03 +0000 (16:50 -0600)]
nvme: add disk name to trace events
This will print the disk name to the nvme event trace for io requests so
a user can better distinguish traffic to different disks. This can be used
to create disk based filters. For example, to see only nvme0n2 traffic:
Signed-off-by: Keith Busch <keith.busch@intel.com> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
[hch: turned __assign_disk_name into an inline function] Signed-off-by: Christoph Hellwig <hch@lst.de>
Keith Busch [Fri, 29 Jun 2018 22:50:01 +0000 (16:50 -0600)]
nvme: use hw qid in trace events
We can not match a command to its completion based on the command
id alone. We need the submitting queue identifier to pair with the
completion, so this patch adds that to the trace buffer.
This patch is also collapsing the admin and IO submission traces into a
single one so we don't need to duplicate this and creating unnecessary
code branches: we know if the command is an admin vs IO based on the qid.
And since we're here, the patch fixes code formatting in the area.
Signed-off-by: Keith Busch <keith.busch@intel.com> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
[hch: move the qid helper to nvme.h and made it an inline function] Signed-off-by: Christoph Hellwig <hch@lst.de>
Max Gurtovoy [Sun, 1 Jul 2018 09:20:24 +0000 (12:20 +0300)]
nvmet-rdma: add an error flow for post_recv failures
Posting receive buffer operation can fail, thus we should make
sure to have an error flow during initialization phase. While
we're here, add a debug print in case of a failure.
Signed-off-by: Max Gurtovoy <maxg@mellanox.com> Signed-off-by: Christoph Hellwig <hch@lst.de>
Steve Wise [Wed, 20 Jun 2018 14:15:10 +0000 (07:15 -0700)]
nvmet-rdma: support max(16KB, PAGE_SIZE) inline data
The patch enables inline data sizes using up to 4 recv sges, and capping
the size at 16KB or at least 1 page size. So on a 4K page system, up to
16KB is supported, and for a 64K page system 1 page of 64KB is supported.
We avoid > 0 order page allocations for the inline buffers by using
multiple recv sges, one for each page. If the device cannot support
the configured inline data size due to lack of enough recv sges, then
log a warning and reduce the inline size.
Add a new configfs port attribute, called param_inline_data_size,
to allow configuring the size of inline data for a given nvmf port.
The maximum size allowed is still enforced by nvmet-rdma with
NVMET_RDMA_MAX_INLINE_DATA_SIZE, which is now max(16KB, PAGE_SIZE).
And the default size, if not specified via configfs, is still PAGE_SIZE.
This preserves the existing behavior, but allows larger inline sizes
for small page systems. If the configured inline data size exceeds
NVMET_RDMA_MAX_INLINE_DATA_SIZE, a warning is logged and the size is
reduced. If param_inline_data_size is set to 0, then inline data is
disabled for that nvmf port.
Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Max Gurtovoy <maxg@mellanox.com> Signed-off-by: Steve Wise <swise@opengridcomputing.com> Signed-off-by: Christoph Hellwig <hch@lst.de>
Steve Wise [Wed, 20 Jun 2018 14:15:05 +0000 (07:15 -0700)]
nvme-rdma: support up to 4 segments of inline data
Allow up to 4 segments of inline data for NVMF WRITE operations. This
reduces latency for small WRITEs by removing the need for the target to
issue a READ WR for IB, or a REG_MR + READ WR chain for iWarp.
Also cap the inline segments used based on the limitations of the
device.
Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Max Gurtovoy <maxg@mellanox.com> Signed-off-by: Steve Wise <swise@opengridcomputing.com> Signed-off-by: Christoph Hellwig <hch@lst.de>
nvmet: add buffered I/O support for file backed ns
Add a new "buffered_io" attribute, which disabled direct I/O and thus
enables page cache based caching when enabled. The attribute can only
be changed when the namespace is disabled as the file has to be reopend
for the change to take effect.
The possibly blocking read/write are deferred to a newly introduced
global workqueue.
Signed-off-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com> Signed-off-by: Christoph Hellwig <hch@lst.de>
nvmet: add commands supported and effects log page
This patch adds support for Commands Supported and Effects log page
(Log Identifier 05h) for NVMeOF. This also makes it easier to find
which commands are supported, e.g. :-
James Smart [Tue, 12 Jun 2018 23:28:24 +0000 (16:28 -0700)]
nvme: move init of keep_alive work item to controller initialization
Currently, the code initializes the keep alive work item whenever
nvme_start_keep_alive() is called. However, this routine is called
several times while reconnecting, etc. Although it's hoped that keep
alive is always disabled and not scheduled when start is called,
re-initing if it were scheduled or completing can have very bad
side effects. There's no need for re-initialization.
Move the keep_alive work item and cmd struct initialization to
controller init.
Signed-off-by: James Smart <james.smart@broadcom.com> Signed-off-by: Christoph Hellwig <hch@lst.de>
Ming Lei [Sun, 22 Jul 2018 06:10:15 +0000 (14:10 +0800)]
blk-mq: fail the request in case issue failure
Inside blk_mq_try_issue_list_directly(), if the request is issued as
failed, we shouldn't try to do it again, otherwise the warning in
blk_mq_start_request() will be triggered. This change is aligned to
behaviour of other ways of request issue & dispatch.
Fixes: 6ce3dd6eec1 ("blk-mq: issue directly if hw queue isn't busy in case of 'none'") Cc: Kashyap Desai <kashyap.desai@broadcom.com> Cc: Laurence Oberman <loberman@redhat.com> Cc: Omar Sandoval <osandov@fb.com> Cc: Christoph Hellwig <hch@lst.de> Cc: Bart Van Assche <bart.vanassche@wdc.com> Cc: Hannes Reinecke <hare@suse.de> Cc: Kashyap Desai <kashyap.desai@broadcom.com> Cc: kernel test robot <rong.a.chen@intel.com> Cc: LKP <lkp@01.org> Reported-by: kernel test robot <rong.a.chen@intel.com> Signed-off-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
Josef Bacik [Fri, 20 Jul 2018 01:42:13 +0000 (21:42 -0400)]
blk-rq-qos: make depth comparisons unsigned
With the change to use UINT_MAX I broke the depth check as any value of
inflight (ie 0) would be less than (int)UINT_MAX. Fix this by changing
everything to unsigned int to match the depth.
Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
blkcg: Track DISCARD statistics and output them in cgroup io.stat
Add tracking of REQ_OP_DISCARD ios to the per-cgroup io.stat. Two
fields, dbytes and dios, to respectively count the total bytes and
number of discards are added.
Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Andy Newell <newella@fb.com> Cc: Michael Callahan <michaelcallahan@fb.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
Michael Callahan [Wed, 18 Jul 2018 11:47:40 +0000 (04:47 -0700)]
block: Track DISCARD statistics and output them in stat and diskstat
Add tracking of REQ_OP_DISCARD ios to the partition statistics and
append them to the various stat files in /sys as well as
/proc/diskstats. These are tracked with the same four stats as reads
and writes:
Number of discard ios completed.
Number of discard ios merged
Number of discard sectors completed
Milliseconds spent on discard requests
This is done via adding a new STAT_DISCARD define to genhd.h and then
using it to index that stat field for discard requests.
tj: Refreshed on top of v4.17 and other previous updates.
Signed-off-by: Michael Callahan <michaelcallahan@fb.com> Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Andy Newell <newella@fb.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
Michael Callahan [Wed, 18 Jul 2018 11:47:39 +0000 (04:47 -0700)]
block: Add and use op_stat_group() for indexing disk_stat fields.
Add and use a new op_stat_group() function for indexing partition stat
fields rather than indexing them by rq_data_dir() or bio_data_dir().
This function works similarly to op_is_sync() in that it takes the
request::cmd_flags or bio::bi_opf flags and determines which stats
should et updated.
In addition, the second parameter to generic_start_io_acct() and
generic_end_io_acct() is now a REQ_OP rather than simply a read or
write bit and it uses op_stat_group() on the parameter to determine
the stat group.
Note that the partition in_flight counts are not part of the per-cpu
statistics and as such are not indexed via this function. It's now
indexed by op_is_write().
tj: Refreshed on top of v4.17. Updated to pass around REQ_OP.
Signed-off-by: Michael Callahan <michaelcallahan@fb.com> Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Minchan Kim <minchan@kernel.org> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Joshua Morris <josh.h.morris@us.ibm.com> Cc: Philipp Reisner <philipp.reisner@linbit.com> Cc: Matias Bjorling <mb@lightnvm.io> Cc: Kent Overstreet <kent.overstreet@gmail.com> Cc: Alasdair Kergon <agk@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
Michael Callahan [Wed, 18 Jul 2018 11:47:38 +0000 (04:47 -0700)]
block: Define and use STAT_READ and STAT_WRITE
Add defines for STAT_READ and STAT_WRITE for indexing the partition
stat entries. This clarifies some fs/ code which has hardcoded 1 for
STAT_WRITE and will make it easier to extend the stats with additional
fields.
tj: Refreshed on top of v4.17.
Signed-off-by: Michael Callahan <michaelcallahan@fb.com> Signed-off-by: Tejun Heo <tj@kernel.org> Cc: "Theodore Ts'o" <tytso@mit.edu> Cc: Jaegeuk Kim <jaegeuk@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
Michael Callahan [Wed, 18 Jul 2018 11:47:37 +0000 (04:47 -0700)]
block: Add part_stat_read_accum to read across field entries.
Add a part_stat_read_accum macro to genhd.h to read and sum across
field entries. For example to sum up the number read and write
sectors completed. In addition to being ar reasonable cleanup by
itself this will make it easier to add new stat fields in the future.
block: make bdev_ops->rw_page() take a REQ_OP instead of bool
c11f0c0b5bb9 ("block/mm: make bdev_ops->rw_page() take a bool for
read/write") replaced @op with boolean @is_write, which limited the
amount of information going into ->rw_page() and more importantly
page_endio(), which removed the need to expose block internals to mm.
Unfortunately, we want to track discards separately and @is_write
isn't enough information. This patch updates bdev_ops->rw_page() to
take REQ_OP instead but leaves page_endio() to take bool @is_write.
This allows the block part of operations to have enough information
while not leaking it to mm.
Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Mike Christie <mchristi@redhat.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Dan Williams <dan.j.williams@intel.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
Ming Lei [Tue, 10 Jul 2018 01:03:31 +0000 (09:03 +0800)]
blk-mq: issue directly if hw queue isn't busy in case of 'none'
In case of 'none' io scheduler, when hw queue isn't busy, it isn't
necessary to enqueue request to sw queue and dequeue it from
sw queue because request may be submitted to hw queue asap without
extra cost, meantime there shouldn't be much request in sw queue,
and we don't need to worry about effect on IO merge.
There are still some single hw queue SCSI HBAs(HPSA, megaraid_sas, ...)
which may connect high performance devices, so 'none' is often required
for obtaining good performance.
This patch improves IOPS and decreases CPU unilization on megaraid_sas,
per Kashyap's test.
Cc: Kashyap Desai <kashyap.desai@broadcom.com> Cc: Laurence Oberman <loberman@redhat.com> Cc: Omar Sandoval <osandov@fb.com> Cc: Christoph Hellwig <hch@lst.de> Cc: Bart Van Assche <bart.vanassche@wdc.com> Cc: Hannes Reinecke <hare@suse.de> Reported-by: Kashyap Desai <kashyap.desai@broadcom.com> Tested-by: Kashyap Desai <kashyap.desai@broadcom.com> Signed-off-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
Josef Bacik [Mon, 16 Jul 2018 16:12:23 +0000 (12:12 -0400)]
blk-iolatency: truncate our current time
In our longer tests we noticed that some boxes would degrade to the
point of uselessness. This is because we truncate the current time when
saving it in our bio, but I was using the raw current time to subtract
from. So once the box had been up a certain amount of time it would
appear as if our IO's were taking several years to complete. Fix this
by truncating the current time so it matches the issue time. Verified
this worked by running with this patch for a week on our test tier.
Signed-off-by: Josef Bacik <jbacik@fb.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
Josef Bacik [Mon, 16 Jul 2018 16:12:22 +0000 (12:12 -0400)]
blk-iolatency: don't change the latency window
Early versions of these patches had us waiting for seconds at a time
during submission, so we had to adjust the timing window we monitored
for latency. Now we don't do things like that so this is unnecessary
code.
Signed-off-by: Josef Bacik <jbacik@fb.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
In the read path, partial reads are currently performed synchronously
which affects performance for workloads that generate many partial
reads. This patch adds an asynchronous partial read path as well as
the required partial read ctx.
lightnvm: pblk: expose generic disk name on pr_* msgs
The error messages in pblk does not say which pblk instance that
a message occurred from. Update each error message to reflect the
instance it belongs to, and also prefix it with pblk, so we know
the message comes from the pblk module.
For devices that does not specify a limit on its transfer size, the
get_chk_meta command may send down a single I/O retrieving the full
chunk metadata table. Resulting in large 2-4MB I/O requests. Instead,
split up the I/Os to a maximum of 256KB and issue them separately to
reduce memory requirements.
If using pblk on a 32bit architecture, and there is a need to
perform a partial read, the partial read bitmap will only have
allocated 32 entries, where as 64 are needed.
Make sure that the read_bitmap is initialized to 64bits on 32bit
architectures as well.
Signed-off-by: Matias Bjørling <mb@lightnvm.io> Reviewed-by: Igor Konopko <igor.j.konopko@intel.com> Reviewed-by: Javier GonzĂ¡lez <javier@cnexlabs.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
Since both blk_old_get_request() and blk_mq_alloc_request() initialize
rq->__data_len to zero, it is not necessary to initialize that member
in nvme_nvm_alloc_request(). Hence remove the rq->__data_len
initialization from nvme_nvm_alloc_request().
lightnvm: pblk: enable line minor version detection
When recovering a line, an extra check was added when debugging was
active, such that minor version where also checked. Unfortunately,
this used the ifdef NVM_DEBUG, which is not correct.
Instead use the proper DEBUG def, and now that it compiles, also fix
the variable.
Signed-off-by: Matias Bjørling <mb@lightnvm.io> Fixes: d0ab0b1ab991f ("lightnvm: pblk: check data lines version on recovery") Reviewed-by: Javier GonzĂ¡lez <javier@cnexlabs.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
lightnvm: pblk: handle case when mw_cunits equals to 0
Some devices can expose mw_cunits equal to 0, it can cause the
creation of too small write buffer and cause performance to drop
on write workloads.
Additionally, write buffer size must cover write data requirements,
such as WS_MIN and MW_CUNITS - it must be greater than or equal to
the larger one multiplied by the number of PUs. However, for
performance reasons, use the WS_OPT value to calculation instead of
WS_MIN.
Because the place where buffer size is calculated was changed, this
patch also removes pgs_in_buffer filed in pblk structure.
Signed-off-by: Marcin Dziegielewski <marcin.dziegielewski@intel.com> Signed-off-by: Igor Konopko <igor.j.konopko@intel.com> Reviewed-by: Javier GonzĂ¡lez <javier@cnexlabs.com> Signed-off-by: Matias Bjørling <mb@lightnvm.io> Signed-off-by: Jens Axboe <axboe@kernel.dk>
Remove blkdev_entry_to_request() macro, which remained unused through
the observable history, also note that it repeats list_entry_rq() macro
verbatim.
Signed-off-by: Vladimir Zapolskiy <vz@mleia.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
block: skd: Use %pad printk format for dma_addr_t values
Use the existing %pad printk format to print dma_addr_t values.
This avoids the following warnings when compiling on the parisc64 platform:
drivers/block/skd_main.c: In function 'skd_preop_sg_list':
drivers/block/skd_main.c:660:4: warning: format '%llx' expects argument of type 'long long unsigned int', but argument 6 has type 'dma_addr_t {aka unsigned int}' [-Wformat=]
The code poses a security risk due to user memory access in ->release
and had an API that can't be used reliably. As far as we know it was
never used for real, but if that turns out wrong we'll have to revert
this commit and come up with a band aid.
Jann Horn did look software archives for users of this interface,
and the only users found were example code in sg3_utils, and optional
support in an optional module of the tgt user space iscsi target,
which looks like a proof of concept extension of the /dev/sg
read/write support.
Tony Battersby chimes in that the code is basically unsafe to use in
general:
The read/write interface on /dev/bsg is impossible to use safely
because the list of completed commands is per-device (bd->done_list)
rather than per-fd like it is with /dev/sg. So if program A and
program B are both using the write/read interface on the same bsg
device, then their command responses will get mixed up, and program
A will read() some command results from program B and vice versa.
So no, I don't use read/write on /dev/bsg. From a security standpoint,
it should definitely be fixed or removed.
Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
Josef Bacik [Wed, 11 Jul 2018 14:34:42 +0000 (10:34 -0400)]
blk-iolatency: fix max_depth comparisons
max_depth used to be a u64, but I changed it to a unsigned int but
didn't convert my comparisons over everywhere. Fix by using UINT_MAX
everywhere instead of (u64)-1.
Reported-by: Dan Carpenter <dan.carpenter@oracle.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
Randy Dunlap [Sat, 7 Jul 2018 03:49:19 +0000 (20:49 -0700)]
block/DAC960.c: fix defined but not used build warnings
Fix build warnings in DAC960.c when CONFIG_PROC_FS is not enabled
by marking the unused functions as __maybe_unused.
../drivers/block/DAC960.c:6429:12: warning: 'dac960_proc_show' defined but not used [-Wunused-function]
../drivers/block/DAC960.c:6449:12: warning: 'dac960_initial_status_proc_show' defined but not used [-Wunused-function]
../drivers/block/DAC960.c:6456:12: warning: 'dac960_current_status_proc_show' defined but not used [-Wunused-function]
Adds support for exposing a null_blk device through the zone device
interface.
The interface is managed with the parameters zoned and zone_size.
If zoned is set, the null_blk instance registers as a zoned block
device. The zone_size parameter defines how big each zone will be.
Signed-off-by: Matias Bjørling <matias.bjorling@wdc.com> Signed-off-by: Bart Van Assche <bart.vanassche@wdc.com> Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
block: Add default switch case to blk_pm_allow_request() to kill warning
With gcc 4.9.0 and 7.3.0:
block/blk-core.c: In function 'blk_pm_allow_request':
block/blk-core.c:2747:2: warning: enumeration value 'RPM_ACTIVE' not handled in switch [-Wswitch]
switch (rq->q->rpm_status) {
^
Convert the return statement below the switch() block into a default
case to fix this.
Fixes: e4f36b249b4d4e75 ("block: fix peeking requests during PM") Signed-off-by: Geert Uytterhoeven <geert+renesas@glider.be> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
block: fix infinite loop if the device loses discard capability
If __blkdev_issue_discard is in progress and a device mapper device is
reloaded with a table that doesn't support discard,
q->limits.max_discard_sectors is set to zero. This results in infinite
loop in __blkdev_issue_discard.
This patch checks if max_discard_sectors is zero and aborts with
-EOPNOTSUPP.
Liu Bo [Thu, 5 Jul 2018 19:07:13 +0000 (03:07 +0800)]
null_blk: remove NULLB_DEV_FL_CONFIGURED on turning off nullb device
Currently mbps knob could only be set once before switching power knob to
on, after power knob has been set at least once, there is no way to set
mbps knob again due to -EBUSY.
As nullb is mainly used for testing, in order to make it flexible, this
removes the flag NULLB_DEV_FL_CONFIGURED so that mbps knob can be reset
when power knob is off, e.g.
Josef Bacik [Tue, 3 Jul 2018 15:15:03 +0000 (11:15 -0400)]
mm: skip readahead if the cgroup is congested
We noticed in testing we'd get pretty bad latency stalls under heavy
pressure because read ahead would try to do its thing while the cgroup
was under severe pressure. If we're under this much pressure we want to
do as little IO as possible so we can still make progress on real work
if we're a throttled cgroup, so just skip readahead if our group is
under pressure.
Signed-off-by: Josef Bacik <jbacik@fb.com> Acked-by: Tejun Heo <tj@kernel.org> Acked-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
Josef Bacik [Tue, 3 Jul 2018 15:15:01 +0000 (11:15 -0400)]
block: introduce blk-iolatency io controller
Current IO controllers for the block layer are less than ideal for our
use case. The io.max controller is great at hard limiting, but it is
not work conserving. This patch introduces io.latency. You provide a
latency target for your group and we monitor the io in short windows to
make sure we are not exceeding those latency targets. This makes use of
the rq-qos infrastructure and works much like the wbt stuff. There are
a few differences from wbt
- It's bio based, so the latency covers the whole block layer in addition to
the actual io.
- We will throttle all IO types that comes in here if we need to.
- We use the mean latency over the 100ms window. This is because writes can
be particularly fast, which could give us a false sense of the impact of
other workloads on our protected workload.
- By default there's no throttling, we set the queue_depth to INT_MAX so that
we can have as many outstanding bio's as we're allowed to. Only at
throttle time do we pay attention to the actual queue depth.
- We backcharge cgroups for root cg issued IO and induce artificial
delays in order to deal with cases like metadata only or swap heavy
workloads.
In testing this has worked out relatively well. Protected workloads
will throttle noisy workloads down to 1 io at time if they are doing
normal IO on their own, or induce up to a 1 second delay per syscall if
they are doing a lot of root issued IO (metadata/swap IO).
Our testing has revolved mostly around our production web servers where
we have hhvm (the web server application) in a protected group and
everything else in another group. We see slightly higher requests per
second (RPS) on the test tier vs the control tier, and much more stable
RPS across all machines in the test tier vs the control tier.
Another test we run is a slow memory allocator in the unprotected group.
Before this would eventually push us into swap and cause the whole box
to die and not recover at all. With these patches we see slight RPS
drops (usually 10-15%) before the memory consumer is properly killed and
things recover within seconds.
Josef Bacik [Tue, 3 Jul 2018 15:15:00 +0000 (11:15 -0400)]
rq-qos: introduce dio_bio callback
wbt cares only about request completion time, but controllers may need
information that is on the bio itself, so add a done_bio callback for
rq-qos so things like blk-iolatency can use it to have the bio when it
completes.
Signed-off-by: Josef Bacik <jbacik@fb.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
Josef Bacik [Tue, 3 Jul 2018 15:14:59 +0000 (11:14 -0400)]
block: remove external dependency on wbt_flags
We don't really need to save this stuff in the core block code, we can
just pass the bio back into the helpers later on to derive the same
flags and update the rq->wbt_flags appropriately.
Signed-off-by: Josef Bacik <jbacik@fb.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
Josef Bacik [Tue, 3 Jul 2018 15:32:35 +0000 (09:32 -0600)]
blk-rq-qos: refactor out common elements of blk-wbt
blkcg-qos is going to do essentially what wbt does, only on a cgroup
basis. Break out the common code that will be shared between blkcg-qos
and wbt into blk-rq-qos.* so they can both utilize the same
infrastructure.
Signed-off-by: Josef Bacik <jbacik@fb.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
memcontrol: schedule throttling if we are congested
Memory allocations can induce swapping via kswapd or direct reclaim. If
we are having IO done for us by kswapd and don't actually go into direct
reclaim we may never get scheduled for throttling. So instead check to
see if our cgroup is congested, and if so schedule the throttling.
Before we return to user space the throttling stuff will only throttle
if we actually required it.
Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Josef Bacik <jbacik@fb.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
Josef Bacik [Tue, 3 Jul 2018 15:14:55 +0000 (11:14 -0400)]
blkcg: add generic throttling mechanism
Since IO can be issued from literally anywhere it's almost impossible to
do throttling without having some sort of adverse effect somewhere else
in the system because of locking or other dependencies. The best way to
solve this is to do the throttling when we know we aren't holding any
other kernel resources. Do this by tracking throttling in a per-blkg
basis, and if we require throttling flag the task that it needs to check
before it returns to user space and possibly sleep there.
This is to address the case where a process is doing work that is
generating IO that can't be throttled, whether that is directly with a
lot of REQ_META IO, or indirectly by allocating so much memory that it
is swamping the disk with REQ_SWAP. We can't use task_add_work as we
don't want to induce a memory allocation in the IO path, so simply
saving the request queue in the task and flagging it to do the
notify_resume thing achieves the same result without the overhead of a
memory allocation.
swap,blkcg: issue swap io with the appropriate context
For backcharging we need to know who the page belongs to when swapping
it out. We don't worry about things that do ->rw_page (zram etc) at the
moment, we're only worried about pages that actually go to a block
device.
Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Josef Bacik <jbacik@fb.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
Josef Bacik [Tue, 3 Jul 2018 15:14:53 +0000 (11:14 -0400)]
blk: introduce REQ_SWAP
Just like REQ_META, it's important to know the IO coming down is swap
in order to guard against potential IO priority inversion issues with
cgroups. Add REQ_SWAP and use it for all swap IO, and add it to our
bio_issue_as_root_blkg helper.
Josef Bacik [Tue, 3 Jul 2018 15:14:52 +0000 (11:14 -0400)]
blk-cgroup: allow controllers to output their own stats
blk-iolatency has a few stats that it would like to print out, and
instead of adding a bunch of crap to the generic code just provide a
helper so that controllers can add stuff to the stat line if they want
to.
Hide it behind a boot option since it changes the output of io.stat from
normal, and these stats are only interesting to developers.
Josef Bacik [Tue, 3 Jul 2018 15:14:51 +0000 (11:14 -0400)]
block: introduce bio_issue_as_root_blkg
Instead of forcing all file systems to get the right context on their
bio's, simply check for REQ_META to see if we need to issue as the root
blkg. We don't want to force all bio's to have the root blkg associated
with them if REQ_META is set, as some controllers (blk-iolatency) need
to know who the originating cgroup is so it can backcharge them for the
work they are doing. This helper will make sure that the controllers do
the proper thing wrt the IO priority and backcharging.
Josef Bacik [Tue, 3 Jul 2018 15:14:50 +0000 (11:14 -0400)]
block: add bi_blkg to the bio for cgroups
Currently io.low uses a bi_cg_private to stash its private data for the
blkg, however other blkcg policies may want to use this as well. Since
we can get the private data out of the blkg, move this to bi_blkg in the
bio and make it generic, then we can use bio_associate_blkg() to attach
the blkg to the bio.
Theoretically we could simply replace the bi_css with this since we can
get to all the same information from the blkg, however you have to
lookup the blkg, so for example wbc_init_bio() would have to lookup and
possibly allocate the blkg for the css it was trying to attach to the
bio. This could be problematic and result in us either not attaching
the css at all to the bio, or falling back to the root blkcg if we are
unable to allocate the corresponding blkg.
So for now do this, and in the future if possible we could just replace
the bi_css with bi_blkg and update the helpers to do the correct
translation.
Ming Lei [Tue, 3 Jul 2018 15:03:16 +0000 (09:03 -0600)]
blk-mq: dequeue request one by one from sw queue if hctx is busy
It won't be efficient to dequeue request one by one from sw queue,
but we have to do that when queue is busy for better merge performance.
This patch takes the Exponential Weighted Moving Average(EWMA) to figure
out if queue is busy, then only dequeue request one by one from sw queue
when queue is busy.
Fixes: b347689ffbca ("blk-mq-sched: improve dispatching from sw queue") Cc: Kashyap Desai <kashyap.desai@broadcom.com> Cc: Laurence Oberman <loberman@redhat.com> Cc: Omar Sandoval <osandov@fb.com> Cc: Christoph Hellwig <hch@lst.de> Cc: Bart Van Assche <bart.vanassche@wdc.com> Cc: Hannes Reinecke <hare@suse.de> Reported-by: Kashyap Desai <kashyap.desai@broadcom.com> Tested-by: Kashyap Desai <kashyap.desai@broadcom.com> Signed-off-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
Ming Lei [Mon, 2 Jul 2018 09:35:59 +0000 (17:35 +0800)]
blk-mq: only attempt to merge bio if there is rq in sw queue
Only attempt to merge bio iff the ctx->rq_list isn't empty, because:
1) for high-performance SSD, most of times dispatch may succeed, then
there may be nothing left in ctx->rq_list, so don't try to merge over
sw queue if it is empty, then we can save one acquiring of ctx->lock
2) we can't expect good merge performance on per-cpu sw queue, and missing
one merge on sw queue won't be a big deal since tasks can be scheduled from
one CPU to another.
Cc: Laurence Oberman <loberman@redhat.com> Cc: Omar Sandoval <osandov@fb.com> Cc: Bart Van Assche <bart.vanassche@wdc.com> Tested-by: Kashyap Desai <kashyap.desai@broadcom.com> Reported-by: Kashyap Desai <kashyap.desai@broadcom.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
Ming Lei [Mon, 2 Jul 2018 09:35:58 +0000 (17:35 +0800)]
blk-mq: use list_splice_tail_init() to insert requests
list_splice_tail_init() is much more faster than inserting each
request one by one, given all requets in 'list' belong to
same sw queue and ctx->lock is required to insert requests.
Cc: Laurence Oberman <loberman@redhat.com> Cc: Omar Sandoval <osandov@fb.com> Cc: Bart Van Assche <bart.vanassche@wdc.com> Tested-by: Kashyap Desai <kashyap.desai@broadcom.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
Minwoo Im [Mon, 2 Jul 2018 14:46:43 +0000 (23:46 +0900)]
blk-mq: code clean-up by adding an API to clear set->mq_map
set->mq_map is now currently cleared if something goes wrong when
establishing a queue map in blk-mq-pci.c. It's also cleared before
updating a queue map in blk_mq_update_queue_map().
This patch provides an API to clear set->mq_map to make it clear.
Signed-off-by: Minwoo Im <minwoo.im.dev@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
Colin Ian King [Mon, 2 Jul 2018 08:14:19 +0000 (09:14 +0100)]
paride: remove redundant variable n
Variable n is being assigned but is never used hence it is redundant
and can be removed. Also put spacing between variables in declaration
to clean up checkpatch warnings.
Cleans up clang warning:
warning: variable 'n' set but not used [-Wunused-but-set-variable]
Signed-off-by: Colin Ian King <colin.king@canonical.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
Bart Van Assche [Wed, 27 Jun 2018 20:09:05 +0000 (13:09 -0700)]
block: Document how blk_update_request() handles RQF_SPECIAL_PAYLOAD requests
The payload of struct request is stored in the request.bio chain if
the RQF_SPECIAL_PAYLOAD flag is not set and in request.special_vec if
RQF_SPECIAL_PAYLOAD has been set. However, blk_update_request()
iterates over req->bio whether or not RQF_SPECIAL_PAYLOAD has been
set. Additionally, the RQF_SPECIAL_PAYLOAD flag is ignored by
blk_rq_bytes() which means that the value returned by that function
is incorrect if the RQF_SPECIAL_PAYLOAD flag has been set. It is not
clear to me whether this is an oversight or whether this happened on
purpose. Anyway, document that it is known that both functions ignore
RQF_SPECIAL_PAYLOAD. See also commit f9d03f96b988 ("block: improve
handling of the magic discard payload").
Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Bart Van Assche <bart.vanassche@wdc.com> Cc: Ming Lei <ming.lei@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
Ming Lei [Mon, 25 Jun 2018 11:31:49 +0000 (19:31 +0800)]
blk-mq: avoid to synchronize rcu inside blk_cleanup_queue()
SCSI probing may synchronously create and destroy a lot of request_queues
for non-existent devices. Any synchronize_rcu() in queue creation or
destroy path may introduce long latency during booting, see detailed
description in comment of blk_register_queue().
This patch removes one synchronize_rcu() inside blk_cleanup_queue()
for this case, commit c2856ae2f315d75(blk-mq: quiesce queue before freeing queue)
needs synchronize_rcu() for implementing blk_mq_quiesce_queue(), but
when queue isn't initialized, it isn't necessary to do that since
only pass-through requests are involved, no original issue in
scsi_execute() at all.
Without this patch and previous one, it may take more 20+ seconds for
virtio-scsi to complete disk probe. With the two patches, the time becomes
less than 100ms.
Fixes: c2856ae2f315d75 ("blk-mq: quiesce queue before freeing queue") Reported-by: Andrew Jones <drjones@redhat.com> Cc: Omar Sandoval <osandov@fb.com> Cc: Bart Van Assche <bart.vanassche@wdc.com> Cc: linux-scsi@vger.kernel.org Cc: "Martin K. Petersen" <martin.petersen@oracle.com> Cc: Christoph Hellwig <hch@lst.de> Tested-by: Andrew Jones <drjones@redhat.com> Signed-off-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
Ming Lei [Mon, 25 Jun 2018 11:31:48 +0000 (19:31 +0800)]
blk-mq: remove synchronize_rcu() from blk_mq_del_queue_tag_set()
We have to remove synchronize_rcu() from blk_queue_cleanup(),
otherwise long delay can be caused during lun probe. For removing
it, we have to avoid to iterate the set->tag_list in IO path, eg,
blk_mq_sched_restart().
This patch reverts 5b79413946d (Revert "blk-mq: don't handle
TAG_SHARED in restart"). Given we have fixed enough IO hang issue,
and there isn't any reason to restart all queues in one tags any more,
see the following reasons:
1) blk-mq core can deal with shared-tags case well via blk_mq_get_driver_tag(),
which can wake up queues waiting for driver tag.
2) SCSI is a bit special because it may return BLK_STS_RESOURCE if queue,
target or host is ready, but SCSI built-in restart can cover all these well,
see scsi_end_request(), queue will be rerun after any request initiated from
this host/target is completed.
In my test on scsi_debug(8 luns), this patch may improve IOPS by 20% ~ 30%
when running I/O on these 8 luns concurrently.
Fixes: 705cda97ee3a ("blk-mq: Make it safe to use RCU to iterate over blk_mq_tag_set.tag_list") Cc: Omar Sandoval <osandov@fb.com> Cc: Bart Van Assche <bart.vanassche@wdc.com> Cc: Christoph Hellwig <hch@lst.de> Cc: Martin K. Petersen <martin.petersen@oracle.com> Cc: linux-scsi@vger.kernel.org Reported-by: Andrew Jones <drjones@redhat.com> Tested-by: Andrew Jones <drjones@redhat.com> Signed-off-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
Ming Lei [Mon, 25 Jun 2018 11:31:47 +0000 (19:31 +0800)]
blk-mq: introduce new lock for protecting hctx->dispatch_wait
Now hctx->lock is only acquired when adding hctx->dispatch_wait to
one wait queue, but not held when removing it from the wait queue.
IO hang can be observed easily if SCHED RESTART is disabled, that means
now RESTART exits just for fixing the issue in blk_mq_mark_tag_wait().
This patch fixes the issue by introducing hctx->dispatch_wait_lock and
holding it for removing hctx->dispatch_wait in blk_mq_dispatch_wake(),
since we need to avoid acquiring hctx->lock in irq context.
Fixes: eb619fdb2d4cb8b3d3419 ("blk-mq: fix issue with shared tag queue re-running") Cc: Christoph Hellwig <hch@lst.de> Cc: Omar Sandoval <osandov@fb.com> Cc: Bart Van Assche <bart.vanassche@wdc.com> Tested-by: Andrew Jones <drjones@redhat.com> Signed-off-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
Ming Lei [Mon, 25 Jun 2018 11:31:46 +0000 (19:31 +0800)]
blk-mq: don't pass **hctx to blk_mq_mark_tag_wait()
'hctx' won't be changed at all, so not necessary to pass
'**hctx' to blk_mq_mark_tag_wait().
Cc: Christoph Hellwig <hch@lst.de> Cc: Bart Van Assche <bart.vanassche@wdc.com> Tested-by: Andrew Jones <drjones@redhat.com> Reviewed-by: Omar Sandoval <osandov@fb.com> Signed-off-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
Ming Lei [Mon, 25 Jun 2018 11:31:45 +0000 (19:31 +0800)]
blk-mq: cleanup blk_mq_get_driver_tag()
We never pass 'wait' as true to blk_mq_get_driver_tag(), and hence
we never change '**hctx' as well. The last use of these went away
with the flush cleanup, commit 0c2a6fe4dc3e.
So cleanup the usage and remove the two extra parameters.
Cc: Bart Van Assche <bart.vanassche@wdc.com> Cc: Christoph Hellwig <hch@lst.de> Tested-by: Andrew Jones <drjones@redhat.com> Reviewed-by: Omar Sandoval <osandov@fb.com> Signed-off-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>