filemap_range_has_page() return true if the file's mapping has
a page within the range mentioned. This function will be used
to check if a write() call will cause a writeback of previous
writes.
Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
Ming Lei [Mon, 19 Jun 2017 02:21:08 +0000 (10:21 +0800)]
nvme: host: unquiesce queue in nvme_kill_queues()
When nvme_kill_queues() is run, queues may be in
quiesced state, so we forcibly unquiesce queues to avoid
blocking dispatch, and I/O hang can be avoided in
remove path.
Peviously we use blk_mq_start_stopped_hw_queues() as
counterpart of blk_mq_quiesce_queue(), now we have
introduced blk_mq_unquiesce_queue(), so use it explicitly.
Cc: linux-nvme@lists.infradead.org Signed-off-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
Ming Lei [Tue, 6 Jun 2017 15:22:10 +0000 (23:22 +0800)]
Revert "blk-mq: don't use sync workqueue flushing from drivers"
This patch reverts commit 2719aa217e0d02(blk-mq: don't use
sync workqueue flushing from drivers) because only
blk_mq_quiesce_queue() need the sync flush, and now
we don't need to stop queue any more, so revert it.
Also changes to cancel_delayed_work() in blk_mq_stop_hw_queue().
Reviewed-by: Bart Van Assche <Bart.VanAssche@sandisk.com> Signed-off-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
Ming Lei [Tue, 6 Jun 2017 15:22:09 +0000 (23:22 +0800)]
blk-mq: clarify dispatch may not be drained/blocked by stopping queue
BLK_MQ_S_STOPPED may not be observed in other concurrent I/O paths,
we can't guarantee that dispatching won't happen after returning
from the APIs of stopping queue.
So clarify the fact and avoid potential misuse.
Signed-off-by: Ming Lei <ming.lei@redhat.com> Reviewed-by: Bart Van Assche <Bart.VanAssche@sandisk.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
Ming Lei [Tue, 6 Jun 2017 15:22:08 +0000 (23:22 +0800)]
blk-mq: don't stop queue for quiescing
Queue can be started by other blk-mq APIs and can be used in
different cases, this limits uses of blk_mq_quiesce_queue()
if it is based on stopping queue, and make its usage very
difficult, especially users have to use the stop queue APIs
carefully for avoiding to break blk_mq_quiesce_queue().
We have applied the QUIESCED flag for draining and blocking
dispatch, so it isn't necessary to stop queue any more.
After stopping queue is removed, blk_mq_quiesce_queue() can
be used safely and easily, then users won't worry about queue
restarting during quiescing at all.
Reviewed-by: Bart Van Assche <Bart.VanAssche@sandisk.com> Signed-off-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
Ming Lei [Sun, 18 Jun 2017 20:24:27 +0000 (14:24 -0600)]
blk-mq: use QUEUE_FLAG_QUIESCED to quiesce queue
It is required that no dispatch can happen any more once
blk_mq_quiesce_queue() returns, and we don't have such requirement
on APIs of stopping queue.
But blk_mq_quiesce_queue() still may not block/drain dispatch in the
the case of BLK_MQ_S_START_ON_RUN, so use the new introduced flag of
QUEUE_FLAG_QUIESCED and evaluate it inside RCU read-side critical
sections for fixing this issue.
Also blk_mq_quiesce_queue() is implemented via stopping queue, which
limits its uses, and easy to cause race, because any queue restart in
other paths may break blk_mq_quiesce_queue(). With the introduced
flag of QUEUE_FLAG_QUIESCED, we don't need to depend on stopping queue
for quiescing any more.
Signed-off-by: Ming Lei <ming.lei@redhat.com> Reviewed-by: Bart Van Assche <Bart.VanAssche@sandisk.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
Ming Lei [Tue, 6 Jun 2017 15:22:03 +0000 (23:22 +0800)]
blk-mq: introduce blk_mq_unquiesce_queue
blk_mq_start_stopped_hw_queues() is used implictly
as counterpart of blk_mq_quiesce_queue() for unquiescing queue,
so we introduce blk_mq_unquiesce_queue() and make it
as counterpart of blk_mq_quiesce_queue() explicitly.
This function is for improving the current quiescing mechanism
in the following patches.
Reviewed-by: Bart Van Assche <Bart.VanAssche@sandisk.com> Signed-off-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
NeilBrown [Sun, 18 Jun 2017 04:38:59 +0000 (14:38 +1000)]
block: don't check for BIO_MAX_PAGES in blk_bio_segment_split()
blk_bio_segment_split() makes sure bios have no more than
BIO_MAX_PAGES entries in the bi_io_vec.
This was done because bio_clone_bioset() (when given a
mempool bioset) could not handle larger io_vecs.
No driver uses bio_clone_bioset() any more, they all
use bio_clone_fast() if anything, and bio_clone_fast()
doesn't clone the bi_io_vec.
The main user of of bio_clone_bioset() at this level
is bounce.c, and bouncing now happens before blk_bio_segment_split(),
so that is not of concern.
NeilBrown [Sun, 18 Jun 2017 04:38:59 +0000 (14:38 +1000)]
block: remove bio_clone() and all references.
bio_clone() is no longer used.
Only bio_clone_bioset() or bio_clone_fast().
This is for the best, as bio_clone() used fs_bio_set,
and filesystems are unlikely to want to use bio_clone().
So remove bio_clone() and all references.
This includes a fix to some incorrect documentation.
Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: NeilBrown <neilb@suse.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
NeilBrown [Sun, 18 Jun 2017 04:38:59 +0000 (14:38 +1000)]
bcache: use kmalloc to allocate bio in bch_data_verify()
This function allocates a bio, then a collection
of pages. It copes with failure.
It currently uses a mempool() to allocate the bio,
but alloc_page() to allocate the pages. These fail
in different ways, so the usage is inconsistent.
Change the bio_clone() to bio_clone_kmalloc()
so that no pool is used either for the bio or the pages.
Reviewed-by: Christoph Hellwig <hch@lst.de> Acked-by: Kent Overstreet <kent.overstreet@gmail.com>
Reviewed-by : Ming Lei <ming.lei@redhat.com> Signed-off-by: NeilBrown <neilb@suse.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
NeilBrown [Sun, 18 Jun 2017 04:38:59 +0000 (14:38 +1000)]
xen-blkfront: remove bio splitting.
bios that are re-submitted will pass through blk_queue_split() when
blk_queue_bio() is called, and this will split the bio if necessary.
There is no longer any need to do this splitting in xen-blkfront.
Acked-by: Roger Pau Monné <roger.pau@citrix.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: NeilBrown <neilb@suse.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
NeilBrown [Sun, 18 Jun 2017 04:38:58 +0000 (14:38 +1000)]
lightnvm/pblk-read: use bio_clone_fast()
pblk_submit_read() uses bio_clone_bioset() but doesn't change the
io_vec, so bio_clone_fast() is a better choice.
It also uses fs_bio_set which is intended for filesystems. Using it
in a device driver can deadlock.
So allocate a new bioset, and and use bio_clone_fast().
Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Javier González <javier@cnexlabs.com> Tested-by: Javier González <javier@cnexlabs.com> Signed-off-by: NeilBrown <neilb@suse.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
NeilBrown [Sun, 18 Jun 2017 04:38:58 +0000 (14:38 +1000)]
pktcdvd: use bio_clone_fast() instead of bio_clone()
pktcdvd doesn't change the bi_io_vec of the clone bio,
so it is more efficient to use bio_clone_fast(), and not clone
the bi_io_vec.
This requires providing a bio_set, and it is safest to
provide a dedicated bio_set rather than sharing
fs_bio_set, which filesytems use.
This new bio_set, pkt_bio_set, can also be use for the bio_split()
call as the two allocations (bio_clone_fast, and bio_split) are
independent, neither can block a bio allocated by the other.
NeilBrown [Sun, 18 Jun 2017 04:38:58 +0000 (14:38 +1000)]
drbd: use bio_clone_fast() instead of bio_clone()
drbd does not modify the bi_io_vec of the cloned bio,
so there is no need to clone that part. So bio_clone_fast()
is the better choice.
For bio_clone_fast() we need to specify a bio_set.
We could use fs_bio_set, which bio_clone() uses, or
drbd_md_io_bio_set, which drbd uses for metadata, but it is
generally best to avoid sharing bio_sets unless you can
be certain that there are no interdependencies.
So create a new bio_set, drbd_io_bio_set, and use bio_clone_fast().
Also remove a "XXX cannot fail ???" comment because it definitely
cannot fail - bio_clone_fast() doesn't fail if the GFP flags allow for
sleeping.
NeilBrown [Sun, 18 Jun 2017 04:38:58 +0000 (14:38 +1000)]
rbd: use bio_clone_fast() instead of bio_clone()
bio_clone() makes a copy of the bi_io_vec, but rbd never changes that,
so there is no need for a copy.
bio_clone_fast() can be used instead, which avoids making the copy.
This requires that we provide a bio_set. bio_clone() uses fs_bio_set,
but it isn't, in general, safe to use the same bio_set at different
levels of the stack, as that can lead to deadlocks. As filesystems
use fs_bio_set, block devices shouldn't.
As rbd never stacks, it is safe to have a single global bio_set for
all rbd devices to use. So allocate that when the module is
initialised, and use it with bio_clone_fast().
NeilBrown [Sun, 18 Jun 2017 04:38:58 +0000 (14:38 +1000)]
block: Improvements to bounce-buffer handling
Since commit 23688bf4f830 ("block: ensure to split after potentially
bouncing a bio") blk_queue_bounce() is called *before*
blk_queue_split().
This means that:
1/ the comments blk_queue_split() about bounce buffers are
irrelevant, and
2/ a very large bio (more than BIO_MAX_PAGES) will no longer be
split before it arrives at blk_queue_bounce(), leading to the
possibility that bio_clone_bioset() will fail and a NULL
will be dereferenced.
Separately, blk_queue_bounce() shouldn't use fs_bio_set as the bio
being copied could be from the same set, and this could lead to a
deadlock.
So:
- allocate 2 private biosets for blk_queue_bounce, one for
splitting enormous bios and one for cloning bios.
- add code to split a bio that exceeds BIO_MAX_PAGES.
- Fix up the comments in blk_queue_split()
Credit-to: Ming Lei <tom.leiming@gmail.com> (suggested using single bio_for_each_segment loop) Reviewed-by: Ming Lei <ming.lei@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: NeilBrown <neilb@suse.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
NeilBrown [Sun, 18 Jun 2017 04:38:57 +0000 (14:38 +1000)]
blk: use non-rescuing bioset for q->bio_split.
A rescuing bioset is only useful if there might be bios from
that same bioset on the bio_list_on_stack queue at a time
when bio_alloc_bioset() is called. This never applies to
q->bio_split.
Allocations from q->bio_split are only ever made from
blk_queue_split() which is only ever called early in each of
various make_request_fn()s. The original bio (call this A)
is then passed to generic_make_request() and is placed on
the bio_list_on_stack queue, and the bio that was allocated
from q->bio_split (B) is processed.
The processing of this may cause other bios to be passed to
generic_make_request() or may even cause the bio B itself to
be passed, possible after some prefix has been split off
(using some other bioset).
generic_make_request() now guarantees that all of these bios
(B and dependants) will be fully processed before the tail
of the original bio A gets handled. None of these early bios
can possible trigger an allocation from the original
q->bio_split as they are either too small to require
splitting or (more likely) are destined for a different queue.
The next time that the original q->bio_split might be used
by this thread is when A is processed again, as it might
still be too big to handle directly. By this time there
cannot be any other bios allocated from q->bio_split in the
generic_make_request() queue. So no rescuing will ever be
needed.
Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: NeilBrown <neilb@suse.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
NeilBrown [Sun, 18 Jun 2017 04:38:57 +0000 (14:38 +1000)]
blk: make the bioset rescue_workqueue optional.
This patch converts bioset_create() to not create a workqueue by
default, so alloctions will never trigger punt_bios_to_rescuer(). It
also introduces a new flag BIOSET_NEED_RESCUER which tells
bioset_create() to preserve the old behavior.
All callers of bioset_create() that are inside block device drivers,
are given the BIOSET_NEED_RESCUER flag.
biosets used by filesystems or other top-level users do not
need rescuing as the bio can never be queued behind other
bios. This includes fs_bio_set, blkdev_dio_pool,
btrfs_bioset, xfs_ioend_bioset, and one allocated by
target_core_iblock.c.
biosets used by md/raid do not need rescuing as
their usage was recently audited and revised to never
risk deadlock.
It is hoped that most, if not all, of the remaining biosets
can end up being the non-rescued version.
Reviewed-by: Christoph Hellwig <hch@lst.de> Credit-to: Ming Lei <ming.lei@redhat.com> (minor fixes) Reviewed-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: NeilBrown <neilb@suse.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
NeilBrown [Sun, 18 Jun 2017 04:38:57 +0000 (14:38 +1000)]
blk: replace bioset_create_nobvec() with a flags arg to bioset_create()
"flags" arguments are often seen as good API design as they allow
easy extensibility.
bioset_create_nobvec() is implemented internally as a variation in
flags passed to __bioset_create().
To support future extension, make the internal structure part of the
API.
i.e. add a 'flags' argument to bioset_create() and discard
bioset_create_nobvec().
Note that the bio_split allocations in drivers/md/raid* do not need
the bvec mempool - they should have used bioset_create_nobvec().
Suggested-by: Christoph Hellwig <hch@infradead.org> Reviewed-by: Christoph Hellwig <hch@infradead.org> Reviewed-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: NeilBrown <neilb@suse.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
NeilBrown [Sun, 18 Jun 2017 04:38:57 +0000 (14:38 +1000)]
blk: remove bio_set arg from blk_queue_split()
blk_queue_split() is always called with the last arg being q->bio_split,
where 'q' is the first arg.
Also blk_queue_split() sometimes uses the passed-in 'bs' and sometimes uses
q->bio_split.
This is inconsistent and unnecessary. Remove the last arg and always use
q->bio_split inside blk_queue_split()
Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Ming Lei <ming.lei@redhat.com> Credit-to: Javier González <jg@lightnvm.io> (Noticed that lightnvm was missed) Reviewed-by: Javier González <javier@cnexlabs.com> Tested-by: Javier González <javier@cnexlabs.com> Signed-off-by: NeilBrown <neilb@suse.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
This patch makes sure we always allocate requests in the core blk-mq
code and use a common prepare_request method to initialize them for
both mq I/O schedulers. For Kyber and additional limit_depth method
is added that is called before allocating the request.
Also because none of the intializations can really fail the new method
does not return an error - instead the bfq finish method is hardened
to deal with the no-IOC case.
Last but not least this removes the abuse of RQF_QUEUE by the blk-mq
scheduling code as RQF_ELFPRIV is all that is needed now.
Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
blk_mq_sched_assign_ioc now only handles the assigned of the ioc if
the schedule needs it (bfq only at the moment). The caller to the
per-request initializer is moved out so that it can be merged with
a similar call for the kyber I/O scheduler.
Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
NeilBrown [Fri, 16 Jun 2017 05:02:09 +0000 (15:02 +1000)]
loop: Add PF_LESS_THROTTLE to block/loop device thread.
When a filesystem is mounted from a loop device, writes are
throttled by balance_dirty_pages() twice: once when writing
to the filesystem and once when the loop_handle_cmd() writes
to the backing file. This double-throttling can trigger
positive feedback loops that create significant delays. The
throttling at the lower level is seen by the upper level as
a slow device, so it throttles extra hard.
The PF_LESS_THROTTLE flag was created to handle exactly this
circumstance, though with an NFS filesystem mounted from a
local NFS server. It reduces the throttling on the lower
layer so that it can proceed largely unthrottled.
To demonstrate this, create a filesystem on a loop device
and write (e.g. with dd) several large files which combine
to consume significantly more than the limit set by
/proc/sys/vm/dirty_ratio or dirty_bytes. Measure the total
time taken.
When I do this directly on a device (no loop device) the
total time for several runs (mkfs, mount, write 200 files,
umount) is fairly stable: 28-35 seconds.
When I do this over a loop device the times are much worse
and less stable. 52-460 seconds. Half below 100seconds,
half above.
When I apply this patch, the times become stable again,
though not as fast as the no-loop-back case: 53-72 seconds.
There may be room for further improvement as the total overhead still
seems too high, but this is a big improvement.
Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Ming Lei <tom.leiming@gmail.com> Suggested-by: Michal Hocko <mhocko@suse.com> Acked-by: Michal Hocko <mhocko@suse.com> Signed-off-by: NeilBrown <neilb@suse.com> Signed-off-by: Jens Axboe <axboe@fb.com>
Jens Axboe [Fri, 16 Jun 2017 16:14:59 +0000 (10:14 -0600)]
Merge branch 'nvme-4.13' of git://git.infradead.org/nvme into for-4.13/block
Pull NVMe changes for 4.13 from Christoph:
Highlights:
- UUID identifier support from Johannes
- Lots of cleanups from Sagi
- Host Memory Buffer support from me
And lots of cleanups and smaller fixes of course.
Note that the UUID identifier changes are based on top of the uuid tree.
I am the maintainer of that tree and will send it to Linus as soon as
4.12 is released as various other trees depend on it as well (and the
diffstat includes those changes unfortunately)
Arvind Yadav [Fri, 16 Jun 2017 09:54:39 +0000 (15:24 +0530)]
block: swim3: make of_device_ids const.
of_device_ids are not supposed to change at runtime. All functions
working with of_device_ids provided by <linux/of.h> work with const
of_device_ids. So mark the non-const structs as const.
File size before:
text data bss dec hex filename
8908 1096 624 10628 2984 drivers/block/swim3.o
File size after constify swim3_match:
text data bss dec hex filename
9708 296 624 10628 2984 drivers/block/swim3.o
Bart Van Assche [Tue, 13 Jun 2017 15:07:33 +0000 (08:07 -0700)]
block: Dedicated error code fixups
This patch fixes two sparse warnings introduced by the "dedicated
error codes for the block layer V3" patch series. These changes
have not been tested.
Signed-off-by: Bart Van Assche <bart.vanassche@sandisk.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@fb.com>
Scott Bauer [Thu, 15 Jun 2017 16:44:30 +0000 (10:44 -0600)]
nvme: implement NS Optimal IO Boundary from 1.3 Spec
The NVMe 1.3 spec introduces Namespace Optimal IO Boundaries (NOIOB),
which standardizes the stripe mechanism we currently have quirks for.
This patch implements the necessary logic to handle this new feature.
Signed-off-by: Scott Bauer <scott.bauer@intel.com> Signed-off-by: Christoph Hellwig <hch@lst.de>
nvme: move reset workqueue handling to common code
This moves the nvme_reset function from the PCIe driver to common code,
renaming it to nvme_reset_ctrl in the process. Additionally a new
helper nvme_reset_ctrl_sync is added for the case where we want to
wait for the reset. To facilitate that the reset_work work structure is
move to the common nvme_ctrl structure and the ->reset_ctrl method is
removed. For now the drivers initialize the reset_work with their own
callback, but longer term we should move to callouts for specific
parts of the reset process and move even more code to the core.
Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
nvme: move protection information check into nvme_setup_rw
It only applies to read/write commands, and this way non-PCIe drivers
get the check as well instead of having to duplicate it when adding
metadata support.
Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Keith Busch <keith.busch@intel.com> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Christoph Hellwig <hch@lst.de>
Dan Carpenter [Wed, 14 Jun 2017 10:46:45 +0000 (13:46 +0300)]
nvme-rdma: fix error code in nvme_rdma_create_ctrl()
We accidentally return ERR_PTR(0) which is NULL. The caller isn't
explicitly checking for that but I couldn't immediately spot whether
this would lead to a NULL dereference. Anyway, we can fix add an
error code easily enough.
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com> Signed-off-by: Christoph Hellwig <hch@lst.de>
Guan Junxiong [Tue, 13 Jun 2017 02:51:24 +0000 (10:51 +0800)]
nvmf: keep track of nvmet connect error status
To let the host know what happends to the connection establishment,
adjust the behavior of nvmf_log_connect_error to make more connect
specifig error codes human-readble.
Bart Van Assche [Thu, 8 Jun 2017 16:43:29 +0000 (09:43 -0700)]
nvmet-fc: Remove a set-but-not-used variable
This was detected by building the nvmet-fc driver with W=1.
Signed-off-by: Bart Van Assche <bart.vanassche@sandisk.com> Cc: James Smart <james.smart@broadcom.com> Cc: Christoph Hellwig <hch@lst.de> Cc: Johannes Thumshirn <jthumshirn@suse.de> Signed-off-by: Christoph Hellwig <hch@lst.de>
Allow overriding the announced NVMe Version of a via configfs.
This is particularly helpful when debugging new features for the host
or target side without bumping the hard coded version (as the target
might not be fully compliant to the announced version yet).
Signed-off-by: Johannes Thumshirn <jthumshirn@suse.de> Reviewed-by: Hannes Reinecke <hare@suse.com> Reviewed-by: Guan Junxiong <guanjunxiong@huawei.com> Signed-off-by: Christoph Hellwig <hch@lst.de>
nvmet: add uuid field to nvme_ns and populate via configfs
Add the UUID field from the NVMe Namespace Identification Descriptor
to the nvmet_ns structure and allow it's population via configfs.
Signed-off-by: Johannes Thumshirn <jthumshirn@suse.de> Reviewed-by: Max Gurtovoy <maxg@mellanox.com> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Hannes Reinecke <hare@suse.com> Signed-off-by: Christoph Hellwig <hch@lst.de>
nvmet: implement namespace identify descriptor list
A NVMe Identify NS command with a CNS value of '3' is expecting a list
of Namespace Identification Descriptor structures to be returned to
the host for the namespace requested in the namespace identify
command.
This Namespace Identification Descriptor structure consists of the
type of the namespace identifier, the length of the identifier and the
actual identifier.
Valid types are NGUID and UUID which we have saved in our nvme_ns
structure if they have been configured via configfs. If no value has
been assigened to one of these we return an "invalid opcode" back to
the host to maintain backward compatibiliy with older implementations
without Namespace Identify Descriptor list support.
Also as the Namespace Identify Descriptor list is the only mandatory
feature change between 1.2.1 and 1.3 we can bump the advertised
version as well.
Signed-off-by: Johannes Thumshirn <jthumshirn@suse.de> Reviewed-by: Hannes Reinecke <hare@suse.com> Reviewed-by: Max Gurtovoy <maxg@mellanox.com> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Christoph Hellwig <hch@lst.de>
Now that we have a way for getting the UUID from a target, provide it
to userspace as well.
Unfortunately there is already a sysfs attribute called UUID which is
a misnomer as it holds the NGUID value. So instead of creating yet
another wrong name, create a new 'nguid' sysfs attribute for the
NGUID. For the UUID attribute add a check wheter the namespace has a
UUID assigned to it and return this or return the NGUID to maintain
backwards compatibility. This should give userspace a chance to catch
up.
Signed-off-by: Johannes Thumshirn <jthumshirn@suse.de> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Hannes Reinecke <hare@suse.com> Reviewed-by: Sagi Grimberg <sagi@rimberg.me> Signed-off-by: Christoph Hellwig <hch@lst.de>
If a target identifies itself as NVMe 1.3 compliant, try to get the
list of Namespace Identification Descriptors and populate the UUID,
NGUID and EUI64 fileds in the NVMe namespace structure with these
values.
Signed-off-by: Johannes Thumshirn <jthumshirn@suse.de> Reviewed-by: Hannes Reinecke <hare@suse.com> Signed-off-by: Christoph Hellwig <hch@lst.de>
The uuid field in the nvme_ns structure represents the nguid field
from the identify namespace command. And as NVMe 1.3 introduced an
UUID in the NVMe Namespace Identification Descriptor this will
collide.
So rename the uuid to nguid to prevent any further
confusion. Unfortunately we export the nguid to sysfs in the uuid
sysfs attribute, but this can't be changed anymore without possibly
breaking existing userspace.
Signed-off-by: Johannes Thumshirn <jthumshirn@suse.de> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Hannes Reinecke <hare@suse.com> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Johannes Thumshirn <jthumshirn@suse.de> Reviewed-by: Max Gurtovoy <maxg@mellanox.com> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Hannes Reinecke <hare@suse.com> Signed-off-by: Christoph Hellwig <hch@lst.de>
Use NVME_IDENTIFY_DATA_SIZE define instead of hard coding the magic
4096 value.
Signed-off-by: Johannes Thumshirn <jthumshirn@suse.de> Reviewed-by: Max Gurtovoy <maxg@mellanox.com> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Hannes Reinecke <hare@suse.com>
[hch: converted three more users] Signed-off-by: Christoph Hellwig <hch@lst.de>
Keith Busch [Wed, 7 Jun 2017 18:32:50 +0000 (20:32 +0200)]
nvme-pci: Remove watchdog timer
The controller status polling was added to preemptively reset a failed
controller. This early detection would allow commands that would normally
timeout a chance for a retry, or find broken links when the platform
didn't support hotplug.
This once-per-second MMIO read, however, created more problems than
it solves. This often races with PCIe Hotplug events that required
complicated syncing between work queues, frequently triggered PCIe
Completion Timeout errors that also lead to fatal machine checks, and
unnecessarily disrupts low power modes by running on idle controllers.
This patch removes the watchdog timer, and instead checks controller
health only on an IO timeout when we have a reason to believe something
is wrong. If the controller is failed, the driver will disable immediately
and request scheduling a reset.
Suggested-by: Andy Lutomirski <luto@amacapital.net> Signed-off-by: Keith Busch <keith.busch@intel.com> Signed-off-by: Christoph Hellwig <hch@lst.de>
Xu Yu [Wed, 24 May 2017 08:39:55 +0000 (16:39 +0800)]
nvme-pci: remap BAR0 to cover admin CQ doorbell for large stride
The existing driver initially maps 8192 bytes of BAR0 which is
intended to cover doorbells of admin SQ and CQ. However, if a
large stride, e.g. 10, is used, the doorbell of admin CQ will
be out of 8192 bytes. Consequently, a page fault will be raised
when the admin CQ doorbell is accessed in nvme_configure_admin_queue().
This patch fixes this issue by remapping BAR0 before accessing
admin CQ doorbell if the initial mapping is not enough.
Sagi Grimberg [Thu, 4 May 2017 10:33:12 +0000 (13:33 +0300)]
nvme: Don't allow to reset a reconnecting controller
The reset operation is guaranteed to fail for all scenarios
but the esoteric case where in the last reconnect attempt
concurrent with the reset we happen to successfully reconnect.
We just deny initiating a reset if we are reconnecting.
Signed-off-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Christoph Hellwig <hch@lst.de>
Instead of introducing a flag for if the queue is allocated,
simply free the rdma resources when we get the error.
We allocate the queue rdma resources when we have an address
resolution, their we allocate (or take a reference on) our device
so we should free it when we have error after the address resolution
namely:
1. route resolution error
2. connect reject
3. connect error
4. peer unreachable error
Signed-off-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Christoph Hellwig <hch@lst.de>
If a controller supports the host memory buffer we try to provide
it with the requested size up to an upper cap set as a module
parameter. We try to give as few as possible descriptors, eventually
working our way down.
Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Keith Busch <keith.busch@intel.com> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Max Gurtovoy <maxg@mellanox.com> Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Arnav Dawn [Fri, 12 May 2017 15:12:03 +0000 (17:12 +0200)]
nvme.h: add dword 12 - 15 fields to struct nvme_features
Signed-off-by: Arnav Dawn <a.dawn@samsung.com>
[hch: split from a larger patch, new changelog] Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Keith Busch <keith.busch@intel.com> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
The merge of 4.12-rc5 into the for-4.13/block tree didn't handle the queue
ready case correctly. Fix this by propagating blk_status_t into
nvme_rdma_queue_is_ready.
Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@fb.com>
Linus Torvalds [Sun, 11 Jun 2017 23:17:29 +0000 (16:17 -0700)]
Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/linux-security
Pull key subsystem fixes from James Morris:
"Here are a bunch of fixes for Linux keyrings, including:
- Fix up the refcount handling now that key structs use the
refcount_t type and the refcount_t ops don't allow a 0->1
transition.
- Fix a potential NULL deref after error in x509_cert_parse().
- Don't put data for the crypto algorithms to use on the stack.
- Fix the handling of a null payload being passed to add_key().
- Fix incorrect cleanup an uninitialised key_preparsed_payload in
key_update().
- Explicit sanitisation of potentially secure data before freeing.
- Fixes for the Diffie-Helman code"
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/linux-security: (23 commits)
KEYS: fix refcount_inc() on zero
KEYS: Convert KEYCTL_DH_COMPUTE to use the crypto KPP API
crypto : asymmetric_keys : verify_pefile:zero memory content before freeing
KEYS: DH: add __user annotations to keyctl_kdf_params
KEYS: DH: ensure the KDF counter is properly aligned
KEYS: DH: don't feed uninitialized "otherinfo" into KDF
KEYS: DH: forbid using digest_null as the KDF hash
KEYS: sanitize key structs before freeing
KEYS: trusted: sanitize all key material
KEYS: encrypted: sanitize all key material
KEYS: user_defined: sanitize key payloads
KEYS: sanitize add_key() and keyctl() key payloads
KEYS: fix freeing uninitialized memory in key_update()
KEYS: fix dereferencing NULL payload with nonzero length
KEYS: encrypted: use constant-time HMAC comparison
KEYS: encrypted: fix race causing incorrect HMAC calculations
KEYS: encrypted: fix buffer overread in valid_master_desc()
KEYS: encrypted: avoid encrypting/decrypting stack buffers
KEYS: put keyring if install_session_keyring_to_cred() fails
KEYS: Delete an error message for a failed memory allocation in get_derived_key()
...
Linus Torvalds [Sun, 11 Jun 2017 22:51:56 +0000 (15:51 -0700)]
compiler, clang: properly override 'inline' for clang
Commit abb2ea7dfd82 ("compiler, clang: suppress warning for unused
static inline functions") just caused more warnings due to re-defining
the 'inline' macro.
So undef it before re-defining it, and also add the 'notrace' attribute
like the gcc version that this is overriding does.
Linus Torvalds [Sun, 11 Jun 2017 19:02:01 +0000 (12:02 -0700)]
Merge tag 'random_for_linus_stable' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/random
Pull randomness fixes from Ted Ts'o:
"Improve performance by using a lockless update mechanism suggested by
Linus, and make sure we refresh per-CPU entropy returned get_random_*
as soon as the CRNG is initialized"
* tag 'random_for_linus_stable' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/random:
random: invalidate batched entropy after crng init
random: use lockless method of accessing and updating f->reg_idx
Linus Torvalds [Sun, 11 Jun 2017 18:57:47 +0000 (11:57 -0700)]
Merge tag 'ext4_for_linus_stable' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4
Pull ext4 fixes from Ted Ts'o:
"Fix various bug fixes in ext4 caused by races and memory allocation
failures"
* tag 'ext4_for_linus_stable' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4:
ext4: fix fdatasync(2) after extent manipulation operations
ext4: fix data corruption for mmap writes
ext4: fix data corruption with EXT4_GET_BLOCKS_ZERO
ext4: fix quota charging for shared xattr blocks
ext4: remove redundant check for encrypted file on dio write path
ext4: remove unused d_name argument from ext4_search_dir() et al.
ext4: fix off-by-one error when writing back pages before dio read
ext4: fix off-by-one on max nr_pages in ext4_find_unwritten_pgoff()
ext4: keep existing extra fields when inode expands
ext4: handle the rest of ext4_mb_load_buddy() ENOMEM errors
ext4: fix off-by-in in loop termination in ext4_find_unwritten_pgoff()
ext4: fix SEEK_HOLE
jbd2: preserve original nofs flag during journal restart
ext4: clear lockdep subtype for quota files on quota off
Linus Torvalds [Sun, 11 Jun 2017 18:34:27 +0000 (11:34 -0700)]
Merge tag 'gpio-v4.12-2' of git://git.kernel.org/pub/scm/linux/kernel/git/linusw/linux-gpio
Pull GPIO fixes from Linus Walleij:
"A few overdue GPIO patches for the v4.12 kernel.
- Fix debounce logic on the Aspeed platform.
- Fix the "virtual gpio" things on the Intel Crystal Cove.
- Fix the blink counter selection on the MVEBU platform"
* tag 'gpio-v4.12-2' of git://git.kernel.org/pub/scm/linux/kernel/git/linusw/linux-gpio:
gpio: mvebu: fix gpio bank registration when pwm is used
gpio: mvebu: fix blink counter register selection
MAINTAINERS: remove self from GPIO maintainers
gpio: crystalcove: Do not write regular gpio registers for virtual GPIOs
gpio: aspeed: Don't attempt to debounce if disabled
Linus Torvalds [Sun, 11 Jun 2017 18:29:15 +0000 (11:29 -0700)]
Merge tag 'char-misc-4.12-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/char-misc
Pull char/misc driver fixes from Greg KH:
"Here are some small driver fixes for 4.12-rc5. Nothing major here,
just some small bugfixes found by people testing, and a MAINTAINERS
file update for the genwqe driver.
All have been in linux-next with no reported issues"
[ The cxl driver fix came in through the powerpc tree earlier ]
* tag 'char-misc-4.12-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/char-misc:
cxl: Avoid double free_irq() for psl,slice interrupts
mei: make sysfs modalias format similar as uevent modalias
drivers: char: mem: Fix wraparound check to allow mappings up to the end
MAINTAINERS: Change maintainer of genwqe driver
goldfish_pipe: use GFP_ATOMIC under spin lock
firmware: vpd: do not leak kobjects
firmware: vpd: avoid potential use-after-free when destroying section
firmware: vpd: do not leave freed section attributes to the list
Linus Torvalds [Sun, 11 Jun 2017 18:25:51 +0000 (11:25 -0700)]
Merge tag 'staging-4.12-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/staging
Pull staging/IIO fixes from Greg KH:
"These are mostly all IIO driver fixes, resolving a number of tiny
issues. There's also a ccree and lustre fix in here as well, both fix
problems found in those codebases.
All have been in linux-next with no reported issues"
* tag 'staging-4.12-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/staging:
staging: ccree: fix buffer copy
staging/lustre/lov: remove set_fs() call from lov_getstripe()
staging: ccree: add CRYPTO dependency
iio: adc: sun4i-gpadc-iio: fix parent device being used in devm function
iio: light: ltr501 Fix interchanged als/ps register field
iio: adc: bcm_iproc_adc: swap primary and secondary isr handler's
iio: trigger: fix NULL pointer dereference in iio_trigger_write_current()
iio: adc: max9611: Fix attribute measure unit
iio: adc: ti_am335x_adc: allocating too much in probe
iio: adc: sun4i-gpadc-iio: Fix module autoload when OF devices are registered
iio: adc: sun4i-gpadc-iio: Fix module autoload when PLATFORM devices are registered
iio: proximity: as3935: fix iio_trigger_poll issue
iio: proximity: as3935: fix AS3935_INT mask
iio: adc: Max9611: checking for ERR_PTR instead of NULL in probe
iio: proximity: as3935: recalibrate RCO after resume
Linus Torvalds [Sun, 11 Jun 2017 18:21:08 +0000 (11:21 -0700)]
Merge tag 'scsi-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi
Pull SCSI fixes from James Bottomley:
"This is a set of user visible fixes (excepting one format string
change).
Four of the qla2xxx fixes only affect the firmware dump path, but it's
still important to the enterprise. The rest are various NULL pointer
crash conditions or outright driver hangs"
* tag 'scsi-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi:
scsi: cxgb4i: libcxgbi: in error case RST tcp conn
scsi: scsi_debug: Avoid PI being disabled when TPGS is enabled
scsi: qla2xxx: Fix extraneous ref on sp's after adapter break
scsi: lpfc: prevent potential null pointer dereference
scsi: lpfc: Avoid NULL pointer dereference in lpfc_els_abort()
scsi: lpfc: nvmet_fc: fix format string
scsi: qla2xxx: Fix crash due to NULL pointer dereference of ctx
scsi: qla2xxx: Fix mailbox pointer error in fwdump capture
scsi: qla2xxx: Set bit 15 for DIAG_ECHO_TEST MBC
scsi: qla2xxx: Modify T262 FW dump template to specify same start/end to debug customer issues
scsi: qla2xxx: Fix crash due to mismatch mumber of Q-pair creation for Multi queue
scsi: qla2xxx: Fix NULL pointer access due to redundant fc_host_port_name call
scsi: qla2xxx: Fix recursive loop during target mode configuration for ISP25XX leaving system unresponsive
scsi: bnx2fc: fix race condition in bnx2fc_get_host_stats()
scsi: qla2xxx: don't disable a not previously enabled PCI device
Linus Torvalds [Sun, 11 Jun 2017 18:15:09 +0000 (11:15 -0700)]
Merge branch 'libnvdimm-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm
Pull libnvdimm fix from Dan Williams:
"We expanded the device-dax fs type in 4.12 to be a generic provider of
a struct dax_device with an embedded inode. However, Sasha found some
basic negative testing was not run to verify that this fs cleanly
handles being mounted directly.
Note that the fresh rebase was done to remove an unnecessary Cc:
<stable> tag, but this commit otherwise had a build success
notification from the 0day robot."
Wanpeng Li [Fri, 9 Jun 2017 03:13:40 +0000 (20:13 -0700)]
KVM: async_pf: avoid async pf injection when in guest mode
INFO: task gnome-terminal-:1734 blocked for more than 120 seconds.
Not tainted 4.12.0-rc4+ #8
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
gnome-terminal- D 0 1734 1015 0x00000000
Call Trace:
__schedule+0x3cd/0xb30
schedule+0x40/0x90
kvm_async_pf_task_wait+0x1cc/0x270
? __vfs_read+0x37/0x150
? prepare_to_swait+0x22/0x70
do_async_page_fault+0x77/0xb0
? do_async_page_fault+0x77/0xb0
async_page_fault+0x28/0x30
This is triggered by running both win7 and win2016 on L1 KVM simultaneously,
and then gives stress to memory on L1, I can observed this hang on L1 when
at least ~70% swap area is occupied on L0.
This is due to async pf was injected to L2 which should be injected to L1,
L2 guest starts receiving pagefault w/ bogus %cr2(apf token from the host
actually), and L1 guest starts accumulating tasks stuck in D state in
kvm_async_pf_task_wait() since missing PAGE_READY async_pfs.
This patch fixes the hang by doing async pf when executing L1 guest.
Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Radim Krčmář <rkrcmar@redhat.com> Cc: stable@vger.kernel.org Signed-off-by: Wanpeng Li <wanpeng.li@hotmail.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>