Roland Kammerer [Tue, 29 Aug 2017 08:20:46 +0000 (10:20 +0200)]
drbd: move global variables to drbd namespace and make some static
This is a follow-up to Gregs complaints that drbd clutteres the global
namespace.
Some of DRBD's module parameters are only used within one compilation
unit. Make these static.
Signed-off-by: Roland Kammerer <roland.kammerer@linbit.com> Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com> Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
drbd: rename "usermode_helper" to "drbd_usermode_helper"
Nothing like having a very generic global variable in a tiny driver
subsystem to make a mess of the global namespace...
Note, there are many other "generic" named global variables in the drbd
subsystem, someone should fix those up one day before they hit a linking
error.
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com> Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
Lars Ellenberg [Tue, 29 Aug 2017 08:20:44 +0000 (10:20 +0200)]
drbd: fix race between handshake and admin disconnect/down
conn_try_disconnect() could potentialy hit the BUG_ON()
in _conn_set_state() where it iterates over _drbd_set_state()
and "asserts" via BUG_ON() that the latter was successful.
If the STATE_SENT bit was not yet visible to conn_is_valid_transition()
early in _conn_request_state(), but became visible before conn_set_state()
later in that call path, we could hit the BUG_ON() after _drbd_set_state(),
because it returned SS_IN_TRANSIENT_STATE.
To avoid that race, we better protect set_bit(SENT_STATE) with the spinlock.
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com> Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
Lars Ellenberg [Tue, 29 Aug 2017 08:20:43 +0000 (10:20 +0200)]
drbd: fix potential deadlock when trying to detach during handshake
When requesting a detach, we first suspend IO, and also inhibit meta-data IO
by means of drbd_md_get_buffer(), because we don't want to "fail" the disk
while there is IO in-flight: the transition into D_FAILED for detach purposes
may get misinterpreted as actual IO error in a confused endio function.
We wrap it all into wait_event(), to retry in case the drbd_req_state()
returns SS_IN_TRANSIENT_STATE, as it does for example during an ongoing
connection handshake.
In that example, the receiver thread may need to grab drbd_md_get_buffer()
during the handshake to make progress. To avoid potential deadlock with
detach, detach needs to grab and release the meta data buffer inside of
that wait_event retry loop. To avoid lock inversion between
mutex_lock(&device->state_mutex) and drbd_md_get_buffer(device),
introduce a new enum chg_state_flag CS_INHIBIT_MD_IO, and move the
call to drbd_md_get_buffer() inside the state_mutex grabbed in
drbd_req_state().
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com> Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
If there are still resources defined, but "empty", no more volumes
or connections configured, they don't hold module reference counts,
so rmmod is possible.
To avoid DRBD leftovers in debugfs, we need to call our global
drbd_debugfs_cleanup() only after all resources have been cleaned up.
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com> Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
Lars Ellenberg [Tue, 29 Aug 2017 08:20:39 +0000 (10:20 +0200)]
drbd: fix potential get_ldev/put_ldev refcount imbalance during attach
Race:
drbd_adm_attach() | async drbd_md_endio()
|
device->ldev is still NULL. |
|
drbd_md_read( |
.endio = drbd_md_endio; |
submit; |
.... |
wait for done == 1; | done = 1;
); | wake_up();
.. lot of other stuff, |
.. includeing taking and |
...giving up locks, |
.. doing further IO, |
.. stuff that takes "some time" |
| while in this context,
| this is the next statement.
| which means this context was scheduled
.. only then, finally, | away for "some time".
device->ldev = nbc; |
| if (device->ldev)
| put_ldev()
Unlikely, but possible. I was able to provoke it "reliably"
by adding an mdelay(500); after the wake_up().
Fixed by moving the if (!NULL) put_ldev() before done = 1;
Impact of the bug was that the resulting refcount imbalance
could lead to premature destruction of the object, potentially
causing a NULL pointer dereference during a subsequent detach.
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com> Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
Philipp Reisner [Tue, 29 Aug 2017 08:20:37 +0000 (10:20 +0200)]
drbd: Fix resource role for newly created resources in events2
The conn_higest_role() (a terribly misnamed function) returns
the role of the resource. It returned R_UNKNOWN as long as the
resource has not a single device.
Resources without devices are short living objects.
But it matters for the NOTIFY_CREATE netwlink message. It makes
a lot more sense to report R_SECONDARY for the newly created
resource than R_UNKNOWN.
I reviewd all call sites of conn_highest_role(), that change
does not matter for the other call sites.
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com> Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
Baoyou Xie [Tue, 29 Aug 2017 08:20:36 +0000 (10:20 +0200)]
drbd: mark symbols static where possible
We get a few warnings when building kernel with W=1:
drbd/drbd_receiver.c:1224:6: warning: no previous prototype for 'one_flush_endio' [-Wmissing-prototypes]
drbd/drbd_req.c:1450:6: warning: no previous prototype for 'send_and_submit_pending' [-Wmissing-prototypes]
drbd/drbd_main.c:924:6: warning: no previous prototype for 'assign_p_sizes_qlim' [-Wmissing-prototypes]
....
In fact, these functions are only used in the file in which they are
declared and don't need a declaration, but can be made static.
So this patch marks these functions with 'static'.
Signed-off-by: Baoyou Xie <baoyou.xie@linaro.org> Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com> Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
Lars Ellenberg [Tue, 29 Aug 2017 08:20:35 +0000 (10:20 +0200)]
drbd: Send P_NEG_ACK upon write error in protocol != C
In protocol != C, we forgot to send the P_NEG_ACK for failing writes.
Once we no longer submit to local disk, because we already "detached",
due to the typical "on-io-error detach;" config setting,
we already send the neg acks right away.
Only those requests that have been submitted,
and have been error-completed by the local disk,
would forget to send the neg-ack,
and only in asynchronous replication (protocol != C).
Unless this happened during resync,
where we already always send acks, regardless of protocol.
The primary side needs the P_NEG_ACK in order to mark
the affected block(s) for resync in its out-of-sync bitmap.
If the blocks in question are not re-written again,
we may miss to resync them later, causing data inconsistencies.
This patch will always send the neg-acks, and also at least try to
persist the out-of-sync status on the local node already.
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com> Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
Lars Ellenberg [Tue, 29 Aug 2017 08:20:34 +0000 (10:20 +0200)]
drbd: add explicit plugging when submitting batches
When submitting batches of requests which had been queued on the
submitter thread, typically because they needed to wait for an
activity log transactions, use explicit plugging to help potential
merging of requests in the backend io-scheduler.
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com> Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
Lars Ellenberg [Tue, 29 Aug 2017 08:20:32 +0000 (10:20 +0200)]
drbd: introduce drbd_recv_header_maybe_unplug
Recently, drbd_recv_header() was changed to potentially
implicitly "unplug" the backend device(s), in case there
is currently nothing to receive.
Be more explicit about it: re-introduce the original drbd_recv_header(),
and introduce a new drbd_recv_header_maybe_unplug() for use by the
receiver "main loop".
Using explicit plugging via blk_start_plug(); blk_finish_plug();
really helps the io-scheduler of the backend with merging requests.
Wrap the receiver "main loop" with such a plug.
Also catch unplug events on the Primary,
and try to propagate.
This is performance relevant. Without this, if the receiving side does
not merge requests, number of IOPS on the peer can me significantly
higher than IOPS on the Primary, and can easily become the bottleneck.
Together, both changes should help to reduce the number of IOPS
as seen on the backend of the receiving side, by increasing
the chance of merging mergable requests, without trading latency
for more throughput.
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com> Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
Ben Hutchings [Sun, 13 Aug 2017 17:03:15 +0000 (18:03 +0100)]
mq-deadline: Enable auto-loading when built as module
The block core requests modules with the "-iosched" name suffix, but
mq-deadline does not have that suffix. Add an alias.
Fixes: 945ffb60c11d ("mq-deadline: add blk-mq adaptation of the deadline ...") Reviewed-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Ben Hutchings <ben@decadent.org.uk> Signed-off-by: Jens Axboe <axboe@kernel.dk>
Damien Le Moal [Tue, 29 Aug 2017 02:54:37 +0000 (11:54 +0900)]
block: Make blk_dequeue_request() static
The only caller of this function is blk_start_request() in the same
file. Fix blk_start_request() description accordingly.
Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Bart Van Assche <bart.vanassche@wdc.com> Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
Bart Van Assche [Tue, 29 Aug 2017 15:32:10 +0000 (08:32 -0700)]
skd: Let the block layer core choose .nr_requests
Since blk_mq_init_queue() initializes .nr_requests to the tag set
size and since that value is a good default for the skd driver, do
not overwrite the value set by blk_mq_init_queue(). This change
doubles the default value of .nr_requests.
Signed-off-by: Bart Van Assche <bart.vanassche@wdc.com> Cc: Christoph Hellwig <hch@lst.de> Cc: Hannes Reinecke <hare@suse.de> Cc: Johannes Thumshirn <jthumshirn@suse.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
Bart Van Assche [Tue, 29 Aug 2017 15:32:09 +0000 (08:32 -0700)]
skd: Remove blk_queue_bounce_limit() call
Since sTec s1120 devices support 64-bit DMA it is not necessary
to request data buffer bouncing. Hence remove the
blk_queue_bounce_limit() call.
Suggested-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Bart Van Assche <bart.vanassche@wdc.com> Cc: Christoph Hellwig <hch@lst.de> Cc: Hannes Reinecke <hare@suse.de> Cc: Johannes Thumshirn <jthumshirn@suse.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
Bhumika Goyal [Mon, 21 Aug 2017 11:43:08 +0000 (17:13 +0530)]
nbd: make device_attribute const
Make this const as is is only passed as an argument to the
function device_create_file and device_remove_file and the corresponding
arguments are of type const.
Done using Coccinelle
Shaohua Li [Mon, 28 Aug 2017 20:49:31 +0000 (13:49 -0700)]
block/nullb: delete unnecessary memory free
Commit 2984c86(nullb: factor disk parameters) has a typo. The
nullb_device allocation/free is done outside of null_add_dev. The commit
accidentally frees the nullb_device in error code path.
Reported-by: Dan Carpenter <dan.carpenter@oracle.com> Signed-off-by: Shaohua Li <shli@fb.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
David Jeffery [Mon, 28 Aug 2017 16:52:44 +0000 (10:52 -0600)]
block: fix warning when I/O elevator is changed as request_queue is being removed
There is a race between changing I/O elevator and request_queue removal
which can trigger the warning in kobject_add_internal. A program can
use sysfs to request a change of elevator at the same time another task
is unregistering the request_queue the elevator would be attached to.
The elevator's kobject will then attempt to be connected to the
request_queue in the object tree when the request_queue has just been
removed from sysfs. This triggers the warning in kobject_add_internal
as the request_queue no longer has a sysfs directory:
kobject_add_internal failed for iosched (error: -2 parent: queue)
------------[ cut here ]------------
WARNING: CPU: 3 PID: 14075 at lib/kobject.c:244 kobject_add_internal+0x103/0x2d0
To fix this warning, we can check the QUEUE_FLAG_REGISTERED flag when
changing the elevator and use the request_queue's sysfs_lock to
serialize between clearing the flag and the elevator testing the flag.
Signed-off-by: David Jeffery <djeffery@redhat.com> Tested-by: Ming Lei <ming.lei@redhat.com> Reviewed-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
Bart Van Assche [Fri, 25 Aug 2017 21:24:14 +0000 (14:24 -0700)]
skd: Remove SKD_ID_INCR
The SKD_ID_INCR flag in skd_request_context.id duplicates information
that is already available otherwise, e.g. through the block layer
request state and through skd_request_context.state. Hence remove
the code that manipulates this flag and also the flag itself.
Since skd_isr_completion_posted() only uses the lower bits of
skd_request_context.id as hardware tag, this patch does not change
the behavior of the skd driver. I'm referring to the following code:
tag = req_id & SKD_ID_SLOT_AND_TABLE_MASK;
Signed-off-by: Bart Van Assche <bart.vanassche@wdc.com> Cc: Christoph Hellwig <hch@lst.de> Cc: Hannes Reinecke <hare@suse.de> Cc: Johannes Thumshirn <jthumshirn@suse.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
Bart Van Assche [Fri, 25 Aug 2017 21:24:13 +0000 (14:24 -0700)]
skd: Make it easier for static analyzers to analyze skd_free_disk()
Although it is easy to see that skdev->disk != NULL if skdev->queue
!= NULL, add a test for skdev->disk to avoid that smatch reports the
following warning:
drivers/block/skd_main.c:3080 skd_free_disk()
error: we previously assumed 'disk' could be null (see line 3074)
Reported-by: Dan Carpenter <dan.carpenter@oracle.com> Signed-off-by: Bart Van Assche <bart.vanassche@wdc.com> Cc: Dan Carpenter <dan.carpenter@oracle.com> Cc: Christoph Hellwig <hch@lst.de> Cc: Hannes Reinecke <hare@suse.de> Cc: Johannes Thumshirn <jthumshirn@suse.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
Bart Van Assche [Fri, 25 Aug 2017 21:24:12 +0000 (14:24 -0700)]
skd: Inline skd_end_request()
It is not worth to keep the debug statements in skd_end_request().
Without debug statements that function only consists of two
statements. Hence inline skd_end_request().
Suggested-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Bart Van Assche <bart.vanassche@wdc.com> Cc: Christoph Hellwig <hch@lst.de> Cc: Hannes Reinecke <hare@suse.de> Cc: Johannes Thumshirn <jthumshirn@suse.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
Bart Van Assche [Fri, 25 Aug 2017 21:24:11 +0000 (14:24 -0700)]
skd: Rename skd_softirq_done() into skd_complete_rq()
The latter name follows more closely the function names used in
other blk-mq drivers.
Suggested-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Bart Van Assche <bart.vanassche@wdc.com> Cc: Christoph Hellwig <hch@lst.de> Cc: Hannes Reinecke <hare@suse.de> Cc: Johannes Thumshirn <jthumshirn@suse.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
Omar Sandoval [Thu, 24 Aug 2017 18:09:25 +0000 (11:09 -0700)]
block: update comments to reflect REQ_FLUSH -> REQ_PREFLUSH rename
Normally I wouldn't bother with this, but in my opinion the comments are
the most important part of this whole file since without them no one
would have any clue how this insanity works.
Milan Broz [Wed, 9 Aug 2017 15:47:26 +0000 (17:47 +0200)]
bio-integrity: Fix regression if profile verify_fn is NULL
In dm-integrity target we register integrity profile that have
both generate_fn and verify_fn callbacks set to NULL.
This is used if dm-integrity is stacked under a dm-crypt device
for authenticated encryption (integrity payload contains authentication
tag and IV seed).
In this case the verification is done through own crypto API
processing inside dm-crypt; integrity profile is only holder
of these data. (And memory is owned by dm-crypt as well.)
After the commit (and previous changes)
Commit 7c20f11680a441df09de7235206f70115fbf6290
Author: Christoph Hellwig <hch@lst.de>
Date: Mon Jul 3 16:58:43 2017 -0600
block: replace bi_bdev with a gendisk pointer and partitions index
This way we don't need a block_device structure to submit I/O. The
block_device has different life time rules from the gendisk and
request_queue and is usually only available when the block device node
is open. Other callers need to explicitly create one (e.g. the lightnvm
passthrough code, or the new nvme multipathing code).
For the actual I/O path all that we need is the gendisk, which exists
once per block device. But given that the block layer also does
partition remapping we additionally need a partition index, which is
used for said remapping in generic_make_request.
Note that all the block drivers generally want request_queue or
sometimes the gendisk, so this removes a layer of indirection all
over the stack.
Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
The block layer always remaps partitions before calling into the
->make_request methods of drivers. Thus the call to get_start_sect in
in_chunk_boundary will always return 0 and can be removed.
Reviewed-by: Shaohua Li <shli@fb.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
btrfs: index check-integrity state hash by a dev_t
We won't have the struct block_device available in the bio soon, so switch
to the numerical dev_t instead of the block_device pointer for looking up
the check-integrity state.
Reviewed-by: Liu Bo <bo.li.liu@oracle.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
Bart Van Assche [Wed, 23 Aug 2017 17:56:33 +0000 (10:56 -0700)]
skd: Change default interrupt mode to MSI-X
Since MSI support on some motherboards is unreliable, change the
default interrupt mode from MSI to MSI-X. This patch avoids that
the following message appears sporadially in the kernel logs of
my test setup:
do_IRQ: 3.193 No irq handler for vector
Signed-off-by: Bart Van Assche <bart.vanassche@wdc.com> Cc: Christoph Hellwig <hch@lst.de> Cc: Hannes Reinecke <hare@suse.de> Cc: Johannes Thumshirn <jthumshirn@suse.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
Bart Van Assche [Wed, 23 Aug 2017 17:56:32 +0000 (10:56 -0700)]
skd: Avoid double completions in case of a timeout
Avoid that normal request completion and the timeout handler can
run concurrently by calling blk_mq_complete_request() instead of
blk_mq_end_request() from skd_end_request(). Avoid that the block
layer can reuse a request while the firmware is still processing
it. Convert skd_softirq_done() to blk-mq. Pass the pointer to
skd_softirq_done() to the block layer core through
blk_mq_ops.complete instead of by calling blk_queue_softirq_done().
Pass the pointer to skd_timed_out() to the block layer core
through blk_mq_ops.timeout instead of by calling
blk_queue_timed_out(). The timeout handler has been tested as
follows:
Bart Van Assche [Wed, 23 Aug 2017 17:56:29 +0000 (10:56 -0700)]
block: Warn if blk_queue_rq_timed_out() is called for a blk-mq queue
The timeout handler set by blk_queue_rq_timed_out() is only used
in single queue mode. Calling this function for blk-mq drivers is
wrong. Hence issue a warning if this function is called by a blk-mq
driver.
Signed-off-by: Bart Van Assche <bart.vanassche@wdc.com> Cc: Christoph Hellwig <hch@lst.de> Cc: Hannes Reinecke <hare@suse.de> Cc: Johannes Thumshirn <jthumshirn@suse.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
Shaohua Li [Mon, 14 Aug 2017 22:05:00 +0000 (15:05 -0700)]
nullb: badbblocks support
Sometime disk could have tracks broken and data there is inaccessable,
but data in other parts can be accessed in normal way. MD RAID supports
such disks. But we don't have a good way to test it, because we can't
control which part of a physical disk is bad. For a virtual disk, this
can be easily controlled.
This patch adds a new 'badblock' attribute. Configure it in this way:
echo "+1-100" > xxx/badblock, this will make sector [1-100] as bad
blocks.
echo "-20-30" > xxx/badblock, this will make sector [20-30] good
If badblocks are accessed, the nullb disk will return IO error. Other
parts of the disk can accessed in normal way.
Signed-off-by: Shaohua Li <shli@fb.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
Shaohua Li [Mon, 14 Aug 2017 22:04:59 +0000 (15:04 -0700)]
nullb: emulate cache
Software must flush disk cache to guarantee data safety. To check if
software correctly does disk cache flush, we must know the behavior of
disk. But physical disk behavior is uncontrollable. Even software
doesn't do the flush, the disk probably does the flush. This patch tries
to emulate a cache in the test disk.
All write will go to a cache first, when the cache is full, we then
flush some data to disk storage. A flush request will flush all data of
the cache to disk storage. A FUA write will write to memory store
directly and revalidate data in cache. If there is a power failure (by
writing to power attribute, 'echo 0 > disk_name/power'), we discard all
data in the cache, but preserve the data in disk storage. Later we can
power on the disk again as usual (write 1 to 'power' attribute), then we
can check data integrity and very if software does everything correctly.
A new attribute 'cache_size' (in MB) is added to configure cache size.
Shaohua Li [Mon, 14 Aug 2017 22:04:58 +0000 (15:04 -0700)]
nullb: bandwidth control
In test, we usually expect controllable disk speed. For example, in a
raid array, we'd like some disks are fast and some are slow. MD RAID
actually has a feature for this. To test the feature, we'd like to make
the disk run in specific speed.
block throttling probably can be used for this purpose, but it requires
cgroup setup. Here we just implement a simple throttling mechanism in
the driver. There is slight fluctuation in the mechanism, but it's good
enough for test.
To configure the bandwidth cap, user sets the 'mbps' attribute. mbps is
MB/s.
Shaohua Li [Mon, 14 Aug 2017 22:04:54 +0000 (15:04 -0700)]
nullb: add interface to power on disk
The device created in nullb configfs interface isn't power on by
default. After user configures the device, user can do 'echo 1 >
xxx/nullb/device_name/power' to power on the device, which will create a
disk. the xxx/nullb/device_name/index is the disk index, so if the index
is 2, the new created disk should be named as /dev/nullb2. Note, the
'index' is only valid after disk is power on.
'echo 0 > xxx/nullb/device_name/power' will remove the disk. Note, this
doesn't remove the device. To remove the device, user should do 'rmdir
xxx/nullb/device_name'. Removing the device will remove the disk too.
Signed-off-by: Shaohua Li <shli@fb.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
Shaohua Li [Mon, 14 Aug 2017 22:04:53 +0000 (15:04 -0700)]
nullb: add configfs interface
Add configfs interface for nullb. configfs interface is more flexible
and easy to configure in a per-disk basis.
Configuration is something like this:
mount -t configfs none /mnt
Checking which features the driver supports:
cat /mnt/nullb/features
The 'features' attribute is for future extension. We probably will add
new features into the driver, userspace can check this attribute to find
the supported features.
Create/remove a device:
mkdir/rmdir /mnt/nullb/a
Then configure the device by setting attributes under /mnt/nullb/a, most
of nullb supported module parameters are converted to attributes:
size; /* device size in MB */
completion_nsec; /* time in ns to complete a request */
submit_queues; /* number of submission queues */
home_node; /* home node for the device */
queue_mode; /* block interface */
blocksize; /* block size */
irqmode; /* IRQ completion handler */
hw_queue_depth; /* queue depth */
use_lightnvm; /* register as a LightNVM device */
blocking; /* blocking blk-mq device */
use_per_node_hctx; /* use per-node allocation for hardware context */
Note, creating a device doesn't create a disk immediately. Creating a
disk is done in two phases: create a device and then power on the
device. Next patch will introduce device power on.
Shaohua Li [Mon, 14 Aug 2017 22:04:52 +0000 (15:04 -0700)]
nullb: factor disk parameters
When we switch to configfs interface, each disk could have different
configuration. To prepare for the change, we move most disk setting to a
separate data structure. The existing module parameter interface is
kept. The 'nr_devices' and 'shared_tags' don't make sense for per-disk
setting, so they are remained as global settings.
Signed-off-by: Shaohua Li <shli@fb.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
Dan Carpenter [Wed, 23 Aug 2017 11:20:57 +0000 (14:20 +0300)]
skd: error pointer dereference in skd_cons_disk()
My initial impulse was to check for IS_ERR_OR_NULL() but when I looked
at this code a bit more closely, we should only need to check for
IS_ERR().
The blk_mq_alloc_tag_set() returns negative error codes and zero on
success so we can just do an "if (rc) goto err_out;". It's better to
preserve the error code anyhow. The blk_mq_init_queue() returns error
pointers on failure, it never returns NULL. We can also remove the
"q = NULL;" at the start because that's no longer needed.
Fixes: ca33dd92968b ("skd: Convert to blk-mq") Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
Bart Van Assche [Thu, 17 Aug 2017 20:13:38 +0000 (13:13 -0700)]
skd: Bump driver version
Bump the driver version. Remove the build ID because build IDs do
not make sense for an upstream kernel driver. Keep the driver
version in the module information but do not report it during every
load, unload or probe.
Signed-off-by: Bart Van Assche <bart.vanassche@wdc.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
Bart Van Assche [Thu, 17 Aug 2017 20:13:35 +0000 (13:13 -0700)]
skd: Reduce memory usage
Every single coherent DMA memory buffer occupies at least one page.
Reduce memory usage by switching from coherent buffers to streaming
DMA for I/O requests (struct skd_fitmsg_context) and S/G-lists
(struct fit_sg_descriptor[]).
Signed-off-by: Bart Van Assche <bart.vanassche@wdc.com> Cc: Christoph Hellwig <hch@lst.de> Cc: Hannes Reinecke <hare@suse.de> Cc: Johannes Thumshirn <jthumshirn@suse.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
Bart Van Assche [Thu, 17 Aug 2017 20:13:34 +0000 (13:13 -0700)]
skd: Remove skd_device.in_flight
Since skd_device.in_flight is only used to display the number of
in-flight requests in debug messages, remove that member and
introduce skd_in_flight(). That last function relies on the block
layer to determine the number of in flight requests.
Signed-off-by: Bart Van Assche <bart.vanassche@wdc.com> Cc: Christoph Hellwig <hch@lst.de> Cc: Hannes Reinecke <hare@suse.de> Cc: Johannes Thumshirn <jthumshirn@suse.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
Bart Van Assche [Thu, 17 Aug 2017 20:13:32 +0000 (13:13 -0700)]
skd: Convert to blk-mq
Introduce a tag set and a blk_mq_ops structure. Set .cmd_size such
that struct request and struct skd_request_context are allocated
through a single allocation. Remove the skd_request_context.req
pointer. Make queue starting asynchronous such that this can occur
safely from interrupt context. Use locking to protect skdev->skmsg
and *skdev->skmsg against concurrent access from concurrent
.queue_rq() calls. Introduce the functions skd_init_request() and
skd_exit_request() to set up / clean up the per-request S/G-list.
Signed-off-by: Bart Van Assche <bart.vanassche@wdc.com> Cc: Christoph Hellwig <hch@lst.de> Cc: Hannes Reinecke <hare@suse.de> Cc: Johannes Thumshirn <jthumshirn@suse.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
Bart Van Assche [Thu, 17 Aug 2017 20:13:28 +0000 (13:13 -0700)]
skd: Introduce skd_process_request()
The only functional change in this patch is that the skd_fitmsg_context
in which requests are accumulated is changed from a local variable into
a member of struct skd_device. This patch will make the blk-mq conversion
easier.
Signed-off-by: Bart Van Assche <bart.vanassche@wdc.com> Cc: Christoph Hellwig <hch@lst.de> Cc: Hannes Reinecke <hare@suse.de> Cc: Johannes Thumshirn <jthumshirn@suse.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
Bart Van Assche [Thu, 17 Aug 2017 20:13:27 +0000 (13:13 -0700)]
skd: Convert several per-device scalar variables into atomics
Convert the per-device scalar variables that are protected by the
queue lock into atomics such that it becomes safe to access these
variables without holding the queue lock.
Signed-off-by: Bart Van Assche <bart.vanassche@wdc.com> Cc: Christoph Hellwig <hch@lst.de> Cc: Hannes Reinecke <hare@suse.de> Cc: Johannes Thumshirn <jthumshirn@suse.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
Bart Van Assche [Thu, 17 Aug 2017 20:13:26 +0000 (13:13 -0700)]
skd: Enable request tags for the block layer queue
Use the request tag when allocating a skd_fitmsg_context or
skd_request_context such that the lists used to track free elements
can be eliminated. Swap the skd_end_request() and skd_release_req()
calls to avoid triggering a use-after-free. Remove
skd_fitmsg_context.state and .outstanding because FIT messages are
shared among requests and because updating a FIT message after a
request has finished whould trigger a use-after-free.
Signed-off-by: Bart Van Assche <bart.vanassche@wdc.com> Cc: Christoph Hellwig <hch@lst.de> Cc: Hannes Reinecke <hare@suse.de> Cc: Johannes Thumshirn <jthumshirn@suse.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
Bart Van Assche [Thu, 17 Aug 2017 20:13:25 +0000 (13:13 -0700)]
skd: Initialize skd_special_context.req.n_sg to one
The debug code in skd_send_special_fitmsg() assumes that req.n_sg
represents the number of S/G descriptors. However, skd_construct()
initializes that member variable to zero. Set req.n_sg to one such
that the debugging code in skd_send_special_fitmsg() works as
expected.
Signed-off-by: Bart Van Assche <bart.vanassche@wdc.com> Cc: Christoph Hellwig <hch@lst.de> Cc: Hannes Reinecke <hare@suse.de> Cc: Johannes Thumshirn <jthumshirn@suse.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
Bart Van Assche [Thu, 17 Aug 2017 20:13:22 +0000 (13:13 -0700)]
skd: Convert explicit skd_request_fn() calls
This will make it easier to convert this driver to the blk-mq
approach. This patch also reduces interrupt latency by moving
skd_request_fn() calls out of the skd_isr() interrupt.
Signed-off-by: Bart Van Assche <bart.vanassche@wdc.com> Cc: Christoph Hellwig <hch@lst.de> Cc: Hannes Reinecke <hare@suse.de> Cc: Johannes Thumshirn <jthumshirn@suse.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
Bart Van Assche [Thu, 17 Aug 2017 20:13:17 +0000 (13:13 -0700)]
skd: Remove superfluous occurrences of the 'volatile' keyword
mem_map[i] is accessed through readl() / writel() hence declaring
mem_map as volatile is not necessary.
Remove the volatile declarations from struct fit_completion_entry_v1
pointers and struct fit_comp_error_info since reading these structures
multiple times is safe.
Signed-off-by: Bart Van Assche <bart.vanassche@wdc.com> Cc: Christoph Hellwig <hch@lst.de> Cc: Hannes Reinecke <hare@suse.de> Cc: Johannes Thumshirn <jthumshirn@suse.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
Bart Van Assche [Thu, 17 Aug 2017 20:13:13 +0000 (13:13 -0700)]
skd: Remove superfluous initializations from skd_isr_completion_posted()
The value of skcmp, cmp_cntxt etc. is overwritten during every
loop iteration and is not used after the loop has finished. Hence
initializing these variables outside the loop is not necessary.
This patch does not change any functionality.
Signed-off-by: Bart Van Assche <bart.vanassche@wdc.com> Cc: Christoph Hellwig <hch@lst.de> Cc: Hannes Reinecke <hare@suse.de> Cc: Johannes Thumshirn <jthumshirn@suse.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
Bart Van Assche [Thu, 17 Aug 2017 20:13:05 +0000 (13:13 -0700)]
skd: Simplify the code for deciding whether or not to send a FIT msg
Due to the previous patch it is guaranteed that the FIT msg contains
at least one request after the for-loop has finished. Use this to
simplify the code.
Signed-off-by: Bart Van Assche <bart.vanassche@wdc.com> Cc: Christoph Hellwig <hch@lst.de> Cc: Hannes Reinecke <hare@suse.de> Cc: Johannes Thumshirn <jthumshirn@suse.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>