Jason Gunthorpe [Wed, 4 Jul 2018 05:50:23 +0000 (08:50 +0300)]
RDMA/uverbs: Store the specs_root in the struct ib_uverbs_device
The specs are required to operate the uverbs file, so they belong inside
the ib_uverbs_device, not inside the ib_device. The spec passed in the
ib_device is just a communication from the driver and should not be used
during runtime.
This also changes the lifetime of the spec memory to match the
ib_uverbs_device, however at this time the spec_root can still contain
driver pointers after disassociation, so it cannot be used if ib_dev is
NULL. This is preparation for another series.
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com> Reviewed-by: Michael J. Ruhl <michael.j.ruhl@intel.com> Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Jason Gunthorpe [Wed, 4 Jul 2018 19:19:46 +0000 (13:19 -0600)]
Merge branch 'mlx5-dump-fill-mkey' into rdma.git for-next
For dependencies, branch based on 'mellanox/mlx5-next' of
git://git.kernel.org/pub/scm/linux/kernel/git/mellanox/linux.git
Pull Dump and fill MKEY from Leon Romanovsky:
====================
MLX5 IB HCA offers the memory key, dump_fill_mkey to increase performance,
when used in a send or receive operations.
It is used to force local HCA operations to skip the PCI bus access, while
keeping track of the processed length in the ibv_sge handling.
In this three patch series, we expose various bits in our HW spec
file (mlx5_ifc.h), move unneeded for mlx5_core FW command and export such
memory key to user space thought our mlx5-abi header file.
====================
Botched auto-merge in mlx5_ib_alloc_ucontext() resolved by hand.
* branch 'mlx5-dump-fill-mkey':
IB/mlx5: Expose dump and fill memory key
net/mlx5: Add hardware definitions for dump_fill_mkey
net/mlx5: Limit scope of dump_fill_mkey function
net/mlx5: Rate limit errors in command interface
Yonatan Cohen [Tue, 19 Jun 2018 05:47:24 +0000 (08:47 +0300)]
IB/mlx5: Expose dump and fill memory key
MLX5 IB HCA offers the memory key, dump_fill_mkey to boost
performance, when used in a send or receive operations.
It is used to force local HCA operations to skip the PCI bus access,
while keeping track of the processed length in the ibv_sge handling.
Meaning, instead of a PCI write access the HCA leaves the target
memory untouched, and skips filling that packet section. Similar
behavior is done upon send, the HCA skips data in memory relevant
to this key and saves PCI bus access.
This functionality saves PCI read/write operations.
Signed-off-by: Yonatan Cohen <yonatanc@mellanox.com> Reviewed-by: Yishai Hadas <yishaih@mellanox.com> Reviewed-by: Guy Levi <guyle@mellanox.com> Signed-off-by: Leon Romanovsky <leonro@mellanox.com> Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Dan Carpenter [Wed, 4 Jul 2018 09:58:02 +0000 (12:58 +0300)]
RDMA/bnxt_re: Fix a bunch of off by one bugs in qplib_fp.c
The srq->swq[] is allocated in bnxt_qplib_create_srq(). It has
srq->hwq.max_elements elements so these tests should be > instead of >=
or we might go beyond the end of the array.
Fixes: 1ac5a4047975 ("RDMA/bnxt_re: Add bnxt_re RoCE driver") Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com> Acked-by: Selvin Xavier <selvin.xavier@broadcom.com> Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Dan Carpenter [Wed, 4 Jul 2018 09:57:11 +0000 (12:57 +0300)]
RDMA/bnxt_re: Fix a couple off by one bugs
The sgid_tbl->tbl[] array is allocated in bnxt_qplib_alloc_sgid_tbl().
It has sgid_tbl->max elements. So the > should be >= to prevent
accessing one element beyond the end of the array.
Fixes: 1ac5a4047975 ("RDMA/bnxt_re: Add bnxt_re RoCE driver") Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com> Acked-by: Selvin Xavier <selvin.xavier@broadcom.com> Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Dan Carpenter [Wed, 4 Jul 2018 09:32:12 +0000 (12:32 +0300)]
IB/core: type promotion bug in rdma_rw_init_one_mr()
"nents" is an unsigned int, so if ib_map_mr_sg() returns a negative
error code then it's type promoted to a high unsigned int which is
treated as success.
Fixes: a060b5629ab0 ("IB/core: generic RDMA READ/WRITE API") Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com> Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Or Gerlitz [Tue, 3 Jul 2018 15:02:37 +0000 (18:02 +0300)]
MAINTAINERS: Moving out...
The 2.6.18... was a hell of a ride, and by now both me and
Roi are not dealing with iser any more. Max will replace us
as maintainer, good luck to you dear!
Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com> Acked-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
rdma_ah_find_type() can reach into ib_device->port_immutable with a
potentially out-of-bounds port number, so check that the port number is
valid first.
Fixes: 44c58487d51a ("IB/core: Define 'ib' and 'roce' rdma_ah_attr types") Signed-off-by: Tarick Bedeir <tarick@google.com> Reviewed-by: Leon Romanovsky <leonro@mellanox.com> Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
This occurs because vmw_pvrdma on probe stores a pointer to the netdev
that exists on function 0 of the same bus/device/slot (which represents
the vmxnet3 ethernet driver). However, it never removes this pointer if
the vmxnet3 module is removed, leading to crashes resulting from use after
free dereferencing incidents like the one above.
The fix is pretty straightforward. vmw_pvrdma should listen for
NETDEV_REGISTER and NETDEV_UNREGISTER events in its event listener code
block, and update the stored netdev pointer accordingly. This solution
has been tested by myself and the reporter with successful results. This
fix also allows the pvrdma driver to find its underlying ethernet device
in the event that vmxnet3 is loaded after pvrdma, which it was not able to
do before.
Signed-off-by: Neil Horman <nhorman@tuxdriver.com> Reported-by: ruquin@redhat.com Tested-by: Adit Ranadive <aditr@vmware.com> Acked-by: Adit Ranadive <aditr@vmware.com> Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Maor Gottlieb [Sun, 1 Jul 2018 12:50:17 +0000 (15:50 +0300)]
IB/mlx5: Fix GRE flow specification
Currently the driver sets the mask of the gre_protocol to 0xffff
without consideration in the user request.
Fix it by copy the mask from the verbs spec.
Fixes: da2f22ae7707 ("IB/mlx5: Add support for GRE flow specification") Signed-off-by: Maor Gottlieb <maorg@mellanox.com> Reviewed-by: Ariel Levkovich <lariel@mellanox.com> Signed-off-by: Leon Romanovsky <leonro@mellanox.com> Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Michael J. Ruhl [Mon, 2 Jul 2018 15:08:37 +0000 (08:08 -0700)]
IB/hfi1: Remove incorrect call to do_interrupt callback
The general interrupt handler is_rcv_avail_int() has two paths,
do_interrupt() (callback) and handle_user_interrupt(). The
do_interrupt() callback is for the threaded receive handling.
is_rcv_avail_int() cannot handle threaded IRQs.
If the do_interrupt() path is taken, and the IRQ returns
IRQ_WAKE_THREAD, the IRQ behavior will be indeterminate.
Remove incorrect call to do_interrupt() from is_rcv_avail_int(),
leaving the un-threaded (handle_user_interrupt()) path.
Fixes: f4f30031c33c ("staging/rdma/hfi1: Thread the receive interrupt.") Reviewed-by: Mike Marciniszyn <mike.marciniszyn@intel.com> Reviewed-by: Kamenee Arumugam <kamenee.arumugam@intel.com> Signed-off-by: Michael J. Ruhl <michael.j.ruhl@intel.com> Signed-off-by: Dennis Dalessandro <dennis.dalessandro@intel.com> Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Michael J. Ruhl [Mon, 2 Jul 2018 15:08:27 +0000 (08:08 -0700)]
IB/hfi1: Set in_use_ctxts bits for user ctxts only
The in_use_ctxts bitmask is for user receive contexts only. Setting it for
any other type of receive context is incorrect.
Move initial set of in_use_ctxts bits from the general context init to the
user context specific init. Having this bit set can allow contexts to be
incorrectly identified by some IRQ handlers. This will allow
handle_user_interrupt() will now filter user contexts correctly.
Clean up redundant is_rcv_urgent_int() user context check.
A follow on patch will clean up an incorrect code path in the
is_rcv_avail_int().
Fixes: 8737ce95c463 ("IB/hfi1: Fix an assign/ordering issue with shared context IDs") Reviewed-by: Mike Marciniszyn <mike.marciniszyn@intel.com> Reviewed-by: Kamenee Arumugam <kamenee.arumugam@intel.com> Signed-off-by: Michael J. Ruhl <michael.j.ruhl@intel.com> Signed-off-by: Dennis Dalessandro <dennis.dalessandro@intel.com> Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Avoid that the compiler complains about set-but-not-used variables when
building with W=1. This patch does not change any functionality.
Signed-off-by: Bart Van Assche <bart.vanassche@wdc.com> Cc: Leon Romanovsky <leonro@mellanox.com> Acked-by: Leon Romanovsky <leonro@mellanox.com> Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
IB/srp: Remove driver version and release data information
Remove the driver version and release date information because such
information is not relevant for an upstream driver. See also commit e1267b01240a ("RDMA: Remove useless MODULE_VERSION").
Signed-off-by: Bart Van Assche <bart.vanassche@wdc.com> Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Hence use be16_to_cpu() to convert it to CPU endianness. Compile-tested
only.
Fixes: af808ece5ce9 ("IB/SA: Check dlid before SA agent queries for ClassPortInfo") Signed-off-by: Bart Van Assche <bart.vanassche@wdc.com> Cc: Venkata Sandeep Dhanalakota <venkata.s.dhanalakota@intel.com> Cc: Mike Marciniszyn <mike.marciniszyn@intel.com> Cc: Dennis Dalessandro <dennis.dalessandro@intel.com> Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Yishai Hadas [Wed, 20 Jun 2018 14:11:39 +0000 (17:11 +0300)]
IB: Improve uverbs_cleanup_ucontext algorithm
Improve uverbs_cleanup_ucontext algorithm to work properly when the
topology graph of the objects cannot be determined at compile time. This
is the case with objects created via the devx interface in mlx5.
Typically uverbs objects must be created in a strict topologically sorted
order, so that LIFO ordering will generally cause them to be freed
properly. There are only a few cases (eg memory windows) where objects can
point to things out of the strict LIFO order.
Instead of using an explicit ordering scheme where the HW destroy is not
allowed to fail, go over the list multiple times and allow the destroy
function to fail. If progress halts then a final, desperate, cleanup is
done before leaking the memory. This indicates a driver bug.
Signed-off-by: Yishai Hadas <yishaih@mellanox.com> Signed-off-by: Leon Romanovsky <leonro@mellanox.com> Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Bart Van Assche [Tue, 26 Jun 2018 22:24:48 +0000 (15:24 -0700)]
IB/srpt: Support HCAs with more than two ports
Since there are adapters that have four ports, increase the size of
the srpt_device.port[] array. This patch avoids that the following
warning is hit with quad port Chelsio adapters:
Reported-by: Steve Wise <swise@opengridcomputing.com> Signed-off-by: Bart Van Assche <bart.vanassche@wdc.com> Cc: Steve Wise <swise@opengridcomputing.com> Cc: Christoph Hellwig <hch@infradead.org> Cc: <stable@vger.kernel.org> Reviewed-by: Steve Wise <swise@opengridcomputing.com> Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Sagi Grimberg [Wed, 27 Jun 2018 08:03:45 +0000 (11:03 +0300)]
IB/iser: set can_queue earlier to allow setting higher queue depth
We need to set can_queue earlier than when enabling the scsi host.
in a blk-mq enabled environment, the tagset allocation is taken
from can_queue which cannot be modified later. Also, pass an updated
.can_queue to iscsi_session_setup to have enough iscsi tasks allocated
in the session kfifo.
Reported-by: Karandeep Chahal <karandeepchahal@gmail.com> Signed-off-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Max Gurtovoy <maxg@mellanox.com> Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Vijay Immanuel [Tue, 19 Jun 2018 01:48:56 +0000 (18:48 -0700)]
IB/rxe: don't clear the tx queue on every transfer
Do not call sk_dst_set() on every packet transfer because
that calls sk_tx_queue_clear(), which clears the tx queue.
A QP must stay on the same tx queue to maintain packet order.
Signed-off-by: Vijay Immanuel <vijayi@attalasystems.com> Acked-by: Moni Shoua <monis@mellanox.com> Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Jason Gunthorpe [Wed, 27 Jun 2018 07:44:26 +0000 (10:44 +0300)]
IB/cm: Remove now useless rcu_lock in dst_fetch_ha
This lock used to be protecting a call to dst_get_neighbour_noref,
however the below commit changed it to dst_neigh_lookup which no longer
requires rcu.
Access to nud_state, neigh_event_send or rdma_copy_addr does not require
RCU, so delete the lock.
Fixes: 02b619555ad6 ("infiniband: Convert dst_fetch_ha() over to dst_neigh_lookup().") Signed-off-by: Jason Gunthorpe <jgg@mellanox.com> Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Those prints are extremely valuable to diagnose issues with running system
and it is important to keep them. However, not-so-careful user can trigger
endless number of such prints by depleting HW resources and will spam
dmesg.
Rate limiting of such messages solves this issue.
Signed-off-by: Leon Romanovsky <leonro@mellanox.com> Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Jason Gunthorpe [Sun, 24 Jun 2018 13:57:50 +0000 (16:57 +0300)]
IB/mlx4: Create slave AH's directly
Since slave GID's do not exist in the core gid table we can no longer use
the core code to help do this without creating inconsistencies. Directly
create the AH using mlx4 internal APIs.
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com> Reviewed-by: Jack Morgenstein <jackm@dev.mellanox.co.il> Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Leon Romanovsky [Sun, 24 Jun 2018 08:23:47 +0000 (11:23 +0300)]
RDMA/uverbs: Don't overwrite NULL pointer with ZERO_SIZE_PTR
Number of specs is provided by user and in valid case can be equal to zero.
Such argument causes to call to kcalloc() with zero-length request and in
return the ZERO_SIZE_PTR is assigned. This pointer is different from NULL
and makes various if (..) checks to success.
Fixes: b6ba4a9aa59f ("IB/uverbs: Add support for flow counters") Signed-off-by: Leon Romanovsky <leonro@mellanox.com> Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Jason Gunthorpe [Mon, 25 Jun 2018 21:21:15 +0000 (15:21 -0600)]
RDMA/uverbs: Check existence of create_flow callback
In the accepted series "Refactor ib_uverbs_write path", we presented the
roadmap to get rid of uverbs_cmd_mask and uverbs_ex_cmd_mask fields in
favor of simple check of function pointer. So let's put NULL check of
create_flow function callback despite the fact that uverbs_ex_cmd_mask
still exists.
Link: https://www.spinics.net/lists/linux-rdma/msg60753.html Suggested-by: Michael J Ruhl <michael.j.ruhl@intel.com> Signed-off-by: Leon Romanovsky <leonro@mellanox.com> Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Bart Van Assche [Fri, 22 Jun 2018 15:11:17 +0000 (08:11 -0700)]
MAINTAINERS: Update SRP entries
Reflect the acquisition of SanDisk by Western Digital in my e-mail
address. Remove the reference to David Dillow's git tree since SRP patches
are queued by Doug and Jason. Remove the reference to the OpenFabrics
website since the srp_daemon source code has been moved from that website
into the rdma-core project. Add an entry for the SRP target driver.
Signed-off-by: Bart Van Assche <bart.vanassche@wdc.com> Cc: Bart Van Assche <bart.vanassche@sandisk.com> Cc: David Dillow <dillow@google.com> Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Jason Gunthorpe [Wed, 13 Jun 2018 17:19:42 +0000 (11:19 -0600)]
IB/usnic: Update with bug fixes from core code
usnic has a modified version of the core codes' ib_umem_get() and
related, and the copy misses many of the bug fixes done over the years:
Commit bc3e53f682d9 ("mm: distinguish between mlocked and pinned pages")
Commit 87773dd56d54 ("IB: ib_umem_release() should decrement mm->pinned_vm
from ib_umem_get")
Commit 8494057ab5e4 ("IB/uverbs: Prevent integer overflow in ib_umem_get
address arithmetic")
Commit 8abaae62f3fd ("IB/core: disallow registering 0-sized memory region")
Commit 66578b0b2f69 ("IB/core: don't disallow registering region starting
at 0x0")
Commit 53376fedb9da ("RDMA/core: not to set page dirty bit if it's already
set.")
Commit 8e907ed48827 ("IB/umem: Use the correct mm during ib_umem_release")
Yishai Hadas [Tue, 19 Jun 2018 07:43:56 +0000 (10:43 +0300)]
IB/mlx4: Add support for drain SQ & RQ
This patch follows the logic from ib_core but considers the internal
device state upon executing the involved commands.
Specifically, Upon internal error state modify QP to an error state can
be assumed to be success as each in-progress WR going to be flushed in
error in any case as expected by that modify command.
In addition,
As the drain should never fail the driver makes sure that post_send/recv
will succeed even if the device is already in an internal error state.
As such once the driver will supply the simulated/SW CQEs the CQE for
the drain WR will be handled as well.
In case of an internal error state the CQE for the drain WR may be
completed as part of the main task that handled the error state or by
the task that issued the drain WR.
As the above depends on scheduling the code takes the relevant locks
and actions to make sure that the completion handler for that WR will
always be called after that the post_send/recv were issued but not in
parallel to the other task that handles the error flow.
Signed-off-by: Yishai Hadas <yishaih@mellanox.com> Signed-off-by: Leon Romanovsky <leonro@mellanox.com> Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Yishai Hadas [Tue, 19 Jun 2018 07:43:55 +0000 (10:43 +0300)]
IB/mlx5: Add support for drain SQ & RQ
This patch follows the logic from ib_core but considers the internal
device state upon executing the involved commands.
Specifically,
Upon internal error state modify QP to an error state can be assumed to
be success as each in-progress WR going to be flushed in error in any
case as expected by that modify command.
In addition,
As the drain should never fail the driver makes sure that post_send/recv
will succeed even if the device is already in an internal error state.
As such once the driver will supply the simulated/SW CQEs the CQE for
the drain WR will be handled as well.
In case of an internal error state the CQE for the drain WR may be
completed as part of the main task that handled the error state or by
the task that issued the drain WR.
As the above depends on scheduling the code takes the relevant locks and
actions to make sure that the completion handler for that WR will always
be called after that the post_send/recv were issued but not in parallel
to the other task that handles the error flow.
Signed-off-by: Yishai Hadas <yishaih@mellanox.com> Reviewed-by: Max Gurtovoy <maxg@mellanox.com> Signed-off-by: Leon Romanovsky <leonro@mellanox.com> Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Parav Pandit [Tue, 19 Jun 2018 07:59:19 +0000 (10:59 +0300)]
IB/cm: Replace members of sa_path_rec with 'struct sgid_attr *'
While processing a path record entry in CM messages the associated GID
attribute is now also supplied.
Currently for RoCE a netdevice's net namespace pointer and ifindex are
stored in path record entry. Both of these fields of the netdev can change
anytime while processing CM messages. Additionally storing net namespace
without holding reference will lead to use-after-free crash. Therefore it
is removed. Netdevice information for RoCE is instead provided via
referenced gid attribute in ib_cm requests.
Such a design leads to a situation where the kernel can crash when the net
pointer becomes invalid. However today it is always initialized to
init_net, which cannot become invalid. In order to support processing
packets in any arbitrary namespace of the received packet, it is necessary
to avoid such conditions.
This patch removes the dependency on the net pointer and ifindex; instead
it will rely on SGID attribute which contains a pointer to netdev.
Signed-off-by: Parav Pandit <parav@mellanox.com> Signed-off-by: Jason Gunthorpe <jgg@mellanox.com> Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Parav Pandit [Tue, 19 Jun 2018 07:59:18 +0000 (10:59 +0300)]
IB/cm: Pass the sgid_attr through various events
Make the sgid_attr available along with path information to the event
consumer, this allows the consumer to keep using the same GID table entry
as the event is related to.
Signed-off-by: Parav Pandit <parav@mellanox.com> Signed-off-by: Jason Gunthorpe <jgg@mellanox.com> Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Parav Pandit [Tue, 19 Jun 2018 07:59:15 +0000 (10:59 +0300)]
IB: Make ib_init_ah_from_mcmember set sgid_attr
This is really just a CM support function, normally a multicast address
does not have a specific SGID - but the RDMA CM usage model does restrict
things to the netdevice the CM id is bound to, at least for roce case.
Store the selected table entry in the sgid_attr for everything else to
use.
Signed-off-by: Parav Pandit <parav@mellanox.com> Signed-off-by: Jason Gunthorpe <jgg@mellanox.com> Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Parav Pandit [Tue, 19 Jun 2018 07:59:14 +0000 (10:59 +0300)]
IB: Make ib_init_ah_attr_from_wc set sgid_attr
The work completion is inspected to determine what dgid table entry was
used to receieve the packet, produces a sgid_attr that matches and sticks
it in the ah_attr.
All callers of this function are now required to release the ah_attr on
success.
Signed-off-by: Parav Pandit <parav@mellanox.com> Signed-off-by: Jason Gunthorpe <jgg@mellanox.com> Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Michael J. Ruhl [Wed, 20 Jun 2018 16:43:23 +0000 (09:43 -0700)]
IB/hfi1: Remove INTx support and simplify MSIx usage
The INTx IRQ support does not work for all HF1 IRQ handlers
(specifically the receive data IRQs).
Remove all supporting code for the INTx IRQ.
If the requested MSIx vector request is unsuccessful, do not allow the
driver to continue.
Reviewed-by: Mike Marciniszyn <mike.marciniszyn@intel.com> Reviewed-by: Kamenee Arumugam <kamenee.arumugam@intel.com> Reviewed-by: Sadanand Warrier <sadanand.warrier@intel.com> Signed-off-by: Michael J. Ruhl <michael.j.ruhl@intel.com> Signed-off-by: Dennis Dalessandro <dennis.dalessandro@intel.com> Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Mike Marciniszyn [Wed, 20 Jun 2018 16:43:14 +0000 (09:43 -0700)]
IB/hfi1: Reorg ctxtdata and rightsize fields
Many fields in ctxtdata are incorrectly sized and the organization of the
fields within the structure is a jumble.
Fix by:
- Correcting oversize fields.
- Putting fields common to all contexts at the top with hot fields
at the top.
- Moving PSM fields to the bottom of the structure.
Reviewed-by: Michael J. Ruhl <michael.j.ruhl@intel.com> Signed-off-by: Mike Marciniszyn <mike.marciniszyn@intel.com> Signed-off-by: Dennis Dalessandro <dennis.dalessandro@intel.com> Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Mike Marciniszyn [Wed, 20 Jun 2018 16:43:06 +0000 (09:43 -0700)]
IB/hfi1: Remove caches of chip CSRs
Remove the sizeable cache of the chip sizing CSRs and replace with CSR
reads as needed.
Reviewed-by: Michael J. Ruhl <michael.j.ruhl@intel.com> Signed-off-by: Mike Marciniszyn <mike.marciniszyn@intel.com> Signed-off-by: Dennis Dalessandro <dennis.dalessandro@intel.com> Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Mike Marciniszyn [Wed, 20 Jun 2018 16:42:57 +0000 (09:42 -0700)]
IB/hfi1: Remove unused/writeonly devdata fields
Reviewed-by: Michael J. Ruhl <michael.j.ruhl@intel.com> Signed-off-by: Mike Marciniszyn <mike.marciniszyn@intel.com> Signed-off-by: Dennis Dalessandro <dennis.dalessandro@intel.com> Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Mike Marciniszyn [Wed, 20 Jun 2018 16:42:49 +0000 (09:42 -0700)]
IB/hfi1: Rightsize ctxt_eager_bufs fields
Fields in this structure are sized excessively based on hardware
limitations and input values.
Fix by reducing fields as appropriate and repositioning to close holes in
the structure.
Reviewed-by: Michael J. Ruhl <michael.j.ruhl@intel.com> Signed-off-by: Mike Marciniszyn <mike.marciniszyn@intel.com> Signed-off-by: Dennis Dalessandro <dennis.dalessandro@intel.com> Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Mike Marciniszyn [Wed, 20 Jun 2018 16:42:40 +0000 (09:42 -0700)]
IB/hfi1: Remove rcvctrl from ctxtdata
It is only ever written.
Reviewed-by: Michael J. Ruhl <michael.j.ruhl@intel.com> Signed-off-by: Mike Marciniszyn <mike.marciniszyn@intel.com> Signed-off-by: Dennis Dalessandro <dennis.dalessandro@intel.com> Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Mike Marciniszyn [Wed, 20 Jun 2018 16:42:31 +0000 (09:42 -0700)]
IB/hfi1: Remove rcvhdrq_size
The usage of this ctxt data field is not hot path and the value can be
computed on demand to cut down the ctxtdata bloat.
Reviewed-by: Michael J. Ruhl <michael.j.ruhl@intel.com> Signed-off-by: Mike Marciniszyn <mike.marciniszyn@intel.com> Signed-off-by: Dennis Dalessandro <dennis.dalessandro@intel.com> Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Parav Pandit [Thu, 21 Jun 2018 12:31:25 +0000 (15:31 +0300)]
IB/core: Free GID table entry during GID deletion
If we already hold the table->lock when doing the kref_put it means we are
in a context where it is safe to do the deletion synchronously, with no
need for the work queue.
This helps to eliminate issues when GID change is requested as part of MAC
address change or bonding event change where expectation is to replace the
GID almost immediately.
Fixes: b150c3862d21 ("IB/core: Introduce GID entry reference counts") Reviewed-by: Daniel Jurgens <danielj@mellanox.com> Signed-off-by: Parav Pandit <parav@mellanox.com> Signed-off-by: Leon Romanovsky <leonro@mellanox.com> Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Parav Pandit [Thu, 21 Jun 2018 12:31:24 +0000 (15:31 +0300)]
RDMA/cma: Consider net namespace while leaving multicast group
When sending multicast leave request, consider the net ns in which this
cm_id is created.
Code was duplicated in cma_leave_mc_groups() and rdma_leave_multicast(),
which is now done using a helper function cma_leave_roce_mc_group().
Fixes: bee3c3c91865 ("IB/cma: Join and leave multicast groups with IGMP") Reviewed-by: Daniel Jurgens <danielj@mellanox.com> Signed-off-by: Parav Pandit <parav@mellanox.com> Signed-off-by: Leon Romanovsky <leonro@mellanox.com> Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Move some s_flags defines out of rdmavt and into hfi1 because they are
hfi1 specific and therefore should remain in the driver instead of
bubbling up to rdmavt.
Document device specific ranges in rdmavt and remap
those in hfi1.
Reviewed-by: Dennis Dalessandro <dennis.dalessandro@intel.com> Signed-off-by: Mike Marciniszyn <mike.marciniszyn@intel.com> Signed-off-by: Kaike Wan <kaike.wan@intel.com> Signed-off-by: Dennis Dalessandro <dennis.dalessandro@intel.com> Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
The field is based on a constant that can never change.
Use the define to assign the register instead.
Reviewed-by: Michael J. Ruhl <michael.j.ruhl@intel.com> Signed-off-by: Mike Marciniszyn <mike.marciniszyn@intel.com> Signed-off-by: Dennis Dalessandro <dennis.dalessandro@intel.com> Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
This field should be in ctxtdata to allow for better locality of access by
eliminating a dd dereference.
The new field is now side-by-side with rcvhdrqentsize since the rhf_offset
is a function of the rcvhdrqentsize.
Both fields are now correctly sized as u8.
Reviewed-by: Michael J. Ruhl <michael.j.ruhl@intel.com> Signed-off-by: Mike Marciniszyn <mike.marciniszyn@intel.com> Signed-off-by: Dennis Dalessandro <dennis.dalessandro@intel.com> Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
IB/hfi1: Move normal functions from hfi1_devdata to const array
The current implementation precludes having receive context specific
packet type receive handlers.
Fix this by adding adding c99 const array for the existing handlers and
remove the current 72 bytes of pointers from devdata.
A new pointer in hfi1_ctxtdata will point to the const array.
Reviewed-by: Michael J. Ruhl <michael.j.ruhl@intel.com> Signed-off-by: Mike Marciniszyn <mike.marciniszyn@intel.com> Signed-off-by: Dennis Dalessandro <dennis.dalessandro@intel.com> Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Yishai Hadas [Sun, 17 Jun 2018 10:00:05 +0000 (13:00 +0300)]
IB/mlx5: Add DEVX query EQN support
Return the matching device EQN for a given user vector number via the
DEVX interface.
Note:
EQs are owned by the kernel and shared by all user processes.
Basically, a user CQ can point to any EQ.
The kernel doesn't enforce any such limitation today either.
Signed-off-by: Yishai Hadas <yishaih@mellanox.com> Signed-off-by: Leon Romanovsky <leonro@mellanox.com> Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Yishai Hadas [Sun, 17 Jun 2018 10:00:04 +0000 (13:00 +0300)]
IB/mlx5: Add DEVX support for memory registration
Add support to register a memory with the firmware via the DEVX
interface.
The driver translates a given user address to ib_umem then it will
register the physical addresses with the firmware and get a unique id
for this registration to be used for this virtual address.
Signed-off-by: Yishai Hadas <yishaih@mellanox.com> Signed-off-by: Leon Romanovsky <leonro@mellanox.com> Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Yishai Hadas [Sun, 17 Jun 2018 10:00:03 +0000 (13:00 +0300)]
IB/mlx5: Add support for DEVX query UAR
Return a device UAR index for a given user index via the DEVX interface.
Security note:
The hardware protection mechanism works like this: Each device object that
is subject to UAR doorbells (QP/SQ/CQ) gets a UAR ID (called uar_page in
the device specification manual) upon its creation. Then upon doorbell,
hardware fetches the object context for which the doorbell was rang, and
validates that the UAR through which the DB was rang matches the UAR ID
of the object.
If no match the doorbell is silently ignored by the hardware. Of
course, the user cannot ring a doorbell on a UAR that was not mapped to
it.
Now in devx, as the devx kernel does not manipulate the QP/SQ/CQ command
mailboxes (except tagging them with UID), we expose to the user its UAR
ID, so it can embed it in these objects in the expected specification
format. So the only thing the user can do is hurt itself by creating a
QP/SQ/CQ with a UAR ID other than his, and then in this case other users
may ring a doorbell on its objects.
The consequence of that will be that another user can schedule a QP/SQ
of the buggy user for execution (just insert it to the hardware schedule
queue or arm its CQ for event generation), no further harm is expected.
Signed-off-by: Yishai Hadas <yishaih@mellanox.com> Signed-off-by: Leon Romanovsky <leonro@mellanox.com> Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Yishai Hadas [Sun, 17 Jun 2018 09:59:57 +0000 (12:59 +0300)]
IB/mlx5: Introduce DEVX
Introduce DEVX to enable direct device commands in downstream patches
from this series.
In that mode of work the firmware manages the isolation between
processes' resources and as such a DEVX user id is created and assigned
to the given user context upon allocation request.
A capability check is done to make sure that this feature is really
supported by the firmware prior to creating the DEVX user id.
Signed-off-by: Yishai Hadas <yishaih@mellanox.com> Signed-off-by: Leon Romanovsky <leonro@mellanox.com> Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Matan Barak [Sun, 17 Jun 2018 09:59:54 +0000 (12:59 +0300)]
IB/uverbs: Allow an empty namespace in ioctl() framework
The ioctl parser framework wrongly assumed that each namespace is
populated. This could lead to NULL dereferences. Fix the parser to
always check that a given namespace indeed exists.
Fixes: fac9658cabb9 ("IB/core: Add new ioctl interface") Signed-off-by: Matan Barak <matanb@mellanox.com> Signed-off-by: Leon Romanovsky <leonro@mellanox.com> Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Matan Barak [Sun, 17 Jun 2018 09:59:53 +0000 (12:59 +0300)]
IB/uverbs: Add a macro to define a type with no kernel known size
Sometimes the uverbs uAPI doesn't really care about the structure it gets
from user-space. All it wants to do is to allocate enough space and send
it to the hardware/provider driver. Adding a UVERBS_ATTR_MIN_SIZE that
could be used for this scenarios. We use USHRT_MAX as the kernel known
size to bypass any zero validations.
Signed-off-by: Matan Barak <matanb@mellanox.com> Signed-off-by: Leon Romanovsky <leonro@mellanox.com> Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Matan Barak [Sun, 17 Jun 2018 09:59:52 +0000 (12:59 +0300)]
IB/uverbs: Add PTR_IN attributes that are allocated/copied automatically
Adding UVERBS_ATTR_SPEC_F_ALLOC_AND_COPY flag to PTR_IN attributes.
By using this flag, the parse automatically allocates and copies the
user-space data. This data is accessible by using uverbs_attr_get_len
and uverbs_attr_get_alloced_ptr inline accessor functions from the
handler.
Signed-off-by: Matan Barak <matanb@mellanox.com> Signed-off-by: Yishai Hadas <yishaih@mellanox.com> Signed-off-by: Leon Romanovsky <leonro@mellanox.com> Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Matan Barak [Sun, 17 Jun 2018 09:59:51 +0000 (12:59 +0300)]
IB/uverbs: Refactor uverbs_finalize_objects
uverbs_finalize_objects is currently used only to commit or abort
objects. Since we want to add automatic allocation/free of PTR_IN
attributes, moving it to uverbs_ioctl.c and renamit it to
uverbs_finalize_attrs.
Signed-off-by: Matan Barak <matanb@mellanox.com> Signed-off-by: Leon Romanovsky <leonro@mellanox.com> Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Steve Wise [Mon, 18 Jun 2018 15:05:26 +0000 (08:05 -0700)]
IB/core: add max_send_sge and max_recv_sge attributes
This patch replaces the ib_device_attr.max_sge with max_send_sge and
max_recv_sge. It allows ulps to take advantage of devices that have very
different send and recv sge depths. For example cxgb4 has a max_recv_sge
of 4, yet a max_send_sge of 16. Splitting out these attributes allows
much more efficient use of the SQ for cxgb4 with ulps that use the RDMA_RW
API. Consider a large RDMA WRITE that has 16 scattergather entries.
With max_sge of 4, the ulp would send 4 WRITE WRs, but with max_sge of
16, it can be done with 1 WRITE WR.
Acked-by: Sagi Grimberg <sagi@grimberg.me> Acked-by: Christoph Hellwig <hch@lst.de> Acked-by: Selvin Xavier <selvin.xavier@broadcom.com> Acked-by: Shiraz Saleem <shiraz.saleem@intel.com> Acked-by: Dennis Dalessandro <dennis.dalessandro@intel.com> Signed-off-by: Steve Wise <swise@opengridcomputing.com> Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Vijay Immanuel [Wed, 13 Jun 2018 01:12:05 +0000 (18:12 -0700)]
IB/rxe: support for 802.1q VLAN on the listener
Set the vlan flag and vlan_id field in the wc for rdma_listen()
to work over VLAN. This is required by ib_init_ah_attr_from_wc()
which is called by the CM REQ handler.
Signed-off-by: Vijay Immanuel <vijayi@attalasystems.com> Reviewed-by: Yonatan Cohen <yonatanc@mellanox.com> Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Vijay Immanuel [Thu, 14 Jun 2018 01:48:37 +0000 (18:48 -0700)]
IB/rxe: increase max MR limit
Increase the max MR limit to support more I/O queues
for NVMe over Fabrics hosts.
Signed-off-by: Vijay Immanuel <vijayi@attalasystems.com> Reviewed-by: Leon Romanovsky <leonro@mellanox.com> Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Allocate agent IDs from a global IDR instead of an atomic variable.
This eliminates the possibility of reusing an ID which is already in
use after 4 billion registrations. We limit the assigned ID to be less
than 2^24 as the mlx4 driver uses the most significant byte of the agent
ID to store the slave number. Users unlucky enough to see a collision
between agent numbers and slave numbers see messages like:
mlx4_ib: egress mad has non-null tid msb:1 class:4 slave:0
and the MAD layer stops working.
We look up the agent under protection of the RCU lock, which means we
have to free the agent using kfree_rcu, and only increment the reference
counter if it is not 0.
Signed-off-by: Matthew Wilcox <mawilcox@microsoft.com> Reported-by: Hans Westgaard Ry <hans.westgaard.ry@oracle.com> Acked-by: Jack Morgenstein <jackm@dev.mellanox.co.il> Tested-by: Jack Morgenstein <jackm@dev.mellanox.co.il> Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Allow users of the IDR to use the XArray lock for their own
synchronisation purposes. The IDR continues to rely on the caller to
handle locking, but this lets the caller use the lock embedded in the
IDR data structure instead of allocating their own lock.
Signed-off-by: Matthew Wilcox <willy@infradead.org> Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Parav Pandit [Wed, 13 Jun 2018 07:22:09 +0000 (10:22 +0300)]
RDMA: Convert drivers to use the AH's sgid_attr in post_wr paths
For UD the drivers were doing a sgid_index lookup into the cache to get
the attrs, however we can now directly access the same attrs stores in
the ib_ah instead and remove the lookup.
Signed-off-by: Parav Pandit <parav@mellanox.com> Signed-off-by: Jason Gunthorpe <jgg@mellanox.com> Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Parav Pandit [Wed, 13 Jun 2018 07:22:07 +0000 (10:22 +0300)]
IB/mlx4: Use GID attribute from ah attribute
While converting GID index from attribute to that of the HCA, GID
attribute is available from the ah_attr. Make use of GID attribute
to simplify the code and also avoid avoid GID query.
Signed-off-by: Parav Pandit <parav@mellanox.com> Signed-off-by: Jason Gunthorpe <jgg@mellanox.com> Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Parav Pandit [Wed, 13 Jun 2018 07:22:06 +0000 (10:22 +0300)]
RDMA: Convert drivers to use sgid_attr instead of sgid_index
The core code now ensures that all driver callbacks that receive an
rdma_ah_attrs will have a sgid_attr's pointer if there is a GRH present.
Drivers can use this pointer instead of calling a query function with
sgid_index. This simplifies the drivers and also avoids races where a
gid_index lookup may return different data if it is changed.
Signed-off-by: Parav Pandit <parav@mellanox.com> Signed-off-by: Jason Gunthorpe <jgg@mellanox.com> Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Jason Gunthorpe [Wed, 13 Jun 2018 07:22:05 +0000 (10:22 +0300)]
IB{cm, core}: Introduce and use ah_attr copy, move, replace APIs
Introduce AH attribute copy, move and replace APIs to be used by core and
provider drivers.
In CM code flow when ah attribute might be re-initialized twice while
processing incoming request, or initialized once while from path record
while sending out CM requests. Therefore use rdma_move_ah_attr API to
handle such scenarios instead of memcpy().
Provider drivers keeps a copy ah_attr during the lifetime of the ah.
Therefore, use rdma_replace_ah_attr() which conditionally release
reference to old ah_attr and holds reference to new attribute whose
referrence is released when the AH is freed.
Signed-off-by: Parav Pandit <parav@mellanox.com> Signed-off-by: Jason Gunthorpe <jgg@mellanox.com> Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Jason Gunthorpe [Wed, 13 Jun 2018 07:22:03 +0000 (10:22 +0300)]
IB/core: Add a sgid_attr pointer to struct rdma_ah_attr
The sgid_attr will ultimately replace the sgid_index in the ah_attr.
This will allow for all layers to have a consistent view of what
gid table entry was selected as processing runs through all stages of the
stack.
This commit introduces the pointer and ensures it is set before calling
any driver callback that includes a struct ah_attr callback, allowing
future patches to adjust both the drivers and the callers to use
sgid_attr instead of sgid_index.
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com> Signed-off-by: Parav Pandit <parav@mellanox.com> Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Matthew Wilcox [Fri, 8 Jun 2018 17:42:17 +0000 (10:42 -0700)]
IB/mad: Agent registration is process context only
Document this (it's implicitly true due to sleeping operations already
in use in both registration and deregistration). Use this fact to use
spin_lock_irq instead of spin_lock_irqsave. This improves performance
slightly.
Signed-off-by: Matthew Wilcox <mawilcox@microsoft.com> Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>