Jason Gunthorpe [Sun, 16 Sep 2018 17:48:11 +0000 (20:48 +0300)]
RDMA/umem: Avoid synchronize_srcu in the ODP MR destruction path
synchronize_rcu is slow enough that it should be avoided on the syscall
path when user space is destroying MRs. After all the rework we can now
trivially do this by having call_srcu kfree the per_mm.
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com> Signed-off-by: Leon Romanovsky <leonro@mellanox.com> Signed-off-by: Doug Ledford <dledford@redhat.com>
Jason Gunthorpe [Sun, 16 Sep 2018 17:48:10 +0000 (20:48 +0300)]
RDMA/umem: Handle a half-complete start/end sequence
mmu_notifier_unregister() can race between a invalidate_start/end and
cause the invalidate_end to be skipped. This causes an imbalance in the
locking, which lockdep complains about.
This is not actually a bug, as we immediately kfree the memory holding the
lock, but it simple enough to fix.
Mark when the notifier is being destroyed and abort the start callback.
This can be done under the lock we already obtained, and can re-purpose
the invalidate_range test we already have.
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com> Signed-off-by: Leon Romanovsky <leonro@mellanox.com> Signed-off-by: Doug Ledford <dledford@redhat.com>
Jason Gunthorpe [Sun, 16 Sep 2018 17:48:08 +0000 (20:48 +0300)]
RDMA/umem: Use umem->owning_mm inside ODP
Since ODP had a single struct mmu_notifier located in the ucontext it
could only handle a single MM at a time, and this prevented it from using
the new owning_mm system.
With the prior rework it is now simple to let ODP track multiple MMs per
ucontext, finish the job so that the per_mm is allocated on a mm by mm
basis, and freed when the last umem is dropped from the ucontext.
As a side effect the new saner locking removes the lockdep splat about
nesting the umem_rwsem between mmu_notifier_unregister and
ib_umem_odp_release.
It also makes ODP work with multiple processes, across, fork, etc.
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com> Signed-off-by: Leon Romanovsky <leonro@mellanox.com> Signed-off-by: Doug Ledford <dledford@redhat.com>
Jason Gunthorpe [Sun, 16 Sep 2018 17:48:07 +0000 (20:48 +0300)]
RDMA/umem: Move all the ODP related stuff out of ucontext and into per_mm
This is the first step to make ODP use the owning_mm that is now part of
struct ib_umem.
Each ODP umem is linked to a single per_mm structure, which in turn, is
linked to a single mm, via the embedded mmu_notifier. This first patch
introduces the structure and reworks eveything to use it.
This also needs to introduce tgid into the ib_ucontext_per_mm, as
get_user_pages_remote() requires the originating task for statistics
tracking.
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com> Signed-off-by: Leon Romanovsky <leonro@mellanox.com> Signed-off-by: Doug Ledford <dledford@redhat.com>
Jason Gunthorpe [Sun, 16 Sep 2018 17:48:06 +0000 (20:48 +0300)]
RDMA/umem: Get rid of struct ib_umem.odp_data
This no longer has any use, we can use container_of to get to the
umem_odp, and a simple flag to indicate if this is an odp MR. Remove the
few remaining references to it.
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com> Signed-off-by: Leon Romanovsky <leonro@mellanox.com> Signed-off-by: Doug Ledford <dledford@redhat.com>
Jason Gunthorpe [Sun, 16 Sep 2018 17:48:04 +0000 (20:48 +0300)]
RDMA/umem: Use ib_umem_odp in all function signatures connected to ODP
All of these functions already require the ODP version of the umem struct,
make this very clear by having the signature require it. This paves the
way to using the container_of() pattern to link umem_odp and umem
together.
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com> Signed-off-by: Leon Romanovsky <leonro@mellanox.com> Signed-off-by: Doug Ledford <dledford@redhat.com>
Jason Gunthorpe [Sun, 16 Sep 2018 17:44:45 +0000 (20:44 +0300)]
RDMA/umem: Do not use current->tgid to track the mm_struct
This is just wrong, the process that calls into the reg_mr is the process
associated with the umem, and that does not have to be the same process
that created the context.
When this code was first written mmgrab() didn't exist, however these days
we can just directly hold the mm_struct pointer in the umem and have no
ambiguity when it comes to releasing the umem as to which mm it was
associated with.
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com> Signed-off-by: Leon Romanovsky <leonro@mellanox.com> Signed-off-by: Doug Ledford <dledford@redhat.com>
Jason Gunthorpe [Sun, 16 Sep 2018 17:43:08 +0000 (20:43 +0300)]
RDMA/ucontext: Add a core API for mmaping driver IO memory
To support disassociation and PCI hot unplug, we have to track all the
VMAs that refer to the device IO memory. When disassociation occurs the
VMAs have to be revised to point to the zero page, not the IO memory, to
allow the physical HW to be unplugged.
The three drivers supporting this implemented three different versions
of this algorithm, all leaving something to be desired. This new common
implementation has a few differences from the driver versions:
- Track all VMAs, including splitting/truncating/etc. Tie the lifetime of
the private data allocation to the lifetime of the vma. This avoids any
tricks with setting vm_ops which Linus didn't like. (see link)
- Support multiple mms, and support properly tracking mmaps triggered by
processes other than the one first opening the uverbs fd. This makes
fork behavior of disassociation enabled drivers the same as fork support
in normal drivers.
- Don't use crazy get_task stuff.
- Simplify the approach for to racing between vm_ops close and
disassociation, fixing the related bugs most of the driver
implementations had. Since we are in core code the tracking list can be
placed in struct ib_uverbs_ufile, which has a lifetime strictly longer
than any VMAs created by mmap on the uverbs FD.
It will trigger unnecessary interrupts caused by time out if prints inside
aeq handle under some configurations. Thus, move all prints out of aeq
handle to work queue.
Signed-off-by: liuyixian <liuyixian@huawei.com> Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Jason Gunthorpe [Mon, 17 Sep 2018 21:44:46 +0000 (15:44 -0600)]
RDMA/uverbs: Fix error unwind in ib_uverbs_add_one
The error path has several mistakes
- cdev_del should not be called if cdev_device_add fails
- We must call put_device on all the goto exit paths as that is what frees
the uapi, SRCU and the struct itself.
While we are here consolidate all the uvdev_dev init that cannot fail at
the top.
Fixes: c5c4d92e70f3 ("RDMA/uverbs: Use cdev_device_add() instead of cdev_add()") Signed-off-by: Jason Gunthorpe <jgg@mellanox.com> Reviewed-by: Parav Pandit <parav@mellanox.com>
RDMA/core: Properly return the error code of rdma_set_src_addr_rcu
rdma_set_src_addr_rcu should check copy_src_l2_addr fails, rather than
always return 0. Also copy_src_l2_addr should return 'ret' as its return
value when rdma_translate_ip fails.
Fixes: c31d4b2ddf07 ("RDMA/core: Protect against changing dst->dev during destination resolve") Signed-off-by: YueHaibing <yuehaibing@huawei.com> Reviewed-by: Parav Pandit <parav@mellanox.com> Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Commit f27b4746f378 ("i40iw: add connection management code") uses an
incorrect rcu iterator, whilst holding the rtnl_lock. Since the
critical region invokes i40iw_manage_qhash(), which is a sleeping
function, the rcu locking and traversal cannot be used.
Signed-off-by: Håkon Bugge <haakon.bugge@oracle.com> Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Jason Gunthorpe [Tue, 14 Aug 2018 22:33:02 +0000 (15:33 -0700)]
IB/rxe: Revise the ib_wr_opcode enum
This enum has become part of the uABI, as both RXE and the
ib_uverbs_post_send() command expect userspace to supply values from this
enum. So it should be properly placed in include/uapi/rdma.
In userspace this enum is called 'enum ibv_wr_opcode' as part of
libibverbs.h. That enum defines different values for IB_WR_LOCAL_INV,
IB_WR_SEND_WITH_INV, and IB_WR_LSO. These were introduced (incorrectly, it
turns out) into libiberbs in 2015.
The kernel has changed its mind on the numbering for several of the IB_WC
values over the years, but has remained stable on IB_WR_LOCAL_INV and
below.
Based on this we can conclude that there is no real user space user of the
values beyond IB_WR_ATOMIC_FETCH_AND_ADD, as they have never worked via
rdma-core. This is confirmed by inspection, only rxe uses the kernel enum
and implements the latter operations. rxe has clearly never worked with
these attributes from userspace. Other drivers that support these opcodes
implement the functionality without calling out to the kernel.
To make IB_WR_SEND_WITH_INV and related work for RXE in userspace we
choose to renumber the IB_WR enum in the kernel to match the uABI that
userspace has bee using since before Soft RoCE was merged. This is an
overall simpler configuration for the whole software stack, and obviously
can't break anything existing.
Reported-by: Seth Howell <seth.howell@intel.com> Tested-by: Seth Howell <seth.howell@intel.com> Fixes: 8700e3e7c485 ("Soft RoCE driver") Cc: <stable@vger.kernel.org> Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
IB/ipoib: Use dev_port to expose network interface port numbers
Some InfiniBand network devices have multiple ports on the same PCI
function. This initializes the `dev_port' sysfs field of those
network interfaces with their port number.
Prior to this the kernel erroneously used the `dev_id' sysfs
field of those network interfaces to convey the port number to userspace.
The use of `dev_id' was considered correct until Linux 3.15,
when another field, `dev_port', was defined for this particular
purpose and `dev_id' was reserved for distinguishing stacked ifaces
(e.g: VLANs) with the same hardware address as their parent device.
Similar fixes to net/mlx4_en and many other drivers, which started
exporting this information through `dev_id' before 3.15, were accepted
into the kernel 4 years ago.
See 76a066f2a2a0 (`net/mlx4_en: Expose port number through sysfs').
The sysfs field was introduced 4 years ago along with fixes to various
drivers that erroneously used `dev_id' for that purpose, but it was not
properly documented anywhere.
See commit v3.14-rc3-739-g3f85944fe207.
RDMA/core: Consider net ns of gid attribute for RoCE
When resolving destination address or route, when net namespace is
unavailable, refer to the net namespace of the netdevice of the SGID
attribute. This is typically the case for requests arriving from the
network for RoCE ports.
Signed-off-by: Parav Pandit <parav@mellanox.com> Signed-off-by: Leon Romanovsky <leonro@mellanox.com> Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
RDMA/core: Introduce rdma_read_gid_attr_ndev_rcu() to check GID attribute
Introduce an API rdma_read_gid_attr_ndev_rcu() to return GID attribute
netdevice which is in UP state for accessing netdevice's fields such as
net namespace and ifindex.
This is useful for users who intent to access netdevice fields under rcu
lock.
Signed-off-by: Parav Pandit <parav@mellanox.com> Signed-off-by: Leon Romanovsky <leonro@mellanox.com> Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Currently RoCE route resolve functionality is split between two
functions. (a) roce_resolve_route_from_path() and its helper function
rdma_resolve_ip_route().
Due to this multiple sockaddr src structures are created in both functions
with rdma_dev_addr is an interface between the two for checks.
Since there is only one user of rdma_resolve_ip_route() as RoCE, combine
the functionality of both functions to roce_resolve_route_from_path() and
further reduce the scope of rdma_dev_addr to core/addr.c
This also allow to extend addr_resolve() in subsequent patch to consider
netdev properties of GID in safer way under rcu lock.
Additionally src and dst addresses were always provided, so skip the src
addr NULL pointer check as they are present on the stack now.
Signed-off-by: Parav Pandit <parav@mellanox.com> Signed-off-by: Leon Romanovsky <leonro@mellanox.com> Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
RDMA/core: Protect against changing dst->dev during destination resolve
During resolving address process, during route lookup and while performing
src address translation in case of loopback mode, hold the rcu lock so
that if netdevice is moving to different net namespace, or being
unregistered, it can be synchronized with net/core/dev.c, ie
change_net_namespace()
->dev_close_many()
->rt6_uncached_list_flush_dev() who would change dst->dev
to loopback device of the given net namespace.
Therefore, hold the rcu lock and sync with synchronize_net() of
change_net_namespace() to ensure that netdevice cannot get freed while
dst->dev is being used.
Signed-off-by: Parav Pandit <parav@mellanox.com> Reviewed-by: Daniel Jurgens <danielj@mellanox.com> Signed-off-by: Leon Romanovsky <leonro@mellanox.com> Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
RDMA/core: Rename rdma_copy_addr to rdma_copy_src_l2_addr
Now that rdma_copy_addr() only copies the source addresses and all callers
are interested in copying only source addresses, simplify it to drop the
destination address argument.
Given that it only copies source layer2 addresses, rename it to
rdma_copy_src_l2_addr for better code readability.
Signed-off-by: Parav Pandit <parav@mellanox.com> Reviewed-by: Daniel Jurgens <danielj@mellanox.com> Signed-off-by: Leon Romanovsky <leonro@mellanox.com> Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
RDMA/core: Introduce and use rdma_set_src_addr() between IPv4 and IPv6
rdma_translate_ip() is done while resolving address for the loopback
addresses. The current flow is convoluted with resolve neighbor being
optional.
This patch simplifies the code in following ways.
(a) Use common code between IPv4 and IPv6 for address translation,
loopback checks and acquiring netdevice.
(b) During neigh resolve in addr_resolve_neigh(), only copy destination
address.
(c) Always resolve the source address before the destination address,
because it doesn't depend on resolving neigh being requested or not.
This helps to reduce 3 calls of rdma_copy_addr and rdma_translate_ip to
one and makes it easier to follow the code flow.
Now that ib_nl_fetch_ha() doesn't depend on dst, drop dst argument from
ib_nl_fetch_ha().
Signed-off-by: Parav Pandit <parav@mellanox.com> Reviewed-by: Daniel Jurgens <danielj@mellanox.com> Signed-off-by: Leon Romanovsky <leonro@mellanox.com> Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
RDMA/core Introduce and use rdma_find_ndev_for_src_ip_rcu
This fixes two issues:
1. When address family is other than IPv4 or v6, rdma_translate_ip()
returns success which is incorrect.
2. When address familty is AF_INET6, and if the source address is not
found, it returns success, which is also incorrect.
Therefore, introduce and use rdma_find_ndev_for_src_ip_rcu() helper
function which returns correct success or error status and is also useful
for future code refactor in addr_resolve().
Signed-off-by: Parav Pandit <parav@mellanox.com> Reviewed-by: Daniel Jurgens <danielj@mellanox.com> Signed-off-by: Leon Romanovsky <leonro@mellanox.com> Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Moni Shoua [Wed, 12 Sep 2018 06:33:55 +0000 (09:33 +0300)]
IB/mlx5: Allow transition of DCI QP to reset
The transition is allowed from any state and the atrribute mask must be
IB_QP_STATE.
Fixes: c32a4f296e1d ("IB/mlx5: Add support for DC Initiator QP") Signed-off-by: Moni Shoua <monis@mellanox.com> Reviewed-by: Artemy Kovalyov <artemyko@mellanox.com> Signed-off-by: Leon Romanovsky <leonro@mellanox.com> Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Michael J. Ruhl [Mon, 10 Sep 2018 14:54:04 +0000 (07:54 -0700)]
IB/hfi1: set_intr_bits uses incorrect source for register modification
HFI IRQ enable bits are not being set correctly. Send context error and
DC IRQs are not being enabled correctly. In addition, send context error
IRQs are not being delivered.
Because of this, send context errors are not being handled correctly when
they occur.
When setting the IRQ bits, if an IRQ range is used, and the last bit is on
a register boundary (bit 63), the calculated index for the final register
modification is incorrect (index + 1 vs. index).
The incorrect index calculation causes incorrect IRQ bits to be set. In
this case the send context error IRQ is NOT enabled.
Fix by using the 'last' value rather than the counted 'src' value to
determine the final index to use. This satisfies all cases.
Fixes: a2f7bbdc2dba ("IB/hfi1: Rework the IRQ API to be more flexible") Reviewed-by: Mike Marciniszyn <mike.marciniszyn@intel.com> Reviewed-by: Dennis Dalessandro <dennis.dalessandro@intel.com> Signed-off-by: Michael J. Ruhl <michael.j.ruhl@intel.com> Signed-off-by: Dennis Dalessandro <dennis.dalessandro@intel.com> Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Michael J. Ruhl [Mon, 10 Sep 2018 16:39:28 +0000 (09:39 -0700)]
IB/hfi1: Missing return value in error path for user sdma
If the set_txreq_header_agh() function returns an error, the exit path
is chosen.
In this path, the code fails to set the return value. This will cause
the caller to not realize an error has occurred.
Set the return value correctly in the error path.
Signed-off-by: Michael J. Ruhl <michael.j.ruhl@intel.com> Signed-off-by: Dennis Dalessandro <dennis.dalessandro@intel.com> Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Michael J. Ruhl [Mon, 10 Sep 2018 16:39:20 +0000 (09:39 -0700)]
IB/hfi1: Right size user_sdma sequence numbers and related variables
Hardware limits the maximum number of packets to u16 packets.
Match that size for all relevant sequence numbers in the user_sdma
engine.
Reviewed-by: Mike Marciniszyn <mike.marciniszyn@intel.com> Signed-off-by: Michael J. Ruhl <michael.j.ruhl@intel.com> Signed-off-by: Dennis Dalessandro <dennis.dalessandro@intel.com> Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Michael J. Ruhl [Mon, 10 Sep 2018 16:39:11 +0000 (09:39 -0700)]
IB/hfi1: Remove race conditions in user_sdma send path
Packet queue state is over used to determine SDMA descriptor
availablitity and packet queue request state.
cpu 0 ret = user_sdma_send_pkts(req, pcount);
cpu 0 if (atomic_read(&pq->n_reqs))
cpu 1 IRQ user_sdma_txreq_cb calls pq_update() (state to _INACTIVE)
cpu 0 xchg(&pq->state, SDMA_PKT_Q_ACTIVE);
At this point pq->n_reqs == 0 and pq->state is incorrectly
SDMA_PKT_Q_ACTIVE. The close path will hang waiting for the state
to return to _INACTIVE.
This can also change the state from _DEFERRED to _ACTIVE. However,
this is a mostly benign race.
Remove the racy code path.
Use n_reqs to determine if a packet queue is active or not.
Reviewed-by: Mitko Haralanov <mitko.haralanov@intel.com> Reviewed-by: Mike Marciniszyn <mike.marciniszyn@intel.com> Signed-off-by: Michael J. Ruhl <michael.j.ruhl@intel.com> Signed-off-by: Dennis Dalessandro <dennis.dalessandro@intel.com> Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Michael J. Ruhl [Mon, 10 Sep 2018 16:39:03 +0000 (09:39 -0700)]
IB/hfi1: Eliminate races in the SDMA send error path
pq_update() can only be called in two places: from the completion
function when the complete (npkts) sequence of packets has been
submitted and processed, or from setup function if a subset of the
packets were submitted (i.e. the error path).
Currently both paths can call pq_update() if an error occurrs. This
race will cause the n_req value to go negative, hanging file_close(),
or cause a crash by freeing the txlist more than once.
Several variables are used to determine SDMA send state. Most of
these are unnecessary, and have code inspectible races between the
setup function and the completion function, in both the send path and
the error path.
The request 'status' value can be set by the setup or by the
completion function. This is code inspectibly racy. Since the status
is not needed in the completion code or by the caller it has been
removed.
The request 'done' value races between usage by the setup and the
completion function. The completion function does not need this.
When the number of processed packets matches npkts, it is done.
The 'has_error' value races between usage of the setup and the
completion function. This can cause incorrect error handling and leave
the n_req in an incorrect value (i.e. negative).
Simplify the code by removing all of the unneeded state checks and
variables.
Clean up iovs node when it is freed.
Eliminate race conditions in the error path:
If all packets are submitted, the completion handler will set the
completion status correctly (ok or aborted).
If all packets are not submitted, the caller must wait until the
submitted packets have completed, and then set the completion status.
These two change eliminate the race condition in the error path.
Reviewed-by: Mitko Haralanov <mitko.haralanov@intel.com> Reviewed-by: Mike Marciniszyn <mike.marciniszyn@intel.com> Signed-off-by: Michael J. Ruhl <michael.j.ruhl@intel.com> Signed-off-by: Dennis Dalessandro <dennis.dalessandro@intel.com> Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Michael J. Ruhl [Mon, 10 Sep 2018 16:49:27 +0000 (09:49 -0700)]
IB/{hfi1, qib, rdmavt}: Schedule multi RC/UC packets instead of posting
The post_send() path determines if it should post directly or, schedule
the post for later. The current logic is:
if the swqe ring is empty or (for hfi1) wqe->length <= piothreshold
post the send
else
schedule
This can allow large requests to call the send engine directly. Large
requests can potentially produce a large number of packets prior to
returning to the caller, blocking the caller from posting more requests,
and allowing better parallel processing.
Allow the driver(s) more say in this logic (pass call_send to the driver,
rather than examining a return value).
Update hfi1/qib logic to schedule the send engine if an RC or UC message
is larger than the QP MTU size.
Reviewed-by: Mike Marciniszyn <mike.marciniszyn@intel.com> Reviewed-by: Ira Weiny <ira.weiny@intel.com> Signed-off-by: Michael J. Ruhl <michael.j.ruhl@intel.com> Signed-off-by: Dennis Dalessandro <dennis.dalessandro@intel.com> Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Mark Bloch [Thu, 6 Sep 2018 14:27:08 +0000 (17:27 +0300)]
RDMA/mlx5: Allow creating a matcher for a NIC TX flow table
Currently a matcher can only be created and attached to a NIC RX flow
table. Extend it to allow it on NIC TX flow tables as well.
In order to achieve that, we:
1) Expose a new attribute: MLX5_IB_ATTR_FLOW_MATCHER_FLOW_FLAGS.
enum ib_flow_flags is used as valid flags. Only
IB_FLOW_ATTR_FLAGS_EGRESS is supported.
2) Remove the requirement to have a DEVX or QP destination when creating a
flow. A flow added to NIC TX flow table will forward the packet outside
of the vport (Wire or E-Switch in the SR-iOV case).
Signed-off-by: Mark Bloch <markb@mellanox.com> Reviewed-by: Yishai Hadas <yishaih@mellanox.com> Signed-off-by: Leon Romanovsky <leonro@mellanox.com> Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Mark Bloch [Thu, 6 Sep 2018 14:27:06 +0000 (17:27 +0300)]
RDMA/mlx5: Add flow actions support to raw create flow
Support attaching flow actions to a flow rule via raw create flow.
For now only NIC RX path is supported. This change requires to export
flow resources management functions so we can maintain proper bookkeeping
of flow actions.
Signed-off-by: Mark Bloch <markb@mellanox.com> Reviewed-by: Yishai Hadas <yishaih@mellanox.com> Signed-off-by: Leon Romanovsky <leonro@mellanox.com> Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Guy Levi [Thu, 6 Sep 2018 14:27:01 +0000 (17:27 +0300)]
IB/uverbs: Add IDRs array attribute type to ioctl() interface
Methods sometimes need to get a flexible set of IDRs and not a strict set
as can be achieved today by the conventional IDR attribute. Add a new
IDRS_ARRAY attribute to the generic uverbs ioctl layer.
IDRS_ARRAY points to array of idrs of the same object type and same access
rights, only write and read are supported.
Signed-off-by: Guy Levi <guyle@mellanox.com> Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>`` Signed-off-by: Leon Romanovsky <leonro@mellanox.com> Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Mark Bloch [Sun, 2 Sep 2018 09:51:35 +0000 (12:51 +0300)]
RDMA/mlx5: Enable reformat on NIC RX if supported
A L3_TUNNEL_TO_L2 decap flow action requires to enable the encap bit on
the flow table, enable it if supported. This will allow to attach those
flow actions to NIC RX steering. We don't enable if running on a
representor.
Signed-off-by: Mark Bloch <markb@mellanox.com> Signed-off-by: Leon Romanovsky <leonro@mellanox.com> Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Mark Bloch [Sun, 2 Sep 2018 09:51:33 +0000 (12:51 +0300)]
RDMA/mlx5: Enable decap and packet reformat on flow tables
If NIC RX flow tables support decap operation, enable it on creation,
This allows to perform decapsulation of tunnelled packets by steering
rules. If NIC TX flow tables support reformat operation, enable it on
creation.
We don't enable those capabilities on representors as the E-Switch should
handle packet modification (can be configured via TC) and as current
hardware can't handle both FDB and NIC flow tables with decap/packet
reformat support.
Signed-off-by: Mark Bloch <markb@mellanox.com> Signed-off-by: Leon Romanovsky <leonro@mellanox.com> Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Mark Bloch [Sun, 2 Sep 2018 09:51:31 +0000 (12:51 +0300)]
RDMA/mlx5: Add NIC TX steering support
Just like ingress steering, allow a user to create steering rules that
match egress vport traffic. We expose the same number of priorities as
the bypass (NIC RX) steering.
Signed-off-by: Mark Bloch <markb@mellanox.com> Signed-off-by: Leon Romanovsky <leonro@mellanox.com> Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Chuck Lever [Tue, 4 Sep 2018 15:45:14 +0000 (11:45 -0400)]
RDMA/core: Document CM @event_handler function
Code audit suggests that the RDMA CM event handler callback function is
_always_ invoked in a context that is safe to block. That's important for
consumer implementers to know, so document that in the comment before
rdma_create_id (where the handler function is set up by the consumer).
Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
RDMA/core: Assign device ifindex before publishing the device
Even though device->ifindex is assigned before adding the device in the
list which is read by netlink flow, it is better to assign rdma device
index before publishing the device in the system to users and clients.
Signed-off-by: Parav Pandit <parav@mellanox.com> Reviewed-by: Daniel Jurgens <danielj@mellanox.com> Signed-off-by: Leon Romanovsky <leonro@mellanox.com> Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Parav Pandit [Tue, 28 Aug 2018 12:08:45 +0000 (15:08 +0300)]
RDMA/core: Define client_data_lock as rwlock instead of spinlock
Even though device registration/unregistration and client
registration/unregistration is not a performance path, define the
client_data_lock as rwlock for code clarity.
Signed-off-by: Parav Pandit <parav@mellanox.com> Reviewed-by: Daniel Jurgens <danielj@mellanox.com> Signed-off-by: Leon Romanovsky <leonro@mellanox.com> Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Parav Pandit [Tue, 28 Aug 2018 12:08:44 +0000 (15:08 +0300)]
RDMA/core: Use simpler spin lock irq API from blocking context
add_client_context(), ib_unregister_device() and ib_unregister_client()
are designed to call from blocking context. There is no need to save and
restore last interrupt state when called from such blocking context. Even
though this is not a performance path, using the right spin lock API is
desired for code clarity.
To avoid checkpatch warning while removing flags, sizeof() is used.
Signed-off-by: Parav Pandit <parav@mellanox.com> Reviewed-by: Daniel Jurgens <danielj@mellanox.com> Signed-off-by: Leon Romanovsky <leonro@mellanox.com> Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Parav Pandit [Tue, 28 Aug 2018 12:08:43 +0000 (15:08 +0300)]
RDMA/core: Remove context entries from list while unregistering device
While unregistering a device, remove the context elements from the list to
not have any stale entries. With that any errors/bugs can be checked when
device is freed.
Signed-off-by: Parav Pandit <parav@mellanox.com> Reviewed-by: Daniel Jurgens <danielj@mellanox.com> Signed-off-by: Leon Romanovsky <leonro@mellanox.com> Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Parav Pandit [Tue, 28 Aug 2018 12:08:42 +0000 (15:08 +0300)]
RDMA/core: Use simplified list_for_each
While traversing client_data_list in following conditions, linked list is
only read, no elements of the list are removed. Therefore, use
list_for_each_entry(), instead of list_for_each_safe().
Signed-off-by: Parav Pandit <parav@mellanox.com> Reviewed-by: Daniel Jurgens <danielj@mellanox.com> Signed-off-by: Leon Romanovsky <leonro@mellanox.com> Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Parav Pandit [Tue, 28 Aug 2018 12:08:41 +0000 (15:08 +0300)]
RDMA/core: No need to protect kfree with spin lock and semaphore
While unregistering a client, only context removal should be protected
with lock. There is no need to protect a freeing of such context which is
already removed from the list.
Signed-off-by: Parav Pandit <parav@mellanox.com> Reviewed-by: Daniel Jurgens <danielj@mellanox.com> Signed-off-by: Leon Romanovsky <leonro@mellanox.com> Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Parav Pandit [Tue, 28 Aug 2018 11:45:32 +0000 (14:45 +0300)]
RDMA/{cma, core}: Avoid callback on rdma_addr_cancel()
Currently rdma_addr_cancel() is an async operation, which notifies that
cancel is done by executing the callback function given during
rdma_resolve_ip(). If resolve_ip request is already completed than
callback is not executed.
Instead, now rdma_resolve_addr() and rdma_addr_cancel() simplified in
following ways.
1. rdma_addr_cancel() now a synchronous method. If request was
pending, after it is cancelled, no callback is notified.
2. rdma_resolve_addr() and respective addr_handler() callback doesn't
need to hold reference to cm_id.
Signed-off-by: Parav Pandit <parav@mellanox.com> Signed-off-by: Leon Romanovsky <leonro@mellanox.com> Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Muhammad Sammar [Tue, 28 Aug 2018 11:45:30 +0000 (14:45 +0300)]
IB/ipoib: Ensure that MTU isn't less than minimum permitted
It is illegal to change MTU to a value lower than the minimum MTU
stated in ethernet spec. In addition to that we need to add 4 bytes
for encapsulation header (IPOIB_ENCAP_LEN).
Before "ifconfig ib0 mtu 0" command, succeeds while it obviously shouldn't.
Signed-off-by: Muhammad Sammar <muhammads@mellanox.com> Reviewed-by: Feras Daoud <ferasda@mellanox.com> Signed-off-by: Leon Romanovsky <leonro@mellanox.com> Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Parav Pandit [Tue, 28 Aug 2018 11:45:29 +0000 (14:45 +0300)]
IB/mlx5: Don't hold spin lock while checking device state
mdev->state device state is not protected by the QP for which WRs are
being processed. Therefore, there is no need to hold spin lock while
checking mdev state.
Given that device fatal error is unlikely situation, wrap the condition
check with unlikely().
Additionally, kernel QP1 is also a kernel ULP for which soft CQEs needs
to be generated. Therefore, check for device fatal error before
processing QP1 work requests.
Fixes: 89ea94a7b6c4 ("IB/mlx5: Reset flow support for IB kernel ULPs") Signed-off-by: Parav Pandit <parav@mellanox.com> Reviewed-by: Daniel Jurgens <danielj@mellanox.com> Reviewed-by: Maor Gottlieb <maorg@mellanox.com> Signed-off-by: Leon Romanovsky <leonro@mellanox.com> Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Parav Pandit [Tue, 28 Aug 2018 11:45:28 +0000 (14:45 +0300)]
RDMA/core: Fail early if unsupported QP is provided
When requested QP type is not supported for a {device, port}, return the
error right away before validating all parameters during mad agent
registration time.
Signed-off-by: Parav Pandit <parav@mellanox.com> Reviewed-by: Daniel Jurgens <danielj@mellanox.com> Signed-off-by: Leon Romanovsky <leonro@mellanox.com> Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Jason Gunthorpe [Wed, 5 Sep 2018 22:21:22 +0000 (16:21 -0600)]
Merge branch 'uverbs_dev_cleanups' into rdma.git for-next
For dependencies, branch based on rdma.git 'for-rc' of
https://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma.git/
Pull 'uverbs_dev_cleanups' from Leon Romanovsky:
====================
Reuse the char device code interfaces to simplify ib_uverbs_device
creation and destruction. As part of this series, we are sending fix to
cleanup path, which was discovered during internal review,
The fix definitely can go to -rc, but it means that this series will be
dependent on rdma-rc.
====================
* branch 'uverbs_dev_cleanups':
RDMA/uverbs: Use device.groups to initialize device attributes
RDMA/uverbs: Use cdev_device_add() instead of cdev_add()
RDMA/core: Depend on device_add() to add device attributes
RDMA/uverbs: Fix error cleanup path of ib_uverbs_add_one()
RDMA/uverbs: Use device.groups to initialize device attributes
Instead of explicitly adding device attribute files and handling such
error conditions, depend on device core layer to create device attributes
files based group pointer NULL terminated array.
Signed-off-by: Parav Pandit <parav@mellanox.com> Reviewed-by: Daniel Jurgens <danielj@mellanox.com> Signed-off-by: Leon Romanovsky <leonro@mellanox.com> Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
RDMA/core: Depend on device_add() to add device attributes
Instead of adding/removing device attribute files, depend on device_add()
which considers adding these device files based on NULL terminated
attributes group array.
Signed-off-by: Parav Pandit <parav@mellanox.com> Reviewed-by: Daniel Jurgens <danielj@mellanox.com> Signed-off-by: Leon Romanovsky <leonro@mellanox.com> Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
RDMA/uverbs: Fix error cleanup path of ib_uverbs_add_one()
If ib_uverbs_create_uapi() fails, dev_num should be freed from the bitmap.
Fixes: 7d96c9b17636 ("IB/uverbs: Have the core code create the uverbs_root_spec") Signed-off-by: Parav Pandit <parav@mellanox.com> Signed-off-by: Leon Romanovsky <leonro@mellanox.com> Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
bnxt_re: Fix couple of memory leaks that could lead to IOMMU call traces
1. DMA-able memory allocated for Shadow QP was not being freed.
2. bnxt_qplib_alloc_qp_hdr_buf() had a bug wherein the SQ pointer was
erroneously pointing to the RQ. But since the corresponding
free_qp_hdr_buf() was correct, memory being free was less than what was
allocated.
Fixes: 1ac5a4047975 ("RDMA/bnxt_re: Add bnxt_re RoCE driver") Signed-off-by: Somnath Kotur <somnath.kotur@broadcom.com> Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
RDMA/core: Replace open-coded variant of get_device
Reuse existing get_device() API to do it symmetric to already used
put_device() in commit 924b8900a49d ("RDMA/core: Replace open-coded
variant of put_device")
Signed-off-by: Parav Pandit <parav@mellanox.com> Signed-off-by: Leon Romanovsky <leonro@mellanox.com> Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Leon Romanovsky [Mon, 3 Sep 2018 17:18:03 +0000 (20:18 +0300)]
RDMA/uverbs: Declare closing variable as boolean
The "closing" variable is used as boolean and set to "true" in one
place, update the declaration of that variable and their other
assignment to proper type.
Fixes: e951747a087a ("IB/uverbs: Rework the locking for cleaning up the ucontext") Signed-off-by: Leon Romanovsky <leonro@mellanox.com> Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Leon Romanovsky [Wed, 5 Sep 2018 05:58:31 +0000 (08:58 +0300)]
RDMA/nes: Delete impossible debug prints
The pci-core and net-core logic ensure that parameters provided
to nes_probe() and nes_netdev_open() are valid, hence the assert
print are not possible.
Cc: Faisal Latif <faisal.latif@intel.com> Signed-off-by: Leon Romanovsky <leonro@mellanox.com> Reviewed-by: Dennis Dalessandro <dennis.dalessandro@intel.com> Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
drivers/infiniband/hw/qedr/verbs.c: In function 'qedr_create_srq':
drivers/infiniband/hw/qedr/verbs.c:1450:24: warning:
variable 'ctx' set but not used [-Wunused-but-set-variable]
Igor Stoppa [Fri, 31 Aug 2018 20:16:45 +0000 (23:16 +0300)]
IB/srp: Remove unnecessary unlikely()
WARN_ON() already contains an unlikely(), so it's not necessary to wrap it
into another.
Signed-off-by: Igor Stoppa <igor.stoppa@huawei.com> Reviewed-by: Bart Van Assche <bvanassche@acm.org> Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Jack Morgenstein [Mon, 27 Aug 2018 05:35:55 +0000 (08:35 +0300)]
IB/core: Add an unbound WQ type to the new CQ API
The upstream kernel commit cited below modified the workqueue in the
new CQ API to be bound to a specific CPU (instead of being unbound).
This caused ALL users of the new CQ API to use the same bound WQ.
Specifically, MAD handling was severely delayed when the CPU bound
to the WQ was busy handling (higher priority) interrupts.
This caused a delay in the MAD "heartbeat" response handling,
which resulted in ports being incorrectly classified as "down".
To fix this, add a new "unbound" WQ type to the new CQ API, so that users
have the option to choose either a bound WQ or an unbound WQ.
For MADs, choose the new "unbound" WQ.
Fixes: b7363e67b23e ("IB/device: Convert ib-comp-wq to be CPU-bound") Signed-off-by: Jack Morgenstein <jackm@dev.mellanox.co.il> Signed-off-by: Leon Romanovsky <leonro@mellanox.com> Reviewed-by: Sagi Grimberg <sagi@grimberg.m> Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Joe Perches [Fri, 10 Aug 2018 18:42:46 +0000 (11:42 -0700)]
RDMA/bnxt_re: QPLIB: Add and use #define dev_fmt(fmt) "QPLIB: " fmt
Consistently use the "QPLIB: " prefix for dev_<level> logging.
Miscellanea:
o Add missing newlines to avoid possible message interleaving
o Coalesce consecutive dev_<level> uses that emit a message header to
avoid < 80 column lengths and mistakenly output on multiple lines
o Reflow modified lines to use 80 columns where appropriate
o Consistently use "%s: " where __func__ is output
o QPLIB: is now always output immediately after the dev_<level> header
Signed-off-by: Joe Perches <joe@perches.com> Acked-by: Selvin Xavier <selvin.xavier@broadcom.com> Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Aaron Knister [Fri, 24 Aug 2018 12:42:46 +0000 (08:42 -0400)]
IB/ipoib: Avoid a race condition between start_xmit and cm_rep_handler
Inside of start_xmit() the call to check if the connection is up and the
queueing of the packets for later transmission is not atomic which leaves
a window where cm_rep_handler can run, set the connection up, dequeue
pending packets and leave the subsequently queued packets by start_xmit()
sitting on neigh->queue until they're dropped when the connection is torn
down. This only applies to connected mode. These dropped packets can
really upset TCP, for example, and cause multi-minute delays in
transmission for open connections.
Here's the code in start_xmit where we check to see if the connection is
up:
if (ipoib_cm_get(neigh)) {
if (ipoib_cm_up(neigh)) {
ipoib_cm_send(dev, skb, ipoib_cm_get(neigh));
goto unref;
}
}
The race occurs if cm_rep_handler execution occurs after the above
connection check (specifically if it gets to the point where it acquires
priv->lock to dequeue pending skb's) but before the below code snippet in
start_xmit where packets are queued.
Jason Gunthorpe [Wed, 5 Sep 2018 21:24:58 +0000 (15:24 -0600)]
Merge branch 'mlx5-flow-mutate' into rdma.git for-next
For dependencies, branch based on 'mellanox/mlx5-next' of
git://git.kernel.org/pub/scm/linux/kernel/git/mellanox/linux.git
Pull Flow actions to mutate packets from Leon Romanovsky:
====================
This series exposes the ability to create flow actions which can
mutate packet headers. We do that by exposing two new verbs:
* modify header - can change existing packet headers. packet
* reformat - can encapsulate or decapsulate a packet.
Once created a flow action must be attached to a steering
rule for it to take effect.
The first 10 patches refactor mlx5_core code, rename internal structures
to better reflect their operation and export needed functions so the RDMA
side can allocate the action.
The last 5 patches expose via the IOCTL infrastructure mlx5_ib methods
which do the actual allocation of resources and return an handle to the
user. A user of this API is expected to know how to work with the device's
spec as the input to those function is HW depended.
An example usage of the modify header action is routing, A user can create
an action which edits the L2 header and decrease the TTL.
An example usage of the packet reformat action is VXLAN encap/decap which
is done by the HW.
====================
* branch 'mlx5-flow-mutate':
RDMA/mlx5: Extend packet reformat verbs
RDMA/mlx5: Add new flow action verb - packet reformat
RDMA/uverbs: Add generic function to fill in flow action object
RDMA/mlx5: Add a new flow action verb - modify header
RDMA/uverbs: Add UVERBS_ATTR_CONST_IN to the specs language
net/mlx5: Export packet reformat alloc/dealloc functions
net/mlx5: Pass a namespace for packet reformat ID allocation
net/mlx5: Expose new packet reformat capabilities
{net, RDMA}/mlx5: Rename encap to reformat packet
net/mlx5: Move header encap type to IFC header file
net/mlx5: Break encap/decap into two separated flow table creation flags
net/mlx5: Add support for more namespaces when allocating modify header
net/mlx5: Export modify header alloc/dealloc functions
net/mlx5: Add proper NIC TX steering flow tables support
net/mlx5: Cleanup flow namespace getter switch logic
Mark Bloch [Tue, 28 Aug 2018 11:18:53 +0000 (14:18 +0300)]
RDMA/mlx5: Add new flow action verb - packet reformat
For now, only add L2_TUNNEL_TO_L2 option. This will allow to perform
generic decap operation if the encapsulating protocol is L2 based, and the
inner packet is also L2 based. For example this can be used to decap VXLAN
packets.
Signed-off-by: Mark Bloch <markb@mellanox.com> Signed-off-by: Leon Romanovsky <leonro@mellanox.com> Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Mark Bloch [Tue, 28 Aug 2018 11:18:51 +0000 (14:18 +0300)]
RDMA/mlx5: Add a new flow action verb - modify header
Expose the ability to create a flow action which changes packet
headers. The data passed from userspace should be modify header actions as
defined by HW specification.
Signed-off-by: Mark Bloch <markb@mellanox.com> Signed-off-by: Leon Romanovsky <leonro@mellanox.com> Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Mark Bloch [Tue, 28 Aug 2018 11:18:50 +0000 (14:18 +0300)]
RDMA/uverbs: Add UVERBS_ATTR_CONST_IN to the specs language
This makes it clear and safe to access constants passed in from user
space. We define a consistent ABI of u64 for all constants, and verify
that the data passed in can be represented by the type the user supplies.
The expectation is this will always be used with an enum declaring the
constant values, and the user will use the enum type as input to the
accessor.
To retrieve the attribute value we introduce two helper calls - one
standard which may fail if attribute is not valid and one where caller can
provide a default value which will be used in case the attribute is not
valid (useful when attribute is optional).
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com> Signed-off-by: Ariel Levkovich <lariel@mellanox.com> Signed-off-by: Mark Bloch <markb@mellanox.com> Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Mark Bloch [Tue, 28 Aug 2018 11:18:48 +0000 (14:18 +0300)]
net/mlx5: Pass a namespace for packet reformat ID allocation
Currently we attach packet reformat actions only to the FDB namespace.
In preparation to be able to use that for NIC steering, pass the actual
namespace as a parameter.
Signed-off-by: Mark Bloch <markb@mellanox.com> Reviewed-by: Or Gerlitz <ogerlitz@mellanox.com> Reviewed-by: Saeed Mahameed <saeedm@mellanox.com> Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Mark Bloch [Tue, 28 Aug 2018 11:18:46 +0000 (14:18 +0300)]
{net, RDMA}/mlx5: Rename encap to reformat packet
Renames all encap mlx5_{core,ib} code to use the new naming of packet
reformat. This change doesn't introduce any function change and is
needed to properly reflect the operation being done by this action.
For example not only can we encapsulate a packet, but also decapsulate it.
Signed-off-by: Mark Bloch <markb@mellanox.com> Reviewed-by: Saeed Mahameed <saeedm@mellanox.com> Signed-off-by: Leon Romanovsky <leonro@mellanox.com>