Max Gurtovoy [Tue, 11 Jun 2019 15:52:46 +0000 (18:52 +0300)]
RDMA/mlx5: Introduce and implement new IB_WR_REG_MR_INTEGRITY work request
This new WR will be used to perform PI (protection information) handover
using the new API. Using the new API, the user will post a single WR that
will internally perform all the needed actions to complete PI operation.
This new WR will use a memory region that was allocated as
IB_MR_TYPE_INTEGRITY and was mapped using ib_map_mr_sg_pi to perform the
registration. In the old API, in order to perform a signature handover
operation, each ULP should perform the following:
1. Map and register the data buffers.
2. Map and register the protection buffers.
3. Post a special reg WR to configure the signature handover operation
layout.
4. Invalidate the signature memory key.
5. Invalidate protection buffers memory key.
6. Invalidate data buffers memory key.
In the new API, the mapping of both data and protection buffers is
performed using a single call to ib_map_mr_sg_pi function. Also the
registration of the buffers and the configuration of the signature
operation layout is done by a single new work request called
IB_WR_REG_MR_INTEGRITY.
This patch implements this operation for mlx5 devices that are capable to
offload data integrity generation/validation while performing the actual
buffer transfer.
This patch will not remove the old signature API that is used by the iSER
initiator and target drivers. This will be done in the future.
In the internal implementation, for each IB_WR_REG_MR_INTEGRITY work
request, we are using a single UMR operation to register both data and
protection buffers using KLM's.
Afterwards, another UMR operation will describe the strided block format.
These will be followed by 2 SET_PSV operations to set the memory/wire
domains initial signature parameters passed by the user.
In the end of the whole transaction, only the signature memory key
(the one that exposed for the RDMA operation) will be invalidated.
Signed-off-by: Max Gurtovoy <maxg@mellanox.com> Signed-off-by: Israel Rukshin <israelr@mellanox.com> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Max Gurtovoy [Tue, 11 Jun 2019 15:52:45 +0000 (18:52 +0300)]
RDMA/mlx5: Update set_sig_data_segment attribute for new signature API
Explicitly pass the sig_mr and the access flags for the mkey segment
configuration. This function will be used also in the new signature
API, so modify it in order to use it in both APIs. This is a preparation
commit before adding new signature API.
Signed-off-by: Max Gurtovoy <maxg@mellanox.com> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Max Gurtovoy [Tue, 11 Jun 2019 15:52:44 +0000 (18:52 +0300)]
RDMA/mlx5: Pass UMR segment flags instead of boolean
UMR ctrl segment flags can vary between UMR operations. for example,
using inline UMR or adding free/not-free checks for a memory key.
This is a preparation commit before adding new signature API that
will not need not-free checks for the internal memory key during the
UMR operation.
Signed-off-by: Max Gurtovoy <maxg@mellanox.com> Reviewed-by: Leon Romanovsky <leonro@mellanox.com> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Max Gurtovoy [Tue, 11 Jun 2019 15:52:43 +0000 (18:52 +0300)]
RDMA/mlx5: Add attr for max number page list length for PI operation
PI offload (protection information) is a feature that each RDMA provider
can implement differently. Thus, introduce new device attribute to define
the maximal length of the page list for PI fast registration operation. For
example, mlx5 driver uses a single internal MR to map both data and
protection SGL's, so it's equal to max_fast_reg_page_list_len / 2.
Signed-off-by: Max Gurtovoy <maxg@mellanox.com> Reviewed-by: Leon Romanovsky <leonro@mellanox.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Max Gurtovoy [Tue, 11 Jun 2019 15:52:42 +0000 (18:52 +0300)]
RDMA/mlx5: Implement mlx5_ib_map_mr_sg_pi and mlx5_ib_alloc_mr_integrity
mlx5_ib_map_mr_sg_pi() will map the PI and data dma mapped SG lists to the
mlx5 memory region prior to the registration operation. In the new
API, the mlx5 driver will allocate an internal memory region for the
UMR operation to register both PI and data SG lists. The internal MR
will use KLM mode in order to map 2 (possibly non-contiguous/non-align)
SG lists using 1 memory key. In the new API, each ULP will use 1 memory
region for the signature operation (instead of 3 in the old API). This
memory region will have a key that will be exposed to remote server to
perform RDMA operation. The internal memory key that will map the SG lists
will stay private.
Signed-off-by: Max Gurtovoy <maxg@mellanox.com> Signed-off-by: Israel Rukshin <israelr@mellanox.com> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Max Gurtovoy [Tue, 11 Jun 2019 15:52:41 +0000 (18:52 +0300)]
RDMA/core: Add signature attrs element for ib_mr structure
This element will describe the needed characteristics for the signature
operation per signature enabled memory region (type IB_MR_TYPE_INTEGRITY).
Also add meta_length attribute to ib_sig_attrs structure for saving the
mapped metadata length (needed for the new API implementation).
Signed-off-by: Max Gurtovoy <maxg@mellanox.com> Signed-off-by: Israel Rukshin <israelr@mellanox.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Max Gurtovoy [Tue, 11 Jun 2019 15:52:40 +0000 (18:52 +0300)]
RDMA/core: Introduce ib_map_mr_sg_pi to map data/protection sgl's
This function will map the previously dma mapped SG lists for PI
(protection information) and data to an appropriate memory region for
future registration.
The given MR must be allocated as IB_MR_TYPE_INTEGRITY.
Signed-off-by: Max Gurtovoy <maxg@mellanox.com> Signed-off-by: Israel Rukshin <israelr@mellanox.com> Reviewed-by: Bart Van Assche <bvanassche@acm.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Israel Rukshin [Tue, 11 Jun 2019 15:52:39 +0000 (18:52 +0300)]
RDMA/core: Introduce IB_MR_TYPE_INTEGRITY and ib_alloc_mr_integrity API
This is a preparation for signature verbs API re-design. In the new
design a single MR with IB_MR_TYPE_INTEGRITY type will be used to perform
the needed mapping for data integrity operations.
Signed-off-by: Israel Rukshin <israelr@mellanox.com> Signed-off-by: Max Gurtovoy <maxg@mellanox.com> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Bart Van Assche <bvanassche@acm.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Max Gurtovoy [Tue, 11 Jun 2019 15:52:38 +0000 (18:52 +0300)]
RDMA/core: Save the MR type in the ib_mr structure
This is a preparation for the signature verbs API change. This change is
needed since the MR type will define, in the upcoming patches, the need
for allocating internal resources in LLD for signature handover related
operations. It will also help to make sure that signature related
functions are called with an appropriate MR type and fail otherwise.
Also introduce new mr types IB_MR_TYPE_USER, IB_MR_TYPE_DMA and
IB_MR_TYPE_DM for correctness.
Signed-off-by: Max Gurtovoy <maxg@mellanox.com> Signed-off-by: Israel Rukshin <israelr@mellanox.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Max Gurtovoy [Tue, 11 Jun 2019 15:52:37 +0000 (18:52 +0300)]
RDMA/core: Introduce new header file for signature operations
Ease the exhausted ib_verbs.h file and make the code more readable.
Signed-off-by: Max Gurtovoy <maxg@mellanox.com> Signed-off-by: Israel Rukshin <israelr@mellanox.com> Reviewed-by: Leon Romanovsky <leonro@mellanox.com> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Gal Pressman [Thu, 13 Jun 2019 09:10:13 +0000 (12:10 +0300)]
RDMA/efa: Be consistent with success flow return value
The EFA driver is written with success oriented flows in mind, meaning
that functions should mostly end with a return 0 statement.
Error flows return their error value on their own instead of assuming
that the function will return the error at the end.
This commit fixes a bunch of functions that were not aligned with this
behavior.
Gal Pressman [Thu, 13 Jun 2019 09:10:12 +0000 (12:10 +0300)]
RDMA/efa: Use API to get contiguous memory blocks aligned to device supported page size
Use the ib_umem_find_best_pgsz() and rdma_for_each_block() API when
registering an MR instead of coding it in the driver.
ib_umem_find_best_pgsz() is used to find the best suitable page size
which replaces the existing efa_cont_pages() implementation.
rdma_for_each_block() is used to iterate the umem in aligned contiguous
memory blocks.
Mike Marciniszyn [Thu, 13 Jun 2019 12:30:52 +0000 (08:30 -0400)]
IB/{rdmavt, qib, hfi1}: Convert to new completion API
Convert all completions to use the new completion routine that
fixes a race between post send and completion where fields from
a SWQE can be read after SWQE has been freed.
This patch also addresses issues reported in
https://marc.info/?l=linux-kernel&m=155656897409107&w=2.
The reserved operation path has no need for any barrier.
The barrier for the other path is addressed by the
smp_load_acquire() barrier.
Cc: Andrea Parri <andrea.parri@amarulasolutions.com> Reviewed-by: Michael J. Ruhl <michael.j.ruhl@intel.com> Signed-off-by: Mike Marciniszyn <mike.marciniszyn@intel.com> Signed-off-by: Dennis Dalessandro <dennis.dalessandro@intel.com> Signed-off-by: Doug Ledford <dledford@redhat.com>
Mike Marciniszyn [Thu, 13 Jun 2019 12:30:44 +0000 (08:30 -0400)]
IB/rdmavt: Add new completion inline
There is opencoded send completion logic all over all
the drivers.
We need to convert to this routine to enforce ordering
issues for completions. This routine fixes an ordering
issue where the read of the SWQE fields necessary for creating
the completion can race with a post send if the post send catches
a send queue at the edge of being full. Is is possible in that situation
to read SWQE fields that are being written.
This new routine insures that SWQE fields are read prior to advancing
the index that post send uses to determine queue fullness.
Reviewed-by: Michael J. Ruhl <michael.j.ruhl@intel.com> Signed-off-by: Mike Marciniszyn <mike.marciniszyn@intel.com> Signed-off-by: Dennis Dalessandro <dennis.dalessandro@intel.com> Signed-off-by: Doug Ledford <dledford@redhat.com>
Jason Gunthorpe [Fri, 14 Jun 2019 00:46:45 +0000 (21:46 -0300)]
RDMA/odp: Do not leak dma maps when working with huge pages
The ib_dma_unmap_page() must match the length of the ib_dma_map_page(),
which is based on odp_shift. Otherwise iommu resources under this API
will not be properly freed.
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com> Signed-off-by: Doug Ledford <dledford@redhat.com>
Leon Romanovsky [Sun, 16 Jun 2019 12:05:58 +0000 (15:05 +0300)]
RDMa/hns: Don't stuck in endless timeout loop
The "end" variable is declared as unsigned and can't be negative, it
leads to the situation where timeout limit is not honored, so let's
convert logic to ensure that loop is bounded.
drivers/infiniband/hw/hns/hns_roce_hw_v1.c: In function _hns_roce_v1_clear_hem_:
drivers/infiniband/hw/hns/hns_roce_hw_v1.c:2471:12: warning: comparison of unsigned expression < 0 is always false [-Wtype-limits]
2471 | if (end < 0) {
| ^
Fixes: 669cefb654cb ("RDMA/hns: Remove jiffies operation in disable interrupt context") Signed-off-by: Leon Romanovsky <leonro@mellanox.com> Signed-off-by: Doug Ledford <dledford@redhat.com>
Leon Romanovsky [Sun, 16 Jun 2019 12:05:20 +0000 (15:05 +0300)]
RDMA: Check umem pointer validity prior to release
Update ib_umem_release() to behave similarly to kfree() and allow
submitting NULL pointer as safe input to this function.
Fixes: a52c8e2469c3 ("RDMA: Clean destroy CQ in drivers do not return errors") Signed-off-by: Leon Romanovsky <leonro@mellanox.com> Signed-off-by: Doug Ledford <dledford@redhat.com>
Lang Cheng [Fri, 14 Jun 2019 14:56:03 +0000 (22:56 +0800)]
RDMA/hns: reset function when removing module
During removing the driver, we needs to notify the roce engine to
stop working immediately,and symmetrically recycle the hardware
resources requested during initialization.
The hardware provides a command called function clear that can package
these operations,so that the driver can only focus on releasing
resources that applied from the operating system.
This patch implements the call of this command.
Signed-off-by: Lang Cheng <chenglang@huawei.com> Signed-off-by: Lijun Ou <oulijun@huawei.com> Signed-off-by: Doug Ledford <dledford@redhat.com>
When the wqe num is larger than 16384, hns_roce_table_find will return an
invalid mtt, this will lead an kernel paging requet error if the driver try
to access it. It's the mtt design defect which can't support up to the max
wqe num of hip08.
This patch fixs it by replacing mtt with mtr for wqe.
Fixes: 926a01dc000d ("RDMA/hns: Add QP operations support for hip08 SoC") Signed-off-by: Xi Wang <wangxi11@huawei.com> Signed-off-by: Lijun Ou <oulijun@huawei.com> Signed-off-by: Doug Ledford <dledford@redhat.com>
Lijun Ou [Sat, 8 Jun 2019 06:46:08 +0000 (14:46 +0800)]
RDMA/hns: Add mtr support for mixed multihop addressing
Currently, the MTT(memory translate table) design required a buffer
space must has the same hopnum, but the hip08 hw can support mixed
hopnum config in a buffer space.
This patch adds the MTR(memory translate region) design for supporting
mixed multihop.
Signed-off-by: Xi Wang <wangxi11@huawei.com> Signed-off-by: Lijun Ou <oulijun@huawei.com> Signed-off-by: Doug Ledford <dledford@redhat.com>
Doug Ledford [Wed, 19 Jun 2019 13:20:49 +0000 (09:20 -0400)]
RDMA/netlink: Resort policy array
Sort the netlink policy array by netlink attribute name. This will make
it easier in the future to find the entry you are looking for when you
need to make changes, or to make sure you don't add the same entry
twice.
Fix the whitespace while we are there.
Reviewed-by: Jason Gunthorpe <jgg@mellanox.com> Signed-off-by: Doug Ledford <dledford@redhat.com>
Maor Gottlieb [Wed, 12 Jun 2019 12:20:14 +0000 (15:20 +0300)]
RDMA/mlx5: Enable decap and packet reformat on FDB
If FDB flow tables support decap operation, enable it on creation,
This allows to perform decapsulation of tunnelled packets by steering
rules. If FDB flow tables support reformat operation, enable it on
creation as well.
Signed-off-by: Maor Gottlieb <maorg@mellanox.com> Reviewed-by: Petr Vorel <pvorel@suse.cz> Signed-off-by: Leon Romanovsky <leonro@mellanox.com> Signed-off-by: Doug Ledford <dledford@redhat.com>
Maor Gottlieb [Wed, 12 Jun 2019 12:20:13 +0000 (15:20 +0300)]
RDMA/mlx5: Consider eswitch encap mode
When flow steering is created, then the encap support should
consider the eswitch encap mode. If the eswitch flow table (FDB)
supports encap then it shouldn't be supported on NIC RX flow tables.
Fixes: 4adda1122c490 ('RDMA/mlx5: Enable decap and packet reformat on flow tables') Signed-off-by: Maor Gottlieb <maorg@mellanox.com> Reviewed-by: Petr Vorel <pvorel@suse.cz> Signed-off-by: Leon Romanovsky <leonro@mellanox.com> Signed-off-by: Doug Ledford <dledford@redhat.com>
Jason Gunthorpe [Fri, 14 Jun 2019 00:38:19 +0000 (21:38 -0300)]
RDMA: Report available cdevs through RDMA_NLDEV_CMD_GET_CHARDEV
Update the struct ib_client for all modules exporting cdevs related to the
ibdevice to also implement RDMA_NLDEV_CMD_GET_CHARDEV. All cdevs are now
autoloadable and discoverable by userspace over netlink instead of relying
on sysfs.
uverbs also exposes the DRIVER_ID for drivers that are able to support
driver id binding in rdma-core.
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com> Reviewed-by: Leon Romanovsky <leonro@mellanox.com> Signed-off-by: Doug Ledford <dledford@redhat.com>
Jason Gunthorpe [Fri, 14 Jun 2019 00:38:18 +0000 (21:38 -0300)]
RDMA: Add NLDEV_GET_CHARDEV to allow char dev discovery and autoload
Allow userspace to issue a netlink query against the ib_device for
something like "uverbs" and get back the char dev name, inode major/minor,
and interface ABI information for "uverbs0".
Since we are now in netlink this can also trigger a module autoload to
make the uverbs device come into existence.
Largely this will let us replace searching and reading inside sysfs to
setup devices, and provides an alternative (using driver_id) to device
name based provider binding for things like rxe.
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com> Signed-off-by: Doug Ledford <dledford@redhat.com>
Yuval Avnery [Mon, 10 Jun 2019 23:38:42 +0000 (23:38 +0000)]
net/mlx5: Add EQ enable/disable API
Previously, EQ joined the chain notifier on creation.
This forced the caller to be ready to handle events before creating
the EQ through eq_create_generic interface.
To help the caller control when the created EQ will be attached to the
IRQ, add enable/disable API.
Ariel Levkovich [Mon, 10 Jun 2019 23:38:41 +0000 (23:38 +0000)]
net/mlx5: Use a single IRQ for all async EQs
The patch modifies the IRQ allocation so that all async EQs are
assigned to the same IRQ resulting in more available IRQs for
completion EQs.
The changes are using the support for IRQ sharing and EQ polling budget
that was introduced in previous patches so when the shared interrupt is
triggered, the kernel will serially call the handler of each of the
sharing EQs with a certain budget of EQEs to poll in order to prevent
starvation.
Yuval Avnery [Mon, 10 Jun 2019 23:38:27 +0000 (23:38 +0000)]
net/mlx5: Separate IRQ data from EQ table data
IRQ table should only exist for mlx5_core_dev for PF and VF only.
EQ table of mediated devices should hold a pointer to the IRQ table
of the parent PCI device.
Yuval Avnery [Mon, 10 Jun 2019 23:38:25 +0000 (23:38 +0000)]
net/mlx5: Separate IRQ request/free from EQ life cycle
Instead of requesting IRQ with eq creation, IRQs will be requested
before EQ table creation.
Instead of freeing the IRQs after EQ destroy, free IRQs after eq
table destroy.
Yuval Avnery [Mon, 10 Jun 2019 23:38:21 +0000 (23:38 +0000)]
net/mlx5: Introduce EQ polling budget
Multiple EQs may share the same irq in subsequent patches.
To avoid starvation, a budget is set per EQ's interrupt handler.
Because of this change, it is no longer required to check that
MLX5_NUM_SPARE_EQE eqes were polled (to detect that arm is required).
It is guaranteed that MLX5_NUM_SPARE_EQE > budget, therefore the
handler will arm and exit the handler before all the entries in the
eq are polled.
In the scenario where the handler is out of budget and there are more
EQEs to poll, arming the EQ guarantees that the HW will send another
interrupt and the handler will be called again.
Bodong Wang [Mon, 10 Jun 2019 23:38:19 +0000 (23:38 +0000)]
net/mlx5: Support querying max VFs from device
For ECPF with eswitch manager privilege, query the host max VF count
by querying the device using query_functions command.
With this enhancement:
1. flow steering entries are created only for valid vports based on
the max VF count of the PF.
2. Driver only queries cap of valid vport.
Eswitch requires the max VFs when doing initialization, so do sr-iov
init before eswitch init.
Vu Pham [Mon, 10 Jun 2019 23:38:16 +0000 (23:38 +0000)]
net/mlx5: E-Switch, Handle representors creation in handler context
Unified representors creation in esw_functions_changed context
handler. Emulate the esw_function_changed event for FW/HW that
does not support this event.
Signed-off-by: Vu Pham <vuhuong@mellanox.com> Reviewed-by: Parav Pandit <parav@mellanox.com> Reviewed-by: Bodong Wang <bodong@mellanox.com> Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Daniel Jurgens [Mon, 10 Jun 2019 23:38:14 +0000 (23:38 +0000)]
net/mlx5: Increase wait time for fw initialization
Firmware FLR happens sequentially, in some cases, like when destroying
a VM that had many VFs, may require waiting much longer than 10 seconds.
Increase the timeout to 2 minutes, and print a wait countdown status
every 20 seconds.
Signed-off-by: Daniel Jurgens <danielj@mellanox.com> Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Jason Gunthorpe [Mon, 10 Jun 2019 19:49:11 +0000 (16:49 -0300)]
rdma: Remove nes
This driver was first merged over 10 years ago and has not seen major
activity by the authors in the last 7 years. However, in that time it has
been patched 150 times to adapt it to changing kernel APIs.
Further, the hardware has several issues, like not supporting 64 bit DMA,
that make it rather uninteresting for use with modern systems and RDMA.
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com> Reviewed-by: Leon Romanovsky <leonro@mellanox.com> Reviewed-by: Shiraz Saleem <shiraz.saleem@intel.com> Signed-off-by: Doug Ledford <dledford@redhat.com>
Kamal Heib [Thu, 30 May 2019 13:18:17 +0000 (16:18 +0300)]
RDMA/ipoib: Remove check for ETH_SS_TEST
The default action for unlisted tests is "not-supported", so given that
ipoib doesn't support ETH_SS_TEST, there is no need to check for it
in the case statements, just let it get caught by the default: case.
Fixes: e3614bc9dc44 ("IB/ipoib: Add readout of statistics using ethtool") Signed-off-by: Kamal Heib <kamalheib1@gmail.com> Signed-off-by: Doug Ledford <dledford@redhat.com>
Leon Romanovsky [Tue, 28 May 2019 11:37:29 +0000 (14:37 +0300)]
RDMA: Convert CQ allocations to be under core responsibility
Ensure that CQ is allocated and freed by IB/core and not by drivers.
Signed-off-by: Leon Romanovsky <leonro@mellanox.com> Acked-by: Gal Pressman <galpress@amazon.com> Reviewed-by: Dennis Dalessandro <dennis.dalessandro@intel.com> Tested-by: Dennis Dalessandro <dennis.dalessandro@intel.com> Signed-off-by: Doug Ledford <dledford@redhat.com>
Leon Romanovsky [Tue, 28 May 2019 11:37:27 +0000 (14:37 +0300)]
RDMA/nes: Avoid memory allocation during CQ destroy
The memory allocation call can fail and cause to early return
from nes_desotroy_cq() function. This situation will cause to
memory leak of struct nes_cq. Rewrite function to avoid memory
allocation.
Signed-off-by: Leon Romanovsky <leonro@mellanox.com> Signed-off-by: Doug Ledford <dledford@redhat.com>
Lijun Ou [Fri, 31 May 2019 10:28:03 +0000 (18:28 +0800)]
RDMA/hns: Bugfix for filling the sge of srq
When user post recv a srq with multiple sges, the hardware will get the
last correct sge and count the sge numbers according to the specific
identifier with lkey. For example, when the driver fills the sges with
every wr less than the max sge that the user configured when creating srq,
the hardware will stop getting the sge according to the specific lkey in
the sge. However, it will always end with the first sge in the current
post srq recv interface implementation.
Fixes: c7bcb13442e1 ("RDMA/hns: Add SRQ support for hip08 kernel mode") Signed-off-by: Lijun Ou <oulijun@huawei.com> Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Colin Ian King [Fri, 31 May 2019 09:21:01 +0000 (10:21 +0100)]
RDMA/hns: fix inverted logic of readl read and shift
A previous change incorrectly changed the inverted logic and logically
negated the readl rather than the shifted readl result. Fix this by
adding in missing parentheses around the expression that needs to be
logically negated.
Addresses-Coverity: ("Logically dead code") Fixes: 669cefb654cb ("RDMA/hns: Remove jiffies operation in disable interrupt context") Signed-off-by: Colin Ian King <colin.king@canonical.com> Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Bart Van Assche [Wed, 29 May 2019 16:38:31 +0000 (09:38 -0700)]
RDMA/srp: Accept again source addresses that do not have a port number
The function srp_parse_in() is used both for parsing source address
specifications and for target address specifications. Target addresses
must have a port number. Having to specify a port number for source
addresses is inconvenient. Make sure that srp_parse_in() supports again
parsing addresses with no port number.
Cc: <stable@vger.kernel.org> Fixes: c62adb7def71 ("IB/srp: Fix IPv6 address parsing") Signed-off-by: Bart Van Assche <bvanassche@acm.org> Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Add support for reporting link state for ipoib net devices.
$ ip l set dev mlx4_ib0 up
$ ethtool mlx4_ib0 | grep Link
Link detected: yes
$ ip l set dev mlx4_ib0 down
$ ethtool mlx4_ib0 | grep Link
Link detected: no
Signed-off-by: Kamal Heib <kamalheib1@gmail.com> Reviewed-by: Leon Romanovsky <leonro@mellanox.com> Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Currently for every representor type and for every single vport,
representer function pointers copy is stored even though they don't
change from one to other vport.
Additionally priv data entry for the rep is not passed during
registration, but its copied. It is used (set and cleared) by the user
of the reps.
As we want to scale vports, to simplify and also to split constants
from data,
1. Rename mlx5_eswitch_rep_if to mlx5_eswitch_rep_ops as to match _ops
prefix with other standard netdev, ibdev ops.
2. Constify the IB and Ethernet rep ops structure.
3. Instead of storing copy of all rep function pointers, store copy
per eswitch rep type.
4. Split data and function pointers to mlx5_eswitch_rep_ops and
mlx5_eswitch_rep_data.
Eli Britstein [Wed, 29 May 2019 22:50:29 +0000 (22:50 +0000)]
net/mlx5: Introduce termination table bits
Termination table is a flow table with a termination flag. The flag
allows the firmware to assume that the the specified actions are the last
actions list. This assumption allows the FW to safely perform potential
looping logic (e.g. hairpin). Introduce the bits for this attribute.
Signed-off-by: Eli Britstein <elibr@mellanox.com> Reviewed-by: Oz Shlomo <ozsh@mellanox.com> Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Lijun Ou [Thu, 30 May 2019 15:55:53 +0000 (23:55 +0800)]
RDMA/hns: Bugfix for posting multiple srq work request
When the user submits more than 32 work request to a srq queue
at a time, it needs to find the corresponding number of entries
in the bitmap in the idx queue. However, the original lookup
function named ffs only processes 32 bits of the array element,
When the number of srq wqe issued exceeds 32, the ffs will only
process the lower 32 bits of the elements, it will not be able
to get the correct wqe index for srq wqe.
Signed-off-by: Xi Wang <wangxi11@huawei.com> Signed-off-by: Lijun Ou <oulijun@huawei.com> Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Dan Carpenter [Thu, 30 May 2019 08:20:24 +0000 (11:20 +0300)]
RDMA/uverbs: check for allocation failure in uapi_add_elm()
If the kzalloc() fails then we should return ERR_PTR(-ENOMEM). In the
current code it's possible that the kzalloc() fails and the
radix_tree_insert() inserts the NULL pointer successfully and we return
the NULL "elm" pointer to the caller. That results in a NULL pointer
dereference.
Fixes: 9ed3e5f44772 ("IB/uverbs: Build the specs into a radix tree at runtime") Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com> Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Make use of the struct_size() helper instead of an open-coded version
in order to avoid any potential type mistakes, in particular in the
context in which this code is being used.
Also, notice that variable size is unnecessary, hence it is removed.
This code was detected with the help of Coccinelle.
Signed-off-by: Gustavo A. R. Silva <gustavo@embeddedor.com> Reviewed-by: Dennis Dalessandro <dennis.dalessandro@intel.com> Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Make use of the struct_size() helper instead of an open-coded version
in order to avoid any potential type mistakes, in particular in the
context in which this code is being used.
So, replace the following form:
sizeof(*pkt) + sizeof(pkt->addr[0])*n
with:
struct_size(pkt, addr, n)
Also, notice that variable size is unnecessary, hence it is removed.
This code was detected with the help of Coccinelle.
Signed-off-by: Gustavo A. R. Silva <gustavo@embeddedor.com> Reviewed-by: Dennis Dalessandro <dennis.dalessandro@intel.com> Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Make use of the struct_size() helper instead of an open-coded version
in order to avoid any potential type mistakes, in particular in the
context in which this code is being used.
Also, notice that variable size is unnecessary, hence it is removed.
This code was detected with the help of Coccinelle.
Signed-off-by: Gustavo A. R. Silva <gustavo@embeddedor.com> Reviewed-by: Dennis Dalessandro <dennis.dalessandro@intel.com> Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Gal Pressman [Tue, 28 May 2019 12:46:17 +0000 (15:46 +0300)]
RDMA/efa: Use rdma block iterator in chunk list creation
When creating the chunks list the rdma_for_each_block() iterator is used
in order to iterate over the payload in EFA_CHUNK_PAYLOAD_SIZE (device
defined) strides.
Reviewed-by: Firas JahJah <firasj@amazon.com> Reviewed-by: Yossi Leybovich <sleybo@amazon.com> Reviewed-by: Shiraz Saleem <shiraz.saleem@intel.com> Signed-off-by: Gal Pressman <galpress@amazon.com> Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
The admin commands abort flow is buggy (use-after-free) and not really
necessary as it is guaranteed that after ib_unregister_device() is called
there are no user verbs threads running in parallel, delete it.
Suggested-by: Jason Gunthorpe <jgg@ziepe.ca> Reviewed-by: Firas JahJah <firasj@amazon.com> Reviewed-by: Yossi Leybovich <sleybo@amazon.com> Signed-off-by: Gal Pressman <galpress@amazon.com> Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Gal Pressman [Tue, 28 May 2019 12:46:14 +0000 (15:46 +0300)]
RDMA/efa: Use kvzalloc instead of kzalloc with fallback
Use kvzalloc which attempts to allocate a physically continuous buffer and
fallbacks to virtually continuous on failure instead of open coding it in
the driver.
The is_vmalloc_addr function is used to determine whether the buffer is
physically continuous or not (which determines direct vs indirect MR
registration mode).
Suggested-by: Jason Gunthorpe <jgg@ziepe.ca> Reviewed-by: Firas JahJah <firasj@amazon.com> Reviewed-by: Yossi Leybovich <sleybo@amazon.com> Signed-off-by: Gal Pressman <galpress@amazon.com> Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Dan Carpenter [Fri, 3 May 2019 12:28:39 +0000 (15:28 +0300)]
net/mlx5: potential error pointer dereference in error handling
The error handling was a bit flipped around. If the mlx5_create_flow_group()
function failed then it would have resulted in dereferencing "fg" when
it was an error pointer.
Fixes: 80f09dfc237f ("net/mlx5: Eswitch, enable RoCE loopback traffic") Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com> Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
John Hubbard [Sat, 25 May 2019 01:45:22 +0000 (18:45 -0700)]
RDMA: Convert put_page() to put_user_page*()
For infiniband code that retains pages via get_user_pages*(), release
those pages via the new put_user_page(), or put_user_pages*(), instead of
put_page()
This is a tiny part of the second step of fixing the problem described in
[1]. The steps are:
1) Provide put_user_page*() routines, intended to be used for releasing
pages that were pinned via get_user_pages*().
2) Convert all of the call sites for get_user_pages*(), to invoke
put_user_page*(), instead of put_page(). This involves dozens of call
sites, and will take some time.
3) After (2) is complete, use get_user_pages*() and put_user_page*() to
implement tracking of these pages. This tracking will be separate from
the existing struct page refcounting.
4) Use the tracking and identification of these pages, to implement
special handling (especially in writeback paths) when the pages are
backed by a filesystem. Again, [1] provides details as to why that is
desirable.
[1] https://lwn.net/Articles/753027/ : "The Trouble with get_user_pages()"
Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: Dennis Dalessandro <dennis.dalessandro@intel.com> Reviewed-by: Ira Weiny <ira.weiny@intel.com> Reviewed-by: Jérôme Glisse <jglisse@redhat.com> Acked-by: Jason Gunthorpe <jgg@mellanox.com> Tested-by: Ira Weiny <ira.weiny@intel.com> Signed-off-by: John Hubbard <jhubbard@nvidia.com> Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
YueHaibing [Sat, 25 May 2019 12:57:37 +0000 (20:57 +0800)]
IB/hfi1: Remove set but not used variables 'offset' and 'fspsn'
Fixes gcc '-Wunused-but-set-variable' warning:
drivers/infiniband/hw/hfi1/tid_rdma.c: In function tid_rdma_rcv_error:
drivers/infiniband/hw/hfi1/tid_rdma.c:2029:7: warning: variable offset set but not used [-Wunused-but-set-variable]
drivers/infiniband/hw/hfi1/tid_rdma.c: In function hfi1_rc_rcv_tid_rdma_ack:
drivers/infiniband/hw/hfi1/tid_rdma.c:4555:35: warning: variable fspsn set but not used [-Wunused-but-set-variable]
'offset' is never used since introduction in
commit d0d564a1caac ("IB/hfi1: Add functions to receive TID RDMA READ request")
'fspsn' is never used since introduciotn in
commit 9e93e967f7b4 ("IB/hfi1: Add a function to receive TID RDMA ACK packet")
Signed-off-by: YueHaibing <yuehaibing@huawei.com> Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Lang Cheng [Fri, 24 May 2019 07:31:23 +0000 (15:31 +0800)]
RDMA/hns: Remove jiffies operation in disable interrupt context
In some functions, the jiffies operation is unnecessary, and we can
control delay using mdelay and udelay functions only. Especially, in
hns_roce_v1_clear_hem, the function calls spin_lock_irqsave, the context
disables interrupt, so we can not use jiffies and msleep functions.
Signed-off-by: Lang Cheng <chenglang@huawei.com> Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Lang Cheng [Fri, 24 May 2019 07:31:22 +0000 (15:31 +0800)]
RDMA/hns: Move spin_lock_irqsave to the correct place
When hip08 set gid, it will call spin_unlock_bh when send cmq. if main.ko
call spin_lock_irqsave firstly, and the kernel is before commit f71b74bca637 ("irq/softirqs: Use lockdep to assert IRQs are
disabled/enabled"), it will cause WARN_ON_ONCE because of calling
spin_unlock_bh in disable context.
In fact, the spin_lock_irqsave in main.ko is only used for hip06, and
should be placed in hns_roce_hw_v1.c. hns_roce_hw_v2.c uses its own
spin_unlock_bh and does not need main.ko manage spin_lock.
Signed-off-by: Lang Cheng <chenglang@huawei.com> Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Add await in destroy_qp() so that all references to qp are dereferenced
and qp is freed in destroy_qp() itself. This ensures freeing of all QPs
before invocation of dealloc_ucontext(), which prevents loss of in use
qpids stored in the ucontext.
Leon Romanovsky [Mon, 20 May 2019 06:54:30 +0000 (09:54 +0300)]
RDMA/cxgb3: Delete and properly mark unimplemented resize CQ function
Resize CQ implementation was guarded by undeclared "notyet" define while
cxgb3 was added to the kernel. Twelve years later, this call is still
unimplemented, so safely delete it and fix improper return error code when
.resize_cq() is not implemented.
Fixes: b038ced7b370 ("RDMA/cxgb3: Add driver for Chelsio T3 RNIC") Signed-off-by: Leon Romanovsky <leonro@mellanox.com> Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Leon Romanovsky [Mon, 20 May 2019 06:54:29 +0000 (09:54 +0300)]
RDMA/cxgb3: Don't expose DMA addresses
DMA addresses like all other kernel addresses should be printed with
special %p* formatter. It is needed to allow control of exposure of such
information through a dedicated knob.
Signed-off-by: Leon Romanovsky <leonro@mellanox.com> Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Leon Romanovsky [Mon, 20 May 2019 06:54:22 +0000 (09:54 +0300)]
RDMA/efa: Remove check that prevents destroy of resources in error flows
Drivers cannot check the udata for validity when doing destroy as there
will be no way to report this error back to the uverbs.
Since udata is new for destroy no driver should start to use it - instead
drivers should opt for the ioctl interface and define it in a way where it
cannot fail due to incorrect data.
Remove the checks on udata construction so EFA is consistent with
everything else.
Signed-off-by: Leon Romanovsky <leonro@mellanox.com> Acked-by: Gal Pressman <galpress@amazon.com> Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Leon Romanovsky [Mon, 20 May 2019 06:54:25 +0000 (09:54 +0300)]
RDMA/nes: Remove second wait queue initialization call
The same wait queue is initialized a couple of lines above.
Fixes: 3c2d774cad5b ("RDMA/nes: Add a driver for NetEffect RNICs") Signed-off-by: Leon Romanovsky <leonro@mellanox.com> Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>