Idan Burstein [Wed, 2 May 2018 10:16:39 +0000 (13:16 +0300)]
IB/mlx5: posting klm/mtt list inline in the send queue for reg_wr
As most kernel RDMA ULPs, (e.g. NVMe over Fabrics in its default
"register_always=Y" mode) registers and invalidates user buffer
upon each IO.
Today the mlx5 driver is posting the registration work
request using scatter/gather entry for the MTT/KLM list.
The fetch of the MTT/KLM list becomes the bottleneck in
number of IO operation could be done by NVMe over Fabrics
host driver on a single adapter as shown below.
This patch is adding the support for inline registration
work request upon MTT/KLM list of size <=64B.
The result for NVMe over Fabrics is increase of > x3.5 for small
IOs as shown below, I expect other ULPs (e.g iSER, SRP, NFS over RDMA)
performance to be enhanced as well.
The following results were taken against a single NVMe-oF (RoCE link layer)
subsystem with a single namespace backed by null_blk using fio benchmark
(with rw=randread, numjobs=48, iodepth={16,64}, ioengine=libaio direct=1):
Parav Pandit [Wed, 2 May 2018 10:12:56 +0000 (13:12 +0300)]
IB/core: Reuse gid_table_release_one() in table allocation failure
_gid_table_setup_one() only performs GID table cache memory allocation,
marks entries as invalid (free) and marks the reserved entries.
At this point GID table is empty and no entries are added.
On dual port device if _gid_table_setup_one() fails to allocate the gid
table for 2nd port, there is no need to perform cleanup_gid_table_port()
to delete GID entries, as GID table is empty.
Therefore make use of existing gid_table_release_one() routine which
frees the GID table memory and avoid code duplication.
Reviewed-by: Daniel Jurgens <danielj@mellanox.com> Signed-off-by: Parav Pandit <parav@mellanox.com> Signed-off-by: Leon Romanovsky <leonro@mellanox.com> Reviewed-by: Dennis Dalessandro <dennis.dalessandro@intel.com> Signed-off-by: Doug Ledford <dledford@redhat.com>
Steve Wise [Thu, 3 May 2018 15:41:30 +0000 (08:41 -0700)]
RDMA/nldev: add driver-specific resource tracking
Each driver can register a "fill entry" function with the restrack core.
This function will be called when filling out a resource, allowing the
driver to add driver-specific details. The details consist of a
nltable of nested attributes, that are in the form of <key, [print-type],
value> tuples. Both key and value attributes are mandatory. The key
nlattr must be a string, and the value nlattr can be one of the driver
attributes that are generic, but typed, allowing the attributes to be
validated. Currently the driver nlattr types include string, s32,
u32, s64, and u64. The print-type nlattr allows a driver to specify
an alternative display format for user tools displaying the attribute.
For example, a u32 attribute will default to "%u", but a print-type
attribute can be included for it to be displayed in hex. This allows
the user tool to print the number in the format desired by the driver
driver.
More attrs can be defined as they become needed by drivers.
Signed-off-by: Steve Wise <swise@opengridcomputing.com> Reviewed-by: Leon Romanovsky <leonro@mellanox.com> Signed-off-by: Doug Ledford <dledford@redhat.com>
Steve Wise [Thu, 3 May 2018 15:40:49 +0000 (08:40 -0700)]
RDMA/nldev: Add explicit pad attribute
Add a specific RDMA_NLDEV_ATTR_PAD attribute to be used for 64b
attribute padding. To preserve the ABI, make this attribute equal to
RDMA_NLDEV_ATTR_UNSPEC, which has a value of 0, because that has been
used up until now as the pad attribute.
Change all the previous use of 0 as the pad with this
new enum.
Signed-off-by: Steve Wise <swise@opengridcomputing.com> Reviewed-by: Leon Romanovsky <leonro@mellanox.com> Signed-off-by: Doug Ledford <dledford@redhat.com>
Parav Pandit [Wed, 2 May 2018 10:18:59 +0000 (13:18 +0300)]
RDMA/cma: Do not query GID during QP state transition to RTR
When commit [1] was added, SGID was queried to derive the SMAC address.
Then, later on during a refactor [2], SMAC was no longer needed. However,
the now useless GID query remained. Then during additional code changes
later on, the GID query was being done in such a way that it caused iWARP
queries to start breaking. Remove the useless GID query and resolve the
iWARP breakage at the same time.
This is discussed in [3].
[1] commit dd5f03beb4f7 ("IB/core: Ethernet L2 attributes in verbs/cm structures")
[2] commit 5c266b2304fb ("IB/cm: Remove the usage of smac and vid of qp_attr and cm_av")
[3] https://www.spinics.net/lists/linux-rdma/msg63951.html
IB/mlx4: Fix integer overflow when calculating optimal MTT size
When the kernel was compiled using the UBSAN option,
we saw the following stack trace:
[ 1184.827917] UBSAN: Undefined behaviour in drivers/infiniband/hw/mlx4/mr.c:349:27
[ 1184.828114] signed integer overflow:
[ 1184.828247] -2147483648 - 1 cannot be represented in type 'int'
The problem was caused by calling round_up in procedure
mlx4_ib_umem_calc_optimal_mtt_size (on line 349, as noted in the stack
trace) with the second parameter (1 << block_shift) (which is an int).
The second parameter should have been (1ULL << block_shift) (which
is an unsigned long long).
(1 << block_shift) is treated by the compiler as an int (because 1 is
an integer).
Now, local variable block_shift is initialized to 31.
If block_shift is 31, 1 << block_shift is 1 << 31 = 0x80000000=-214748368.
This is the most negative int value.
Inside the round_up macro, there is a cast applied to ((1 << 31) - 1).
However, this cast is applied AFTER ((1 << 31) - 1) is calculated.
Since (1 << 31) is treated as an int, we get the negative overflow
identified by UBSAN in the process of calculating ((1 << 31) - 1).
The fix is to change (1 << block_shift) to (1ULL << block_shift) on
line 349.
Fixes: 9901abf58368 ("IB/mlx4: Use optimal numbers of MTT entries") Signed-off-by: Jack Morgenstein <jackm@dev.mellanox.co.il> Signed-off-by: Leon Romanovsky <leonro@mellanox.com> Signed-off-by: Doug Ledford <dledford@redhat.com>
IB/hfi1: Fix memory leak in exception path in get_irq_affinity()
When IRQ affinity is set and the interrupt type is unknown, a cpu
mask allocated within the function is never freed. Fix this memory
leak by allocating memory within the scope where it is used.
Reviewed-by: Mike Marciniszyn <mike.marciniszyn@intel.com> Reviewed-by: Michael J. Ruhl <michael.j.ruhl@intel.com> Signed-off-by: Sebastian Sanchez <sebastian.sanchez@intel.com> Signed-off-by: Dennis Dalessandro <dennis.dalessandro@intel.com> Signed-off-by: Doug Ledford <dledford@redhat.com>
IB/{hfi1, rdmavt}: Fix memory leak in hfi1_alloc_devdata() upon failure
When allocating device data, if there's an allocation failure, the
already allocated memory won't be freed such as per-cpu counters.
Fix memory leaks in exception path by creating a common reentrant
clean up function hfi1_clean_devdata() to be used at driver unload
time and device data allocation failure.
To accomplish this, free_platform_config() and clean_up_i2c() are
changed to be reentrant to remove dependencies when they are called
in different order. This helps avoid NULL pointer dereferences
introduced by this patch if those two functions weren't reentrant.
In addition, set dd->int_counter, dd->rcv_limit,
dd->send_schedule and dd->tx_opstats to NULL after they're freed in
hfi1_clean_devdata(), so that hfi1_clean_devdata() is fully reentrant.
Reviewed-by: Mike Marciniszyn <mike.marciniszyn@intel.com> Reviewed-by: Michael J. Ruhl <michael.j.ruhl@intel.com> Signed-off-by: Sebastian Sanchez <sebastian.sanchez@intel.com> Signed-off-by: Dennis Dalessandro <dennis.dalessandro@intel.com> Signed-off-by: Doug Ledford <dledford@redhat.com>
IB/hfi1: Fix NULL pointer dereference when invalid num_vls is used
When an invalid num_vls is used as a module parameter, the code
execution follows an exception path where the macro dd_dev_err()
expects dd->pcidev->dev not to be NULL in hfi1_init_dd(). This
causes a NULL pointer dereference.
Fix hfi1_init_dd() by initializing dd->pcidev and dd->pcidev->dev
earlier in the code. If a dd exists, then dd->pcidev and
dd->pcidev->dev always exists.
Cc: <stable@vger.kernel.org> # 4.9.x Reviewed-by: Mike Marciniszyn <mike.marciniszyn@intel.com> Reviewed-by: Michael J. Ruhl <michael.j.ruhl@intel.com> Signed-off-by: Sebastian Sanchez <sebastian.sanchez@intel.com> Signed-off-by: Dennis Dalessandro <dennis.dalessandro@intel.com> Signed-off-by: Doug Ledford <dledford@redhat.com>
AHG may be armed to use the stored header, which by design is limited
to edits in the PSN/A 32 bit word (bth2).
When the code is trying to send a BECN, the use of the stored header
will lose the BECN bit.
Fix by avoiding AHG when getting ready to send a BECN. This is
accomplished by always claiming the packet is not a middle packet which
is an AHG precursor. BECNs are not a normal case and this should not
hurt AHG optimizations.
Cc: <stable@vger.kernel.org> # 4.14.x Reviewed-by: Michael J. Ruhl <michael.j.ruhl@intel.com> Signed-off-by: Mike Marciniszyn <mike.marciniszyn@intel.com> Signed-off-by: Dennis Dalessandro <dennis.dalessandro@intel.com> Signed-off-by: Doug Ledford <dledford@redhat.com>
Michael J. Ruhl [Tue, 1 May 2018 12:35:43 +0000 (05:35 -0700)]
IB/hfi1 Use correct type for num_user_context
The module parameter num_user_context is defined as 'int' and
defaults to -1. The module_param_named() says that it is uint.
Correct module_param_named() type information and update the modinfo
text to reflect the default value.
Reviewed-by: Dennis Dalessandro <dennis.dalessandro@intel.com> Signed-off-by: Michael J. Ruhl <michael.j.ruhl@intel.com> Signed-off-by: Dennis Dalessandro <dennis.dalessandro@intel.com> Signed-off-by: Doug Ledford <dledford@redhat.com>
IB/hfi1: Fix handling of FECN marked multicast packet
The code for handling a marked UD packet unconditionally returns the
dlid in the header of the FECN marked packet. This is not correct
for multicast packets where the DLID is in the multicast range.
The subsequent attempt to send the CNP with the multicast lid will
cause the chip to halt the ack send context because the source
lid doesn't match the chip programming. The send context will
be halted and flush any other pending packets in the pio ring causing
the CNP to not be sent.
A part of investigating the fix, it was determined that the 16B work
broke the FECN routine badly with inconsistent use of 16 bit and 32 bits
types for lids and pkeys. Since the port's source lid was correctly 32
bits the type mixmatches need to be dealt with at the same time as
fixing the CNP header issue.
Fix these issues by:
- Using the ports lid for as the SLID for responding to FECN marked UD
packets
- Insure pkey is always 16 bit in this and subordinate routines
- Insure lids are 32 bits in this and subordinate routines
Cc: <stable@vger.kernel.org> # 4.14.x Fixes: 88733e3b8450 ("IB/hfi1: Add 16B UD support") Reviewed-by: Don Hiatt <don.hiatt@intel.com> Reviewed-by: Michael J. Ruhl <michael.j.ruhl@intel.com> Signed-off-by: Mike Marciniszyn <mike.marciniszyn@intel.com> Signed-off-by: Dennis Dalessandro <dennis.dalessandro@intel.com> Signed-off-by: Doug Ledford <dledford@redhat.com>
Currently, the kernel protects access to the agent ID allocator on a per
port basis using a spinlock, so it is impossible for two apps/threads on
the same port to get the same TID, but it is entirely possible for two
threads on different ports to end up with the same TID.
As this can be confusing (regardless of it being legal according to the
IB Spec 1.3, C13-18.1.1, in section 13.4.6.4 - TransactionID usage),
and as the rdma-core user space API for /dev/umad devices implies unique
TIDs even across ports, make the TID an atomic type so that no two
allocations, regardless of port number, will be the same.
Signed-off-by: Håkon Bugge <haakon.bugge@oracle.com> Reviewed-by: Jack Morgenstein <jackm@dev.mellanox.co.il> Reviewed-by: Ira Weiny <ira.weiny@intel.com> Reviewed-by: Zhu Yanjun <yanjun.zhu@oracle.com> Signed-off-by: Doug Ledford <dledford@redhat.com>
When a CQ is shared by multiple QPs, c4iw_flush_hw_cq() needs to acquire
corresponding QP lock before moving the CQEs into its corresponding SW
queue and accessing the SQ contents for completing a WR.
Ignore CQEs if corresponding QP is already flushed.
IB/uverbs: Fix kernel crash during MR deregistration flow
This patch fixes a crash that happens due to access to an
uninitialized DM pointer within the MR object.
The change makes sure the DM pointer in the MR object is set to
NULL during a non-DM MR creation to prevent a false indication
that this MR is related to a DM in the dereg flow.
IB/uverbs: Prevent reregistration of DM_MR to regular MR
This patch adds a check in the ib_uverbs_rereg_mr flow to make
sure there's no attempt to rereg a device memory MR to regular MR.
In such case the command will fail with -EINVAL status.
In the above functions, if error occurs in the above functions or
iptables rules drop skb after ip_local_out, kfree_skb will be called.
So it is not necessary to call kfree_skb in soft roce module again.
Or else crash will occur.
The steps to reproduce:
server client
--------- ---------
|1.1.1.1|<----rxe-channel--->|1.1.1.2|
--------- ---------
On server: rping -s -a 1.1.1.1 -v -C 10000 -S 512
On client: rping -c -a 1.1.1.1 -v -C 10000 -S 512
The kernel configs CONFIG_DEBUG_KMEMLEAK and
CONFIG_DEBUG_OBJECTS are enabled on both server and client.
When rping runs, run the following command in server:
Jianchao Wang [Thu, 26 Apr 2018 03:52:39 +0000 (11:52 +0800)]
IB/rxe: add RXE_START_MASK for rxe_opcode IB_OPCODE_RC_SEND_ONLY_INV
w/o RXE_START_MASK, the last_psn of IB_OPCODE_RC_SEND_ONLY_INV
will not be updated in update_wqe_psn, and the corresponding
wqe will not be acked in rxe_completer due to its last_psn is
zero. Finally, the other wqe will also not be able to be acked,
because the wqe of IB_OPCODE_RC_SEND_ONLY_INV with last_psn 0
is still there. This causes large amount of io timeout when
nvmeof is over rxe.
Add RXE_START_MASK for IB_OPCODE_RC_SEND_ONLY_INV to fix this.
Colin Ian King [Wed, 25 Apr 2018 16:24:04 +0000 (17:24 +0100)]
RDMA/iwpm: fix memory leak on map_info
In the cases where iwpm_hash_bucket is NULL and where function
get_mapinfo_hash_bucket returns NULL then the map_info is never added
to hash_bucket_head and hence there is a leak of map_info. Fix this
by nullifying hash_bucket_head and if that is null we know that
that map_info was not added to hash_bucket_head and hence map_info
should be free'd.
Detected by CoverityScan, CID#1222481 ("Resource Leak")
Fixes: 30dc5e63d6a5 ("RDMA/core: Add support for iWARP Port Mapper user space service") Signed-off-by: Colin Ian King <colin.king@canonical.com> Signed-off-by: Doug Ledford <dledford@redhat.com>
The method ndo_start_xmit() is defined as returning an 'netdev_tx_t',
which is a typedef for an enum type, but the implementation in this
driver returns an 'int'.
Fix this by returning 'netdev_tx_t' in this driver too.
Signed-off-by: Luc Van Oostenryck <luc.vanoostenryck@gmail.com> Reviewed-by: Yuval Shaia <yuval.shaia@oracle.com> Reviewed-by: Leon Romanovsky <leonro@mellanox.com> Signed-off-by: Doug Ledford <dledford@redhat.com>
The method ndo_start_xmit() is defined as returning an 'netdev_tx_t',
which is a typedef for an enum type, but the implementation in this
driver returns an 'int'.
Fix this by returning 'netdev_tx_t' in this driver too.
Signed-off-by: Luc Van Oostenryck <luc.vanoostenryck@gmail.com> Reviewed-by: Leon Romanovsky <leonro@mellanox.com> Signed-off-by: Doug Ledford <dledford@redhat.com>
RDMA/cma: Fix use after destroy access to net namespace for IPoIB
There are few issues with validation of netdevice and listen id lookup
for IB (IPoIB) while processing incoming CM request as below.
1. While performing lookup of bind_list in cma_ps_find(), net namespace
of the netdevice can get deleted in cma_exit_net(), resulting in use
after free access of idr and/or net namespace structures.
This lookup occurs from the workqueue context (and not userspace
context where net namespace is always valid).
CPU0 CPU1
==== ====
bind_list = cma_ps_find();
move netdevice to new namespace
delete net namespace
cma_exit_net()
idr_destroy(idr);
[..]
cma_find_listener(bind_list, ..);
2. While netdevice is validated for IP address in given net namespace,
netdevice's net namespace and/or ifindex can change in
cma_get_net_dev() and cma_match_net_dev().
Above issues are overcome by using rcu lock along with netdevice
UP/DOWN state as described below.
When a net namespace is getting deleted, netdevice is closed and
shutdown before moving it back to init_net namespace.
change_net_namespace() synchronizes with any existing use of netdevice
before changing the netdev properties such as net or ifindex.
Once netdevice IFF_UP flags is cleared, such fields are not guaranteed
to be valid.
Therefore, rcu lock along with netdevice state check ensures that,
while route lookup and cm_id lookup is in progress, netdevice of
interest won't migrate to any other net namespace.
This ensures that associated net namespace of netdevice won't get
deleted while rcu lock is held for netdevice which is in IFF_UP state.
Fixes: fa20105e09e9 ("IB/cma: Add support for network namespaces") Fixes: 4be74b42a6d0 ("IB/cma: Separate port allocation to network namespaces") Fixes: f887f2ac87c2 ("IB/cma: Validate routing of incoming requests") Signed-off-by: Parav Pandit <parav@mellanox.com> Signed-off-by: Leon Romanovsky <leonro@mellanox.com> Signed-off-by: Doug Ledford <dledford@redhat.com>
Previously, if a method contained mandatory attributes in a namespace
that wasn't given by the user, these attributes weren't validated.
Fixing this by iterating over all specification namespaces.
RDMA/cxgb4: release hw resources on device removal
The c4iw_rdev_close() logic was not releasing all the hw
resources (PBL and RQT memory) during the device removal
event (driver unload / system reboot). This can cause panic
in gen_pool_destroy().
The module remove function will wait for all the hw
resources to be released during the device removal event.
Fixes c12a67fe(iw_cxgb4: free EQ queue memory on last deref) Signed-off-by: Raju Rangoju <rajur@chelsio.com> Reviewed-by: Steve Wise <swise@opengridcomputing.com> Cc: stable@vger.kernel.org Signed-off-by: Doug Ledford <dledford@redhat.com>
In the functions rxe_mem_init_dma, rxe_mem_init_user, rxe_mem_init_fast
and copy_data, the function variable rxe is not used. So this function
variable rxe is removed.
CC: Srinivas Eeda <srinivas.eeda@oracle.com> CC: Junxiao Bi <junxiao.bi@oracle.com> Signed-off-by: Zhu Yanjun <yanjun.zhu@oracle.com> Reviewed-by: Leon Romanovsky <leonro@mellanox.com> Signed-off-by: Doug Ledford <dledford@redhat.com>
infiniband: hw: qib: Change return type to vm_fault_t
Use new return type vm_fault_t for fault handler. For
now, this is just documenting that the function returns
a VM_FAULT value rather than an errno. Once all instances
are converted, vm_fault_t will become a distinct type.
Reference id -> 1c8f422059ae ("mm: change return type to
vm_fault_t")
infiniband: hw: hfi1: Change return type to vm_fault_t
Use new return type vm_fault_t for fault handler. For
now, this is just documenting that the function returns
a VM_FAULT value rather than an errno. Once all instances
are converted, vm_fault_t will become a distinct type.
Reference id -> 1c8f422059ae ("mm: change return type to
vm_fault_t")
INFINIBAND_SRP code depends on INFINIBAND_ADDR_TRANS provided symbols.
So declare the kconfig dependency. This is necessary to allow for
enabling INFINIBAND without INFINIBAND_ADDR_TRANS.
CIFS_SMB_DIRECT code depends on INFINIBAND_ADDR_TRANS provided symbols.
So declare the kconfig dependency. This is necessary to allow for
enabling INFINIBAND without INFINIBAND_ADDR_TRANS.
Signed-off-by: Greg Thelen <gthelen@google.com> Cc: Tarick Bedeir <tarick@google.com> Reviewed-by: Long Li <longli@microsoft.com> Signed-off-by: Doug Ledford <dledford@redhat.com>
INFINIBAND_SRPT code depends on INFINIBAND_ADDR_TRANS provided symbols.
So declare the kconfig dependency. This is necessary to allow for
enabling INFINIBAND without INFINIBAND_ADDR_TRANS.
NVME_TARGET_RDMA code depends on INFINIBAND_ADDR_TRANS provided symbols.
So declare the kconfig dependency. This is necessary to allow for
enabling INFINIBAND without INFINIBAND_ADDR_TRANS.
NVME_RDMA code depends on INFINIBAND_ADDR_TRANS provided symbols. So
declare the kconfig dependency. This is necessary to allow for enabling
INFINIBAND without INFINIBAND_ADDR_TRANS.
Leon Romanovsky [Mon, 23 Apr 2018 14:01:56 +0000 (17:01 +0300)]
RDMA/mlx5: Properly check return value of mlx5_get_uars_page
Starting from commit 72f36be06138 ("net/mlx5: Fix mlx5_get_uars_page to
return error code") the mlx5_get_uars_page() call returns error in case
of failure, but it was mistakenly overlooked in the merge commit.
Fixes: e7996a9a77fc ("Merge tag v4.15 of git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux-2.6.git") Reported-by: Alaa Hleihel <alaa@mellanox.com> Signed-off-by: Leon Romanovsky <leonro@mellanox.com> Signed-off-by: Doug Ledford <dledford@redhat.com>
IB/mlx5: Fix represent correct netdevice in dual port RoCE
In commit bcf87f1dbbec ("IB/mlx5: Listen to netdev register/unresiter events in switchdev mode")
incorrectly mapped primary device's netdevice to 2nd port netdevice.
It always represented primary port's netdevice for 2nd port netdevice
when ib representors were not used.
This results into failing to process CM request arriving on 2nd port due
to incorrect mapping of netdevice.
This fix corrects it by considering the right mdev.
Cc: <stable@vger.kernel.org> # 4.16 Fixes: bcf87f1dbbec ("IB/mlx5: Listen to netdev register/unresiter events in switchdev mode") Reviewed-by: Mark Bloch <markb@mellanox.com> Signed-off-by: Parav Pandit <parav@mellanox.com> Signed-off-by: Leon Romanovsky <leonro@mellanox.com> Signed-off-by: Doug Ledford <dledford@redhat.com>
IB/mlx5: Use unlimited rate when static rate is not supported
Before the change, if the user passed a static rate value different
than zero and the FW doesn't support static rate,
it would end up configuring rate of 2.5 GBps.
Fix this by using rate 0; unlimited, in cases where FW
doesn't support static rate configuration.
Cc: <stable@vger.kernel.org> # 3.10 Fixes: e126ba97dba9 ("mlx5: Add driver for Mellanox Connect-IB adapters") Reviewed-by: Majd Dibbiny <majd@mellanox.com> Signed-off-by: Danit Goldberg <danitg@mellanox.com> Signed-off-by: Leon Romanovsky <leonro@mellanox.com> Signed-off-by: Doug Ledford <dledford@redhat.com>
Leon Romanovsky [Mon, 23 Apr 2018 14:01:52 +0000 (17:01 +0300)]
RDMA/mlx5: Fix multiple NULL-ptr deref errors in rereg_mr flow
Failure in rereg MR releases UMEM but leaves the MR to be destroyed
by the user. As a result the following scenario may happen:
"create MR -> rereg MR with failure -> call to rereg MR again" and
hit "NULL-ptr deref or user memory access" errors.
Ensure that rereg MR is only performed on a non-dead MR.
IB/core: Fix deleting default GIDs when changing mac adddress
Before [1], When MAC address of the netdevice is changed, default GID is
supposed to get deleted and added back which affects the node and/or port
GUID in below sequence.
However, ib_cache_gid_del() was not getting invoked in non bonding
scenarios because event_ndev and rdma_ndev are same.
Therefore, fix such condition to ignore checking upper device when event
ndev and rdma_dev are same; similar to bond_set_netdev_default_gids().
Which this fix ib_cache_gid_del() is invoked correctly; however
ib_cache_gid_del() doesn't find the default GID for deletion because
find_gid() was given default_gid = false with
GID_ATTR_FIND_MASK_DEFAULT set.
But it was getting overwritten by ib_cache_gid_set_default_gid() later
on as part of add_cmd().
Therefore, mac address change used to work for default GID.
With refactor series [1], this incorrect behavior is detected.
Therefore,
when deleting default GID, set default_gid and set MASK flag.
when deleting IP based GID, clear default_gid and set MASK flag.
IB/core: Fix to avoid deleting IPv6 look alike default GIDs
When IPv6 link local address is removed, if it matches with the default
GID, default GID(s)s gets removed which may not be a desired behavior.
This behavior is introduced by refactor work in Fixes tag.
When IPv6 link address is removed, removing its equivalent RoCEv2 GID
which exactly matches with default RoCEv2 GID, is right thing to do.
However achieving it correctly requires lot more changes, likely in
roce_gid_mgmt.c and core/cache.c. This should be done as independent
patch.
Therefore, this patch preserves behavior of not deleteing default GIDs.
This is done by providing explicit hint to consider default GID property
using mask and default_gid; similar to add_gid().
IB/core: Don't allow default GID addition at non reseved slots
Default GIDs are marked reserved at the start of the GID table at index
0 and 1 by gid_table_reserve_default(). Currently when default GID is
requested, it can still allocates an empty slot which was not marked as
RESERVED for default GID, which is incorrect.
At least in current code flow of roce_gid_mgmt.c, in theory we can
still request to allocate more than one/two default GIDs depending
on how upper devices are setup.
Therefore, it is better for cache layer to only allow our reserved slots
to be used by default GID allocation requests.
Jason Gunthorpe [Fri, 20 Apr 2018 15:49:10 +0000 (09:49 -0600)]
uapi: Fix SPDX tags for files referring to the 'OpenIB.org' license
Based on discussion with Kate Stewart this license is not a
BSD-2-Clause, but is now formally identified as Linux-OpenIB
by SPDX.
The key difference between the licenses is in the 'warranty'
paragraph.
if_infiniband.h refers to the 'OpenIB.org' license, but
does not include the text, instead it links to an obsolete
web site that contains a license that matches the BSD-2-Clause
SPX. There is no 'three clause' version of the OpenIB.org
license.
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com> Acked-by: David S. Miller <davem@davemloft.net> Signed-off-by: Doug Ledford <dledford@redhat.com>
The RDMA CM will select a source device and address by consulting
the routing table if no source address is passed into
rdma_resolve_address(). Userspace will ask for this by passing an
all-zero source address in the RESOLVE_IP command. Unfortunately
the new check for non-zero address size rejects this with EINVAL,
which breaks valid userspace applications.
Fix this by explicitly allowing a zero address family for the source.
Fixes: 2975d5de6428 ("RDMA/ucma: Check AF family prior resolving address") Cc: <stable@vger.kernel.org> Signed-off-by: Roland Dreier <roland@purestorage.com> Signed-off-by: Doug Ledford <dledford@redhat.com>
Jason Gunthorpe [Thu, 5 Apr 2018 03:00:01 +0000 (21:00 -0600)]
RDMA/ucma: Check for a cm_id->device in all user calls that need it
This is done by auditing all callers of ucma_get_ctx and switching the
ones that unconditionally touch ->device to ucma_get_ctx_dev. This covers
a little less than half of the call sites.
The 11 remaining call sites to ucma_get_ctx() were manually audited.
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com> Reviewed-by: Leon Romanovsky <leonro@mellanox.com> Signed-off-by: Doug Ledford <dledford@redhat.com>
Follow the advice from Bart, the function refcount_inc is replaced
with skb_get in commit 99dae690255e ("IB/rxe: optimize mcast recv process")
and commit 86af61764151 ("IB/rxe: remove unnecessary skb_clone").
IB/uverbs: Add missing braces in anonymous union initializers
With gcc-4.1.2:
drivers/infiniband/core/uverbs_std_types_flow_action.c:366: error: unknown field ‘ptr’ specified in initializer
drivers/infiniband/core/uverbs_std_types_flow_action.c:367: error: unknown field ‘type’ specified in initializer
drivers/infiniband/core/uverbs_std_types_flow_action.c:367: warning: missing braces around initializer
drivers/infiniband/core/uverbs_std_types_flow_action.c:367: warning: (near initialization for ‘uverbs_flow_action_esp_keymat[0].<anonymous>.<anonymous>’)
drivers/infiniband/core/uverbs_std_types_flow_action.c:368: error: unknown field ‘min_len’ specified in initializer
drivers/infiniband/core/uverbs_std_types_flow_action.c:368: warning: excess elements in union initializer
drivers/infiniband/core/uverbs_std_types_flow_action.c:368: warning: (near initialization for ‘uverbs_flow_action_esp_keymat[0].<anonymous>’)
drivers/infiniband/core/uverbs_std_types_flow_action.c:368: error: unknown field ‘len’ specified in initializer
drivers/infiniband/core/uverbs_std_types_flow_action.c:368: warning: excess elements in union initializer
drivers/infiniband/core/uverbs_std_types_flow_action.c:368: warning: (near initialization for ‘uverbs_flow_action_esp_keymat[0].<anonymous>’)
drivers/infiniband/core/uverbs_std_types_flow_action.c:369: error: unknown field ‘flags’ specified in initializer
drivers/infiniband/core/uverbs_std_types_flow_action.c:369: warning: excess elements in union initializer
drivers/infiniband/core/uverbs_std_types_flow_action.c:369: warning: (near initialization for ‘uverbs_flow_action_esp_keymat[0].<anonymous>’)
drivers/infiniband/core/uverbs_std_types_flow_action.c:376: error: unknown field ‘ptr’ specified in initializer
drivers/infiniband/core/uverbs_std_types_flow_action.c:377: error: unknown field ‘type’ specified in initializer
drivers/infiniband/core/uverbs_std_types_flow_action.c:377: warning: missing braces around initializer
drivers/infiniband/core/uverbs_std_types_flow_action.c:377: warning: (near initialization for ‘uverbs_flow_action_esp_replay[0].<anonymous>.<anonymous>’)
drivers/infiniband/core/uverbs_std_types_flow_action.c:379: error: unknown field ‘len’ specified in initializer
drivers/infiniband/core/uverbs_std_types_flow_action.c:379: warning: excess elements in union initializer
drivers/infiniband/core/uverbs_std_types_flow_action.c:379: warning: (near initialization for ‘uverbs_flow_action_esp_replay[0].<anonymous>’)
drivers/infiniband/core/uverbs_std_types_flow_action.c:383: error: unknown field ‘ptr’ specified in initializer
drivers/infiniband/core/uverbs_std_types_flow_action.c:384: error: unknown field ‘type’ specified in initializer
drivers/infiniband/core/uverbs_std_types_flow_action.c:385: error: unknown field ‘min_len’ specified in initializer
drivers/infiniband/core/uverbs_std_types_flow_action.c:385: warning: excess elements in union initializer
drivers/infiniband/core/uverbs_std_types_flow_action.c:385: warning: (near initialization for ‘uverbs_flow_action_esp_replay[1].<anonymous>’)
drivers/infiniband/core/uverbs_std_types_flow_action.c:385: error: unknown field ‘len’ specified in initializer
drivers/infiniband/core/uverbs_std_types_flow_action.c:385: warning: excess elements in union initializer
drivers/infiniband/core/uverbs_std_types_flow_action.c:385: warning: (near initialization for ‘uverbs_flow_action_esp_replay[1].<anonymous>’)
drivers/infiniband/core/uverbs_std_types_flow_action.c:386: error: unknown field ‘flags’ specified in initializer
drivers/infiniband/core/uverbs_std_types_flow_action.c:386: warning: excess elements in union initializer
drivers/infiniband/core/uverbs_std_types_flow_action.c:386: warning: (near initialization for ‘uverbs_flow_action_esp_replay[1].<anonymous>’)
Jia-Ju Bai [Wed, 11 Apr 2018 07:33:06 +0000 (15:33 +0800)]
infiniband: i40iw: Replace GFP_ATOMIC with GFP_KERNEL in i40iw_l2param_change
i40iw_l2param_change() is never called in atomic context.
i40iw_make_listen_node() is only set as ".l2_param_change"
in struct i40e_client_ops, and this function pointer is not called
in atomic context.
Despite never getting called from atomic context,
i40iw_l2param_change() calls kzalloc() with GFP_ATOMIC,
which does not sleep for allocation.
GFP_ATOMIC is not necessary and can be replaced with GFP_KERNEL,
which can sleep and improve the possibility of sucessful allocation.
This is found by a static analysis tool named DCNS written by myself.
And I also manually check it.
Signed-off-by: Jia-Ju Bai <baijiaju1990@gmail.com> Acked-by: Shiraz Saleem <shiraz.saleem@intel.com> Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Jia-Ju Bai [Wed, 11 Apr 2018 07:32:48 +0000 (15:32 +0800)]
infiniband: i40iw: Replace GFP_ATOMIC with GFP_KERNEL in i40iw_make_listen_node
i40iw_make_listen_node() is never called in atomic context.
i40iw_make_listen_node() is only called by i40iw_create_listen, which is
set as ".create_listen" in struct iw_cm_verbs.
Despite never getting called from atomic context,
i40iw_make_listen_node() calls kzalloc() with GFP_ATOMIC,
which does not sleep for allocation.
GFP_ATOMIC is not necessary and can be replaced with GFP_KERNEL,
which can sleep and improve the possibility of sucessful allocation.
This is found by a static analysis tool named DCNS written by myself.
And I also manually check it.
Signed-off-by: Jia-Ju Bai <baijiaju1990@gmail.com> Acked-by: Shiraz Saleem <shiraz.saleem@intel.com> Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Jia-Ju Bai [Wed, 11 Apr 2018 07:32:25 +0000 (15:32 +0800)]
infiniband: i40iw: Replace GFP_ATOMIC with GFP_KERNEL in i40iw_add_mqh_4
i40iw_add_mqh_4() is never called in atomic context, because it
calls rtnl_lock() that can sleep.
Despite never getting called from atomic context,
i40iw_add_mqh_4() calls kzalloc() with GFP_ATOMIC,
which does not sleep for allocation.
GFP_ATOMIC is not necessary and can be replaced with GFP_KERNEL,
which can sleep and improve the possibility of sucessful allocation.
This is found by a static analysis tool named DCNS written by myself.
And I also manually check it.
Signed-off-by: Jia-Ju Bai <baijiaju1990@gmail.com> Acked-by: Shiraz Saleem <shiraz.saleem@intel.com> Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Jason Gunthorpe [Tue, 3 Apr 2018 04:52:04 +0000 (07:52 +0300)]
RDMA/rdma_cm: Delete rdma_addr_client
The only thing it does is block module unload while work is posted from
rdma_resolve_ip().
However, this is not the right place to do this. The users of
rdma_resolve_ip() must ensure their own module does not unload until
rdma_resolve_ip() calls the callback, or until rdma_addr_cancel() is
called.
Similarly callers to rdma_addr_find_l2_eth_by_grh() must ensure their
module does not unload while they are calling code.
The only two users are already safe, so there is no need for this.
Signed-off-by: Leon Romanovsky <leonro@mellanox.com> Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Jason Gunthorpe [Tue, 3 Apr 2018 04:52:03 +0000 (07:52 +0300)]
RDMA/rdma_cm: Make rdma_addr_cancel into a fence
Currently rdma_addr_cancel does not prevent the callback from being used,
this is surprising and hard to reason about. There does not appear to be a
bug here as the only user of this API does refcount properly, fixing it
only to increase clarity.
Signed-off-by: Leon Romanovsky <leonro@mellanox.com> Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Jason Gunthorpe [Tue, 3 Apr 2018 04:52:02 +0000 (07:52 +0300)]
RDMA/rdma_cm: Remove process_req and timer sorting
Now that the work queue is used directly to launch and track the work
there is no need for the second processing function to do 'all list
entries'. Just schedule all entries onto the main work queue directly.
We can also drop all of the useless list sorting now, as the workqueue
sorts by expiration time automatically.
This change requires switching lock to a spinlock as netdev notifiers
are called in an atomic context, this is now easy since the lock does
not need to be held across the lookup, that is already single
threaded due to the work queue.
Signed-off-by: Leon Romanovsky <leonro@mellanox.com> Reviewed-by: Parav Pandit <parav@mellanox.com> Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Randy Dunlap [Tue, 17 Apr 2018 01:51:50 +0000 (18:51 -0700)]
infiniband: mlx5: fix build errors when INFINIBAND_USER_ACCESS=m
Fix build errors when INFINIBAND_USER_ACCESS=m and MLX5_INFINIBAND=y.
The build error occurs when the mlx5 driver code attempts to use
USER_ACCESS interfaces, which are built as a loadable module.
Fixes these build errors:
drivers/infiniband/hw/mlx5/main.o: In function `populate_specs_root':
../drivers/infiniband/hw/mlx5/main.c:4982: undefined reference to `uverbs_default_get_objects'
../drivers/infiniband/hw/mlx5/main.c:4994: undefined reference to `uverbs_alloc_spec_tree'
drivers/infiniband/hw/mlx5/main.o: In function `depopulate_specs_root':
../drivers/infiniband/hw/mlx5/main.c:5001: undefined reference to `uverbs_free_spec_tree'
Build-tested with multiple config combinations.
Fixes: 8c84660bb437 ("IB/mlx5: Initialize the parsing tree root without the help of uverbs") Cc: stable@vger.kernel.org # reported against 4.16 Reported-by: kbuild test robot <lkp@intel.com> Signed-off-by: Randy Dunlap <rdunlap@infradead.org> Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
The header file fs_helpers.h is included twice. So it should be removed.
Fixes: 802c2125689d ("IB/mlx5: Add IPsec support for egress and ingress") CC: Srinivas Eeda <srinivas.eeda@oracle.com> CC: Junxiao Bi <junxiao.bi@oracle.com> Signed-off-by: Zhu Yanjun <yanjun.zhu@oracle.com> Acked-by: Leon Romanovsky <leonro@mellanox.com> Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Merge tag 'for-4.17-part2-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux
Pull more btrfs updates from David Sterba:
"We have queued a few more fixes (error handling, log replay,
softlockup) and the rest is SPDX updates that touche almost all files
so the diffstat is long"
* tag 'for-4.17-part2-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
btrfs: Only check first key for committed tree blocks
btrfs: add SPDX header to Kconfig
btrfs: replace GPL boilerplate by SPDX -- sources
btrfs: replace GPL boilerplate by SPDX -- headers
Btrfs: fix loss of prealloc extents past i_size after fsync log replay
Btrfs: clean up resources during umount after trans is aborted
btrfs: Fix possible softlock on single core machines
Btrfs: bail out on error during replay_dir_deletes
Btrfs: fix NULL pointer dereference in log_dir_items
Merge tag '4.17-rc1SMB3-Fixes' of git://git.samba.org/sfrench/cifs-2.6
Pull cifs fixes from Steve French:
"SMB3 fixes, a few for stable, and some important cleanup work from
Ronnie of the smb3 transport code"
* tag '4.17-rc1SMB3-Fixes' of git://git.samba.org/sfrench/cifs-2.6:
cifs: change validate_buf to validate_iov
cifs: remove rfc1002 hardcoded constants from cifs_discard_remaining_data()
cifs: Change SMB2_open to return an iov for the error parameter
cifs: add resp_buf_size to the mid_q_entry structure
smb3.11: replace a 4 with server->vals->header_preamble_size
cifs: replace a 4 with server->vals->header_preamble_size
cifs: add pdu_size to the TCP_Server_Info structure
SMB311: Improve checking of negotiate security contexts
SMB3: Fix length checking of SMB3.11 negotiate request
CIFS: add ONCE flag for cifs_dbg type
cifs: Use ULL suffix for 64-bit constant
SMB3: Log at least once if tree connect fails during reconnect
cifs: smb2pdu: Fix potential NULL pointer dereference
Merge tag 'scsi-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi
Pull SCSI fixes from James Bottomley:
"This is a set of minor (and safe changes) that didn't make the initial
pull request plus some bug fixes.
The status handling code is actually a running regression from the
previous merge window which had an incomplete fix (now reverted) and
most of the remaining bug fixes are for problems older than the
current merge window"
[ Side note: this merge also takes the base kernel git repository to 6+
million objects for the first time. Technically we hit it a couple of
merges ago already if you count all the tag objects, but now it
reaches 6M+ objects reachable from HEAD.
I was joking around that that's when I should switch to 5.0, because
3.0 happened at the 2M mark, and 4.0 happened at 4M objects. But
probably not, even if numerology is about as good a reason as any.
- Linus ]
* tag 'scsi-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi:
scsi: devinfo: Add Microsoft iSCSI target to 1024 sector blacklist
scsi: cxgb4i: silence overflow warning in t4_uld_rx_handler()
scsi: dpt_i2o: Use after free in I2ORESETCMD ioctl
scsi: core: Make scsi_result_to_blk_status() recognize CONDITION MET
scsi: core: Rename __scsi_error_from_host_byte() into scsi_result_to_blk_status()
Revert "scsi: core: return BLK_STS_OK for DID_OK in __scsi_error_from_host_byte()"
scsi: aacraid: Insure command thread is not recursively stopped
scsi: qla2xxx: Correct setting of SAM_STAT_CHECK_CONDITION
scsi: qla2xxx: correctly shift host byte
scsi: qla2xxx: Fix race condition between iocb timeout and initialisation
scsi: qla2xxx: Avoid double completion of abort command
scsi: qla2xxx: Fix small memory leak in qla2x00_probe_one on probe failure
scsi: scsi_dh: Don't look for NULL devices handlers by name
scsi: core: remove redundant assignment to shost->use_blk_mq
Merge tag 'kbuild-v4.17-2' of git://git.kernel.org/pub/scm/linux/kernel/git/masahiroy/linux-kbuild
Pull more Kbuild updates from Masahiro Yamada:
- pass HOSTLDFLAGS when compiling single .c host programs
- build genksyms lexer and parser files instead of using shipped
versions
- rename *-asn1.[ch] to *.asn1.[ch] for suffix consistency
- let the top .gitignore globally ignore artifacts generated by flex,
bison, and asn1_compiler
- let the top Makefile globally clean artifacts generated by flex,
bison, and asn1_compiler
- use safer .SECONDARY marker instead of .PRECIOUS to prevent
intermediate files from being removed
- support -fmacro-prefix-map option to make __FILE__ a relative path
- fix # escaping to prepare for the future GNU Make release
- clean up deb-pkg by using debian tools instead of handrolled
source/changes generation
- improve rpm-pkg portability by supporting kernel-install as a
fallback of new-kernel-pkg
- extend Kconfig listnewconfig target to provide more information
* tag 'kbuild-v4.17-2' of git://git.kernel.org/pub/scm/linux/kernel/git/masahiroy/linux-kbuild:
kconfig: extend output of 'listnewconfig'
kbuild: rpm-pkg: use kernel-install as a fallback for new-kernel-pkg
Kbuild: fix # escaping in .cmd files for future Make
kbuild: deb-pkg: split generating packaging and build
kbuild: use -fmacro-prefix-map to make __FILE__ a relative path
kbuild: mark $(targets) as .SECONDARY and remove .PRECIOUS markers
kbuild: rename *-asn1.[ch] to *.asn1.[ch]
kbuild: clean up *-asn1.[ch] patterns from top-level Makefile
.gitignore: move *-asn1.[ch] patterns to the top-level .gitignore
kbuild: add %.dtb.S and %.dtb to 'targets' automatically
kbuild: add %.lex.c and %.tab.[ch] to 'targets' automatically
genksyms: generate lexer and parser during build instead of shipping
kbuild: clean up *.lex.c and *.tab.[ch] patterns from top-level Makefile
.gitignore: move *.lex.c *.tab.[ch] patterns to the top-level .gitignore
kbuild: use HOSTLDFLAGS for single .c executables
Merge branch 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 fixes from Thomas Gleixner:
"A set of fixes and updates for x86:
- Address a swiotlb regression which was caused by the recent DMA
rework and made driver fail because dma_direct_supported() returned
false
- Fix a signedness bug in the APIC ID validation which caused invalid
APIC IDs to be detected as valid thereby bloating the CPU possible
space.
- Fix inconsisten config dependcy/select magic for the MFD_CS5535
driver.
- Fix a corruption of the physical address space bits when encryption
has reduced the address space and late cpuinfo updates overwrite
the reduced bit information with the original value.
- Dominiks syscall rework which consolidates the architecture
specific syscall functions so all syscalls can be wrapped with the
same macros. This allows to switch x86/64 to struct pt_regs based
syscalls. Extend the clearing of user space controlled registers in
the entry patch to the lower registers"
* 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/apic: Fix signedness bug in APIC ID validity checks
x86/cpu: Prevent cpuinfo_x86::x86_phys_bits adjustment corruption
x86/olpc: Fix inconsistent MFD_CS5535 configuration
swiotlb: Use dma_direct_supported() for swiotlb_ops
syscalls/x86: Adapt syscall_wrapper.h to the new syscall stub naming convention
syscalls/core, syscalls/x86: Rename struct pt_regs-based sys_*() to __x64_sys_*()
syscalls/core, syscalls/x86: Clean up compat syscall stub naming convention
syscalls/core, syscalls/x86: Clean up syscall stub naming convention
syscalls/x86: Extend register clearing on syscall entry to lower registers
syscalls/x86: Unconditionally enable 'struct pt_regs' based syscalls on x86_64
syscalls/x86: Use 'struct pt_regs' based syscall calling for IA32_EMULATION and x32
syscalls/core: Prepare CONFIG_ARCH_HAS_SYSCALL_WRAPPER=y for compat syscalls
syscalls/x86: Use 'struct pt_regs' based syscall calling convention for 64-bit syscalls
syscalls/core: Introduce CONFIG_ARCH_HAS_SYSCALL_WRAPPER=y
x86/syscalls: Don't pointlessly reload the system call number
x86/mm: Fix documentation of module mapping range with 4-level paging
x86/cpuid: Switch to 'static const' specifier
Merge branch 'x86-pti-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 pti updates from Thomas Gleixner:
"Another series of PTI related changes:
- Remove the manual stack switch for user entries from the idtentry
code. This debloats entry by 5k+ bytes of text.
- Use the proper types for the asm/bootparam.h defines to prevent
user space compile errors.
- Use PAGE_GLOBAL for !PCID systems to gain back performance
- Prevent setting of huge PUD/PMD entries when the entries are not
leaf entries otherwise the entries to which the PUD/PMD points to
and are populated get lost"
* 'x86-pti-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/pgtable: Don't set huge PUD/PMD on non-leaf entries
x86/pti: Leave kernel text global for !PCID
x86/pti: Never implicitly clear _PAGE_GLOBAL for kernel image
x86/pti: Enable global pages for shared areas
x86/mm: Do not forbid _PAGE_RW before init for __ro_after_init
x86/mm: Comment _PAGE_GLOBAL mystery
x86/mm: Remove extra filtering in pageattr code
x86/mm: Do not auto-massage page protections
x86/espfix: Document use of _PAGE_GLOBAL
x86/mm: Introduce "default" kernel PTE mask
x86/mm: Undo double _PAGE_PSE clearing
x86/mm: Factor out pageattr _PAGE_GLOBAL setting
x86/entry/64: Drop idtentry's manual stack switch for user entries
x86/uapi: Fix asm/bootparam.h userspace compilation errors
Merge branch 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull more perf updates from Thomas Gleixner:
"A rather large set of perf updates:
Kernel:
- Fix various initialization issues
- Prevent creating [ku]probes for not CAP_SYS_ADMIN users
Tooling:
- Show only failing syscalls with 'perf trace --failure' (Arnaldo
Carvalho de Melo)
e.g: See what 'openat' syscalls are failing:
# perf trace --failure -e openat
762.323 ( 0.007 ms): VideoCapture/4566 openat(dfd: CWD, filename: /dev/video2) = -1 ENOENT No such file or directory
<SNIP N /dev/videoN open attempts... sigh, where is that improvised camera lid?!? >
790.228 ( 0.008 ms): VideoCapture/4566 openat(dfd: CWD, filename: /dev/video63) = -1 ENOENT No such file or directory
^C#
- Show information about the event (freq, nr_samples, total
period/nr_events) in the annotate --tui and --stdio2 'perf
annotate' output, similar to the first line in the 'perf report
--tui', but just for the samples for a the annotated symbol
(Arnaldo Carvalho de Melo)
- Introduce 'perf version --build-options' to show what features were
linked, aliased as well as a shorter 'perf -vv' (Jin Yao)
- Add a "dso_size" sort order (Kim Phillips)
- Remove redundant ')' in the tracepoint output in 'perf trace'
(Changbin Du)
- Synchronize x86's cpufeatures.h, no effect on toolss (Arnaldo
Carvalho de Melo)
- Show group details on the title line in the annotate browser and
'perf annotate --stdio2' output, so that the per-event columns can
have headers (Arnaldo Carvalho de Melo)
- Fixup vertical line separating metrics from instructions and
cleaning unused lines at the bottom, both in the annotate TUI
browser (Arnaldo Carvalho de Melo)
- Remove duplicated 'samples' in lost samples warning in
'perf report' (Arnaldo Carvalho de Melo)
- Synchronize i915_drm.h, silencing the perf build process,
automagically adding support for the new DRM_I915_QUERY ioctl
(Arnaldo Carvalho de Melo)
- Make auxtrace_queues__add_buffer() allocate struct buffer, from a
patchkit already applied (Adrian Hunter)
- Fix the --stdio2/TUI annotate output to include group details, be
it for a recorded '{a,b,f}' explicit event group or when forcing
group display using 'perf report --group' for a set of events not
recorded as a group (Arnaldo Carvalho de Melo)
- Fix display artifacts in the ui browser (base class for the
annotate and main report/top TUI browser) related to the extra
title lines work (Arnaldo Carvalho de Melo)
- perf auxtrace refactorings, leftovers from a previously partially
processed patchset (Adrian Hunter)
- Fix the builtin clang build (Sandipan Das, Arnaldo Carvalho de
Melo)
- Synchronize i915_drm.h, silencing a perf build warning and in the
process automagically adding support for a new ioctl command
(Arnaldo Carvalho de Melo)
- Fix a strncpy issue in uprobe tracing"
* 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (36 commits)
perf/core: Need CAP_SYS_ADMIN to create k/uprobe with perf_event_open()
tracing/uprobe_event: Fix strncpy corner case
perf/core: Fix perf_uprobe_init()
perf/core: Fix perf_kprobe_init()
perf/core: Fix use-after-free in uprobe_perf_close()
perf tests clang: Fix function name for clang IR test
perf clang: Add support for recent clang versions
perf tools: Fix perf builds with clang support
perf tools: No need to include namespaces.h in util.h
perf hists browser: Remove leftover from row returned from refresh
perf hists browser: Show extra_title_lines in the 'D' debug hotkey
perf auxtrace: Make auxtrace_queues__add_buffer() do CPU filtering
tools headers uapi: Synchronize i915_drm.h
perf report: Remove duplicated 'samples' in lost samples warning
perf ui browser: Fixup cleaning unused lines at the bottom
perf annotate browser: Fixup vertical line separating metrics from instructions
perf annotate: Show group details on the title line
perf auxtrace: Make auxtrace_queues__add_buffer() allocate struct buffer
perf/x86/intel: Move regs->flags EXACT bit init
perf trace: Remove redundant ')'
...
Merge branch 'efi-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 EFI bootup fixlet from Thomas Gleixner:
"A single fix for an early boot warning caused by invoking
this_cpu_has() before SMP initialization"
* 'efi-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/mm: Fix bogus warning during EFI bootup, use boot_cpu_has() instead of this_cpu_has() in build_cr3_noflush()
Merge branch 'irq-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull irq affinity fixes from Thomas Gleixner:
- Fix error path handling in the affinity spreading code
- Make affinity spreading smarter to avoid issues on systems which
claim to have hotpluggable CPUs while in fact they can't hotplug
anything.
So instead of trying to spread the vectors (and thereby the
associated device queues) to all possibe CPUs, spread them on all
present CPUs first. If there are left over vectors after that first
step they are spread among the possible, but not present CPUs which
keeps the code backwards compatible for virtual decives and NVME
which allocate a queue per possible CPU, but makes the spreading
smarter for devices which have less queues than possible or present
CPUs.
* 'irq-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
genirq/affinity: Spread irq vectors among present CPUs as far as possible
genirq/affinity: Allow irq spreading from a given starting point
genirq/affinity: Move actual irq vector spreading into a helper function
genirq/affinity: Rename *node_to_possible_cpumask as *node_to_cpumask
genirq/affinity: Don't return with empty affinity masks on error
Merge tag 'for-linus' of git://github.com/openrisc/linux
Pull OpenRISC fixlet from Stafford Horne:
"Just one small thing here, it came in a while back but I didnt have
anything in my 4.16 queue, still its the only thing for 4.17 so
sending it alone.
Small cleanup: remove unused __ARCH_HAVE_MMU define"
* tag 'for-linus' of git://github.com/openrisc/linux:
openrisc: remove unused __ARCH_HAVE_MMU define
Merge tag 'powerpc-4.17-2' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux
Pull powerpc fixes from Michael Ellerman:
- Fix crashes when loading modules built with a different
CONFIG_RELOCATABLE value by adding CONFIG_RELOCATABLE to vermagic.
- Fix busy loops in the OPAL NVRAM driver if we get certain error
conditions from firmware.
- Remove tlbie trace points from KVM code that's called in real mode,
because it causes crashes.
- Fix checkstops caused by invalid tlbiel on Power9 Radix.
- Ensure the set of CPU features we "know" are always enabled is
actually the minimal set when we build with support for firmware
supplied CPU features.
Thanks to: Aneesh Kumar K.V, Anshuman Khandual, Nicholas Piggin.
* tag 'powerpc-4.17-2' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux:
powerpc/64s: Fix CPU_FTRS_ALWAYS vs DT CPU features
powerpc/mm/radix: Fix checkstops caused by invalid tlbiel
KVM: PPC: Book3S HV: trace_tlbie must not be called in realmode
powerpc/8xx: Fix build with hugetlbfs enabled
powerpc/powernv: Fix OPAL NVRAM driver OPAL_BUSY loops
powerpc/powernv: define a standard delay for OPAL_BUSY type retry loops
powerpc/fscr: Enable interrupts earlier before calling get_user()
powerpc/64s: Fix section mismatch warnings from setup_rfi_flush()
powerpc/modules: Fix crashes by adding CONFIG_RELOCATABLE to vermagic
* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (27 commits)
kernel/kexec_file.c: move purgatories sha256 to common code
kernel/kexec_file.c: allow archs to set purgatory load address
kernel/kexec_file.c: remove mis-use of sh_offset field during purgatory load
kernel/kexec_file.c: remove unneeded variables in kexec_purgatory_setup_sechdrs
kernel/kexec_file.c: remove unneeded for-loop in kexec_purgatory_setup_sechdrs
kernel/kexec_file.c: split up __kexec_load_puragory
kernel/kexec_file.c: use read-only sections in arch_kexec_apply_relocations*
kernel/kexec_file.c: search symbols in read-only kexec_purgatory
kernel/kexec_file.c: make purgatory_info->ehdr const
kernel/kexec_file.c: remove checks in kexec_purgatory_load
include/linux/kexec.h: silence compile warnings
kexec_file, x86: move re-factored code to generic side
x86: kexec_file: clean up prepare_elf64_headers()
x86: kexec_file: lift CRASH_MAX_RANGES limit on crash_mem buffer
x86: kexec_file: remove X86_64 dependency from prepare_elf64_headers()
x86: kexec_file: purge system-ram walking from prepare_elf64_headers()
kexec_file,x86,powerpc: factor out kexec_file_ops functions
kexec_file: make use of purgatory optional
proc: revalidate misc dentries
mm, slab: reschedule cache_reap() on the same CPU
...
Philipp Rudo [Fri, 13 Apr 2018 22:36:46 +0000 (15:36 -0700)]
kernel/kexec_file.c: move purgatories sha256 to common code
The code to verify the new kernels sha digest is applicable for all
architectures. Move it to common code.
One problem is the string.c implementation on x86. Currently sha256
includes x86/boot/string.h which defines memcpy and memset to be gcc
builtins. By moving the sha256 implementation to common code and
changing the include to linux/string.h both functions are no longer
defined. Thus definitions have to be provided in x86/purgatory/string.c
Link: http://lkml.kernel.org/r/20180321112751.22196-12-prudo@linux.vnet.ibm.com Signed-off-by: Philipp Rudo <prudo@linux.vnet.ibm.com> Acked-by: Dave Young <dyoung@redhat.com> Cc: AKASHI Takahiro <takahiro.akashi@linaro.org> Cc: Eric Biederman <ebiederm@xmission.com> Cc: Heiko Carstens <heiko.carstens@de.ibm.com> Cc: Ingo Molnar <mingo@kernel.org> Cc: Martin Schwidefsky <schwidefsky@de.ibm.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Thiago Jung Bauermann <bauerman@linux.vnet.ibm.com> Cc: Vivek Goyal <vgoyal@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Philipp Rudo [Fri, 13 Apr 2018 22:36:43 +0000 (15:36 -0700)]
kernel/kexec_file.c: allow archs to set purgatory load address
For s390 new kernels are loaded to fixed addresses in memory before they
are booted. With the current code this is a problem as it assumes the
kernel will be loaded to an 'arbitrary' address. In particular,
kexec_locate_mem_hole searches for a large enough memory region and sets
the load address (kexec_bufer->mem) to it.
Luckily there is a simple workaround for this problem. By returning 1
in arch_kexec_walk_mem, kexec_locate_mem_hole is turned off. This
allows the architecture to set kbuf->mem by hand. While the trick works
fine for the kernel it does not for the purgatory as here the
architectures don't have access to its kexec_buffer.
Give architectures access to the purgatories kexec_buffer by changing
kexec_load_purgatory to take a pointer to it. With this change
architectures have access to the buffer and can edit it as they need.
A nice side effect of this change is that we can get rid of the
purgatory_info->purgatory_load_address field. As now the information
stored there can directly be accessed from kbuf->mem.
Link: http://lkml.kernel.org/r/20180321112751.22196-11-prudo@linux.vnet.ibm.com Signed-off-by: Philipp Rudo <prudo@linux.vnet.ibm.com> Reviewed-by: Martin Schwidefsky <schwidefsky@de.ibm.com> Acked-by: Dave Young <dyoung@redhat.com> Cc: AKASHI Takahiro <takahiro.akashi@linaro.org> Cc: Eric Biederman <ebiederm@xmission.com> Cc: Heiko Carstens <heiko.carstens@de.ibm.com> Cc: Ingo Molnar <mingo@kernel.org> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Thiago Jung Bauermann <bauerman@linux.vnet.ibm.com> Cc: Vivek Goyal <vgoyal@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Philipp Rudo [Fri, 13 Apr 2018 22:36:39 +0000 (15:36 -0700)]
kernel/kexec_file.c: remove mis-use of sh_offset field during purgatory load
The current code uses the sh_offset field in purgatory_info->sechdrs to
store a pointer to the current load address of the section. Depending
whether the section will be loaded or not this is either a pointer into
purgatory_info->purgatory_buf or kexec_purgatory. This is not only a
violation of the ELF standard but also makes the code very hard to
understand as you cannot tell if the memory you are using is read-only
or not.
Remove this misuse and store the offset of the section in
pugaroty_info->purgatory_buf in sh_offset.
Link: http://lkml.kernel.org/r/20180321112751.22196-10-prudo@linux.vnet.ibm.com Signed-off-by: Philipp Rudo <prudo@linux.vnet.ibm.com> Acked-by: Dave Young <dyoung@redhat.com> Cc: AKASHI Takahiro <takahiro.akashi@linaro.org> Cc: Eric Biederman <ebiederm@xmission.com> Cc: Heiko Carstens <heiko.carstens@de.ibm.com> Cc: Ingo Molnar <mingo@kernel.org> Cc: Martin Schwidefsky <schwidefsky@de.ibm.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Thiago Jung Bauermann <bauerman@linux.vnet.ibm.com> Cc: Vivek Goyal <vgoyal@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Philipp Rudo [Fri, 13 Apr 2018 22:36:32 +0000 (15:36 -0700)]
kernel/kexec_file.c: remove unneeded for-loop in kexec_purgatory_setup_sechdrs
To update the entry point there is an extra loop over all section
headers although this can be done in the main loop. So move it there
and eliminate the extra loop and variable to store the 'entry section
index'.
Also, in the main loop, move the usual case, i.e. non-bss section, out
of the extra if-block.
Link: http://lkml.kernel.org/r/20180321112751.22196-8-prudo@linux.vnet.ibm.com Signed-off-by: Philipp Rudo <prudo@linux.vnet.ibm.com> Reviewed-by: Martin Schwidefsky <schwidefsky@de.ibm.com> Acked-by: Dave Young <dyoung@redhat.com> Cc: AKASHI Takahiro <takahiro.akashi@linaro.org> Cc: Eric Biederman <ebiederm@xmission.com> Cc: Heiko Carstens <heiko.carstens@de.ibm.com> Cc: Ingo Molnar <mingo@kernel.org> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Thiago Jung Bauermann <bauerman@linux.vnet.ibm.com> Cc: Vivek Goyal <vgoyal@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Philipp Rudo [Fri, 13 Apr 2018 22:36:28 +0000 (15:36 -0700)]
kernel/kexec_file.c: split up __kexec_load_puragory
When inspecting __kexec_load_purgatory you find that it has two tasks
1) setting up the kexec_buffer for the new kernel and,
2) setting up pi->sechdrs for the final load address.
The two tasks are independent of each other. To improve readability
split up __kexec_load_purgatory into two functions, one for each task,
and call them directly from kexec_load_purgatory.
Link: http://lkml.kernel.org/r/20180321112751.22196-7-prudo@linux.vnet.ibm.com Signed-off-by: Philipp Rudo <prudo@linux.vnet.ibm.com> Acked-by: Dave Young <dyoung@redhat.com> Cc: AKASHI Takahiro <takahiro.akashi@linaro.org> Cc: Eric Biederman <ebiederm@xmission.com> Cc: Heiko Carstens <heiko.carstens@de.ibm.com> Cc: Ingo Molnar <mingo@kernel.org> Cc: Martin Schwidefsky <schwidefsky@de.ibm.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Thiago Jung Bauermann <bauerman@linux.vnet.ibm.com> Cc: Vivek Goyal <vgoyal@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>