Arınç ÜNAL [Thu, 14 Mar 2024 09:33:42 +0000 (12:33 +0300)]
net: dsa: mt7530: fix handling of all link-local frames
Currently, the MT753X switches treat frames with :01-0D and :0F MAC DAs as
regular multicast frames, therefore flooding them to user ports.
On page 205, section "8.6.3 Frame filtering" of the active standard, IEEE
Std 802.1Q™-2022, it is stated that frames with 01:80:C2:00:00:00-0F as MAC
DA must only be propagated to C-VLAN and MAC Bridge components. That means
VLAN-aware and VLAN-unaware bridges. On the switch designs with CPU ports,
these frames are supposed to be processed by the CPU (software). So we make
the switch only forward them to the CPU port. And if received from a CPU
port, forward to a single port. The software is responsible of making the
switch conform to the latter by setting a single port as destination port
on the special tag.
This switch intellectual property cannot conform to this part of the
standard fully. Whilst the REV_UN frame tag covers the remaining :04-0D and
:0F MAC DAs, it also includes :22-FF which the scope of propagation is not
supposed to be restricted for these MAC DAs.
Set frames with :01-03 MAC DAs to be trapped to the CPU port(s). Add a
comment for the remaining MAC DAs.
Note that the ingress port must have a PVID assigned to it for the switch
to forward untagged frames. A PVID is set by default on VLAN-aware and
VLAN-unaware ports. However, when the network interface that pertains to
the ingress port is attached to a vlan_filtering enabled bridge, the user
can remove the PVID assignment from it which would prevent the link-local
frames from being trapped to the CPU port. I am yet to see a way to forward
link-local frames while preventing other untagged frames from being
forwarded too.
Fixes: b8f126a8d543 ("net-next: dsa: add dsa support for Mediatek MT7530 switch") Signed-off-by: Arınç ÜNAL <arinc.unal@arinc9.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Whether VLAN-aware or not, on every VID VLAN table entry that has the CPU
port as a member of it, frames are set to egress the CPU port with the VLAN
tag stacked. This is so that VLAN tags can be appended after hardware
special tag (called DSA tag in the context of Linux drivers).
For user ports on a VLAN-unaware bridge, frame ingressing the user port
egresses CPU port with only the special tag.
For user ports on a VLAN-aware bridge, frame ingressing the user port
egresses CPU port with the special tag and the VLAN tag.
This causes issues with link-local frames, specifically BPDUs, because the
software expects to receive them VLAN-untagged.
There are two options to make link-local frames egress untagged. Setting
CONSISTENT or UNTAGGED on the EG_TAG bits on the relevant register.
CONSISTENT means frames egress exactly as they ingress. That means
egressing with the VLAN tag they had at ingress or egressing untagged if
they ingressed untagged. Although link-local frames are not supposed to be
transmitted VLAN-tagged, if they are done so, when egressing through a CPU
port, the special tag field will be broken.
BPDU egresses CPU port with VLAN tag egressing stacked, received on
software:
To prevent confusing the software, force the frame to egress UNTAGGED
instead of CONSISTENT. This way, frames can't possibly be received TAGGED
by software which would have the special tag field broken.
VLAN Tag Egress Procedure
For all frames, one of these options set the earliest in this order will
apply to the frame:
- EG_TAG in certain registers for certain frames.
This will apply to frame with matching MAC DA or EtherType.
- EG_TAG in the address table.
This will apply to frame at its incoming port.
- EG_TAG in the PVC register.
This will apply to frame at its incoming port.
- EG_CON and [EG_TAG per port] in the VLAN table.
This will apply to frame at its outgoing port.
- EG_TAG in the PCR register.
This will apply to frame at its outgoing port.
EG_TAG in certain registers for certain frames:
PPPoE Discovery_ARP/RARP: PPP_EG_TAG and ARP_EG_TAG in the APC register.
IGMP_MLD: IGMP_EG_TAG and MLD_EG_TAG in the IMC register.
BPDU and PAE: BPDU_EG_TAG and PAE_EG_TAG in the BPC register.
REV_01 and REV_02: R01_EG_TAG and R02_EG_TAG in the RGAC1 register.
REV_03 and REV_0E: R03_EG_TAG and R0E_EG_TAG in the RGAC2 register.
REV_10 and REV_20: R10_EG_TAG and R20_EG_TAG in the RGAC3 register.
REV_21 and REV_UN: R21_EG_TAG and RUN_EG_TAG in the RGAC4 register.
With this change, it can be observed that a bridge interface with stp_state
and vlan_filtering enabled will properly block ports now.
Fixes: b8f126a8d543 ("net-next: dsa: add dsa support for Mediatek MT7530 switch") Signed-off-by: Arınç ÜNAL <arinc.unal@arinc9.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
====================
Report RCU QS for busy network kthreads
This changeset fixes a common problem for busy networking kthreads.
These threads, e.g. NAPI threads, typically will do:
* polling a batch of packets
* if there are more work, call cond_resched() to allow scheduling
* continue to poll more packets when rx queue is not empty
We observed this being a problem in production, since it can block RCU
tasks from making progress under heavy load. Investigation indicates
that just calling cond_resched() is insufficient for RCU tasks to reach
quiescent states. This also has the side effect of frequently clearing
the TIF_NEED_RESCHED flag on voluntary preempt kernels. As a result,
schedule() will not be called in these circumstances, despite schedule()
in fact provides required quiescent states. This at least affects NAPI
threads, napi_busy_loop, and also cpumap kthread.
By reporting RCU QSes in these kthreads periodically before cond_resched, the
blocked RCU waiters can correctly progress. Instead of just reporting QS for
RCU tasks, these code share the same concern as noted in the commit d28139c4e967 ("rcu: Apply RCU-bh QSes to RCU-sched and RCU-preempt when safe").
So report a consolidated QS for safety.
It is worth noting that, although this problem is reproducible in
napi_busy_loop, it only shows up when setting the polling interval to as high
as 2ms, which is far larger than recommended 50us-100us in the documentation.
So napi_busy_loop is left untouched.
Lastly, this does not affect RT kernels, which does not enter the scheduler
through cond_resched(). Without the mentioned side effect, schedule() will
be called time by time, and clear the RCU task holdouts.
Yan Zhai [Tue, 19 Mar 2024 20:44:40 +0000 (13:44 -0700)]
bpf: report RCU QS in cpumap kthread
When there are heavy load, cpumap kernel threads can be busy polling
packets from redirect queues and block out RCU tasks from reaching
quiescent states. It is insufficient to just call cond_resched() in such
context. Periodically raise a consolidated RCU QS before cond_resched
fixes the problem.
Fixes: 6710e1126934 ("bpf: introduce new bpf cpu map type BPF_MAP_TYPE_CPUMAP") Reviewed-by: Jesper Dangaard Brouer <hawk@kernel.org> Signed-off-by: Yan Zhai <yan@cloudflare.com> Acked-by: Paul E. McKenney <paulmck@kernel.org> Acked-by: Jesper Dangaard Brouer <hawk@kernel.org> Link: https://lore.kernel.org/r/c17b9f1517e19d813da3ede5ed33ee18496bb5d8.1710877680.git.yan@cloudflare.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Yan Zhai [Tue, 19 Mar 2024 20:44:37 +0000 (13:44 -0700)]
net: report RCU QS on threaded NAPI repolling
NAPI threads can keep polling packets under load. Currently it is only
calling cond_resched() before repolling, but it is not sufficient to
clear out the holdout of RCU tasks, which prevent BPF tracing programs
from detaching for long period. This can be reproduced easily with
following set up:
ip netns add test1
ip netns add test2
ip -n test1 link add veth1 type veth peer name veth2 netns test2
ip -n test1 link set veth1 up
ip -n test1 link set lo up
ip -n test2 link set veth2 up
ip -n test2 link set lo up
ip -n test1 addr add 192.168.1.2/31 dev veth1
ip -n test1 addr add 1.1.1.1/32 dev lo
ip -n test2 addr add 192.168.1.3/31 dev veth2
ip -n test2 addr add 2.2.2.2/31 dev lo
ip -n test1 route add default via 192.168.1.3
ip -n test2 route add default via 192.168.1.2
for i in `seq 10 210`; do
for j in `seq 10 210`; do
ip netns exec test2 iptables -I INPUT -s 3.3.$i.$j -p udp --dport 5201
done
done
ip netns exec test2 ethtool -K veth2 gro on
ip netns exec test2 bash -c 'echo 1 > /sys/class/net/veth2/threaded'
ip netns exec test1 ethtool -K veth1 tso off
Then run an iperf3 client/server and a bpftrace script can trigger it:
Yan Zhai [Tue, 19 Mar 2024 20:44:34 +0000 (13:44 -0700)]
rcu: add a helper to report consolidated flavor QS
When under heavy load, network processing can run CPU-bound for many
tens of seconds. Even in preemptible kernels (non-RT kernel), this can
block RCU Tasks grace periods, which can cause trace-event removal to
take more than a minute, which is unacceptably long.
This commit therefore creates a new helper function that passes through
both RCU and RCU-Tasks quiescent states every 100 milliseconds. This
hard-coded value suffices for current workloads.
Suggested-by: Paul E. McKenney <paulmck@kernel.org> Reviewed-by: Jesper Dangaard Brouer <hawk@kernel.org> Signed-off-by: Yan Zhai <yan@cloudflare.com> Reviewed-by: Paul E. McKenney <paulmck@kernel.org> Acked-by: Jesper Dangaard Brouer <hawk@kernel.org> Link: https://lore.kernel.org/r/90431d46ee112d2b0af04dbfe936faaca11810a5.1710877680.git.yan@cloudflare.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Herve Codina [Thu, 14 Mar 2024 12:00:06 +0000 (13:00 +0100)]
lib/bitmap: Fix bitmap_scatter() and bitmap_gather() kernel doc
The make htmldoc command failed with the following error
... include/linux/bitmap.h:524: ERROR: Unexpected indentation.
... include/linux/bitmap.h:524: CRITICAL: Unexpected section title or transition.
Move the visual representation to a literal block.
Fixes: de5f84338970 ("lib/bitmap: Introduce bitmap_scatter() and bitmap_gather() helpers") Reported-by: Stephen Rothwell <sfr@canb.auug.org.au> Closes: https://lore.kernel.org/linux-kernel/20240312153059.3ffde1b7@canb.auug.org.au/ Signed-off-by: Herve Codina <herve.codina@bootlin.com> Reviewed-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com> Reviewed-by: Bagas Sanjaya <bagasdotme@gmail.com> Acked-by: Yury Norov <yury.norov@gmail.com> Link: https://lore.kernel.org/r/20240314120006.458580-1-herve.codina@bootlin.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
This patchset fixes the problems related to RVU mailbox.
During long run tests some times VF commands like setting
MTU or toggling interface fails because VF mailbox is timedout
waiting for response from PF.
Below are the fixes
Patch 1: There are two types of messages in RVU mailbox namely up and down
messages. Down messages are synchronous messages where a PF/VF sends
a message to AF and AF replies back with response. UP messages are
notifications and are asynchronous like AF sending link events to
PF. When VF sends a down message to PF, PF forwards to AF and sends
the response from AF back to VF. PF has to forward VF messages since
there is no path in hardware for VF to send directly to AF.
There is one mailbox interrupt from AF to PF when raised could mean
two scenarios one is where AF sending reply to PF for a down message
sent by PF and another one is AF sending up message asynchronously
when link changed for that PF. Receiving the up message interrupt while
PF is in middle of forwarding down message causes mailbox errors.
Fix this by receiver detecting the type of message from the mbox data register
set by sender.
Patch 2:
During VF driver remove, VF has to wait until last message is
completed and then turn off mailbox interrupts from PF.
Patch 3:
Do not use ordered workqueue for message processing since multiple works are
queued simultaneously by all the VFs and PF link UP messages.
Patch 4:
When sending link event to VF by PF check whether VF is really up to
receive this message.
Patch 5:
In AF driver, use separate interrupt handlers for the AF-VF interrupt and
AF-PF interrupt. Sometimes both interrupts are raised to two CPUs at same
time and both CPUs execute same function at same time corrupting the data.
v2 changes:
Added missing mutex unlock in error path in patch 1
Refactored if else logic in patch 1 as suggested by Paolo Abeni
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
octeontx2-af: Use separate handlers for interrupts
For PF to AF interrupt vector and VF to AF vector same
interrupt handler is registered which is causing race condition.
When two interrupts are raised to two CPUs at same time
then two cores serve same event corrupting the data.
Fixes: 7304ac4567bc ("octeontx2-af: Add mailbox IRQ and msg handlers") Signed-off-by: Subbaraya Sundeep <sbhatta@marvell.com> Signed-off-by: David S. Miller <davem@davemloft.net>
octeontx2-pf: Send UP messages to VF only when VF is up.
When PF sending link status messages to VF, it is possible
that by the time link_event_task work function is executed
VF might have brought down. Hence before sending VF link
status message check whether VF is up to receive it.
Fixes: ad513ed938c9 ("octeontx2-vf: Link event notification support") Signed-off-by: Subbaraya Sundeep <sbhatta@marvell.com> Signed-off-by: David S. Miller <davem@davemloft.net>
octeontx2-pf: Use default max_active works instead of one
Only one execution context for the workqueue used for PF and
VFs mailbox communication is incorrect since multiple works are
queued simultaneously by all the VFs and PF link UP messages.
Hence use default number of execution contexts by passing zero
as max_active to alloc_workqueue function. With this fix in place,
modify UP messages also to wait until completion.
Fixes: d424b6c02415 ("octeontx2-pf: Enable SRIOV and added VF mbox handling") Signed-off-by: Subbaraya Sundeep <sbhatta@marvell.com> Signed-off-by: David S. Miller <davem@davemloft.net>
octeontx2-pf: Wait till detach_resources msg is complete
During VF driver remove, a message is sent to detach VF
resources to PF but VF is not waiting until message is
complete. Also mailbox interrupts need to be turned off
after the detach resource message is complete. This patch
fixes that problem.
Fixes: 05fcc9e08955 ("octeontx2-pf: Attach NIX and NPA block LFs") Signed-off-by: Subbaraya Sundeep <sbhatta@marvell.com> Signed-off-by: David S. Miller <davem@davemloft.net>
octeontx2: Detect the mbox up or down message via register
A single line of interrupt is used to receive up notifications
and down reply messages from AF to PF (similarly from PF to its VF).
PF acts as bridge and forwards VF messages to AF and sends respsones
back from AF to VF. When an async event like link event is received
by up message when PF is in middle of forwarding VF message then
mailbox errors occur because PF state machine is corrupted.
Since VF is a separate driver or VF driver can be in a VM it is
not possible to serialize from the start of communication at VF.
Hence to differentiate between type of messages at PF this patch makes
sender to set mbox data register with distinct values for up and down
messages. Sender also checks whether previous interrupt is received
before triggering current interrupt by waiting for mailbox data register
to become zero.
Fixes: 5a6d7c9daef3 ("octeontx2-pf: Mailbox communication with AF") Signed-off-by: Subbaraya Sundeep <sbhatta@marvell.com> Signed-off-by: David S. Miller <davem@davemloft.net>
1) Fix possible page_pool leak triggered by esp_output.
From Dragos Tatulea.
2) Fix UDP encapsulation in software GSO path.
From Leon Romanovsky.
* tag 'ipsec-2024-03-19' of git://git.kernel.org/pub/scm/linux/kernel/git/klassert/ipsec:
xfrm: Allow UDP encapsulation only in offload modes
net: esp: fix bad handling of pages from page_pool
====================
syzkaller reported a read of uninit req->syncookie. [0]
Originally, req->syncookie was used only in tcp_conn_request()
to indicate if we need to encode SYN cookie in SYN+ACK, so the
field remains uninitialised in other places.
The commit 695751e31a63 ("bpf: tcp: Handle BPF SYN Cookie in
cookie_v[46]_check().") added another meaning in ACK path;
req->syncookie is set true if SYN cookie is validated by BPF
kfunc.
After the change, cookie_v[46]_check() always read req->syncookie,
but it is not initialised in the normal SYN cookie case as reported
by KMSAN.
Let's make sure we always initialise req->syncookie in reqsk_alloc().
Thinh Tran [Fri, 15 Mar 2024 20:55:35 +0000 (15:55 -0500)]
net/bnx2x: Prevent access to a freed page in page_pool
Fix race condition leading to system crash during EEH error handling
During EEH error recovery, the bnx2x driver's transmit timeout logic
could cause a race condition when handling reset tasks. The
bnx2x_tx_timeout() schedules reset tasks via bnx2x_sp_rtnl_task(),
which ultimately leads to bnx2x_nic_unload(). In bnx2x_nic_unload()
SGEs are freed using bnx2x_free_rx_sge_range(). However, this could
overlap with the EEH driver's attempt to reset the device using
bnx2x_io_slot_reset(), which also tries to free SGEs. This race
condition can result in system crashes due to accessing freed memory
locations in bnx2x_free_rx_sge()
799 static inline void bnx2x_free_rx_sge(struct bnx2x *bp,
800 struct bnx2x_fastpath *fp, u16 index)
801 {
802 struct sw_rx_page *sw_buf = &fp->rx_page_ring[index];
803 struct page *page = sw_buf->page;
....
where sw_buf was set to NULL after the call to dma_unmap_page()
by the preceding thread.
Nikita Kiryushin [Fri, 15 Mar 2024 17:50:52 +0000 (20:50 +0300)]
net: phy: fix phy_read_poll_timeout argument type in genphy_loopback
read_poll_timeout inside phy_read_poll_timeout can set val negative
in some cases (for example, __mdiobus_read inside phy_read can return
-EOPNOTSUPP).
Supposedly, commit 4ec732951702 ("net: phylib: fix phy_read*_poll_timeout()")
should fix problems with wrong-signed vals, but I do not see how
as val is sent to phy_read as is and __val = phy_read (not val)
is checked for sign.
Change val type for signed to allow better error handling as done in other
phy_read_poll_timeout callers. This will not fix any error handling
by itself, but allows, for example, to modify cond with appropriate
sign check or check resulting val separately.
Found by Linux Verification Center (linuxtesting.org) with SVACE.
Fixes: 014068dcb5b1 ("net: phy: genphy_loopback: add link speed configuration") Signed-off-by: Nikita Kiryushin <kiryushin@ancud.ru> Reviewed-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk> Link: https://lore.kernel.org/r/20240315175052.8049-1-kiryushin@ancud.ru Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Michal Koutný [Fri, 15 Mar 2024 16:02:10 +0000 (17:02 +0100)]
net/sched: Add module alias for sch_fq_pie
The commit 2c15a5aee2f3 ("net/sched: Load modules via their alias")
starts loading modules via aliases and not canonical names. The new
aliases were added in commit 241a94abcf46 ("net/sched: Add module
aliases for cls_,sch_,act_ modules") via a Coccinele script.
sch_fq_pie.c is missing module.h header and thus Coccinele did not patch
it. Add the include and module alias manually, so that autoloading works
for sch_fq_pie too.
(Note: commit message in commit 241a94abcf46 ("net/sched: Add module
aliases for cls_,sch_,act_ modules") was mangled due to '#'
misinterpretation. The predicate haskernel is:
Tobias Brunner [Fri, 15 Mar 2024 14:35:40 +0000 (15:35 +0100)]
ipv4: raw: Fix sending packets from raw sockets via IPsec tunnels
Since the referenced commit, the xfrm_inner_extract_output() function
uses the protocol field to determine the address family. So not setting
it for IPv4 raw sockets meant that such packets couldn't be tunneled via
IPsec anymore.
IPv6 raw sockets are not affected as they already set the protocol since 9c9c9ad5fae7 ("ipv6: set skb->protocol on tcp, raw and ip6_append_data
genereated skbs").
Fixes: f4796398f21b ("xfrm: Remove inner/outer modes from output path") Signed-off-by: Tobias Brunner <tobias@strongswan.org> Reviewed-by: David Ahern <dsahern@kernel.org> Reviewed-by: Nicolas Dichtel <nicolas.dichtel@6wind.com> Link: https://lore.kernel.org/r/c5d9a947-eb19-4164-ac99-468ea814ce20@strongswan.org Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Felix Maurer [Fri, 15 Mar 2024 12:04:52 +0000 (13:04 +0100)]
hsr: Handle failures in module init
A failure during registration of the netdev notifier was not handled at
all. A failure during netlink initialization did not unregister the netdev
notifier.
Handle failures of netdev notifier registration and netlink initialization.
Both functions should only return negative values on failure and thereby
lead to the hsr module not being loaded.
Fixes: f421436a591d ("net/hsr: Add support for the High-availability Seamless Redundancy protocol (HSRv0)") Signed-off-by: Felix Maurer <fmaurer@redhat.com> Reviewed-by: Shigeru Yoshida <syoshida@redhat.com> Reviewed-by: Breno Leitao <leitao@debian.org> Link: https://lore.kernel.org/r/3ce097c15e3f7ace98fc7fd9bcbf299f092e63d1.1710504184.git.fmaurer@redhat.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Yewon Choi [Fri, 15 Mar 2024 09:28:38 +0000 (18:28 +0900)]
rds: introduce acquire/release ordering in acquire/release_in_xmit()
acquire/release_in_xmit() work as bit lock in rds_send_xmit(), so they
are expected to ensure acquire/release memory ordering semantics.
However, test_and_set_bit/clear_bit() don't imply such semantics, on
top of this, following smp_mb__after_atomic() does not guarantee release
ordering (memory barrier actually should be placed before clear_bit()).
Instead, we use clear_bit_unlock/test_and_set_bit_lock() here.
Fixes: 0f4b1c7e89e6 ("rds: fix rds_send_xmit() serialization") Fixes: 1f9ecd7eacfd ("RDS: Pass rds_conn_path to rds_send_xmit()") Signed-off-by: Yewon Choi <woni9911@gmail.com> Reviewed-by: Michal Kubiak <michal.kubiak@intel.com> Link: https://lore.kernel.org/r/ZfQUxnNTO9AJmzwc@libra05 Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Jakub Kicinski [Fri, 15 Mar 2024 00:21:08 +0000 (17:21 -0700)]
tools: ynl: add header guards for nlctrl
I "extracted" YNL C into a GitHub repo to make it easier
to use in other projects: https://github.com/linux-netdev/ynl-c
GitHub actions use Ubuntu by default, and the kernel headers
there are missing f329a0ebeaba ("genetlink: correct uAPI defines").
Add the direct include workaround for nlctrl.
Fixes: 768e044a5fd4 ("doc/netlink/specs: Add spec for nlctrl netlink family") Signed-off-by: Jakub Kicinski <kuba@kernel.org> Reviewed-by: Donald Hunter <donald.hunter@gmail.com> Link: https://lore.kernel.org/r/20240315002108.523232-1-kuba@kernel.org Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Paolo Abeni [Tue, 19 Mar 2024 10:22:54 +0000 (11:22 +0100)]
Merge branch 'wireguard-fixes-for-6-9-rc1'
Jason A. Donenfeld says:
====================
wireguard fixes for 6.9-rc1
This series has four WireGuard fixes:
1) Annotate a data race that KCSAN found by using READ_ONCE/WRITE_ONCE,
which has been causing syzkaller noise.
2) Use the generic netdev tstats allocation and stats getters instead of
doing this within the driver.
3) Explicitly check a flag variable instead of an empty list in the
netlink code, to prevent a UaF situation when paging through GET
results during a remove-all SET operation.
4) Set a flag in the RISC-V CI config so the selftests continue to boot.
====================
wireguard: selftests: set RISCV_ISA_FALLBACK on riscv{32,64}
This option is needed to continue booting with QEMU. Recent changes that
made this optional meant that it gets unset in the test harness, and so
WireGuard CI has been broken. Fix this by simply setting this option.
Cc: stable@vger.kernel.org Fixes: 496ea826d1e1 ("RISC-V: provide Kconfig & commandline options to control parsing "riscv,isa"") Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com> Reviewed-by: Jiri Pirko <jiri@nvidia.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
wireguard: netlink: access device through ctx instead of peer
The previous commit fixed a bug that led to a NULL peer->device being
dereferenced. It's actually easier and faster performance-wise to
instead get the device from ctx->wg. This semantically makes more sense
too, since ctx->wg->peer_allowedips.seq is compared with
ctx->allowedips_seq, basing them both in ctx. This also acts as a
defence in depth provision against freed peers.
Cc: stable@vger.kernel.org Fixes: e7096c131e51 ("net: WireGuard secure network tunnel") Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com> Reviewed-by: Jiri Pirko <jiri@nvidia.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
wireguard: netlink: check for dangling peer via is_dead instead of empty list
If all peers are removed via wg_peer_remove_all(), rather than setting
peer_list to empty, the peer is added to a temporary list with a head on
the stack of wg_peer_remove_all(). If a netlink dump is resumed and the
cursored peer is one that has been removed via wg_peer_remove_all(), it
will iterate from that peer and then attempt to dump freed peers.
Fix this by instead checking peer->is_dead, which was explictly created
for this purpose. Also move up the device_update_lock lockdep assertion,
since reading is_dead relies on that.
It can be reproduced by a small script like:
echo "Setting config..."
ip link add dev wg0 type wireguard
wg setconf wg0 /big-config
(
while true; do
echo "Showing config..."
wg showconf wg0 > /dev/null
done
) &
sleep 4
wg setconf wg0 <(printf "[Peer]\nPublicKey=$(wg genkey)\n")
Resulting in:
BUG: KASAN: slab-use-after-free in __lock_acquire+0x182a/0x1b20
Read of size 8 at addr ffff88811956ec70 by task wg/59
CPU: 2 PID: 59 Comm: wg Not tainted 6.8.0-rc2-debug+ #5
Call Trace:
<TASK>
dump_stack_lvl+0x47/0x70
print_address_description.constprop.0+0x2c/0x380
print_report+0xab/0x250
kasan_report+0xba/0xf0
__lock_acquire+0x182a/0x1b20
lock_acquire+0x191/0x4b0
down_read+0x80/0x440
get_peer+0x140/0xcb0
wg_get_device_dump+0x471/0x1130
Cc: stable@vger.kernel.org Fixes: e7096c131e51 ("net: WireGuard secure network tunnel") Reported-by: Lillian Berry <lillian@star-ark.net> Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com> Reviewed-by: Jiri Pirko <jiri@nvidia.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Commit 3e2f544dd8a33 ("net: get stats64 if device if driver is
configured") moved the callback to dev_get_tstats64() to net core, so,
unless the driver is doing some custom stats collection, it does not
need to set .ndo_get_stats64.
Since this driver is now relying in NETDEV_PCPU_STAT_TSTATS, then, it
doesn't need to set the dev_get_tstats64() generic .ndo_get_stats64
function pointer.
Signed-off-by: Breno Leitao <leitao@debian.org> Reviewed-by: Simon Horman <horms@kernel.org> Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com> Reviewed-by: Jiri Pirko <jiri@nvidia.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Breno Leitao [Thu, 14 Mar 2024 22:49:07 +0000 (16:49 -0600)]
wireguard: device: leverage core stats allocator
With commit 34d21de99cea9 ("net: Move {l,t,d}stats allocation to core
and convert veth & vrf"), stats allocation could be done on net core
instead of in this driver.
With this new approach, the driver doesn't have to bother with error
handling (allocation failure checking, making sure free happens in the
right spot, etc). This is core responsibility now.
Remove the allocation in this driver and leverage the network core
allocation instead.
Signed-off-by: Breno Leitao <leitao@debian.org> Reviewed-by: Simon Horman <horms@kernel.org> Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com> Reviewed-by: Jiri Pirko <jiri@nvidia.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
wireguard: receive: annotate data-race around receiving_counter.counter
Syzkaller with KCSAN identified a data-race issue when accessing
keypair->receiving_counter.counter. Use READ_ONCE() and WRITE_ONCE()
annotations to mark the data race as intentional.
BUG: KCSAN: data-race in wg_packet_decrypt_worker / wg_packet_rx_poll
write to 0xffff888107765888 of 8 bytes by interrupt on cpu 0:
counter_validate drivers/net/wireguard/receive.c:321 [inline]
wg_packet_rx_poll+0x3ac/0xf00 drivers/net/wireguard/receive.c:461
__napi_poll+0x60/0x3b0 net/core/dev.c:6536
napi_poll net/core/dev.c:6605 [inline]
net_rx_action+0x32b/0x750 net/core/dev.c:6738
__do_softirq+0xc4/0x279 kernel/softirq.c:553
do_softirq+0x5e/0x90 kernel/softirq.c:454
__local_bh_enable_ip+0x64/0x70 kernel/softirq.c:381
__raw_spin_unlock_bh include/linux/spinlock_api_smp.h:167 [inline]
_raw_spin_unlock_bh+0x36/0x40 kernel/locking/spinlock.c:210
spin_unlock_bh include/linux/spinlock.h:396 [inline]
ptr_ring_consume_bh include/linux/ptr_ring.h:367 [inline]
wg_packet_decrypt_worker+0x6c5/0x700 drivers/net/wireguard/receive.c:499
process_one_work kernel/workqueue.c:2633 [inline]
...
read to 0xffff888107765888 of 8 bytes by task 3196 on cpu 1:
decrypt_packet drivers/net/wireguard/receive.c:252 [inline]
wg_packet_decrypt_worker+0x220/0x700 drivers/net/wireguard/receive.c:501
process_one_work kernel/workqueue.c:2633 [inline]
process_scheduled_works+0x5b8/0xa30 kernel/workqueue.c:2706
worker_thread+0x525/0x730 kernel/workqueue.c:2787
...
Fixes: a9e90d9931f3 ("wireguard: noise: separate receive counter from send counter") Reported-by: syzbot+d1de830e4ecdaac83d89@syzkaller.appspotmail.com Signed-off-by: Nikita Zhandarovich <n.zhandarovich@fintech.ru> Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com> Reviewed-by: Jiri Pirko <jiri@nvidia.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
The patch currently broke the bpf selftest test_tc_dtime because
uapi field __sk_buff->tstamp_type depends on skb->mono_delivery_time which
does not necessarily mean mono with the original fix as the bit was re-used
for userspace timestamp as well to avoid tstamp reset in the forwarding
path. To solve this we need to keep mono_delivery_time as is and
introduce another bit called user_delivery_time and fall back to the
initial proposal of setting the user_delivery_time bit based on
sk_clockid set from userspace.
Fixes: 885c36e59f46 ("net: Re-use and set mono_delivery_time bit for userspace tstamp packets") Link: https://lore.kernel.org/netdev/bc037db4-58bb-4861-ac31-a361a93841d3@linux.dev/ Signed-off-by: Abhishek Chauhan <quic_abchauha@quicinc.com> Acked-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Martin KaFai Lau <martin.lau@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
Arınç ÜNAL [Thu, 14 Mar 2024 09:28:35 +0000 (12:28 +0300)]
net: dsa: mt7530: prevent possible incorrect XTAL frequency selection
On MT7530, the HT_XTAL_FSEL field of the HWTRAP register stores a 2-bit
value that represents the frequency of the crystal oscillator connected to
the switch IC. The field is populated by the state of the ESW_P4_LED_0 and
ESW_P4_LED_0 pins, which is done right after reset is deasserted.
On MT7531, the XTAL25 bit of the STRAP register stores this. The LAN0LED0
pin is used to populate the bit. 25MHz when the pin is high, 40MHz when
it's low.
These pins are also used with LEDs, therefore, their state can be set to
something other than the bootstrapping configuration. For example, a link
may be established on port 3 before the DSA subdriver takes control of the
switch which would set ESW_P3_LED_0 to high.
Currently on mt7530_setup() and mt7531_setup(), 1000 - 1100 usec delay is
described between reset assertion and deassertion. Some switch ICs in real
life conditions cannot always have these pins set back to the bootstrapping
configuration before reset deassertion in this amount of delay. This causes
wrong crystal frequency to be selected which puts the switch in a
nonfunctional state after reset deassertion.
The tests below are conducted on an MT7530 with a 40MHz crystal oscillator
by Justin Swartz.
With a cable from an active peer connected to port 3 before reset, an
incorrect crystal frequency (0b11 = 25MHz) is selected:
[1] Reset is asserted.
[2] Period of 1000 - 1100 usec.
[3] Reset is deasserted.
[4] Period of 315 usec. HWTRAP register is populated with incorrect
XTAL frequency.
[5] Signals reflect the bootstrapped configuration.
Increase the delay between reset_control_assert() and
reset_control_deassert(), and gpiod_set_value_cansleep(priv->reset, 0) and
gpiod_set_value_cansleep(priv->reset, 1) to 5000 - 5100 usec. This amount
ensures a higher possibility that the switch IC will have these pins back
to the bootstrapping configuration before reset deassertion.
With a cable from an active peer connected to port 3 before reset, the
correct crystal frequency (0b10 = 40MHz) is selected:
[1] Reset is asserted.
[2] Period of 5000 - 5100 usec.
[2-1] ESW_P3_LED_0 goes low.
[2-2] Remaining period of 5000 - 5100 usec.
[3] Reset is deasserted.
[4] Period of 310 usec. HWTRAP register is populated with bootstrapped
XTAL frequency.
[5] Signals reflect the bootstrapped configuration.
Revert commit 2920dd92b980 ("net: dsa: mt7530: disable LEDs before reset").
Changing the state of pins via reset assertion is simpler and more
efficient than doing so by setting the LED controller off.
Fixes: b8f126a8d543 ("net-next: dsa: add dsa support for Mediatek MT7530 switch") Fixes: c288575f7810 ("net: dsa: mt7530: Add the support of MT7531 switch") Co-developed-by: Justin Swartz <justin.swartz@risingedge.co.za> Signed-off-by: Justin Swartz <justin.swartz@risingedge.co.za> Signed-off-by: Arınç ÜNAL <arinc.unal@arinc9.com> Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Mon, 18 Mar 2024 12:25:52 +0000 (12:25 +0000)]
Merge branch 'veth-xdp-gro'
Ignat Korchagin says:
====================
net: veth: ability to toggle GRO and XDP independently
It is rather confusing that GRO is automatically enabled, when an XDP program
is attached to a veth interface. Moreover, it is not possible to disable GRO
on a veth, if an XDP program is attached (which might be desirable in some use
cases).
Make GRO and XDP independent for a veth interface.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Ignat Korchagin [Wed, 13 Mar 2024 18:37:58 +0000 (19:37 +0100)]
net: veth: do not manipulate GRO when using XDP
Commit d3256efd8e8b ("veth: allow enabling NAPI even without XDP") tried to fix
the fact that GRO was not possible without XDP, because veth did not use NAPI
without XDP. However, it also introduced the behaviour that GRO is always
enabled, when XDP is enabled.
While it might be desired for most cases, it is confusing for the user at best
as the GRO flag suddenly changes, when an XDP program is attached. It also
introduces some complexities in state management as was partially addressed in
commit fe9f801355f0 ("net: veth: clear GRO when clearing XDP even when down").
But the biggest problem is that it is not possible to disable GRO at all, when
an XDP program is attached, which might be needed for some use cases.
Fix this by not touching the GRO flag on XDP enable/disable as the code already
supports switching to NAPI if either GRO or XDP is requested.
Link: https://lore.kernel.org/lkml/20240311124015.38106-1-ignat@cloudflare.com/ Fixes: d3256efd8e8b ("veth: allow enabling NAPI even without XDP") Fixes: fe9f801355f0 ("net: veth: clear GRO when clearing XDP even when down") Signed-off-by: Ignat Korchagin <ignat@cloudflare.com> Reviewed-by: Toke Høiland-Jørgensen <toke@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Dragos Tatulea [Fri, 8 Mar 2024 15:26:00 +0000 (17:26 +0200)]
net: esp: fix bad handling of pages from page_pool
When the skb is reorganized during esp_output (!esp->inline), the pages
coming from the original skb fragments are supposed to be released back
to the system through put_page. But if the skb fragment pages are
originating from a page_pool, calling put_page on them will trigger a
page_pool leak which will eventually result in a crash.
This leak can be easily observed when using CONFIG_DEBUG_VM and doing
ipsec + gre (non offloaded) forwarding:
BUG: KCSAN: data-race in dev_queue_xmit_nit / packet_setsockopt
write to 0xffff888107804542 of 1 bytes by task 22618 on cpu 0:
packet_setsockopt+0xd83/0xfd0 net/packet/af_packet.c:4003
do_sock_setsockopt net/socket.c:2311 [inline]
__sys_setsockopt+0x1d8/0x250 net/socket.c:2334
__do_sys_setsockopt net/socket.c:2343 [inline]
__se_sys_setsockopt net/socket.c:2340 [inline]
__x64_sys_setsockopt+0x66/0x80 net/socket.c:2340
do_syscall_64+0xd3/0x1d0
entry_SYSCALL_64_after_hwframe+0x6d/0x75
read to 0xffff888107804542 of 1 bytes by task 27 on cpu 1:
dev_queue_xmit_nit+0x82/0x620 net/core/dev.c:2248
xmit_one net/core/dev.c:3527 [inline]
dev_hard_start_xmit+0xcc/0x3f0 net/core/dev.c:3547
__dev_queue_xmit+0xf24/0x1dd0 net/core/dev.c:4335
dev_queue_xmit include/linux/netdevice.h:3091 [inline]
batadv_send_skb_packet+0x264/0x300 net/batman-adv/send.c:108
batadv_send_broadcast_skb+0x24/0x30 net/batman-adv/send.c:127
batadv_iv_ogm_send_to_if net/batman-adv/bat_iv_ogm.c:392 [inline]
batadv_iv_ogm_emit net/batman-adv/bat_iv_ogm.c:420 [inline]
batadv_iv_send_outstanding_bat_ogm_packet+0x3f0/0x4b0 net/batman-adv/bat_iv_ogm.c:1700
process_one_work kernel/workqueue.c:3254 [inline]
process_scheduled_works+0x465/0x990 kernel/workqueue.c:3335
worker_thread+0x526/0x730 kernel/workqueue.c:3416
kthread+0x1d1/0x210 kernel/kthread.c:388
ret_from_fork+0x4b/0x60 arch/x86/kernel/process.c:147
ret_from_fork_asm+0x1a/0x30 arch/x86/entry/entry_64.S:243
value changed: 0x00 -> 0x01
Reported by Kernel Concurrency Sanitizer on:
CPU: 1 PID: 27 Comm: kworker/u8:1 Tainted: G W 6.8.0-syzkaller-08073-g480e035fc4c7 #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 02/29/2024
Workqueue: bat_events batadv_iv_send_outstanding_bat_ogm_packet
Fixes: fa788d986a3a ("packet: add sockopt to ignore outgoing packets") Reported-by: syzbot+c669c1136495a2e7c31f@syzkaller.appspotmail.com Closes: https://lore.kernel.org/netdev/CANn89i+Z7MfbkBLOv=p7KZ7=K1rKHO4P1OL5LYDCtBiyqsa9oQ@mail.gmail.com/T/#t Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Willem de Bruijn <willemdebruijn.kernel@gmail.com> Reviewed-by: Willem de Bruijn <willemb@google.com> Reviewed-by: Jason Xing <kerneljasonxing@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Herve Codina [Thu, 14 Mar 2024 12:33:46 +0000 (13:33 +0100)]
net: wan: fsl_qmc_hdlc: Fix module compilation
The fsl_qmc_driver does not compile as module:
error: ‘qmc_hdlc_driver’ undeclared here (not in a function);
405 | MODULE_DEVICE_TABLE(of, qmc_hdlc_driver);
| ^~~~~~~~~~~~~~~
Fix the typo.
Fixes: b40f00ecd463 ("net: wan: Add support for QMC HDLC") Reported-by: Michael Ellerman <mpe@ellerman.id.au> Closes: https://lore.kernel.org/linux-kernel/87ttl93f7i.fsf@mail.lhotse/ Signed-off-by: Herve Codina <herve.codina@bootlin.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Daniel Golle [Wed, 13 Mar 2024 22:50:40 +0000 (22:50 +0000)]
net: ethernet: mtk_eth_soc: fix PPE hanging issue
A patch to resolve an issue was found in MediaTek's GPL-licensed SDK:
In the mtk_ppe_stop() function, the PPE scan mode is not disabled before
disabling the PPE. This can potentially lead to a hang during the process
of disabling the PPE.
Without this patch, the PPE may experience a hang during the reboot test.
Duanqiang Wen [Wed, 13 Mar 2024 08:06:34 +0000 (16:06 +0800)]
net: txgbe: fix clk_name exceed MAX_DEV_ID limits
txgbe register clk which name is i2c_designware.pci_dev_id(),
clk_name will be stored in clk_lookup_alloc. If PCIe bus number
is larger than 0x39, clk_name size will be larger than 20 bytes.
It exceeds clk_lookup_alloc MAX_DEV_ID limits. So the driver
shortened clk_name.
David Howells [Tue, 12 Mar 2024 23:37:18 +0000 (23:37 +0000)]
rxrpc: Fix error check on ->alloc_txbuf()
rxrpc_alloc_*_txbuf() and ->alloc_txbuf() return NULL to indicate no
memory, but rxrpc_send_data() uses IS_ERR().
Fix rxrpc_send_data() to check for NULL only and set -ENOMEM if it sees
that.
Fixes: 49489bb03a50 ("rxrpc: Do zerocopy using MSG_SPLICE_PAGES and page frags") Signed-off-by: David Howells <dhowells@redhat.com> Reported-by: Marc Dionne <marc.dionne@auristor.com>
cc: "David S. Miller" <davem@davemloft.net>
cc: Eric Dumazet <edumazet@google.com>
cc: Jakub Kicinski <kuba@kernel.org>
cc: Paolo Abeni <pabeni@redhat.com>
cc: linux-afs@lists.infradead.org
cc: netdev@vger.kernel.org Signed-off-by: Paolo Abeni <pabeni@redhat.com>
David Howells [Tue, 12 Mar 2024 23:37:17 +0000 (23:37 +0000)]
rxrpc: Fix use of changed alignment param to page_frag_alloc_align()
Commit 411c5f36805c ("mm/page_alloc: modify page_frag_alloc_align() to
accept align as an argument") changed the way page_frag_alloc_align()
worked, but it didn't fix AF_RXRPC as that use of that allocator function
hadn't been merged yet at the time. Now, when the AFS filesystem is used,
this results in:
WARNING: CPU: 4 PID: 379 at include/linux/gfp.h:323 rxrpc_alloc_data_txbuf+0x9d/0x2b0 [rxrpc]
Fix this by using __page_frag_alloc_align() instead.
Note that it might be better to use an order-based alignment rather than a
mask-based alignment.
Fixes: 49489bb03a50 ("rxrpc: Do zerocopy using MSG_SPLICE_PAGES and page frags") Signed-off-by: David Howells <dhowells@redhat.com> Reported-by: Marc Dionne <marc.dionne@auristor.com>
cc: Yunsheng Lin <linyunsheng@huawei.com>
cc: Alexander Duyck <alexander.duyck@gmail.com>
cc: Michael S. Tsirkin <mst@redhat.com>
cc: "David S. Miller" <davem@davemloft.net>
cc: Eric Dumazet <edumazet@google.com>
cc: Jakub Kicinski <kuba@kernel.org>
cc: Paolo Abeni <pabeni@redhat.com>
cc: linux-afs@lists.infradead.org
cc: netdev@vger.kernel.org Signed-off-by: Paolo Abeni <pabeni@redhat.com>
CPU: 1 PID: 5033 Comm: syz-executor334 Not tainted 6.7.0-syzkaller-00562-g9f8413c4a66f #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 11/17/2023
=====================================================
If the packet type ID field in the Ethernet header is either ETH_P_PRP or
ETH_P_HSR, but it is not followed by an HSR tag, hsr_get_skb_sequence_nr()
reads an invalid value as a sequence number. This causes the above issue.
This patch fixes the issue by returning NULL if the Ethernet header is not
followed by an HSR tag.
tcp: Fix refcnt handling in __inet_hash_connect().
syzbot reported a warning in sk_nulls_del_node_init_rcu().
The commit 66b60b0c8c4a ("dccp/tcp: Unhash sk from ehash for tb2 alloc
failure after check_estalblished().") tried to fix an issue that an
unconnected socket occupies an ehash entry when bhash2 allocation fails.
In such a case, we need to revert changes done by check_established(),
which does not hold refcnt when inserting socket into ehash.
So, to revert the change, we need to __sk_nulls_add_node_rcu() instead
of sk_nulls_add_node_rcu().
Otherwise, sock_put() will cause refcnt underflow and leak the socket.
Shay Drory [Tue, 12 Mar 2024 10:52:38 +0000 (12:52 +0200)]
devlink: Fix devlink parallel commands processing
Commit 870c7ad4a52b ("devlink: protect devlink->dev by the instance
lock") added devlink instance locking inside a loop that iterates over
all the registered devlink instances on the machine in the pre-doit
phase. This can lead to serialization of devlink commands over
different devlink instances.
For example: While the first devlink instance is executing firmware
flash, all commands to other devlink instances on the machine are
forced to wait until the first devlink finishes.
Therefore, in the pre-doit phase, take the devlink instance lock only
for the devlink instance the command is targeting. Devlink layer is
taking a reference on the devlink instance, ensuring the devlink->dev
pointer is valid. This reference taking was introduced by commit a380687200e0 ("devlink: take device reference for devlink object").
Without this commit, it would not be safe to access devlink->dev
lockless.
Fixes: 870c7ad4a52b ("devlink: protect devlink->dev by the instance lock") Signed-off-by: Shay Drory <shayd@nvidia.com> Reviewed-by: Jiri Pirko <jiri@nvidia.com> Signed-off-by: David S. Miller <davem@davemloft.net>
taprio_parse_tc_entry() is not correctly checking
TCA_TAPRIO_TC_ENTRY_INDEX attribute:
int tc; // Signed value
tc = nla_get_u32(tb[TCA_TAPRIO_TC_ENTRY_INDEX]);
if (tc >= TC_QOPT_MAX_QUEUE) {
NL_SET_ERR_MSG_MOD(extack, "TC entry index out of range");
return -ERANGE;
}
syzbot reported that it could fed arbitary negative values:
Fixes: a54fc09e4cba ("net/sched: taprio: allow user input of per-tc max SDU") Reported-and-tested-by: syzbot+a340daa06412d6028918@syzkaller.appspotmail.com Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Vladimir Oltean <vladimir.oltean@nxp.com> Reviewed-by: Michal Kubiak <michal.kubiak@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Linu Cherian [Tue, 12 Mar 2024 07:06:22 +0000 (12:36 +0530)]
octeontx2-af: Use matching wake_up API variant in CGX command interface
Use wake_up API instead of wake_up_interruptible, since
wait_event_timeout API is used for waiting on command completion.
Fixes: 1463f382f58d ("octeontx2-af: Add support for CGX link management") Signed-off-by: Linu Cherian <lcherian@marvell.com> Signed-off-by: Sunil Goutham <sgoutham@marvell.com> Signed-off-by: Subbaraya Sundeep <sbhatta@marvell.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Sean Anderson [Mon, 11 Mar 2024 16:38:30 +0000 (12:38 -0400)]
soc: fsl: qbman: Use raw spinlock for cgr_lock
smp_call_function always runs its callback in hard IRQ context, even on
PREEMPT_RT, where spinlocks can sleep. So we need to use a raw spinlock
for cgr_lock to ensure we aren't waiting on a sleeping task.
Although this bug has existed for a while, it was not apparent until
commit ef2a8d5478b9 ("net: dpaa: Adjust queue depth on rate change")
which invokes smp_call_function_single via qman_update_cgr_safe every
time a link goes up or down.
Fixes: 96f413f47677 ("soc/fsl/qbman: fix issue in qman_delete_cgr_safe()") CC: stable@vger.kernel.org Reported-by: Vladimir Oltean <vladimir.oltean@nxp.com> Closes: https://lore.kernel.org/all/20230323153935.nofnjucqjqnz34ej@skbuf/ Reported-by: Steffen Trumtrar <s.trumtrar@pengutronix.de> Closes: https://lore.kernel.org/linux-arm-kernel/87wmsyvclu.fsf@pengutronix.de/ Signed-off-by: Sean Anderson <sean.anderson@linux.dev> Reviewed-by: Camelia Groza <camelia.groza@nxp.com> Tested-by: Vladimir Oltean <vladimir.oltean@nxp.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Sean Anderson [Mon, 11 Mar 2024 16:38:29 +0000 (12:38 -0400)]
soc: fsl: qbman: Always disable interrupts when taking cgr_lock
smp_call_function_single disables IRQs when executing the callback. To
prevent deadlocks, we must disable IRQs when taking cgr_lock elsewhere.
This is already done by qman_update_cgr and qman_delete_cgr; fix the
other lockers.
Fixes: 96f413f47677 ("soc/fsl/qbman: fix issue in qman_delete_cgr_safe()") CC: stable@vger.kernel.org Signed-off-by: Sean Anderson <sean.anderson@linux.dev> Reviewed-by: Camelia Groza <camelia.groza@nxp.com> Tested-by: Vladimir Oltean <vladimir.oltean@nxp.com> Signed-off-by: David S. Miller <davem@davemloft.net>
====================
tcp/rds: Fix use-after-free around kernel TCP reqsk.
syzkaller reported an warning of netns ref tracker for RDS TCP listener,
which commit 740ea3c4a0b2 ("tcp: Clean up kernel listener's reqsk in
inet_twsk_purge()") fixed for per-netns ehash.
This series fixes the bug in the partial fix and fixes the reported bug
in the global ehash.
The notable thing here is 0x4001 in connect(), which is RDS_TCP_PORT.
So, the scenario would be:
1. unshare(CLONE_NEWNET) creates a per netns tcp listener in
rds_tcp_listen_init().
2. syz-executor connect()s to it and creates a reqsk.
3. syz-executor exit()s immediately.
4. netns is dismantled. [0]
5. reqsk timer is fired, and UAF happens while freeing reqsk. [1]
6. listener is freed after RCU grace period. [2]
Basically, reqsk assumes that the listener guarantees netns safety
until all reqsk timers are expired by holding the listener's refcount.
However, this was not the case for kernel sockets.
Commit 740ea3c4a0b2 ("tcp: Clean up kernel listener's reqsk in
inet_twsk_purge()") fixed this issue only for per-netns ehash.
Allocated by task 258 on cpu 0 at 83.612050s:
kasan_save_stack (mm/kasan/common.c:48)
kasan_save_track (mm/kasan/common.c:68)
__kasan_slab_alloc (mm/kasan/common.c:343)
kmem_cache_alloc (mm/slub.c:3813 mm/slub.c:3860 mm/slub.c:3867)
copy_net_ns (./include/linux/slab.h:701 net/core/net_namespace.c:421 net/core/net_namespace.c:480)
create_new_namespaces (kernel/nsproxy.c:110)
unshare_nsproxy_namespaces (kernel/nsproxy.c:228 (discriminator 4))
ksys_unshare (kernel/fork.c:3429)
__x64_sys_unshare (kernel/fork.c:3496)
do_syscall_64 (arch/x86/entry/common.c:52 arch/x86/entry/common.c:83)
entry_SYSCALL_64_after_hwframe (arch/x86/entry/entry_64.S:129)
Freed by task 27 on cpu 0 at 329.158864s:
kasan_save_stack (mm/kasan/common.c:48)
kasan_save_track (mm/kasan/common.c:68)
kasan_save_free_info (mm/kasan/generic.c:643)
__kasan_slab_free (mm/kasan/common.c:265)
kmem_cache_free (mm/slub.c:4299 mm/slub.c:4363)
cleanup_net (net/core/net_namespace.c:456 net/core/net_namespace.c:446 net/core/net_namespace.c:639)
process_one_work (kernel/workqueue.c:2638)
worker_thread (kernel/workqueue.c:2700 kernel/workqueue.c:2787)
kthread (kernel/kthread.c:388)
ret_from_fork (arch/x86/kernel/process.c:153)
ret_from_fork_asm (arch/x86/entry/entry_64.S:250)
The buggy address belongs to the object at ffff88801b370000
which belongs to the cache net_namespace of size 4352
The buggy address is located 1024 bytes inside of
freed 4352-byte region [ffff88801b370000, ffff88801b371100)
Linus Torvalds [Wed, 13 Mar 2024 00:44:08 +0000 (17:44 -0700)]
Merge tag 'net-next-6.9' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next
Pull networking updates from Jakub Kicinski:
"Core & protocols:
- Large effort by Eric to lower rtnl_lock pressure and remove locks:
- Make commonly used parts of rtnetlink (address, route dumps
etc) lockless, protected by RCU instead of rtnl_lock.
- Add a netns exit callback which already holds rtnl_lock,
allowing netns exit to take rtnl_lock once in the core instead
of once for each driver / callback.
- Remove locks / serialization in the socket diag interface.
- Remove 6 calls to synchronize_rcu() while holding rtnl_lock.
- Remove the dev_base_lock, depend on RCU where necessary.
- Support busy polling on a per-epoll context basis. Poll length and
budget parameters can be set independently of system defaults.
- Introduce struct net_hotdata, to make sure read-mostly global
config variables fit in as few cache lines as possible.
- Add optional per-nexthop statistics to ease monitoring / debug of
ECMP imbalance problems.
- Support TCP_NOTSENT_LOWAT in MPTCP.
- Ensure that IPv6 temporary addresses' preferred lifetimes are long
enough, compared to other configured lifetimes, and at least 2 sec.
- Support forwarding of ICMP Error messages in IPSec, per RFC 4301.
- Add support for the independent control state machine for bonding
per IEEE 802.1AX-2008 5.4.15 in addition to the existing coupled
control state machine.
- Add "network ID" to MCTP socket APIs to support hosts with multiple
disjoint MCTP networks.
- Re-use the mono_delivery_time skbuff bit for packets which user
space wants to be sent at a specified time. Maintain the timing
information while traversing veth links, bridge etc.
- Take advantage of MSG_SPLICE_PAGES for RxRPC DATA and ACK packets.
- Simplify many places iterating over netdevs by using an xarray
instead of a hash table walk (hash table remains in place, for use
on fastpaths).
- Speed up scanning for expired routes by keeping a dedicated list.
- Speed up "generic" XDP by trying harder to avoid large allocations.
- Support attaching arbitrary metadata to netconsole messages.
Things we sprinkled into general kernel code:
- Enforce VM_IOREMAP flag and range in ioremap_page_range and
introduce VM_SPARSE kind and vm_area_[un]map_pages (used by
bpf_arena).
- Rework selftest harness to enable the use of the full range of ksft
exit code (pass, fail, skip, xfail, xpass).
Netfilter:
- Allow userspace to define a table that is exclusively owned by a
daemon (via netlink socket aliveness) without auto-removing this
table when the userspace program exits. Such table gets marked as
orphaned and a restarting management daemon can re-attach/regain
ownership.
- Speed up element insertions to nftables' concatenated-ranges set
type. Compact a few related data structures.
BPF:
- Add BPF token support for delegating a subset of BPF subsystem
functionality from privileged system-wide daemons such as systemd
through special mount options for userns-bound BPF fs to a trusted
& unprivileged application.
- Introduce bpf_arena which is sparse shared memory region between
BPF program and user space where structures inside the arena can
have pointers to other areas of the arena, and pointers work
seamlessly for both user-space programs and BPF programs.
- Introduce may_goto instruction that is a contract between the
verifier and the program. The verifier allows the program to loop
assuming it's behaving well, but reserves the right to terminate
it.
- Extend the BPF verifier to enable static subprog calls in spin lock
critical sections.
- Support registration of struct_ops types from modules which helps
projects like fuse-bpf that seeks to implement a new struct_ops
type.
- Add support for retrieval of cookies for perf/kprobe multi links.
- Support arbitrary TCP SYN cookie generation / validation in the TC
layer with BPF to allow creating SYN flood handling in BPF
firewalls.
- Add code generation to inline the bpf_kptr_xchg() helper which
improves performance when stashing/popping the allocated BPF
objects.
Wireless:
- Add SPP (signaling and payload protected) AMSDU support.
- Support wider bandwidth OFDMA, as required for EHT operation.
Driver API:
- Major overhaul of the Energy Efficient Ethernet internals to
support new link modes (2.5GE, 5GE), share more code between
drivers (especially those using phylib), and encourage more
uniform behavior. Convert and clean up drivers.
- Define an API for querying per netdev queue statistics from
drivers.
- IPSec: account in global stats for fully offloaded sessions.
- Create a concept of Ethernet PHY Packages at the Device Tree level,
to allow parameterizing the existing PHY package code.
- Enable Rx hashing (RSS) on GTP protocol fields.
Misc:
- Improvements and refactoring all over networking selftests.
- Create uniform module aliases for TC classifiers, actions, and
packet schedulers to simplify creating modprobe policies.
- Address all missing MODULE_DESCRIPTION() warnings in networking.
- Extend the Netlink descriptions in YAML to cover message
encapsulation or "Netlink polymorphism", where interpretation of
nested attributes depends on link type, classifier type or some
other "class type".
Drivers:
- Ethernet high-speed NICs:
- Add a new driver for Marvell's Octeon PCI Endpoint NIC VF.
- Intel (100G, ice, idpf):
- support E825-C devices
- nVidia/Mellanox:
- support devices with one port and multiple PCIe links
- Broadcom (bnxt):
- support n-tuple filters
- support configuring the RSS key
- Wangxun (ngbe/txgbe):
- implement irq_domain for TXGBE's sub-interrupts
- Pensando/AMD:
- support XDP
- optimize queue submission and wakeup handling (+17% bps)
- optimize struct layout, saving 28% of memory on queues
- Ethernet NICs embedded and virtual:
- Google cloud vNIC:
- refactor driver to perform memory allocations for new queue
config before stopping and freeing the old queue memory
- Synopsys (stmmac):
- obey queueMaxSDU and implement counters required by 802.1Qbv
- Renesas (ravb):
- support packet checksum offload
- suspend to RAM and runtime PM support
- Ethernet switches:
- nVidia/Mellanox:
- support for nexthop group statistics
- Microchip:
- ksz8: implement PHY loopback
- add support for KSZ8567, a 7-port 10/100Mbps switch
- PTP:
- New driver for RENESAS FemtoClock3 Wireless clock generator.
- Support OCP PTP cards designed and built by Adva.
- CAN:
- Support recvmsg() flags for own, local and remote traffic on CAN
BCM sockets.
- Support for esd GmbH PCIe/402 CAN device family.
- m_can:
- Rx/Tx submission coalescing
- wake on frame Rx
- WiFi:
- Intel (iwlwifi):
- enable signaling and payload protected A-MSDUs
- support wider-bandwidth OFDMA
- support for new devices
- bump FW API to 89 for AX devices; 90 for BZ/SC devices
- MediaTek (mt76):
- mt7915: newer ADIE version support
- mt7925: radio temperature sensor support
- Qualcomm (ath11k):
- support 6 GHz station power modes: Low Power Indoor (LPI),
Standard Power) SP and Very Low Power (VLP)
- QCA6390 & WCN6855: support 2 concurrent station interfaces
- QCA2066 support
- Qualcomm (ath12k):
- refactoring in preparation for Multi-Link Operation (MLO)
support
- 1024 Block Ack window size support
- firmware-2.bin support
- support having multiple identical PCI devices (firmware needs
to have ATH12K_FW_FEATURE_MULTI_QRTR_ID)
- QCN9274: support split-PHY devices
- WCN7850: enable Power Save Mode in station mode
- WCN7850: P2P support
- RealTek:
- rtw88: support for more rtw8811cu and rtw8821cu devices
- rtw89: support SCAN_RANDOM_SN and SET_SCAN_DWELL
- rtlwifi: speed up USB firmware initialization
- rtwl8xxxu:
- RTL8188F: concurrent interface support
- Channel Switch Announcement (CSA) support in AP mode
- Broadcom (brcmfmac):
- per-vendor feature support
- per-vendor SAE password setup
- DMI nvram filename quirk for ACEPC W5 Pro"
* tag 'net-next-6.9' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next: (2255 commits)
nexthop: Fix splat with CONFIG_DEBUG_PREEMPT=y
nexthop: Fix out-of-bounds access during attribute validation
nexthop: Only parse NHA_OP_FLAGS for dump messages that require it
nexthop: Only parse NHA_OP_FLAGS for get messages that require it
bpf: move sleepable flag from bpf_prog_aux to bpf_prog
bpf: hardcode BPF_PROG_PACK_SIZE to 2MB * num_possible_nodes()
selftests/bpf: Add kprobe multi triggering benchmarks
ptp: Move from simple ida to xarray
vxlan: Remove generic .ndo_get_stats64
vxlan: Do not alloc tstats manually
devlink: Add comments to use netlink gen tool
nfp: flower: handle acti_netdevs allocation failure
net/packet: Add getsockopt support for PACKET_COPY_THRESH
net/netlink: Add getsockopt support for NETLINK_LISTEN_ALL_NSID
selftests/bpf: Add bpf_arena_htab test.
selftests/bpf: Add bpf_arena_list test.
selftests/bpf: Add unit tests for bpf_arena_alloc/free_pages
bpf: Add helper macro bpf_addr_space_cast()
libbpf: Recognize __arena global variables.
bpftool: Recognize arena map type
...
Linus Torvalds [Tue, 12 Mar 2024 22:18:34 +0000 (15:18 -0700)]
Merge tag 'docs-6.9' of git://git.lwn.net/linux
Pull documentation updates from Jonathan Corbet:
"A moderatly busy cycle for development this time around.
- Some cleanup of the main index page for easier navigation
- Rework some of the other top-level pages for better readability
and, with luck, fewer merge conflicts in the future.
- Submit-checklist improvements, hopefully the first of many.
- New Italian translations
- A fair number of kernel-doc fixes and improvements. We have also
dropped the recommendation to use an old version of Sphinx.
- A new document from Thorsten on bisection
... and lots of fixes and updates"
* tag 'docs-6.9' of git://git.lwn.net/linux: (54 commits)
docs: verify/bisect: fixes, finetuning, and support for Arch
docs: Makefile: Add dependency to $(YNL_INDEX) for targets other than htmldocs
docs: Move ja_JP/howto.rst to ja_JP/process/howto.rst
docs: submit-checklist: use subheadings
docs: submit-checklist: structure by category
docs: new text on bisecting which also covers bug validation
docs: drop the version constraints for sphinx and dependencies
docs: kerneldoc-preamble.sty: Remove code for Sphinx <2.4
docs: Restore "smart quotes" for quotes
docs/zh_CN: accurate translation of "function"
docs: Include simplified link titles in main index
docs: Correct formatting of title in admin-guide/index.rst
docs: kernel_feat.py: fix build error for missing files
MAINTAINERS: Set the field name for subsystem profile section
kasan: Add documentation for CONFIG_KASAN_EXTRA_INFO
Fixed case issue with 'fault-injection' in documentation
kernel-doc: handle #if in enums as well
Documentation: update mailing list addresses
doc: kerneldoc.py: fix indentation
scripts/kernel-doc: simplify signature printing
...
Linus Torvalds [Tue, 12 Mar 2024 22:10:51 +0000 (15:10 -0700)]
Merge tag 'audit-pr-20240312' of git://git.kernel.org/pub/scm/linux/kernel/git/pcmoore/audit
Pull audit updates from Paul Moore:
"Two small audit patches:
- Use the KMEM_CACHE() macro instead of kmem_cache_create()
The guidance appears to be to use the KMEM_CACHE() macro when
possible and there is no reason why we can't use the macro, so
let's use it.
- Remove an unnecessary assignment in audit_dupe_lsm_field()
A return value variable was assigned a value in its declaration,
but the declaration value is overwritten before the return value
variable is ever referenced; drop the assignment at declaration
time"
* tag 'audit-pr-20240312' of git://git.kernel.org/pub/scm/linux/kernel/git/pcmoore/audit:
audit: use KMEM_CACHE() instead of kmem_cache_create()
audit: remove unnecessary assignment in audit_dupe_lsm_field()
Linus Torvalds [Tue, 12 Mar 2024 22:08:06 +0000 (15:08 -0700)]
Merge tag 'Smack-for-6.9' of https://github.com/cschaufler/smack-next
Pull smack updates from Casey Schaufler:
- Improvements to the initialization of in-memory inodes
- A fix in ramfs to propery ensure the initialization of in-memory
inodes
- Removal of duplicated code in smack_cred_transfer()
* tag 'Smack-for-6.9' of https://github.com/cschaufler/smack-next:
Smack: use init_task_smack() in smack_cred_transfer()
ramfs: Initialize security of in-memory inodes
smack: Initialize the in-memory inode in smack_inode_init_security()
smack: Always determine inode labels in smack_inode_init_security()
smack: Handle SMACK64TRANSMUTE in smack_inode_setsecurity()
smack: Set SMACK64TRANSMUTE only for dirs in smack_inode_setxattr()
Linus Torvalds [Tue, 12 Mar 2024 22:05:27 +0000 (15:05 -0700)]
Merge tag 'seccomp-v6.9-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux
Pull seccomp updates from Kees Cook:
"There are no core kernel changes here; it's entirely selftests and
samples:
- Improve reliability of selftests (Terry Tritton, Kees Cook)
- Fix strict-aliasing warning in samples (Arnd Bergmann)"
* tag 'seccomp-v6.9-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux:
samples: user-trap: fix strict-aliasing warning
selftests/seccomp: Pin benchmark to single CPU
selftests/seccomp: user_notification_addfd check nextfd is available
selftests/seccomp: Change the syscall used in KILL_THREAD test
selftests/seccomp: Handle EINVAL on unshare(CLONE_NEWPID)
Linus Torvalds [Tue, 12 Mar 2024 21:49:30 +0000 (14:49 -0700)]
Merge tag 'hardening-v6.9-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux
Pull hardening updates from Kees Cook:
"As is pretty normal for this tree, there are changes all over the
place, especially for small fixes, selftest improvements, and improved
macro usability.
Some header changes ended up landing via this tree as they depended on
the string header cleanups. Also, a notable set of changes is the work
for the reintroduction of the UBSAN signed integer overflow sanitizer
so that we can continue to make improvements on the compiler side to
make this sanitizer a more viable future security hardening option.
Summary:
- string.h and related header cleanups (Tanzir Hasan, Andy
Shevchenko)
- Add comments to explain how __is_constexpr() works
- Fix m68k stack alignment expectations in stackinit Kunit test
- Convert string selftests to KUnit
- Add KUnit tests for fortified string functions
- Improve reporting during fortified string warnings
- Allow non-type arg to type_max() and type_min()
- Allow strscpy() to be called with only 2 arguments
- Add binary mode to leaking_addresses scanner
- Various small cleanups to leaking_addresses scanner
- Adding wrapping_*() arithmetic helper
- Annotate initial signed integer wrap-around in refcount_t
- Add explicit UBSAN section to MAINTAINERS
- Fix UBSAN self-test warnings
- Simplify UBSAN build via removal of CONFIG_UBSAN_SANITIZE_ALL
- Reintroduce UBSAN's signed overflow sanitizer"
* tag 'hardening-v6.9-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux: (51 commits)
selftests/powerpc: Fix load_unaligned_zeropad build failure
string: Convert helpers selftest to KUnit
string: Convert selftest to KUnit
sh: Fix build with CONFIG_UBSAN=y
compiler.h: Explain how __is_constexpr() works
overflow: Allow non-type arg to type_max() and type_min()
VMCI: Fix possible memcpy() run-time warning in vmci_datagram_invoke_guest_handler()
lib/string_helpers: Add flags param to string_get_size()
x86, relocs: Ignore relocations in .notes section
objtool: Fix UNWIND_HINT_{SAVE,RESTORE} across basic blocks
overflow: Use POD in check_shl_overflow()
lib: stackinit: Adjust target string to 8 bytes for m68k
sparc: vdso: Disable UBSAN instrumentation
kernel.h: Move lib/cmdline.c prototypes to string.h
leaking_addresses: Provide mechanism to scan binary files
leaking_addresses: Ignore input device status lines
leaking_addresses: Use File::Temp for /tmp files
MAINTAINERS: Update LEAKING_ADDRESSES details
fortify: Improve buffer overflow reporting
fortify: Add KUnit tests for runtime overflows
...
- Avoid potential double-free during pstorefs umount
* tag 'pstore-v6.9-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux:
pstore/zone: Don't clear memory twice
pstore/zone: Add a null pointer check to the psz_kmsg_read
efi: pstore: Allow dynamic initialization based on module parameter
arm64: defconfig: Enable PSTORE_RAM
pstore/ram: Register to module device table
pstore: inode: Only d_invalidate() is needed
Linus Torvalds [Tue, 12 Mar 2024 21:27:37 +0000 (14:27 -0700)]
Merge tag 'nfsd-6.9' of git://git.kernel.org/pub/scm/linux/kernel/git/cel/linux
Pull nfsd updates from Chuck Lever:
"The bulk of the patches for this release are optimizations, code
clean-ups, and minor bug fixes.
One new feature to mention is that NFSD administrators now have the
ability to revoke NFSv4 open and lock state. NFSD's NFSv3 support has
had this capability for some time.
As always I am grateful to NFSD contributors, reviewers, and testers"
* tag 'nfsd-6.9' of git://git.kernel.org/pub/scm/linux/kernel/git/cel/linux: (75 commits)
NFSD: Clean up nfsd4_encode_replay()
NFSD: send OP_CB_RECALL_ANY to clients when number of delegations reaches its limit
NFSD: Document nfsd_setattr() fill-attributes behavior
nfsd: Fix NFSv3 atomicity bugs in nfsd_setattr()
nfsd: Fix a regression in nfsd_setattr()
NFSD: OP_CB_RECALL_ANY should recall both read and write delegations
NFSD: handle GETATTR conflict with write delegation
NFSD: add support for CB_GETATTR callback
NFSD: Document the phases of CREATE_SESSION
NFSD: Fix the NFSv4.1 CREATE_SESSION operation
nfsd: clean up comments over nfs4_client definition
svcrdma: Add Write chunk WRs to the RPC's Send WR chain
svcrdma: Post WRs for Write chunks in svc_rdma_sendto()
svcrdma: Post the Reply chunk and Send WR together
svcrdma: Move write_info for Reply chunks into struct svc_rdma_send_ctxt
svcrdma: Post Send WR chain
svcrdma: Fix retry loop in svc_rdma_send()
svcrdma: Prevent a UAF in svc_rdma_send()
svcrdma: Fix SQ wake-ups
svcrdma: Increase the per-transport rw_ctx count
...
Linus Torvalds [Tue, 12 Mar 2024 20:25:53 +0000 (13:25 -0700)]
Merge tag 'erofs-for-6.9-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/xiang/erofs
Pull erofs updates from Gao Xiang:
"In this cycle, we introduce compressed inode support over fscache
since a lot of native EROFS images are explicitly compressed so that
EROFS over fscache can be more widely used even without Dragonfly
Nydus [1].
Apart from that, there are some folio conversions for compressed
inodes available as well as a lockdep false positive fix.
Summary:
- Some folio conversions for compressed inodes;
- Add compressed inode support over fscache;
- Fix lockdep false positives of erofs_pseudo_mnt"
Link: https://nydus.dev
* tag 'erofs-for-6.9-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/xiang/erofs:
erofs: support compressed inodes over fscache
erofs: make iov_iter describe target buffers over fscache
erofs: fix lockdep false positives on initializing erofs_pseudo_mnt
erofs: refine managed cache operations to folios
erofs: convert z_erofs_submissionqueue_endio() to folios
erofs: convert z_erofs_fill_bio_vec() to folios
erofs: get rid of `justfound` debugging tag
erofs: convert z_erofs_do_read_page() to folios
erofs: convert z_erofs_onlinepage_.* to folios
Linus Torvalds [Tue, 12 Mar 2024 20:17:36 +0000 (13:17 -0700)]
Merge tag 'fscrypt-for-linus' of git://git.kernel.org/pub/scm/fs/fscrypt/linux
Pull fscrypt updates from Eric Biggers:
"Fix flakiness in a test by releasing the quota synchronously when a
key is removed, and other minor cleanups"
* tag 'fscrypt-for-linus' of git://git.kernel.org/pub/scm/fs/fscrypt/linux:
fscrypt: shrink the size of struct fscrypt_inode_info slightly
fscrypt: write CBC-CTS instead of CTS-CBC
fscrypt: clear keyring before calling key_put()
fscrypt: explicitly require that inode->i_blkbits be set
Linus Torvalds [Tue, 12 Mar 2024 19:28:34 +0000 (12:28 -0700)]
Merge tag 'for-6.9-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux
Pull btrfs updates from David Sterba:
"Mostly stabilization, refactoring and cleanup changes. There rest are
minor performance optimizations due to caching or lock contention
reduction and a few notable fixes.
Performance improvements:
- minor speedup in logging when repeatedly allocated structure is
preallocated only once, improves latency and decreases lock
contention
- minor throughput increase (+6%), reduced lock contention after
clearing delayed allocation bits, applies to several common
workload types
- skip full quota rescan if a new relation is added in the same
transaction
Fixes:
- zstd fix for inline compressed file in subpage mode, updated
version from the 6.8 time
- more fiemap followup fixes after reduced locking done in 6.8:
- fix race when detecting delalloc ranges
Core changes:
- more debugging code:
- added assertions for a very rare crash in raid56 calculation
- tree-checker dumps page state to give more insights into
possible reference counting issues
- add checksum calculation offloading sysfs knob, for now enabled
under DEBUG only to determine a good heuristic for deciding the
offload or synchronous, depends on various factors (block group
profile, device speed) and is not as clear as initially thought
(checksum type)
- error handling improvements, added assertions
- more page to folio conversion (defrag, truncate), cached size and
shift
- preparation for more fine grained locking of sectors in subpage
mode
- cleanups and refactoring:
- include cleanups, forward declarations
- pointer-to-structure helpers
- redundant argument removals
- removed unused code
- slab cache updates, last use of SLAB_MEM_SPREAD removed"
* tag 'for-6.9-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux: (114 commits)
btrfs: reuse cloned extent buffer during fiemap to avoid re-allocations
btrfs: fix race when detecting delalloc ranges during fiemap
btrfs: fix off-by-one chunk length calculation at contains_pending_extent()
btrfs: qgroup: allow quick inherit if snapshot is created and added to the same parent
btrfs: qgroup: validate btrfs_qgroup_inherit parameter
btrfs: include device major and minor numbers in the device scan notice
btrfs: mark btrfs_put_caching_control() static
btrfs: remove SLAB_MEM_SPREAD flag use
btrfs: qgroup: always free reserved space for extent records
btrfs: tree-checker: dump the page status if hit something wrong
btrfs: compression: remove dead comments in btrfs_compress_heuristic()
btrfs: subpage: make writer lock utilize bitmap
btrfs: subpage: make reader lock utilize bitmap
btrfs: unexport btrfs_subpage_start_writer() and btrfs_subpage_end_and_test_writer()
btrfs: pass a valid extent map cache pointer to __get_extent_map()
btrfs: merge btrfs_del_delalloc_inode() helpers
btrfs: pass btrfs_device to btrfs_scratch_superblocks()
btrfs: handle transaction commit errors in flush_reservations()
btrfs: use KMEM_CACHE() to create btrfs_free_space cache
btrfs: use KMEM_CACHE() to create delayed ref caches
...
Linus Torvalds [Tue, 12 Mar 2024 17:56:28 +0000 (10:56 -0700)]
Merge tag 'asm-generic-6.9' of git://git.kernel.org/pub/scm/linux/kernel/git/arnd/asm-generic
Pull asm-generic updates from Arnd Bergmann:
"Just two small updates this time:
- A series I did to unify the definition of PAGE_SIZE through
Kconfig, intended to help with a vdso rework that needs the
constant but cannot include the normal kernel headers when building
the compat VDSO on arm64 and potentially others
- a patch from Yan Zhao to remove the pfn_to_virt() definitions from
a couple of architectures after finding they were both incorrect
and entirely unused"
* tag 'asm-generic-6.9' of git://git.kernel.org/pub/scm/linux/kernel/git/arnd/asm-generic:
arch: define CONFIG_PAGE_SIZE_*KB on all architectures
arch: simplify architecture specific page size configuration
arch: consolidate existing CONFIG_PAGE_SIZE_*KB definitions
mm: Remove broken pfn_to_virt() on arch csky/hexagon/openrisc
Linus Torvalds [Tue, 12 Mar 2024 17:52:18 +0000 (10:52 -0700)]
Merge tag 'soc-defconfig-6.9' of git://git.kernel.org/pub/scm/linux/kernel/git/soc/soc
Pull ARM defconfig updates from Arnd Bergmann:
"This has the usual updates to enable platform specific driver modules
as new hardware gets supported, as well as an update to the
virt.config fragment so we disable all newly added platforms again"
* tag 'soc-defconfig-6.9' of git://git.kernel.org/pub/scm/linux/kernel/git/soc/soc: (24 commits)
arm64: defconfig: Enable support for cbmem entries in the coreboot table
ARM: defconfig: enable STMicroelectronics accelerometer and gyro for Exynos
arm64: defconfig: drop ext2 filesystem and redundant ext3
arm64: defconfig: Enable Rockchip HDMI/eDP Combo PHY
arm64: defconfig: Enable Wave5 Video Encoder/Decoder
arm64: config: disable new platforms in virt.config
arm64: defconfig: Enable QCOM PBS
arm64: deconfig: enable Goodix Berlin SPI touchscreen driver as module
arm64: defconfig: Enable X1E80100 multimedia clock controllers configs
arm64: defconfig: Enable GCC and interconnect for QDU1000/QRU1000
arm64: defconfig: enable i.MX8MP ldb bridge
arm64: defconfig: enable the vf610 gpio driver
ARM: imx_v6_v7_defconfig: enable the vf610 gpio driver
ARM: multi_v7_defconfig: Add more TI Keystone support
arm64: defconfig: enable WCD939x USBSS driver as module
arm64: defconfig: enable audio drivers for SM8650 QRD board
arm64: defconfig: Enable Qualcomm interconnect providers
ARM: multi_v7_defconfig: Enable BACKLIGHT_CLASS_DEVICE
arm64: defconfig: Enable i.MX8QXP device drivers
ARM: multi_v7_defconfig: Add more TI Keystone support
...
Linus Torvalds [Tue, 12 Mar 2024 17:47:33 +0000 (10:47 -0700)]
Merge tag 'soc-arm-6.9' of git://git.kernel.org/pub/scm/linux/kernel/git/soc/soc
Pull ARM SoC code updates from Arnd Bergmann:
"These are mostly minor updates, including a number of kerneldoc fixes
from Randy Dunlap across multiple platforms. OMAP gets a few bugfixes,
and the MAINTAINERS file gets updated for AMD Zynq and NXP S32G"
* tag 'soc-arm-6.9' of git://git.kernel.org/pub/scm/linux/kernel/git/soc/soc: (23 commits)
ARM: s32c: update MAINTAINERS entry
ARM: AM33xx: PRM: Implement REBOOT_COLD
ARM: AM33xx: PRM: Remove redundand defines
ARM: omap1: remove duplicated 'select ARCH_OMAP'
ARM: s3c64xx: make bus_type const
ARM: imx: Remove usage of the deprecated ida_simple_xx() API
ARM: OMAP2+: fix kernel-doc warnings
ARM: OMAP2+: fix kernel-doc warnings
ARM: OMAP2+: fix a kernel-doc warning
ARM: OMAP2+: PRM: fix kernel-doc warnings
ARM: OMAP2+: prm44xx: fix a kernel-doc warning
ARM: OMAP2+: pmic-cpcap: fix kernel-doc warnings
ARM: OMAP2+: hwmod: fix kernel-doc warnings
ARM: OMAP2+: hwmod: remove misuse of kernel-doc
ARM: OMAP2+: CMINST: use matching function name in kernel-doc
ARM: OMAP2+: cm33xx: use matching function name in kernel-doc
ARM: OMAP2+: clock: fix a function name in kernel-doc
ARM: OMAP2+: clockdomain: fix kernel-doc warnings
ARM: OMAP2+: am33xx-restart: fix function name in kernel-doc
soc: xilinx: update maintainer of event manager driver
...
Linus Torvalds [Tue, 12 Mar 2024 17:35:24 +0000 (10:35 -0700)]
Merge tag 'soc-drivers-6.9' of git://git.kernel.org/pub/scm/linux/kernel/git/soc/soc
Pull ARM SoC driver updates from Arnd Bergmann:
"This is the usual mix of updates for drivers that are used on (mostly
ARM) SoCs with no other top-level subsystem tree, including:
- The SCMI firmware subsystem gains support for version 3.2 of the
specification and updates to the notification code
- Feature updates for Tegra and Qualcomm platforms for added hardware
support
- A number of platforms get soc_device additions for identifying
newly added chips from Renesas, Qualcomm, Mediatek and Google
- Trivial improvements for firmware and memory drivers amongst
others, in particular 'const' annotations throughout multiple
subsystems"
* tag 'soc-drivers-6.9' of git://git.kernel.org/pub/scm/linux/kernel/git/soc/soc: (96 commits)
tee: make tee_bus_type const
soc: qcom: aoss: add missing kerneldoc for qmp members
soc: qcom: geni-se: drop unused kerneldoc struct geni_wrapper param
soc: qcom: spm: fix building with CONFIG_REGULATOR=n
bus: ti-sysc: constify the struct device_type usage
memory: stm32-fmc2-ebi: keep power domain on
memory: stm32-fmc2-ebi: add MP25 RIF support
memory: stm32-fmc2-ebi: add MP25 support
memory: stm32-fmc2-ebi: check regmap_read return value
dt-bindings: memory-controller: st,stm32: add MP25 support
dt-bindings: bus: imx-weim: convert to YAML
watchdog: s3c2410_wdt: use exynos_get_pmu_regmap_by_phandle() for PMU regs
soc: samsung: exynos-pmu: Add regmap support for SoCs that protect PMU regs
MAINTAINERS: Update SCMI entry with HWMON driver
MAINTAINERS: samsung: gs101: match patches touching Google Tensor SoC
memory: tegra: Fix indentation
memory: tegra: Add BPMP and ICC info for DLA clients
memory: tegra: Correct DLA client names
dt-bindings: memory: renesas,rpc-if: Document R-Car V4M support
firmware: arm_scmi: Update the supported clock protocol version
...
Linus Torvalds [Tue, 12 Mar 2024 17:29:57 +0000 (10:29 -0700)]
Merge tag 'soc-dt-6.9' of git://git.kernel.org/pub/scm/linux/kernel/git/soc/soc
Pull SoC device tree updates from Arnd Bergmann:
"There is very little going on with new SoC support this time, all the
new chips are variations of others that we already support, and they
are all based on ARMv8 cores:
- Mediatek MT7981B (Filogic 820) and MT7988A (Filogic 880) are
networking SoCs designed to be used in wireless routers, similar to
the already supported MT7986A (Filogic 830).
- NXP i.MX8DXP is a variant of i.MX8QXP, with two CPU cores less.
These are used in many embedded and industrial applications.
- Renesas R8A779G2 (R-Car V4H ES2.0) and R8A779H0 (R-Car V4M) are
automotive SoCs.
- TI J722S is another automotive variant of its K3 family, related to
the AM62 series.
There are a total of 7 new arm32 machines and 45 arm64 ones, including
- Two Android phones based on the old Tegra30 chip
- Two machines using Cortex-A53 SoCs from Allwinner, a mini PC and a
SoM development board
- A set-top box using Amlogic Meson G12A S905X2
- Eight embedded board using NXP i.MX6/8/9
- Three machines using Mediatek network router chips
- Ten Chromebooks, all based on Mediatek MT8186
- One development board based on Mediatek MT8395 (Genio 1200)
- Seven tablets and phones based on Qualcomm SoCs, most of them from
Samsung.
- A third development board for Qualcomm SM8550 (Snapdragon 8 Gen 2)
- Three variants of the "White Hawk" board for Renesas automotive
SoCs
- Ten Rockchips RK35xx based machines, including NAS, Tablet, Game
console and industrial form factors.
- Three evaluation boards for TI K3 based SoCs
The other changes are mainly the usual feature additions for existing
hardware, cleanups, and dtc compile time fixes. One notable change is
the inclusion of PowerVR SGX GPU nodes on TI SoCs"
* tag 'soc-dt-6.9' of git://git.kernel.org/pub/scm/linux/kernel/git/soc/soc: (824 commits)
riscv: dts: Move BUILTIN_DTB_SOURCE to common Kconfig
riscv: dts: starfive: jh7100: fix root clock names
ARM: dts: samsung: exynos4412: decrease memory to account for unusable region
arm64: dts: qcom: sm8250-xiaomi-elish: set rotation
arm64: dts: qcom: sm8650: Fix SPMI channels size
arm64: dts: qcom: sm8550: Fix SPMI channels size
arm64: dts: rockchip: Fix name for UART pin header on qnap-ts433
arm: dts: marvell: clearfog-gtr-l8: align port numbers with enclosure
arm: dts: marvell: clearfog-gtr-l8: add support for second sfp connector
dt-bindings: soc: renesas: renesas-soc: Add pattern for gray-hawk
dtc: Enable dtc interrupt_provider check
arm64: dts: st: add video encoder support to stm32mp255
arm64: dts: st: add video decoder support to stm32mp255
ARM: dts: stm32: enable crypto accelerator on stm32mp135f-dk
ARM: dts: stm32: enable CRC on stm32mp135f-dk
ARM: dts: stm32: add CRC on stm32mp131
ARM: dts: add stm32f769-disco-mb1166-reva09
ARM: dts: stm32: add display support on stm32f769-disco
ARM: dts: stm32: rename mmc_vcard to vcc-3v3 on stm32f769-disco
ARM: dts: stm32: add DSI support on stm32f769
...
Linus Torvalds [Tue, 12 Mar 2024 17:27:52 +0000 (10:27 -0700)]
Merge tag 'm68k-for-v6.9-tag1' of git://git.kernel.org/pub/scm/linux/kernel/git/geert/linux-m68k
Pull m68k updates from Geert Uytterhoeven:
- Make the Zorro bus type constant
- defconfig updates
* tag 'm68k-for-v6.9-tag1' of git://git.kernel.org/pub/scm/linux/kernel/git/geert/linux-m68k:
m68k: defconfig: Update defconfigs for v6.8-rc1
zorro: Make zorro_bus_type const
Linus Torvalds [Tue, 12 Mar 2024 17:14:22 +0000 (10:14 -0700)]
Merge tag 's390-6.9-1' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux
Pull s390 updates from Heiko Carstens:
- Various virtual vs physical address usage fixes
- Fix error handling in Processor Activity Instrumentation device
driver, and export number of counters with a sysfs file
- Allow for multiple events when Processor Activity Instrumentation
counters are monitored in system wide sampling
- Change multiplier and shift values of the Time-of-Day clock source to
improve steering precision
- Remove a couple of unneeded GFP_DMA flags from allocations
- Disable mmap alignment if randomize_va_space is also disabled, to
avoid a too small heap
- Various changes to allow s390 to be compiled with LLVM=1, since
ld.lld and llvm-objcopy will have proper s390 support witch clang 19
- Add __uninitialized macro to Compiler Attributes. This is helpful
with s390's FPU code where some users have up to 520 byte stack
frames. Clearing such stack frames (if INIT_STACK_ALL_PATTERN or
INIT_STACK_ALL_ZERO is enabled) before they are used contradicts the
intention (performance improvement) of such code sections.
- Convert switch_to() to an out-of-line function, and use the generic
switch_to header file
- Replace the usage of s390's debug feature with pr_debug() calls
within the zcrypt device driver
- Improve hotplug support of the Adjunct Processor device driver
- Improve retry handling in the zcrypt device driver
- Various changes to the in-kernel FPU code:
- Make in-kernel FPU sections preemptible
- Convert various larger inline assemblies and assembler files to
C, mainly by using singe instruction inline assemblies. This
increases readability, but also allows makes it easier to add
proper instrumentation hooks
- Cleanup of the header files
- Provide fast variants of csum_partial() and
csum_partial_copy_nocheck() based on vector instructions
- Introduce and use a lock to synchronize accesses to zpci device data
structures to avoid inconsistent states caused by concurrent accesses
- Compile the kernel without -fPIE. This addresses the following
problems if the kernel is compiled with -fPIE:
- It uses dynamic symbols (.dynsym), for which the linker refuses
to allow more than 64k sections. This can break features which
use '-ffunction-sections' and '-fdata-sections', including
kpatch-build and function granular KASLR
- It unnecessarily uses GOT relocations, adding an extra layer of
indirection for many memory accesses
- Fix shared_cpu_list for CPU private L2 caches, which incorrectly were
reported as globally shared
* tag 's390-6.9-1' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux: (117 commits)
s390/tools: handle rela R_390_GOTPCDBL/R_390_GOTOFF64
s390/cache: prevent rebuild of shared_cpu_list
s390/crypto: remove retry loop with sleep from PAES pkey invocation
s390/pkey: improve pkey retry behavior
s390/zcrypt: improve zcrypt retry behavior
s390/zcrypt: introduce retries on in-kernel send CPRB functions
s390/ap: introduce mutex to lock the AP bus scan
s390/ap: rework ap_scan_bus() to return true on config change
s390/ap: clarify AP scan bus related functions and variables
s390/ap: rearm APQNs bindings complete completion
s390/configs: increase number of LOCKDEP_BITS
s390/vfio-ap: handle hardware checkstop state on queue reset operation
s390/pai: change sampling event assignment for PMU device driver
s390/boot: fix minor comment style damages
s390/boot: do not check for zero-termination relocation entry
s390/boot: make type of __vmlinux_relocs_64_start|end consistent
s390/boot: sanitize kaslr_adjust_relocs() function prototype
s390/boot: simplify GOT handling
s390: vmlinux.lds.S: fix .got.plt assertion
s390/boot: workaround current 'llvm-objdump -t -j ...' behavior
...
Linus Torvalds [Tue, 12 Mar 2024 16:58:57 +0000 (09:58 -0700)]
Merge tag 'x86-boot-2024-03-12' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 boot updates from Ingo Molnar:
- Continuing work by Ard Biesheuvel to improve the x86 early startup
code, with the long-term goal to make it position independent:
- Get rid of early accesses to global objects, either by moving
them to the stack, deferring the access until later, or dropping
the globals entirely
- Move all code that runs early via the 1:1 mapping into
.head.text, and move code that does not out of it, so that build
time checks can be added later to ensure that no inadvertent
absolute references were emitted into code that does not
tolerate them
- Remove fixup_pointer() and occurrences of __pa_symbol(), which
rely on the compiler emitting absolute references, which is not
guaranteed
- Improve the early console code
- Add early console message about ignored NMIs, so that users are at
least warned about their existence - even if we cannot do anything
about them
- Improve the kexec code's kernel load address handling
- Enable more X86S (simplified x86) bits
- Simplify early boot GDT handling
- Micro-optimize the boot code a bit
- Misc cleanups
* tag 'x86-boot-2024-03-12' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (22 commits)
x86/sev: Move early startup code into .head.text section
x86/sme: Move early SME kernel encryption handling into .head.text
x86/boot: Move mem_encrypt= parsing to the decompressor
efi/libstub: Add generic support for parsing mem_encrypt=
x86/startup_64: Simplify virtual switch on primary boot
x86/startup_64: Simplify calculation of initial page table address
x86/startup_64: Defer assignment of 5-level paging global variables
x86/startup_64: Simplify CR4 handling in startup code
x86/boot: Use 32-bit XOR to clear registers
efi/x86: Set the PE/COFF header's NX compat flag unconditionally
x86/boot/64: Load the final kernel GDT during early boot directly, remove startup_gdt[]
x86/boot/64: Use RIP_REL_REF() to access early_top_pgt[]
x86/boot/64: Use RIP_REL_REF() to access early page tables
x86/boot/64: Use RIP_REL_REF() to access '__supported_pte_mask'
x86/boot/64: Use RIP_REL_REF() to access early_dynamic_pgts[]
x86/boot/64: Use RIP_REL_REF() to assign 'phys_base'
x86/boot/64: Simplify global variable accesses in GDT/IDT programming
x86/trampoline: Bypass compat mode in trampoline_start64() if not needed
kexec: Allocate kernel above bzImage's pref_address
x86/boot: Add a message about ignored early NMIs
...
Linus Torvalds [Tue, 12 Mar 2024 16:31:39 +0000 (09:31 -0700)]
Merge tag 'rfds-for-linus-2024-03-11' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 RFDS mitigation from Dave Hansen:
"RFDS is a CPU vulnerability that may allow a malicious userspace to
infer stale register values from kernel space. Kernel registers can
have all kinds of secrets in them so the mitigation is basically to
wait until the kernel is about to return to userspace and has user
values in the registers. At that point there is little chance of
kernel secrets ending up in the registers and the microarchitectural
state can be cleared.
This leverages some recent robustness fixes for the existing MDS
vulnerability. Both MDS and RFDS use the VERW instruction for
mitigation"
* tag 'rfds-for-linus-2024-03-11' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
KVM/x86: Export RFDS_NO and RFDS_CLEAR to guests
x86/rfds: Mitigate Register File Data Sampling (RFDS)
Documentation/hw-vuln: Add documentation for RFDS
x86/mmio: Disable KVM mitigation when X86_FEATURE_CLEAR_CPU_BUF is set
It was originally in x86/urgent, but was deemed wrong so got zapped.
But in the meantime, x86/urgent had been merged into x86/apic to
resolve a conflict. I didn't notice the merge so didn't zap it
from x86/apic and it managed to make it up with the x86/apic
material.
The reverted commit is known to cause some KASAN problems.
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
====================
nexthop: Fix two nexthop group statistics issues
Fix two issues that were introduced as part of the recent nexthop group
statistics submission. See the commit messages for more details.
====================
Ido Schimmel [Mon, 11 Mar 2024 16:23:07 +0000 (18:23 +0200)]
nexthop: Fix splat with CONFIG_DEBUG_PREEMPT=y
Locally generated packets can increment the new nexthop statistics from
process context, resulting in the following splat [1] due to preemption
being enabled. Fix by using get_cpu_ptr() / put_cpu_ptr() which will
which take care of disabling / enabling preemption.
BUG: using smp_processor_id() in preemptible [00000000] code: ping/949
caller is nexthop_select_path+0xcf8/0x1e30
CPU: 12 PID: 949 Comm: ping Not tainted 6.8.0-rc7-custom-gcb450f605fae #11
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.2-1.fc38 04/01/2014
Call Trace:
<TASK>
dump_stack_lvl+0xbd/0xe0
check_preemption_disabled+0xce/0xe0
nexthop_select_path+0xcf8/0x1e30
fib_select_multipath+0x865/0x18b0
fib_select_path+0x311/0x1160
ip_route_output_key_hash_rcu+0xe54/0x2720
ip_route_output_key_hash+0x193/0x380
ip_route_output_flow+0x25/0x130
raw_sendmsg+0xbab/0x34a0
inet_sendmsg+0xa2/0xe0
__sys_sendto+0x2ad/0x430
__x64_sys_sendto+0xe5/0x1c0
do_syscall_64+0xc5/0x1d0
entry_SYSCALL_64_after_hwframe+0x63/0x6b
[...]
Fixes: f4676ea74b85 ("net: nexthop: Add nexthop group entry stats") Signed-off-by: Ido Schimmel <idosch@nvidia.com> Reviewed-by: David Ahern <dsahern@kernel.org> Link: https://lore.kernel.org/r/20240311162307.545385-5-idosch@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Ido Schimmel [Mon, 11 Mar 2024 16:23:06 +0000 (18:23 +0200)]
nexthop: Fix out-of-bounds access during attribute validation
Passing a maximum attribute type to nlmsg_parse() that is larger than
the size of the passed policy will result in an out-of-bounds access [1]
when the attribute type is used as an index into the policy array.
Fix by setting the maximum attribute type according to the policy size,
as is already done for RTM_NEWNEXTHOP messages. Add a test case that
triggers the bug.
Ido Schimmel [Mon, 11 Mar 2024 16:23:05 +0000 (18:23 +0200)]
nexthop: Only parse NHA_OP_FLAGS for dump messages that require it
The attribute is parsed in __nh_valid_dump_req() which is called by the
dump handlers of RTM_GETNEXTHOP and RTM_GETNEXTHOPBUCKET although it is
only used by the former and rejected by the policy of the latter.
Move the parsing to nh_valid_dump_req() which is only called by the dump
handler of RTM_GETNEXTHOP.
Ido Schimmel [Mon, 11 Mar 2024 16:23:04 +0000 (18:23 +0200)]
nexthop: Only parse NHA_OP_FLAGS for get messages that require it
The attribute is parsed into 'op_flags' in nh_valid_get_del_req() which
is called from the handlers of three message types: RTM_DELNEXTHOP,
RTM_GETNEXTHOPBUCKET and RTM_GETNEXTHOP. The attribute is only used by
the latter and rejected by the policies of the other two.
Pass 'op_flags' as NULL from the handlers of the other two and only
parse the attribute when the argument is not NULL.
Linus Torvalds [Tue, 12 Mar 2024 03:07:52 +0000 (20:07 -0700)]
Merge tag 'x86_mm_for_6.9' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 mm updates from Dave Hansen:
- Add a warning when memory encryption conversions fail. These
operations require VMM cooperation, even in CoCo environments where
the VMM is untrusted. While it's _possible_ that memory pressure
could trigger the new warning, the odds are that a guest would only
see this from an attacking VMM.
- Simplify page fault code by re-enabling interrupts unconditionally
- Avoid truncation issues when pfns are passed in to pfn_to_kaddr()
with small (<64-bit) types.
* tag 'x86_mm_for_6.9' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/mm/cpa: Warn for set_memory_XXcrypted() VMM fails
x86/mm: Get rid of conditional IF flag handling in page fault path
x86/mm: Ensure input to pfn_to_kaddr() is treated as a 64-bit type
Linus Torvalds [Tue, 12 Mar 2024 02:53:15 +0000 (19:53 -0700)]
Merge tag 'x86-core-2024-03-11' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull core x86 updates from Ingo Molnar:
- The biggest change is the rework of the percpu code, to support the
'Named Address Spaces' GCC feature, by Uros Bizjak:
- This allows C code to access GS and FS segment relative memory
via variables declared with such attributes, which allows the
compiler to better optimize those accesses than the previous
inline assembly code.
- The series also includes a number of micro-optimizations for
various percpu access methods, plus a number of cleanups of %gs
accesses in assembly code.
- These changes have been exposed to linux-next testing for the
last ~5 months, with no known regressions in this area.
- Fix/clean up __switch_to()'s broken but accidentally working handling
of FPU switching - which also generates better code
- Propagate more RIP-relative addressing in assembly code, to generate
slightly better code
- Rework the CPU mitigations Kconfig space to be less idiosyncratic, to
make it easier for distros to follow & maintain these options
- Rework the x86 idle code to cure RCU violations and to clean up the
logic
- Clean up the vDSO Makefile logic
- Misc cleanups and fixes
* tag 'x86-core-2024-03-11' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (52 commits)
x86/idle: Select idle routine only once
x86/idle: Let prefer_mwait_c1_over_halt() return bool
x86/idle: Cleanup idle_setup()
x86/idle: Clean up idle selection
x86/idle: Sanitize X86_BUG_AMD_E400 handling
sched/idle: Conditionally handle tick broadcast in default_idle_call()
x86: Increase brk randomness entropy for 64-bit systems
x86/vdso: Move vDSO to mmap region
x86/vdso/kbuild: Group non-standard build attributes and primary object file rules together
x86/vdso: Fix rethunk patching for vdso-image-{32,64}.o
x86/retpoline: Ensure default return thunk isn't used at runtime
x86/vdso: Use CONFIG_COMPAT_32 to specify vdso32
x86/vdso: Use $(addprefix ) instead of $(foreach )
x86/vdso: Simplify obj-y addition
x86/vdso: Consolidate targets and clean-files
x86/bugs: Rename CONFIG_RETHUNK => CONFIG_MITIGATION_RETHUNK
x86/bugs: Rename CONFIG_CPU_SRSO => CONFIG_MITIGATION_SRSO
x86/bugs: Rename CONFIG_CPU_IBRS_ENTRY => CONFIG_MITIGATION_IBRS_ENTRY
x86/bugs: Rename CONFIG_CPU_UNRET_ENTRY => CONFIG_MITIGATION_UNRET_ENTRY
x86/bugs: Rename CONFIG_SLS => CONFIG_MITIGATION_SLS
...
Linus Torvalds [Tue, 12 Mar 2024 02:37:56 +0000 (19:37 -0700)]
Merge tag 'x86-cleanups-2024-03-11' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 cleanups from Ingo Molnar:
"Misc cleanups, including a large series from Thomas Gleixner to cure
sparse warnings"
* tag 'x86-cleanups-2024-03-11' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/nmi: Drop unused declaration of proc_nmi_enabled()
x86/callthunks: Use EXPORT_PER_CPU_SYMBOL_GPL() for per CPU variables
x86/cpu: Provide a declaration for itlb_multihit_kvm_mitigation
x86/cpu: Use EXPORT_PER_CPU_SYMBOL_GPL() for x86_spec_ctrl_current
x86/uaccess: Add missing __force to casts in __access_ok() and valid_user_address()
x86/percpu: Cure per CPU madness on UP
smp: Consolidate smp_prepare_boot_cpu()
x86/msr: Add missing __percpu annotations
x86/msr: Prepare for including <linux/percpu.h> into <asm/msr.h>
perf/x86/amd/uncore: Fix __percpu annotation
x86/nmi: Remove an unnecessary IS_ENABLED(CONFIG_SMP)
x86/apm_32: Remove dead function apm_get_battery_status()
x86/insn-eval: Fix function param name in get_eff_addr_sib()
Linus Torvalds [Tue, 12 Mar 2024 02:23:16 +0000 (19:23 -0700)]
Merge tag 'x86-build-2024-03-11' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 build updates from Ingo Molnar:
- Reduce <asm/bootparam.h> dependencies
- Simplify <asm/efi.h>
- Unify *_setup_data definitions into <asm/setup_data.h>
- Reduce the size of <asm/bootparam.h>
* tag 'x86-build-2024-03-11' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86: Do not include <asm/bootparam.h> in several files
x86/efi: Implement arch_ima_efi_boot_mode() in source file
x86/setup: Move internal setup_data structures into setup_data.h
x86/setup: Move UAPI setup structures into setup_data.h