Tobias Brunner [Fri, 15 Mar 2024 14:35:40 +0000 (15:35 +0100)]
ipv4: raw: Fix sending packets from raw sockets via IPsec tunnels
Since the referenced commit, the xfrm_inner_extract_output() function
uses the protocol field to determine the address family. So not setting
it for IPv4 raw sockets meant that such packets couldn't be tunneled via
IPsec anymore.
IPv6 raw sockets are not affected as they already set the protocol since 9c9c9ad5fae7 ("ipv6: set skb->protocol on tcp, raw and ip6_append_data
genereated skbs").
Fixes: f4796398f21b ("xfrm: Remove inner/outer modes from output path") Signed-off-by: Tobias Brunner <tobias@strongswan.org> Reviewed-by: David Ahern <dsahern@kernel.org> Reviewed-by: Nicolas Dichtel <nicolas.dichtel@6wind.com> Link: https://lore.kernel.org/r/c5d9a947-eb19-4164-ac99-468ea814ce20@strongswan.org Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Felix Maurer [Fri, 15 Mar 2024 12:04:52 +0000 (13:04 +0100)]
hsr: Handle failures in module init
A failure during registration of the netdev notifier was not handled at
all. A failure during netlink initialization did not unregister the netdev
notifier.
Handle failures of netdev notifier registration and netlink initialization.
Both functions should only return negative values on failure and thereby
lead to the hsr module not being loaded.
Fixes: f421436a591d ("net/hsr: Add support for the High-availability Seamless Redundancy protocol (HSRv0)") Signed-off-by: Felix Maurer <fmaurer@redhat.com> Reviewed-by: Shigeru Yoshida <syoshida@redhat.com> Reviewed-by: Breno Leitao <leitao@debian.org> Link: https://lore.kernel.org/r/3ce097c15e3f7ace98fc7fd9bcbf299f092e63d1.1710504184.git.fmaurer@redhat.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Yewon Choi [Fri, 15 Mar 2024 09:28:38 +0000 (18:28 +0900)]
rds: introduce acquire/release ordering in acquire/release_in_xmit()
acquire/release_in_xmit() work as bit lock in rds_send_xmit(), so they
are expected to ensure acquire/release memory ordering semantics.
However, test_and_set_bit/clear_bit() don't imply such semantics, on
top of this, following smp_mb__after_atomic() does not guarantee release
ordering (memory barrier actually should be placed before clear_bit()).
Instead, we use clear_bit_unlock/test_and_set_bit_lock() here.
Fixes: 0f4b1c7e89e6 ("rds: fix rds_send_xmit() serialization") Fixes: 1f9ecd7eacfd ("RDS: Pass rds_conn_path to rds_send_xmit()") Signed-off-by: Yewon Choi <woni9911@gmail.com> Reviewed-by: Michal Kubiak <michal.kubiak@intel.com> Link: https://lore.kernel.org/r/ZfQUxnNTO9AJmzwc@libra05 Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Jakub Kicinski [Fri, 15 Mar 2024 00:21:08 +0000 (17:21 -0700)]
tools: ynl: add header guards for nlctrl
I "extracted" YNL C into a GitHub repo to make it easier
to use in other projects: https://github.com/linux-netdev/ynl-c
GitHub actions use Ubuntu by default, and the kernel headers
there are missing f329a0ebeaba ("genetlink: correct uAPI defines").
Add the direct include workaround for nlctrl.
Fixes: 768e044a5fd4 ("doc/netlink/specs: Add spec for nlctrl netlink family") Signed-off-by: Jakub Kicinski <kuba@kernel.org> Reviewed-by: Donald Hunter <donald.hunter@gmail.com> Link: https://lore.kernel.org/r/20240315002108.523232-1-kuba@kernel.org Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Paolo Abeni [Tue, 19 Mar 2024 10:22:54 +0000 (11:22 +0100)]
Merge branch 'wireguard-fixes-for-6-9-rc1'
Jason A. Donenfeld says:
====================
wireguard fixes for 6.9-rc1
This series has four WireGuard fixes:
1) Annotate a data race that KCSAN found by using READ_ONCE/WRITE_ONCE,
which has been causing syzkaller noise.
2) Use the generic netdev tstats allocation and stats getters instead of
doing this within the driver.
3) Explicitly check a flag variable instead of an empty list in the
netlink code, to prevent a UaF situation when paging through GET
results during a remove-all SET operation.
4) Set a flag in the RISC-V CI config so the selftests continue to boot.
====================
wireguard: selftests: set RISCV_ISA_FALLBACK on riscv{32,64}
This option is needed to continue booting with QEMU. Recent changes that
made this optional meant that it gets unset in the test harness, and so
WireGuard CI has been broken. Fix this by simply setting this option.
Cc: stable@vger.kernel.org Fixes: 496ea826d1e1 ("RISC-V: provide Kconfig & commandline options to control parsing "riscv,isa"") Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com> Reviewed-by: Jiri Pirko <jiri@nvidia.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
wireguard: netlink: access device through ctx instead of peer
The previous commit fixed a bug that led to a NULL peer->device being
dereferenced. It's actually easier and faster performance-wise to
instead get the device from ctx->wg. This semantically makes more sense
too, since ctx->wg->peer_allowedips.seq is compared with
ctx->allowedips_seq, basing them both in ctx. This also acts as a
defence in depth provision against freed peers.
Cc: stable@vger.kernel.org Fixes: e7096c131e51 ("net: WireGuard secure network tunnel") Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com> Reviewed-by: Jiri Pirko <jiri@nvidia.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
wireguard: netlink: check for dangling peer via is_dead instead of empty list
If all peers are removed via wg_peer_remove_all(), rather than setting
peer_list to empty, the peer is added to a temporary list with a head on
the stack of wg_peer_remove_all(). If a netlink dump is resumed and the
cursored peer is one that has been removed via wg_peer_remove_all(), it
will iterate from that peer and then attempt to dump freed peers.
Fix this by instead checking peer->is_dead, which was explictly created
for this purpose. Also move up the device_update_lock lockdep assertion,
since reading is_dead relies on that.
It can be reproduced by a small script like:
echo "Setting config..."
ip link add dev wg0 type wireguard
wg setconf wg0 /big-config
(
while true; do
echo "Showing config..."
wg showconf wg0 > /dev/null
done
) &
sleep 4
wg setconf wg0 <(printf "[Peer]\nPublicKey=$(wg genkey)\n")
Resulting in:
BUG: KASAN: slab-use-after-free in __lock_acquire+0x182a/0x1b20
Read of size 8 at addr ffff88811956ec70 by task wg/59
CPU: 2 PID: 59 Comm: wg Not tainted 6.8.0-rc2-debug+ #5
Call Trace:
<TASK>
dump_stack_lvl+0x47/0x70
print_address_description.constprop.0+0x2c/0x380
print_report+0xab/0x250
kasan_report+0xba/0xf0
__lock_acquire+0x182a/0x1b20
lock_acquire+0x191/0x4b0
down_read+0x80/0x440
get_peer+0x140/0xcb0
wg_get_device_dump+0x471/0x1130
Cc: stable@vger.kernel.org Fixes: e7096c131e51 ("net: WireGuard secure network tunnel") Reported-by: Lillian Berry <lillian@star-ark.net> Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com> Reviewed-by: Jiri Pirko <jiri@nvidia.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Commit 3e2f544dd8a33 ("net: get stats64 if device if driver is
configured") moved the callback to dev_get_tstats64() to net core, so,
unless the driver is doing some custom stats collection, it does not
need to set .ndo_get_stats64.
Since this driver is now relying in NETDEV_PCPU_STAT_TSTATS, then, it
doesn't need to set the dev_get_tstats64() generic .ndo_get_stats64
function pointer.
Signed-off-by: Breno Leitao <leitao@debian.org> Reviewed-by: Simon Horman <horms@kernel.org> Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com> Reviewed-by: Jiri Pirko <jiri@nvidia.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Breno Leitao [Thu, 14 Mar 2024 22:49:07 +0000 (16:49 -0600)]
wireguard: device: leverage core stats allocator
With commit 34d21de99cea9 ("net: Move {l,t,d}stats allocation to core
and convert veth & vrf"), stats allocation could be done on net core
instead of in this driver.
With this new approach, the driver doesn't have to bother with error
handling (allocation failure checking, making sure free happens in the
right spot, etc). This is core responsibility now.
Remove the allocation in this driver and leverage the network core
allocation instead.
Signed-off-by: Breno Leitao <leitao@debian.org> Reviewed-by: Simon Horman <horms@kernel.org> Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com> Reviewed-by: Jiri Pirko <jiri@nvidia.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
wireguard: receive: annotate data-race around receiving_counter.counter
Syzkaller with KCSAN identified a data-race issue when accessing
keypair->receiving_counter.counter. Use READ_ONCE() and WRITE_ONCE()
annotations to mark the data race as intentional.
BUG: KCSAN: data-race in wg_packet_decrypt_worker / wg_packet_rx_poll
write to 0xffff888107765888 of 8 bytes by interrupt on cpu 0:
counter_validate drivers/net/wireguard/receive.c:321 [inline]
wg_packet_rx_poll+0x3ac/0xf00 drivers/net/wireguard/receive.c:461
__napi_poll+0x60/0x3b0 net/core/dev.c:6536
napi_poll net/core/dev.c:6605 [inline]
net_rx_action+0x32b/0x750 net/core/dev.c:6738
__do_softirq+0xc4/0x279 kernel/softirq.c:553
do_softirq+0x5e/0x90 kernel/softirq.c:454
__local_bh_enable_ip+0x64/0x70 kernel/softirq.c:381
__raw_spin_unlock_bh include/linux/spinlock_api_smp.h:167 [inline]
_raw_spin_unlock_bh+0x36/0x40 kernel/locking/spinlock.c:210
spin_unlock_bh include/linux/spinlock.h:396 [inline]
ptr_ring_consume_bh include/linux/ptr_ring.h:367 [inline]
wg_packet_decrypt_worker+0x6c5/0x700 drivers/net/wireguard/receive.c:499
process_one_work kernel/workqueue.c:2633 [inline]
...
read to 0xffff888107765888 of 8 bytes by task 3196 on cpu 1:
decrypt_packet drivers/net/wireguard/receive.c:252 [inline]
wg_packet_decrypt_worker+0x220/0x700 drivers/net/wireguard/receive.c:501
process_one_work kernel/workqueue.c:2633 [inline]
process_scheduled_works+0x5b8/0xa30 kernel/workqueue.c:2706
worker_thread+0x525/0x730 kernel/workqueue.c:2787
...
Fixes: a9e90d9931f3 ("wireguard: noise: separate receive counter from send counter") Reported-by: syzbot+d1de830e4ecdaac83d89@syzkaller.appspotmail.com Signed-off-by: Nikita Zhandarovich <n.zhandarovich@fintech.ru> Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com> Reviewed-by: Jiri Pirko <jiri@nvidia.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
The patch currently broke the bpf selftest test_tc_dtime because
uapi field __sk_buff->tstamp_type depends on skb->mono_delivery_time which
does not necessarily mean mono with the original fix as the bit was re-used
for userspace timestamp as well to avoid tstamp reset in the forwarding
path. To solve this we need to keep mono_delivery_time as is and
introduce another bit called user_delivery_time and fall back to the
initial proposal of setting the user_delivery_time bit based on
sk_clockid set from userspace.
Fixes: 885c36e59f46 ("net: Re-use and set mono_delivery_time bit for userspace tstamp packets") Link: https://lore.kernel.org/netdev/bc037db4-58bb-4861-ac31-a361a93841d3@linux.dev/ Signed-off-by: Abhishek Chauhan <quic_abchauha@quicinc.com> Acked-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Martin KaFai Lau <martin.lau@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
Arınç ÜNAL [Thu, 14 Mar 2024 09:28:35 +0000 (12:28 +0300)]
net: dsa: mt7530: prevent possible incorrect XTAL frequency selection
On MT7530, the HT_XTAL_FSEL field of the HWTRAP register stores a 2-bit
value that represents the frequency of the crystal oscillator connected to
the switch IC. The field is populated by the state of the ESW_P4_LED_0 and
ESW_P4_LED_0 pins, which is done right after reset is deasserted.
On MT7531, the XTAL25 bit of the STRAP register stores this. The LAN0LED0
pin is used to populate the bit. 25MHz when the pin is high, 40MHz when
it's low.
These pins are also used with LEDs, therefore, their state can be set to
something other than the bootstrapping configuration. For example, a link
may be established on port 3 before the DSA subdriver takes control of the
switch which would set ESW_P3_LED_0 to high.
Currently on mt7530_setup() and mt7531_setup(), 1000 - 1100 usec delay is
described between reset assertion and deassertion. Some switch ICs in real
life conditions cannot always have these pins set back to the bootstrapping
configuration before reset deassertion in this amount of delay. This causes
wrong crystal frequency to be selected which puts the switch in a
nonfunctional state after reset deassertion.
The tests below are conducted on an MT7530 with a 40MHz crystal oscillator
by Justin Swartz.
With a cable from an active peer connected to port 3 before reset, an
incorrect crystal frequency (0b11 = 25MHz) is selected:
[1] Reset is asserted.
[2] Period of 1000 - 1100 usec.
[3] Reset is deasserted.
[4] Period of 315 usec. HWTRAP register is populated with incorrect
XTAL frequency.
[5] Signals reflect the bootstrapped configuration.
Increase the delay between reset_control_assert() and
reset_control_deassert(), and gpiod_set_value_cansleep(priv->reset, 0) and
gpiod_set_value_cansleep(priv->reset, 1) to 5000 - 5100 usec. This amount
ensures a higher possibility that the switch IC will have these pins back
to the bootstrapping configuration before reset deassertion.
With a cable from an active peer connected to port 3 before reset, the
correct crystal frequency (0b10 = 40MHz) is selected:
[1] Reset is asserted.
[2] Period of 5000 - 5100 usec.
[2-1] ESW_P3_LED_0 goes low.
[2-2] Remaining period of 5000 - 5100 usec.
[3] Reset is deasserted.
[4] Period of 310 usec. HWTRAP register is populated with bootstrapped
XTAL frequency.
[5] Signals reflect the bootstrapped configuration.
Revert commit 2920dd92b980 ("net: dsa: mt7530: disable LEDs before reset").
Changing the state of pins via reset assertion is simpler and more
efficient than doing so by setting the LED controller off.
Fixes: b8f126a8d543 ("net-next: dsa: add dsa support for Mediatek MT7530 switch") Fixes: c288575f7810 ("net: dsa: mt7530: Add the support of MT7531 switch") Co-developed-by: Justin Swartz <justin.swartz@risingedge.co.za> Signed-off-by: Justin Swartz <justin.swartz@risingedge.co.za> Signed-off-by: Arınç ÜNAL <arinc.unal@arinc9.com> Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Mon, 18 Mar 2024 12:25:52 +0000 (12:25 +0000)]
Merge branch 'veth-xdp-gro'
Ignat Korchagin says:
====================
net: veth: ability to toggle GRO and XDP independently
It is rather confusing that GRO is automatically enabled, when an XDP program
is attached to a veth interface. Moreover, it is not possible to disable GRO
on a veth, if an XDP program is attached (which might be desirable in some use
cases).
Make GRO and XDP independent for a veth interface.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Ignat Korchagin [Wed, 13 Mar 2024 18:37:58 +0000 (19:37 +0100)]
net: veth: do not manipulate GRO when using XDP
Commit d3256efd8e8b ("veth: allow enabling NAPI even without XDP") tried to fix
the fact that GRO was not possible without XDP, because veth did not use NAPI
without XDP. However, it also introduced the behaviour that GRO is always
enabled, when XDP is enabled.
While it might be desired for most cases, it is confusing for the user at best
as the GRO flag suddenly changes, when an XDP program is attached. It also
introduces some complexities in state management as was partially addressed in
commit fe9f801355f0 ("net: veth: clear GRO when clearing XDP even when down").
But the biggest problem is that it is not possible to disable GRO at all, when
an XDP program is attached, which might be needed for some use cases.
Fix this by not touching the GRO flag on XDP enable/disable as the code already
supports switching to NAPI if either GRO or XDP is requested.
Link: https://lore.kernel.org/lkml/20240311124015.38106-1-ignat@cloudflare.com/ Fixes: d3256efd8e8b ("veth: allow enabling NAPI even without XDP") Fixes: fe9f801355f0 ("net: veth: clear GRO when clearing XDP even when down") Signed-off-by: Ignat Korchagin <ignat@cloudflare.com> Reviewed-by: Toke Høiland-Jørgensen <toke@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
BUG: KCSAN: data-race in dev_queue_xmit_nit / packet_setsockopt
write to 0xffff888107804542 of 1 bytes by task 22618 on cpu 0:
packet_setsockopt+0xd83/0xfd0 net/packet/af_packet.c:4003
do_sock_setsockopt net/socket.c:2311 [inline]
__sys_setsockopt+0x1d8/0x250 net/socket.c:2334
__do_sys_setsockopt net/socket.c:2343 [inline]
__se_sys_setsockopt net/socket.c:2340 [inline]
__x64_sys_setsockopt+0x66/0x80 net/socket.c:2340
do_syscall_64+0xd3/0x1d0
entry_SYSCALL_64_after_hwframe+0x6d/0x75
read to 0xffff888107804542 of 1 bytes by task 27 on cpu 1:
dev_queue_xmit_nit+0x82/0x620 net/core/dev.c:2248
xmit_one net/core/dev.c:3527 [inline]
dev_hard_start_xmit+0xcc/0x3f0 net/core/dev.c:3547
__dev_queue_xmit+0xf24/0x1dd0 net/core/dev.c:4335
dev_queue_xmit include/linux/netdevice.h:3091 [inline]
batadv_send_skb_packet+0x264/0x300 net/batman-adv/send.c:108
batadv_send_broadcast_skb+0x24/0x30 net/batman-adv/send.c:127
batadv_iv_ogm_send_to_if net/batman-adv/bat_iv_ogm.c:392 [inline]
batadv_iv_ogm_emit net/batman-adv/bat_iv_ogm.c:420 [inline]
batadv_iv_send_outstanding_bat_ogm_packet+0x3f0/0x4b0 net/batman-adv/bat_iv_ogm.c:1700
process_one_work kernel/workqueue.c:3254 [inline]
process_scheduled_works+0x465/0x990 kernel/workqueue.c:3335
worker_thread+0x526/0x730 kernel/workqueue.c:3416
kthread+0x1d1/0x210 kernel/kthread.c:388
ret_from_fork+0x4b/0x60 arch/x86/kernel/process.c:147
ret_from_fork_asm+0x1a/0x30 arch/x86/entry/entry_64.S:243
value changed: 0x00 -> 0x01
Reported by Kernel Concurrency Sanitizer on:
CPU: 1 PID: 27 Comm: kworker/u8:1 Tainted: G W 6.8.0-syzkaller-08073-g480e035fc4c7 #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 02/29/2024
Workqueue: bat_events batadv_iv_send_outstanding_bat_ogm_packet
Fixes: fa788d986a3a ("packet: add sockopt to ignore outgoing packets") Reported-by: syzbot+c669c1136495a2e7c31f@syzkaller.appspotmail.com Closes: https://lore.kernel.org/netdev/CANn89i+Z7MfbkBLOv=p7KZ7=K1rKHO4P1OL5LYDCtBiyqsa9oQ@mail.gmail.com/T/#t Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Willem de Bruijn <willemdebruijn.kernel@gmail.com> Reviewed-by: Willem de Bruijn <willemb@google.com> Reviewed-by: Jason Xing <kerneljasonxing@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Herve Codina [Thu, 14 Mar 2024 12:33:46 +0000 (13:33 +0100)]
net: wan: fsl_qmc_hdlc: Fix module compilation
The fsl_qmc_driver does not compile as module:
error: ‘qmc_hdlc_driver’ undeclared here (not in a function);
405 | MODULE_DEVICE_TABLE(of, qmc_hdlc_driver);
| ^~~~~~~~~~~~~~~
Fix the typo.
Fixes: b40f00ecd463 ("net: wan: Add support for QMC HDLC") Reported-by: Michael Ellerman <mpe@ellerman.id.au> Closes: https://lore.kernel.org/linux-kernel/87ttl93f7i.fsf@mail.lhotse/ Signed-off-by: Herve Codina <herve.codina@bootlin.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Daniel Golle [Wed, 13 Mar 2024 22:50:40 +0000 (22:50 +0000)]
net: ethernet: mtk_eth_soc: fix PPE hanging issue
A patch to resolve an issue was found in MediaTek's GPL-licensed SDK:
In the mtk_ppe_stop() function, the PPE scan mode is not disabled before
disabling the PPE. This can potentially lead to a hang during the process
of disabling the PPE.
Without this patch, the PPE may experience a hang during the reboot test.
Duanqiang Wen [Wed, 13 Mar 2024 08:06:34 +0000 (16:06 +0800)]
net: txgbe: fix clk_name exceed MAX_DEV_ID limits
txgbe register clk which name is i2c_designware.pci_dev_id(),
clk_name will be stored in clk_lookup_alloc. If PCIe bus number
is larger than 0x39, clk_name size will be larger than 20 bytes.
It exceeds clk_lookup_alloc MAX_DEV_ID limits. So the driver
shortened clk_name.
David Howells [Tue, 12 Mar 2024 23:37:18 +0000 (23:37 +0000)]
rxrpc: Fix error check on ->alloc_txbuf()
rxrpc_alloc_*_txbuf() and ->alloc_txbuf() return NULL to indicate no
memory, but rxrpc_send_data() uses IS_ERR().
Fix rxrpc_send_data() to check for NULL only and set -ENOMEM if it sees
that.
Fixes: 49489bb03a50 ("rxrpc: Do zerocopy using MSG_SPLICE_PAGES and page frags") Signed-off-by: David Howells <dhowells@redhat.com> Reported-by: Marc Dionne <marc.dionne@auristor.com>
cc: "David S. Miller" <davem@davemloft.net>
cc: Eric Dumazet <edumazet@google.com>
cc: Jakub Kicinski <kuba@kernel.org>
cc: Paolo Abeni <pabeni@redhat.com>
cc: linux-afs@lists.infradead.org
cc: netdev@vger.kernel.org Signed-off-by: Paolo Abeni <pabeni@redhat.com>
David Howells [Tue, 12 Mar 2024 23:37:17 +0000 (23:37 +0000)]
rxrpc: Fix use of changed alignment param to page_frag_alloc_align()
Commit 411c5f36805c ("mm/page_alloc: modify page_frag_alloc_align() to
accept align as an argument") changed the way page_frag_alloc_align()
worked, but it didn't fix AF_RXRPC as that use of that allocator function
hadn't been merged yet at the time. Now, when the AFS filesystem is used,
this results in:
WARNING: CPU: 4 PID: 379 at include/linux/gfp.h:323 rxrpc_alloc_data_txbuf+0x9d/0x2b0 [rxrpc]
Fix this by using __page_frag_alloc_align() instead.
Note that it might be better to use an order-based alignment rather than a
mask-based alignment.
Fixes: 49489bb03a50 ("rxrpc: Do zerocopy using MSG_SPLICE_PAGES and page frags") Signed-off-by: David Howells <dhowells@redhat.com> Reported-by: Marc Dionne <marc.dionne@auristor.com>
cc: Yunsheng Lin <linyunsheng@huawei.com>
cc: Alexander Duyck <alexander.duyck@gmail.com>
cc: Michael S. Tsirkin <mst@redhat.com>
cc: "David S. Miller" <davem@davemloft.net>
cc: Eric Dumazet <edumazet@google.com>
cc: Jakub Kicinski <kuba@kernel.org>
cc: Paolo Abeni <pabeni@redhat.com>
cc: linux-afs@lists.infradead.org
cc: netdev@vger.kernel.org Signed-off-by: Paolo Abeni <pabeni@redhat.com>
CPU: 1 PID: 5033 Comm: syz-executor334 Not tainted 6.7.0-syzkaller-00562-g9f8413c4a66f #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 11/17/2023
=====================================================
If the packet type ID field in the Ethernet header is either ETH_P_PRP or
ETH_P_HSR, but it is not followed by an HSR tag, hsr_get_skb_sequence_nr()
reads an invalid value as a sequence number. This causes the above issue.
This patch fixes the issue by returning NULL if the Ethernet header is not
followed by an HSR tag.
tcp: Fix refcnt handling in __inet_hash_connect().
syzbot reported a warning in sk_nulls_del_node_init_rcu().
The commit 66b60b0c8c4a ("dccp/tcp: Unhash sk from ehash for tb2 alloc
failure after check_estalblished().") tried to fix an issue that an
unconnected socket occupies an ehash entry when bhash2 allocation fails.
In such a case, we need to revert changes done by check_established(),
which does not hold refcnt when inserting socket into ehash.
So, to revert the change, we need to __sk_nulls_add_node_rcu() instead
of sk_nulls_add_node_rcu().
Otherwise, sock_put() will cause refcnt underflow and leak the socket.
Shay Drory [Tue, 12 Mar 2024 10:52:38 +0000 (12:52 +0200)]
devlink: Fix devlink parallel commands processing
Commit 870c7ad4a52b ("devlink: protect devlink->dev by the instance
lock") added devlink instance locking inside a loop that iterates over
all the registered devlink instances on the machine in the pre-doit
phase. This can lead to serialization of devlink commands over
different devlink instances.
For example: While the first devlink instance is executing firmware
flash, all commands to other devlink instances on the machine are
forced to wait until the first devlink finishes.
Therefore, in the pre-doit phase, take the devlink instance lock only
for the devlink instance the command is targeting. Devlink layer is
taking a reference on the devlink instance, ensuring the devlink->dev
pointer is valid. This reference taking was introduced by commit a380687200e0 ("devlink: take device reference for devlink object").
Without this commit, it would not be safe to access devlink->dev
lockless.
Fixes: 870c7ad4a52b ("devlink: protect devlink->dev by the instance lock") Signed-off-by: Shay Drory <shayd@nvidia.com> Reviewed-by: Jiri Pirko <jiri@nvidia.com> Signed-off-by: David S. Miller <davem@davemloft.net>
taprio_parse_tc_entry() is not correctly checking
TCA_TAPRIO_TC_ENTRY_INDEX attribute:
int tc; // Signed value
tc = nla_get_u32(tb[TCA_TAPRIO_TC_ENTRY_INDEX]);
if (tc >= TC_QOPT_MAX_QUEUE) {
NL_SET_ERR_MSG_MOD(extack, "TC entry index out of range");
return -ERANGE;
}
syzbot reported that it could fed arbitary negative values:
Fixes: a54fc09e4cba ("net/sched: taprio: allow user input of per-tc max SDU") Reported-and-tested-by: syzbot+a340daa06412d6028918@syzkaller.appspotmail.com Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Vladimir Oltean <vladimir.oltean@nxp.com> Reviewed-by: Michal Kubiak <michal.kubiak@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Linu Cherian [Tue, 12 Mar 2024 07:06:22 +0000 (12:36 +0530)]
octeontx2-af: Use matching wake_up API variant in CGX command interface
Use wake_up API instead of wake_up_interruptible, since
wait_event_timeout API is used for waiting on command completion.
Fixes: 1463f382f58d ("octeontx2-af: Add support for CGX link management") Signed-off-by: Linu Cherian <lcherian@marvell.com> Signed-off-by: Sunil Goutham <sgoutham@marvell.com> Signed-off-by: Subbaraya Sundeep <sbhatta@marvell.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Sean Anderson [Mon, 11 Mar 2024 16:38:30 +0000 (12:38 -0400)]
soc: fsl: qbman: Use raw spinlock for cgr_lock
smp_call_function always runs its callback in hard IRQ context, even on
PREEMPT_RT, where spinlocks can sleep. So we need to use a raw spinlock
for cgr_lock to ensure we aren't waiting on a sleeping task.
Although this bug has existed for a while, it was not apparent until
commit ef2a8d5478b9 ("net: dpaa: Adjust queue depth on rate change")
which invokes smp_call_function_single via qman_update_cgr_safe every
time a link goes up or down.
Fixes: 96f413f47677 ("soc/fsl/qbman: fix issue in qman_delete_cgr_safe()") CC: stable@vger.kernel.org Reported-by: Vladimir Oltean <vladimir.oltean@nxp.com> Closes: https://lore.kernel.org/all/20230323153935.nofnjucqjqnz34ej@skbuf/ Reported-by: Steffen Trumtrar <s.trumtrar@pengutronix.de> Closes: https://lore.kernel.org/linux-arm-kernel/87wmsyvclu.fsf@pengutronix.de/ Signed-off-by: Sean Anderson <sean.anderson@linux.dev> Reviewed-by: Camelia Groza <camelia.groza@nxp.com> Tested-by: Vladimir Oltean <vladimir.oltean@nxp.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Sean Anderson [Mon, 11 Mar 2024 16:38:29 +0000 (12:38 -0400)]
soc: fsl: qbman: Always disable interrupts when taking cgr_lock
smp_call_function_single disables IRQs when executing the callback. To
prevent deadlocks, we must disable IRQs when taking cgr_lock elsewhere.
This is already done by qman_update_cgr and qman_delete_cgr; fix the
other lockers.
Fixes: 96f413f47677 ("soc/fsl/qbman: fix issue in qman_delete_cgr_safe()") CC: stable@vger.kernel.org Signed-off-by: Sean Anderson <sean.anderson@linux.dev> Reviewed-by: Camelia Groza <camelia.groza@nxp.com> Tested-by: Vladimir Oltean <vladimir.oltean@nxp.com> Signed-off-by: David S. Miller <davem@davemloft.net>
====================
tcp/rds: Fix use-after-free around kernel TCP reqsk.
syzkaller reported an warning of netns ref tracker for RDS TCP listener,
which commit 740ea3c4a0b2 ("tcp: Clean up kernel listener's reqsk in
inet_twsk_purge()") fixed for per-netns ehash.
This series fixes the bug in the partial fix and fixes the reported bug
in the global ehash.
The notable thing here is 0x4001 in connect(), which is RDS_TCP_PORT.
So, the scenario would be:
1. unshare(CLONE_NEWNET) creates a per netns tcp listener in
rds_tcp_listen_init().
2. syz-executor connect()s to it and creates a reqsk.
3. syz-executor exit()s immediately.
4. netns is dismantled. [0]
5. reqsk timer is fired, and UAF happens while freeing reqsk. [1]
6. listener is freed after RCU grace period. [2]
Basically, reqsk assumes that the listener guarantees netns safety
until all reqsk timers are expired by holding the listener's refcount.
However, this was not the case for kernel sockets.
Commit 740ea3c4a0b2 ("tcp: Clean up kernel listener's reqsk in
inet_twsk_purge()") fixed this issue only for per-netns ehash.
Allocated by task 258 on cpu 0 at 83.612050s:
kasan_save_stack (mm/kasan/common.c:48)
kasan_save_track (mm/kasan/common.c:68)
__kasan_slab_alloc (mm/kasan/common.c:343)
kmem_cache_alloc (mm/slub.c:3813 mm/slub.c:3860 mm/slub.c:3867)
copy_net_ns (./include/linux/slab.h:701 net/core/net_namespace.c:421 net/core/net_namespace.c:480)
create_new_namespaces (kernel/nsproxy.c:110)
unshare_nsproxy_namespaces (kernel/nsproxy.c:228 (discriminator 4))
ksys_unshare (kernel/fork.c:3429)
__x64_sys_unshare (kernel/fork.c:3496)
do_syscall_64 (arch/x86/entry/common.c:52 arch/x86/entry/common.c:83)
entry_SYSCALL_64_after_hwframe (arch/x86/entry/entry_64.S:129)
Freed by task 27 on cpu 0 at 329.158864s:
kasan_save_stack (mm/kasan/common.c:48)
kasan_save_track (mm/kasan/common.c:68)
kasan_save_free_info (mm/kasan/generic.c:643)
__kasan_slab_free (mm/kasan/common.c:265)
kmem_cache_free (mm/slub.c:4299 mm/slub.c:4363)
cleanup_net (net/core/net_namespace.c:456 net/core/net_namespace.c:446 net/core/net_namespace.c:639)
process_one_work (kernel/workqueue.c:2638)
worker_thread (kernel/workqueue.c:2700 kernel/workqueue.c:2787)
kthread (kernel/kthread.c:388)
ret_from_fork (arch/x86/kernel/process.c:153)
ret_from_fork_asm (arch/x86/entry/entry_64.S:250)
The buggy address belongs to the object at ffff88801b370000
which belongs to the cache net_namespace of size 4352
The buggy address is located 1024 bytes inside of
freed 4352-byte region [ffff88801b370000, ffff88801b371100)
Linus Torvalds [Wed, 13 Mar 2024 00:44:08 +0000 (17:44 -0700)]
Merge tag 'net-next-6.9' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next
Pull networking updates from Jakub Kicinski:
"Core & protocols:
- Large effort by Eric to lower rtnl_lock pressure and remove locks:
- Make commonly used parts of rtnetlink (address, route dumps
etc) lockless, protected by RCU instead of rtnl_lock.
- Add a netns exit callback which already holds rtnl_lock,
allowing netns exit to take rtnl_lock once in the core instead
of once for each driver / callback.
- Remove locks / serialization in the socket diag interface.
- Remove 6 calls to synchronize_rcu() while holding rtnl_lock.
- Remove the dev_base_lock, depend on RCU where necessary.
- Support busy polling on a per-epoll context basis. Poll length and
budget parameters can be set independently of system defaults.
- Introduce struct net_hotdata, to make sure read-mostly global
config variables fit in as few cache lines as possible.
- Add optional per-nexthop statistics to ease monitoring / debug of
ECMP imbalance problems.
- Support TCP_NOTSENT_LOWAT in MPTCP.
- Ensure that IPv6 temporary addresses' preferred lifetimes are long
enough, compared to other configured lifetimes, and at least 2 sec.
- Support forwarding of ICMP Error messages in IPSec, per RFC 4301.
- Add support for the independent control state machine for bonding
per IEEE 802.1AX-2008 5.4.15 in addition to the existing coupled
control state machine.
- Add "network ID" to MCTP socket APIs to support hosts with multiple
disjoint MCTP networks.
- Re-use the mono_delivery_time skbuff bit for packets which user
space wants to be sent at a specified time. Maintain the timing
information while traversing veth links, bridge etc.
- Take advantage of MSG_SPLICE_PAGES for RxRPC DATA and ACK packets.
- Simplify many places iterating over netdevs by using an xarray
instead of a hash table walk (hash table remains in place, for use
on fastpaths).
- Speed up scanning for expired routes by keeping a dedicated list.
- Speed up "generic" XDP by trying harder to avoid large allocations.
- Support attaching arbitrary metadata to netconsole messages.
Things we sprinkled into general kernel code:
- Enforce VM_IOREMAP flag and range in ioremap_page_range and
introduce VM_SPARSE kind and vm_area_[un]map_pages (used by
bpf_arena).
- Rework selftest harness to enable the use of the full range of ksft
exit code (pass, fail, skip, xfail, xpass).
Netfilter:
- Allow userspace to define a table that is exclusively owned by a
daemon (via netlink socket aliveness) without auto-removing this
table when the userspace program exits. Such table gets marked as
orphaned and a restarting management daemon can re-attach/regain
ownership.
- Speed up element insertions to nftables' concatenated-ranges set
type. Compact a few related data structures.
BPF:
- Add BPF token support for delegating a subset of BPF subsystem
functionality from privileged system-wide daemons such as systemd
through special mount options for userns-bound BPF fs to a trusted
& unprivileged application.
- Introduce bpf_arena which is sparse shared memory region between
BPF program and user space where structures inside the arena can
have pointers to other areas of the arena, and pointers work
seamlessly for both user-space programs and BPF programs.
- Introduce may_goto instruction that is a contract between the
verifier and the program. The verifier allows the program to loop
assuming it's behaving well, but reserves the right to terminate
it.
- Extend the BPF verifier to enable static subprog calls in spin lock
critical sections.
- Support registration of struct_ops types from modules which helps
projects like fuse-bpf that seeks to implement a new struct_ops
type.
- Add support for retrieval of cookies for perf/kprobe multi links.
- Support arbitrary TCP SYN cookie generation / validation in the TC
layer with BPF to allow creating SYN flood handling in BPF
firewalls.
- Add code generation to inline the bpf_kptr_xchg() helper which
improves performance when stashing/popping the allocated BPF
objects.
Wireless:
- Add SPP (signaling and payload protected) AMSDU support.
- Support wider bandwidth OFDMA, as required for EHT operation.
Driver API:
- Major overhaul of the Energy Efficient Ethernet internals to
support new link modes (2.5GE, 5GE), share more code between
drivers (especially those using phylib), and encourage more
uniform behavior. Convert and clean up drivers.
- Define an API for querying per netdev queue statistics from
drivers.
- IPSec: account in global stats for fully offloaded sessions.
- Create a concept of Ethernet PHY Packages at the Device Tree level,
to allow parameterizing the existing PHY package code.
- Enable Rx hashing (RSS) on GTP protocol fields.
Misc:
- Improvements and refactoring all over networking selftests.
- Create uniform module aliases for TC classifiers, actions, and
packet schedulers to simplify creating modprobe policies.
- Address all missing MODULE_DESCRIPTION() warnings in networking.
- Extend the Netlink descriptions in YAML to cover message
encapsulation or "Netlink polymorphism", where interpretation of
nested attributes depends on link type, classifier type or some
other "class type".
Drivers:
- Ethernet high-speed NICs:
- Add a new driver for Marvell's Octeon PCI Endpoint NIC VF.
- Intel (100G, ice, idpf):
- support E825-C devices
- nVidia/Mellanox:
- support devices with one port and multiple PCIe links
- Broadcom (bnxt):
- support n-tuple filters
- support configuring the RSS key
- Wangxun (ngbe/txgbe):
- implement irq_domain for TXGBE's sub-interrupts
- Pensando/AMD:
- support XDP
- optimize queue submission and wakeup handling (+17% bps)
- optimize struct layout, saving 28% of memory on queues
- Ethernet NICs embedded and virtual:
- Google cloud vNIC:
- refactor driver to perform memory allocations for new queue
config before stopping and freeing the old queue memory
- Synopsys (stmmac):
- obey queueMaxSDU and implement counters required by 802.1Qbv
- Renesas (ravb):
- support packet checksum offload
- suspend to RAM and runtime PM support
- Ethernet switches:
- nVidia/Mellanox:
- support for nexthop group statistics
- Microchip:
- ksz8: implement PHY loopback
- add support for KSZ8567, a 7-port 10/100Mbps switch
- PTP:
- New driver for RENESAS FemtoClock3 Wireless clock generator.
- Support OCP PTP cards designed and built by Adva.
- CAN:
- Support recvmsg() flags for own, local and remote traffic on CAN
BCM sockets.
- Support for esd GmbH PCIe/402 CAN device family.
- m_can:
- Rx/Tx submission coalescing
- wake on frame Rx
- WiFi:
- Intel (iwlwifi):
- enable signaling and payload protected A-MSDUs
- support wider-bandwidth OFDMA
- support for new devices
- bump FW API to 89 for AX devices; 90 for BZ/SC devices
- MediaTek (mt76):
- mt7915: newer ADIE version support
- mt7925: radio temperature sensor support
- Qualcomm (ath11k):
- support 6 GHz station power modes: Low Power Indoor (LPI),
Standard Power) SP and Very Low Power (VLP)
- QCA6390 & WCN6855: support 2 concurrent station interfaces
- QCA2066 support
- Qualcomm (ath12k):
- refactoring in preparation for Multi-Link Operation (MLO)
support
- 1024 Block Ack window size support
- firmware-2.bin support
- support having multiple identical PCI devices (firmware needs
to have ATH12K_FW_FEATURE_MULTI_QRTR_ID)
- QCN9274: support split-PHY devices
- WCN7850: enable Power Save Mode in station mode
- WCN7850: P2P support
- RealTek:
- rtw88: support for more rtw8811cu and rtw8821cu devices
- rtw89: support SCAN_RANDOM_SN and SET_SCAN_DWELL
- rtlwifi: speed up USB firmware initialization
- rtwl8xxxu:
- RTL8188F: concurrent interface support
- Channel Switch Announcement (CSA) support in AP mode
- Broadcom (brcmfmac):
- per-vendor feature support
- per-vendor SAE password setup
- DMI nvram filename quirk for ACEPC W5 Pro"
* tag 'net-next-6.9' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next: (2255 commits)
nexthop: Fix splat with CONFIG_DEBUG_PREEMPT=y
nexthop: Fix out-of-bounds access during attribute validation
nexthop: Only parse NHA_OP_FLAGS for dump messages that require it
nexthop: Only parse NHA_OP_FLAGS for get messages that require it
bpf: move sleepable flag from bpf_prog_aux to bpf_prog
bpf: hardcode BPF_PROG_PACK_SIZE to 2MB * num_possible_nodes()
selftests/bpf: Add kprobe multi triggering benchmarks
ptp: Move from simple ida to xarray
vxlan: Remove generic .ndo_get_stats64
vxlan: Do not alloc tstats manually
devlink: Add comments to use netlink gen tool
nfp: flower: handle acti_netdevs allocation failure
net/packet: Add getsockopt support for PACKET_COPY_THRESH
net/netlink: Add getsockopt support for NETLINK_LISTEN_ALL_NSID
selftests/bpf: Add bpf_arena_htab test.
selftests/bpf: Add bpf_arena_list test.
selftests/bpf: Add unit tests for bpf_arena_alloc/free_pages
bpf: Add helper macro bpf_addr_space_cast()
libbpf: Recognize __arena global variables.
bpftool: Recognize arena map type
...
Linus Torvalds [Tue, 12 Mar 2024 22:18:34 +0000 (15:18 -0700)]
Merge tag 'docs-6.9' of git://git.lwn.net/linux
Pull documentation updates from Jonathan Corbet:
"A moderatly busy cycle for development this time around.
- Some cleanup of the main index page for easier navigation
- Rework some of the other top-level pages for better readability
and, with luck, fewer merge conflicts in the future.
- Submit-checklist improvements, hopefully the first of many.
- New Italian translations
- A fair number of kernel-doc fixes and improvements. We have also
dropped the recommendation to use an old version of Sphinx.
- A new document from Thorsten on bisection
... and lots of fixes and updates"
* tag 'docs-6.9' of git://git.lwn.net/linux: (54 commits)
docs: verify/bisect: fixes, finetuning, and support for Arch
docs: Makefile: Add dependency to $(YNL_INDEX) for targets other than htmldocs
docs: Move ja_JP/howto.rst to ja_JP/process/howto.rst
docs: submit-checklist: use subheadings
docs: submit-checklist: structure by category
docs: new text on bisecting which also covers bug validation
docs: drop the version constraints for sphinx and dependencies
docs: kerneldoc-preamble.sty: Remove code for Sphinx <2.4
docs: Restore "smart quotes" for quotes
docs/zh_CN: accurate translation of "function"
docs: Include simplified link titles in main index
docs: Correct formatting of title in admin-guide/index.rst
docs: kernel_feat.py: fix build error for missing files
MAINTAINERS: Set the field name for subsystem profile section
kasan: Add documentation for CONFIG_KASAN_EXTRA_INFO
Fixed case issue with 'fault-injection' in documentation
kernel-doc: handle #if in enums as well
Documentation: update mailing list addresses
doc: kerneldoc.py: fix indentation
scripts/kernel-doc: simplify signature printing
...
Linus Torvalds [Tue, 12 Mar 2024 22:10:51 +0000 (15:10 -0700)]
Merge tag 'audit-pr-20240312' of git://git.kernel.org/pub/scm/linux/kernel/git/pcmoore/audit
Pull audit updates from Paul Moore:
"Two small audit patches:
- Use the KMEM_CACHE() macro instead of kmem_cache_create()
The guidance appears to be to use the KMEM_CACHE() macro when
possible and there is no reason why we can't use the macro, so
let's use it.
- Remove an unnecessary assignment in audit_dupe_lsm_field()
A return value variable was assigned a value in its declaration,
but the declaration value is overwritten before the return value
variable is ever referenced; drop the assignment at declaration
time"
* tag 'audit-pr-20240312' of git://git.kernel.org/pub/scm/linux/kernel/git/pcmoore/audit:
audit: use KMEM_CACHE() instead of kmem_cache_create()
audit: remove unnecessary assignment in audit_dupe_lsm_field()
Linus Torvalds [Tue, 12 Mar 2024 22:08:06 +0000 (15:08 -0700)]
Merge tag 'Smack-for-6.9' of https://github.com/cschaufler/smack-next
Pull smack updates from Casey Schaufler:
- Improvements to the initialization of in-memory inodes
- A fix in ramfs to propery ensure the initialization of in-memory
inodes
- Removal of duplicated code in smack_cred_transfer()
* tag 'Smack-for-6.9' of https://github.com/cschaufler/smack-next:
Smack: use init_task_smack() in smack_cred_transfer()
ramfs: Initialize security of in-memory inodes
smack: Initialize the in-memory inode in smack_inode_init_security()
smack: Always determine inode labels in smack_inode_init_security()
smack: Handle SMACK64TRANSMUTE in smack_inode_setsecurity()
smack: Set SMACK64TRANSMUTE only for dirs in smack_inode_setxattr()
Linus Torvalds [Tue, 12 Mar 2024 22:05:27 +0000 (15:05 -0700)]
Merge tag 'seccomp-v6.9-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux
Pull seccomp updates from Kees Cook:
"There are no core kernel changes here; it's entirely selftests and
samples:
- Improve reliability of selftests (Terry Tritton, Kees Cook)
- Fix strict-aliasing warning in samples (Arnd Bergmann)"
* tag 'seccomp-v6.9-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux:
samples: user-trap: fix strict-aliasing warning
selftests/seccomp: Pin benchmark to single CPU
selftests/seccomp: user_notification_addfd check nextfd is available
selftests/seccomp: Change the syscall used in KILL_THREAD test
selftests/seccomp: Handle EINVAL on unshare(CLONE_NEWPID)
Linus Torvalds [Tue, 12 Mar 2024 21:49:30 +0000 (14:49 -0700)]
Merge tag 'hardening-v6.9-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux
Pull hardening updates from Kees Cook:
"As is pretty normal for this tree, there are changes all over the
place, especially for small fixes, selftest improvements, and improved
macro usability.
Some header changes ended up landing via this tree as they depended on
the string header cleanups. Also, a notable set of changes is the work
for the reintroduction of the UBSAN signed integer overflow sanitizer
so that we can continue to make improvements on the compiler side to
make this sanitizer a more viable future security hardening option.
Summary:
- string.h and related header cleanups (Tanzir Hasan, Andy
Shevchenko)
- Add comments to explain how __is_constexpr() works
- Fix m68k stack alignment expectations in stackinit Kunit test
- Convert string selftests to KUnit
- Add KUnit tests for fortified string functions
- Improve reporting during fortified string warnings
- Allow non-type arg to type_max() and type_min()
- Allow strscpy() to be called with only 2 arguments
- Add binary mode to leaking_addresses scanner
- Various small cleanups to leaking_addresses scanner
- Adding wrapping_*() arithmetic helper
- Annotate initial signed integer wrap-around in refcount_t
- Add explicit UBSAN section to MAINTAINERS
- Fix UBSAN self-test warnings
- Simplify UBSAN build via removal of CONFIG_UBSAN_SANITIZE_ALL
- Reintroduce UBSAN's signed overflow sanitizer"
* tag 'hardening-v6.9-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux: (51 commits)
selftests/powerpc: Fix load_unaligned_zeropad build failure
string: Convert helpers selftest to KUnit
string: Convert selftest to KUnit
sh: Fix build with CONFIG_UBSAN=y
compiler.h: Explain how __is_constexpr() works
overflow: Allow non-type arg to type_max() and type_min()
VMCI: Fix possible memcpy() run-time warning in vmci_datagram_invoke_guest_handler()
lib/string_helpers: Add flags param to string_get_size()
x86, relocs: Ignore relocations in .notes section
objtool: Fix UNWIND_HINT_{SAVE,RESTORE} across basic blocks
overflow: Use POD in check_shl_overflow()
lib: stackinit: Adjust target string to 8 bytes for m68k
sparc: vdso: Disable UBSAN instrumentation
kernel.h: Move lib/cmdline.c prototypes to string.h
leaking_addresses: Provide mechanism to scan binary files
leaking_addresses: Ignore input device status lines
leaking_addresses: Use File::Temp for /tmp files
MAINTAINERS: Update LEAKING_ADDRESSES details
fortify: Improve buffer overflow reporting
fortify: Add KUnit tests for runtime overflows
...
- Avoid potential double-free during pstorefs umount
* tag 'pstore-v6.9-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux:
pstore/zone: Don't clear memory twice
pstore/zone: Add a null pointer check to the psz_kmsg_read
efi: pstore: Allow dynamic initialization based on module parameter
arm64: defconfig: Enable PSTORE_RAM
pstore/ram: Register to module device table
pstore: inode: Only d_invalidate() is needed
Linus Torvalds [Tue, 12 Mar 2024 21:27:37 +0000 (14:27 -0700)]
Merge tag 'nfsd-6.9' of git://git.kernel.org/pub/scm/linux/kernel/git/cel/linux
Pull nfsd updates from Chuck Lever:
"The bulk of the patches for this release are optimizations, code
clean-ups, and minor bug fixes.
One new feature to mention is that NFSD administrators now have the
ability to revoke NFSv4 open and lock state. NFSD's NFSv3 support has
had this capability for some time.
As always I am grateful to NFSD contributors, reviewers, and testers"
* tag 'nfsd-6.9' of git://git.kernel.org/pub/scm/linux/kernel/git/cel/linux: (75 commits)
NFSD: Clean up nfsd4_encode_replay()
NFSD: send OP_CB_RECALL_ANY to clients when number of delegations reaches its limit
NFSD: Document nfsd_setattr() fill-attributes behavior
nfsd: Fix NFSv3 atomicity bugs in nfsd_setattr()
nfsd: Fix a regression in nfsd_setattr()
NFSD: OP_CB_RECALL_ANY should recall both read and write delegations
NFSD: handle GETATTR conflict with write delegation
NFSD: add support for CB_GETATTR callback
NFSD: Document the phases of CREATE_SESSION
NFSD: Fix the NFSv4.1 CREATE_SESSION operation
nfsd: clean up comments over nfs4_client definition
svcrdma: Add Write chunk WRs to the RPC's Send WR chain
svcrdma: Post WRs for Write chunks in svc_rdma_sendto()
svcrdma: Post the Reply chunk and Send WR together
svcrdma: Move write_info for Reply chunks into struct svc_rdma_send_ctxt
svcrdma: Post Send WR chain
svcrdma: Fix retry loop in svc_rdma_send()
svcrdma: Prevent a UAF in svc_rdma_send()
svcrdma: Fix SQ wake-ups
svcrdma: Increase the per-transport rw_ctx count
...
Linus Torvalds [Tue, 12 Mar 2024 20:25:53 +0000 (13:25 -0700)]
Merge tag 'erofs-for-6.9-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/xiang/erofs
Pull erofs updates from Gao Xiang:
"In this cycle, we introduce compressed inode support over fscache
since a lot of native EROFS images are explicitly compressed so that
EROFS over fscache can be more widely used even without Dragonfly
Nydus [1].
Apart from that, there are some folio conversions for compressed
inodes available as well as a lockdep false positive fix.
Summary:
- Some folio conversions for compressed inodes;
- Add compressed inode support over fscache;
- Fix lockdep false positives of erofs_pseudo_mnt"
Link: https://nydus.dev
* tag 'erofs-for-6.9-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/xiang/erofs:
erofs: support compressed inodes over fscache
erofs: make iov_iter describe target buffers over fscache
erofs: fix lockdep false positives on initializing erofs_pseudo_mnt
erofs: refine managed cache operations to folios
erofs: convert z_erofs_submissionqueue_endio() to folios
erofs: convert z_erofs_fill_bio_vec() to folios
erofs: get rid of `justfound` debugging tag
erofs: convert z_erofs_do_read_page() to folios
erofs: convert z_erofs_onlinepage_.* to folios
Linus Torvalds [Tue, 12 Mar 2024 20:17:36 +0000 (13:17 -0700)]
Merge tag 'fscrypt-for-linus' of git://git.kernel.org/pub/scm/fs/fscrypt/linux
Pull fscrypt updates from Eric Biggers:
"Fix flakiness in a test by releasing the quota synchronously when a
key is removed, and other minor cleanups"
* tag 'fscrypt-for-linus' of git://git.kernel.org/pub/scm/fs/fscrypt/linux:
fscrypt: shrink the size of struct fscrypt_inode_info slightly
fscrypt: write CBC-CTS instead of CTS-CBC
fscrypt: clear keyring before calling key_put()
fscrypt: explicitly require that inode->i_blkbits be set
Linus Torvalds [Tue, 12 Mar 2024 19:28:34 +0000 (12:28 -0700)]
Merge tag 'for-6.9-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux
Pull btrfs updates from David Sterba:
"Mostly stabilization, refactoring and cleanup changes. There rest are
minor performance optimizations due to caching or lock contention
reduction and a few notable fixes.
Performance improvements:
- minor speedup in logging when repeatedly allocated structure is
preallocated only once, improves latency and decreases lock
contention
- minor throughput increase (+6%), reduced lock contention after
clearing delayed allocation bits, applies to several common
workload types
- skip full quota rescan if a new relation is added in the same
transaction
Fixes:
- zstd fix for inline compressed file in subpage mode, updated
version from the 6.8 time
- more fiemap followup fixes after reduced locking done in 6.8:
- fix race when detecting delalloc ranges
Core changes:
- more debugging code:
- added assertions for a very rare crash in raid56 calculation
- tree-checker dumps page state to give more insights into
possible reference counting issues
- add checksum calculation offloading sysfs knob, for now enabled
under DEBUG only to determine a good heuristic for deciding the
offload or synchronous, depends on various factors (block group
profile, device speed) and is not as clear as initially thought
(checksum type)
- error handling improvements, added assertions
- more page to folio conversion (defrag, truncate), cached size and
shift
- preparation for more fine grained locking of sectors in subpage
mode
- cleanups and refactoring:
- include cleanups, forward declarations
- pointer-to-structure helpers
- redundant argument removals
- removed unused code
- slab cache updates, last use of SLAB_MEM_SPREAD removed"
* tag 'for-6.9-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux: (114 commits)
btrfs: reuse cloned extent buffer during fiemap to avoid re-allocations
btrfs: fix race when detecting delalloc ranges during fiemap
btrfs: fix off-by-one chunk length calculation at contains_pending_extent()
btrfs: qgroup: allow quick inherit if snapshot is created and added to the same parent
btrfs: qgroup: validate btrfs_qgroup_inherit parameter
btrfs: include device major and minor numbers in the device scan notice
btrfs: mark btrfs_put_caching_control() static
btrfs: remove SLAB_MEM_SPREAD flag use
btrfs: qgroup: always free reserved space for extent records
btrfs: tree-checker: dump the page status if hit something wrong
btrfs: compression: remove dead comments in btrfs_compress_heuristic()
btrfs: subpage: make writer lock utilize bitmap
btrfs: subpage: make reader lock utilize bitmap
btrfs: unexport btrfs_subpage_start_writer() and btrfs_subpage_end_and_test_writer()
btrfs: pass a valid extent map cache pointer to __get_extent_map()
btrfs: merge btrfs_del_delalloc_inode() helpers
btrfs: pass btrfs_device to btrfs_scratch_superblocks()
btrfs: handle transaction commit errors in flush_reservations()
btrfs: use KMEM_CACHE() to create btrfs_free_space cache
btrfs: use KMEM_CACHE() to create delayed ref caches
...
Linus Torvalds [Tue, 12 Mar 2024 17:56:28 +0000 (10:56 -0700)]
Merge tag 'asm-generic-6.9' of git://git.kernel.org/pub/scm/linux/kernel/git/arnd/asm-generic
Pull asm-generic updates from Arnd Bergmann:
"Just two small updates this time:
- A series I did to unify the definition of PAGE_SIZE through
Kconfig, intended to help with a vdso rework that needs the
constant but cannot include the normal kernel headers when building
the compat VDSO on arm64 and potentially others
- a patch from Yan Zhao to remove the pfn_to_virt() definitions from
a couple of architectures after finding they were both incorrect
and entirely unused"
* tag 'asm-generic-6.9' of git://git.kernel.org/pub/scm/linux/kernel/git/arnd/asm-generic:
arch: define CONFIG_PAGE_SIZE_*KB on all architectures
arch: simplify architecture specific page size configuration
arch: consolidate existing CONFIG_PAGE_SIZE_*KB definitions
mm: Remove broken pfn_to_virt() on arch csky/hexagon/openrisc
Linus Torvalds [Tue, 12 Mar 2024 17:52:18 +0000 (10:52 -0700)]
Merge tag 'soc-defconfig-6.9' of git://git.kernel.org/pub/scm/linux/kernel/git/soc/soc
Pull ARM defconfig updates from Arnd Bergmann:
"This has the usual updates to enable platform specific driver modules
as new hardware gets supported, as well as an update to the
virt.config fragment so we disable all newly added platforms again"
* tag 'soc-defconfig-6.9' of git://git.kernel.org/pub/scm/linux/kernel/git/soc/soc: (24 commits)
arm64: defconfig: Enable support for cbmem entries in the coreboot table
ARM: defconfig: enable STMicroelectronics accelerometer and gyro for Exynos
arm64: defconfig: drop ext2 filesystem and redundant ext3
arm64: defconfig: Enable Rockchip HDMI/eDP Combo PHY
arm64: defconfig: Enable Wave5 Video Encoder/Decoder
arm64: config: disable new platforms in virt.config
arm64: defconfig: Enable QCOM PBS
arm64: deconfig: enable Goodix Berlin SPI touchscreen driver as module
arm64: defconfig: Enable X1E80100 multimedia clock controllers configs
arm64: defconfig: Enable GCC and interconnect for QDU1000/QRU1000
arm64: defconfig: enable i.MX8MP ldb bridge
arm64: defconfig: enable the vf610 gpio driver
ARM: imx_v6_v7_defconfig: enable the vf610 gpio driver
ARM: multi_v7_defconfig: Add more TI Keystone support
arm64: defconfig: enable WCD939x USBSS driver as module
arm64: defconfig: enable audio drivers for SM8650 QRD board
arm64: defconfig: Enable Qualcomm interconnect providers
ARM: multi_v7_defconfig: Enable BACKLIGHT_CLASS_DEVICE
arm64: defconfig: Enable i.MX8QXP device drivers
ARM: multi_v7_defconfig: Add more TI Keystone support
...
Linus Torvalds [Tue, 12 Mar 2024 17:47:33 +0000 (10:47 -0700)]
Merge tag 'soc-arm-6.9' of git://git.kernel.org/pub/scm/linux/kernel/git/soc/soc
Pull ARM SoC code updates from Arnd Bergmann:
"These are mostly minor updates, including a number of kerneldoc fixes
from Randy Dunlap across multiple platforms. OMAP gets a few bugfixes,
and the MAINTAINERS file gets updated for AMD Zynq and NXP S32G"
* tag 'soc-arm-6.9' of git://git.kernel.org/pub/scm/linux/kernel/git/soc/soc: (23 commits)
ARM: s32c: update MAINTAINERS entry
ARM: AM33xx: PRM: Implement REBOOT_COLD
ARM: AM33xx: PRM: Remove redundand defines
ARM: omap1: remove duplicated 'select ARCH_OMAP'
ARM: s3c64xx: make bus_type const
ARM: imx: Remove usage of the deprecated ida_simple_xx() API
ARM: OMAP2+: fix kernel-doc warnings
ARM: OMAP2+: fix kernel-doc warnings
ARM: OMAP2+: fix a kernel-doc warning
ARM: OMAP2+: PRM: fix kernel-doc warnings
ARM: OMAP2+: prm44xx: fix a kernel-doc warning
ARM: OMAP2+: pmic-cpcap: fix kernel-doc warnings
ARM: OMAP2+: hwmod: fix kernel-doc warnings
ARM: OMAP2+: hwmod: remove misuse of kernel-doc
ARM: OMAP2+: CMINST: use matching function name in kernel-doc
ARM: OMAP2+: cm33xx: use matching function name in kernel-doc
ARM: OMAP2+: clock: fix a function name in kernel-doc
ARM: OMAP2+: clockdomain: fix kernel-doc warnings
ARM: OMAP2+: am33xx-restart: fix function name in kernel-doc
soc: xilinx: update maintainer of event manager driver
...
Linus Torvalds [Tue, 12 Mar 2024 17:35:24 +0000 (10:35 -0700)]
Merge tag 'soc-drivers-6.9' of git://git.kernel.org/pub/scm/linux/kernel/git/soc/soc
Pull ARM SoC driver updates from Arnd Bergmann:
"This is the usual mix of updates for drivers that are used on (mostly
ARM) SoCs with no other top-level subsystem tree, including:
- The SCMI firmware subsystem gains support for version 3.2 of the
specification and updates to the notification code
- Feature updates for Tegra and Qualcomm platforms for added hardware
support
- A number of platforms get soc_device additions for identifying
newly added chips from Renesas, Qualcomm, Mediatek and Google
- Trivial improvements for firmware and memory drivers amongst
others, in particular 'const' annotations throughout multiple
subsystems"
* tag 'soc-drivers-6.9' of git://git.kernel.org/pub/scm/linux/kernel/git/soc/soc: (96 commits)
tee: make tee_bus_type const
soc: qcom: aoss: add missing kerneldoc for qmp members
soc: qcom: geni-se: drop unused kerneldoc struct geni_wrapper param
soc: qcom: spm: fix building with CONFIG_REGULATOR=n
bus: ti-sysc: constify the struct device_type usage
memory: stm32-fmc2-ebi: keep power domain on
memory: stm32-fmc2-ebi: add MP25 RIF support
memory: stm32-fmc2-ebi: add MP25 support
memory: stm32-fmc2-ebi: check regmap_read return value
dt-bindings: memory-controller: st,stm32: add MP25 support
dt-bindings: bus: imx-weim: convert to YAML
watchdog: s3c2410_wdt: use exynos_get_pmu_regmap_by_phandle() for PMU regs
soc: samsung: exynos-pmu: Add regmap support for SoCs that protect PMU regs
MAINTAINERS: Update SCMI entry with HWMON driver
MAINTAINERS: samsung: gs101: match patches touching Google Tensor SoC
memory: tegra: Fix indentation
memory: tegra: Add BPMP and ICC info for DLA clients
memory: tegra: Correct DLA client names
dt-bindings: memory: renesas,rpc-if: Document R-Car V4M support
firmware: arm_scmi: Update the supported clock protocol version
...
Linus Torvalds [Tue, 12 Mar 2024 17:29:57 +0000 (10:29 -0700)]
Merge tag 'soc-dt-6.9' of git://git.kernel.org/pub/scm/linux/kernel/git/soc/soc
Pull SoC device tree updates from Arnd Bergmann:
"There is very little going on with new SoC support this time, all the
new chips are variations of others that we already support, and they
are all based on ARMv8 cores:
- Mediatek MT7981B (Filogic 820) and MT7988A (Filogic 880) are
networking SoCs designed to be used in wireless routers, similar to
the already supported MT7986A (Filogic 830).
- NXP i.MX8DXP is a variant of i.MX8QXP, with two CPU cores less.
These are used in many embedded and industrial applications.
- Renesas R8A779G2 (R-Car V4H ES2.0) and R8A779H0 (R-Car V4M) are
automotive SoCs.
- TI J722S is another automotive variant of its K3 family, related to
the AM62 series.
There are a total of 7 new arm32 machines and 45 arm64 ones, including
- Two Android phones based on the old Tegra30 chip
- Two machines using Cortex-A53 SoCs from Allwinner, a mini PC and a
SoM development board
- A set-top box using Amlogic Meson G12A S905X2
- Eight embedded board using NXP i.MX6/8/9
- Three machines using Mediatek network router chips
- Ten Chromebooks, all based on Mediatek MT8186
- One development board based on Mediatek MT8395 (Genio 1200)
- Seven tablets and phones based on Qualcomm SoCs, most of them from
Samsung.
- A third development board for Qualcomm SM8550 (Snapdragon 8 Gen 2)
- Three variants of the "White Hawk" board for Renesas automotive
SoCs
- Ten Rockchips RK35xx based machines, including NAS, Tablet, Game
console and industrial form factors.
- Three evaluation boards for TI K3 based SoCs
The other changes are mainly the usual feature additions for existing
hardware, cleanups, and dtc compile time fixes. One notable change is
the inclusion of PowerVR SGX GPU nodes on TI SoCs"
* tag 'soc-dt-6.9' of git://git.kernel.org/pub/scm/linux/kernel/git/soc/soc: (824 commits)
riscv: dts: Move BUILTIN_DTB_SOURCE to common Kconfig
riscv: dts: starfive: jh7100: fix root clock names
ARM: dts: samsung: exynos4412: decrease memory to account for unusable region
arm64: dts: qcom: sm8250-xiaomi-elish: set rotation
arm64: dts: qcom: sm8650: Fix SPMI channels size
arm64: dts: qcom: sm8550: Fix SPMI channels size
arm64: dts: rockchip: Fix name for UART pin header on qnap-ts433
arm: dts: marvell: clearfog-gtr-l8: align port numbers with enclosure
arm: dts: marvell: clearfog-gtr-l8: add support for second sfp connector
dt-bindings: soc: renesas: renesas-soc: Add pattern for gray-hawk
dtc: Enable dtc interrupt_provider check
arm64: dts: st: add video encoder support to stm32mp255
arm64: dts: st: add video decoder support to stm32mp255
ARM: dts: stm32: enable crypto accelerator on stm32mp135f-dk
ARM: dts: stm32: enable CRC on stm32mp135f-dk
ARM: dts: stm32: add CRC on stm32mp131
ARM: dts: add stm32f769-disco-mb1166-reva09
ARM: dts: stm32: add display support on stm32f769-disco
ARM: dts: stm32: rename mmc_vcard to vcc-3v3 on stm32f769-disco
ARM: dts: stm32: add DSI support on stm32f769
...
Linus Torvalds [Tue, 12 Mar 2024 17:27:52 +0000 (10:27 -0700)]
Merge tag 'm68k-for-v6.9-tag1' of git://git.kernel.org/pub/scm/linux/kernel/git/geert/linux-m68k
Pull m68k updates from Geert Uytterhoeven:
- Make the Zorro bus type constant
- defconfig updates
* tag 'm68k-for-v6.9-tag1' of git://git.kernel.org/pub/scm/linux/kernel/git/geert/linux-m68k:
m68k: defconfig: Update defconfigs for v6.8-rc1
zorro: Make zorro_bus_type const
Linus Torvalds [Tue, 12 Mar 2024 17:14:22 +0000 (10:14 -0700)]
Merge tag 's390-6.9-1' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux
Pull s390 updates from Heiko Carstens:
- Various virtual vs physical address usage fixes
- Fix error handling in Processor Activity Instrumentation device
driver, and export number of counters with a sysfs file
- Allow for multiple events when Processor Activity Instrumentation
counters are monitored in system wide sampling
- Change multiplier and shift values of the Time-of-Day clock source to
improve steering precision
- Remove a couple of unneeded GFP_DMA flags from allocations
- Disable mmap alignment if randomize_va_space is also disabled, to
avoid a too small heap
- Various changes to allow s390 to be compiled with LLVM=1, since
ld.lld and llvm-objcopy will have proper s390 support witch clang 19
- Add __uninitialized macro to Compiler Attributes. This is helpful
with s390's FPU code where some users have up to 520 byte stack
frames. Clearing such stack frames (if INIT_STACK_ALL_PATTERN or
INIT_STACK_ALL_ZERO is enabled) before they are used contradicts the
intention (performance improvement) of such code sections.
- Convert switch_to() to an out-of-line function, and use the generic
switch_to header file
- Replace the usage of s390's debug feature with pr_debug() calls
within the zcrypt device driver
- Improve hotplug support of the Adjunct Processor device driver
- Improve retry handling in the zcrypt device driver
- Various changes to the in-kernel FPU code:
- Make in-kernel FPU sections preemptible
- Convert various larger inline assemblies and assembler files to
C, mainly by using singe instruction inline assemblies. This
increases readability, but also allows makes it easier to add
proper instrumentation hooks
- Cleanup of the header files
- Provide fast variants of csum_partial() and
csum_partial_copy_nocheck() based on vector instructions
- Introduce and use a lock to synchronize accesses to zpci device data
structures to avoid inconsistent states caused by concurrent accesses
- Compile the kernel without -fPIE. This addresses the following
problems if the kernel is compiled with -fPIE:
- It uses dynamic symbols (.dynsym), for which the linker refuses
to allow more than 64k sections. This can break features which
use '-ffunction-sections' and '-fdata-sections', including
kpatch-build and function granular KASLR
- It unnecessarily uses GOT relocations, adding an extra layer of
indirection for many memory accesses
- Fix shared_cpu_list for CPU private L2 caches, which incorrectly were
reported as globally shared
* tag 's390-6.9-1' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux: (117 commits)
s390/tools: handle rela R_390_GOTPCDBL/R_390_GOTOFF64
s390/cache: prevent rebuild of shared_cpu_list
s390/crypto: remove retry loop with sleep from PAES pkey invocation
s390/pkey: improve pkey retry behavior
s390/zcrypt: improve zcrypt retry behavior
s390/zcrypt: introduce retries on in-kernel send CPRB functions
s390/ap: introduce mutex to lock the AP bus scan
s390/ap: rework ap_scan_bus() to return true on config change
s390/ap: clarify AP scan bus related functions and variables
s390/ap: rearm APQNs bindings complete completion
s390/configs: increase number of LOCKDEP_BITS
s390/vfio-ap: handle hardware checkstop state on queue reset operation
s390/pai: change sampling event assignment for PMU device driver
s390/boot: fix minor comment style damages
s390/boot: do not check for zero-termination relocation entry
s390/boot: make type of __vmlinux_relocs_64_start|end consistent
s390/boot: sanitize kaslr_adjust_relocs() function prototype
s390/boot: simplify GOT handling
s390: vmlinux.lds.S: fix .got.plt assertion
s390/boot: workaround current 'llvm-objdump -t -j ...' behavior
...
Linus Torvalds [Tue, 12 Mar 2024 16:58:57 +0000 (09:58 -0700)]
Merge tag 'x86-boot-2024-03-12' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 boot updates from Ingo Molnar:
- Continuing work by Ard Biesheuvel to improve the x86 early startup
code, with the long-term goal to make it position independent:
- Get rid of early accesses to global objects, either by moving
them to the stack, deferring the access until later, or dropping
the globals entirely
- Move all code that runs early via the 1:1 mapping into
.head.text, and move code that does not out of it, so that build
time checks can be added later to ensure that no inadvertent
absolute references were emitted into code that does not
tolerate them
- Remove fixup_pointer() and occurrences of __pa_symbol(), which
rely on the compiler emitting absolute references, which is not
guaranteed
- Improve the early console code
- Add early console message about ignored NMIs, so that users are at
least warned about their existence - even if we cannot do anything
about them
- Improve the kexec code's kernel load address handling
- Enable more X86S (simplified x86) bits
- Simplify early boot GDT handling
- Micro-optimize the boot code a bit
- Misc cleanups
* tag 'x86-boot-2024-03-12' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (22 commits)
x86/sev: Move early startup code into .head.text section
x86/sme: Move early SME kernel encryption handling into .head.text
x86/boot: Move mem_encrypt= parsing to the decompressor
efi/libstub: Add generic support for parsing mem_encrypt=
x86/startup_64: Simplify virtual switch on primary boot
x86/startup_64: Simplify calculation of initial page table address
x86/startup_64: Defer assignment of 5-level paging global variables
x86/startup_64: Simplify CR4 handling in startup code
x86/boot: Use 32-bit XOR to clear registers
efi/x86: Set the PE/COFF header's NX compat flag unconditionally
x86/boot/64: Load the final kernel GDT during early boot directly, remove startup_gdt[]
x86/boot/64: Use RIP_REL_REF() to access early_top_pgt[]
x86/boot/64: Use RIP_REL_REF() to access early page tables
x86/boot/64: Use RIP_REL_REF() to access '__supported_pte_mask'
x86/boot/64: Use RIP_REL_REF() to access early_dynamic_pgts[]
x86/boot/64: Use RIP_REL_REF() to assign 'phys_base'
x86/boot/64: Simplify global variable accesses in GDT/IDT programming
x86/trampoline: Bypass compat mode in trampoline_start64() if not needed
kexec: Allocate kernel above bzImage's pref_address
x86/boot: Add a message about ignored early NMIs
...
Linus Torvalds [Tue, 12 Mar 2024 16:31:39 +0000 (09:31 -0700)]
Merge tag 'rfds-for-linus-2024-03-11' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 RFDS mitigation from Dave Hansen:
"RFDS is a CPU vulnerability that may allow a malicious userspace to
infer stale register values from kernel space. Kernel registers can
have all kinds of secrets in them so the mitigation is basically to
wait until the kernel is about to return to userspace and has user
values in the registers. At that point there is little chance of
kernel secrets ending up in the registers and the microarchitectural
state can be cleared.
This leverages some recent robustness fixes for the existing MDS
vulnerability. Both MDS and RFDS use the VERW instruction for
mitigation"
* tag 'rfds-for-linus-2024-03-11' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
KVM/x86: Export RFDS_NO and RFDS_CLEAR to guests
x86/rfds: Mitigate Register File Data Sampling (RFDS)
Documentation/hw-vuln: Add documentation for RFDS
x86/mmio: Disable KVM mitigation when X86_FEATURE_CLEAR_CPU_BUF is set
It was originally in x86/urgent, but was deemed wrong so got zapped.
But in the meantime, x86/urgent had been merged into x86/apic to
resolve a conflict. I didn't notice the merge so didn't zap it
from x86/apic and it managed to make it up with the x86/apic
material.
The reverted commit is known to cause some KASAN problems.
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
====================
nexthop: Fix two nexthop group statistics issues
Fix two issues that were introduced as part of the recent nexthop group
statistics submission. See the commit messages for more details.
====================
Ido Schimmel [Mon, 11 Mar 2024 16:23:07 +0000 (18:23 +0200)]
nexthop: Fix splat with CONFIG_DEBUG_PREEMPT=y
Locally generated packets can increment the new nexthop statistics from
process context, resulting in the following splat [1] due to preemption
being enabled. Fix by using get_cpu_ptr() / put_cpu_ptr() which will
which take care of disabling / enabling preemption.
BUG: using smp_processor_id() in preemptible [00000000] code: ping/949
caller is nexthop_select_path+0xcf8/0x1e30
CPU: 12 PID: 949 Comm: ping Not tainted 6.8.0-rc7-custom-gcb450f605fae #11
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.2-1.fc38 04/01/2014
Call Trace:
<TASK>
dump_stack_lvl+0xbd/0xe0
check_preemption_disabled+0xce/0xe0
nexthop_select_path+0xcf8/0x1e30
fib_select_multipath+0x865/0x18b0
fib_select_path+0x311/0x1160
ip_route_output_key_hash_rcu+0xe54/0x2720
ip_route_output_key_hash+0x193/0x380
ip_route_output_flow+0x25/0x130
raw_sendmsg+0xbab/0x34a0
inet_sendmsg+0xa2/0xe0
__sys_sendto+0x2ad/0x430
__x64_sys_sendto+0xe5/0x1c0
do_syscall_64+0xc5/0x1d0
entry_SYSCALL_64_after_hwframe+0x63/0x6b
[...]
Fixes: f4676ea74b85 ("net: nexthop: Add nexthop group entry stats") Signed-off-by: Ido Schimmel <idosch@nvidia.com> Reviewed-by: David Ahern <dsahern@kernel.org> Link: https://lore.kernel.org/r/20240311162307.545385-5-idosch@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Ido Schimmel [Mon, 11 Mar 2024 16:23:06 +0000 (18:23 +0200)]
nexthop: Fix out-of-bounds access during attribute validation
Passing a maximum attribute type to nlmsg_parse() that is larger than
the size of the passed policy will result in an out-of-bounds access [1]
when the attribute type is used as an index into the policy array.
Fix by setting the maximum attribute type according to the policy size,
as is already done for RTM_NEWNEXTHOP messages. Add a test case that
triggers the bug.
Ido Schimmel [Mon, 11 Mar 2024 16:23:05 +0000 (18:23 +0200)]
nexthop: Only parse NHA_OP_FLAGS for dump messages that require it
The attribute is parsed in __nh_valid_dump_req() which is called by the
dump handlers of RTM_GETNEXTHOP and RTM_GETNEXTHOPBUCKET although it is
only used by the former and rejected by the policy of the latter.
Move the parsing to nh_valid_dump_req() which is only called by the dump
handler of RTM_GETNEXTHOP.
Ido Schimmel [Mon, 11 Mar 2024 16:23:04 +0000 (18:23 +0200)]
nexthop: Only parse NHA_OP_FLAGS for get messages that require it
The attribute is parsed into 'op_flags' in nh_valid_get_del_req() which
is called from the handlers of three message types: RTM_DELNEXTHOP,
RTM_GETNEXTHOPBUCKET and RTM_GETNEXTHOP. The attribute is only used by
the latter and rejected by the policies of the other two.
Pass 'op_flags' as NULL from the handlers of the other two and only
parse the attribute when the argument is not NULL.
Linus Torvalds [Tue, 12 Mar 2024 03:07:52 +0000 (20:07 -0700)]
Merge tag 'x86_mm_for_6.9' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 mm updates from Dave Hansen:
- Add a warning when memory encryption conversions fail. These
operations require VMM cooperation, even in CoCo environments where
the VMM is untrusted. While it's _possible_ that memory pressure
could trigger the new warning, the odds are that a guest would only
see this from an attacking VMM.
- Simplify page fault code by re-enabling interrupts unconditionally
- Avoid truncation issues when pfns are passed in to pfn_to_kaddr()
with small (<64-bit) types.
* tag 'x86_mm_for_6.9' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/mm/cpa: Warn for set_memory_XXcrypted() VMM fails
x86/mm: Get rid of conditional IF flag handling in page fault path
x86/mm: Ensure input to pfn_to_kaddr() is treated as a 64-bit type
Linus Torvalds [Tue, 12 Mar 2024 02:53:15 +0000 (19:53 -0700)]
Merge tag 'x86-core-2024-03-11' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull core x86 updates from Ingo Molnar:
- The biggest change is the rework of the percpu code, to support the
'Named Address Spaces' GCC feature, by Uros Bizjak:
- This allows C code to access GS and FS segment relative memory
via variables declared with such attributes, which allows the
compiler to better optimize those accesses than the previous
inline assembly code.
- The series also includes a number of micro-optimizations for
various percpu access methods, plus a number of cleanups of %gs
accesses in assembly code.
- These changes have been exposed to linux-next testing for the
last ~5 months, with no known regressions in this area.
- Fix/clean up __switch_to()'s broken but accidentally working handling
of FPU switching - which also generates better code
- Propagate more RIP-relative addressing in assembly code, to generate
slightly better code
- Rework the CPU mitigations Kconfig space to be less idiosyncratic, to
make it easier for distros to follow & maintain these options
- Rework the x86 idle code to cure RCU violations and to clean up the
logic
- Clean up the vDSO Makefile logic
- Misc cleanups and fixes
* tag 'x86-core-2024-03-11' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (52 commits)
x86/idle: Select idle routine only once
x86/idle: Let prefer_mwait_c1_over_halt() return bool
x86/idle: Cleanup idle_setup()
x86/idle: Clean up idle selection
x86/idle: Sanitize X86_BUG_AMD_E400 handling
sched/idle: Conditionally handle tick broadcast in default_idle_call()
x86: Increase brk randomness entropy for 64-bit systems
x86/vdso: Move vDSO to mmap region
x86/vdso/kbuild: Group non-standard build attributes and primary object file rules together
x86/vdso: Fix rethunk patching for vdso-image-{32,64}.o
x86/retpoline: Ensure default return thunk isn't used at runtime
x86/vdso: Use CONFIG_COMPAT_32 to specify vdso32
x86/vdso: Use $(addprefix ) instead of $(foreach )
x86/vdso: Simplify obj-y addition
x86/vdso: Consolidate targets and clean-files
x86/bugs: Rename CONFIG_RETHUNK => CONFIG_MITIGATION_RETHUNK
x86/bugs: Rename CONFIG_CPU_SRSO => CONFIG_MITIGATION_SRSO
x86/bugs: Rename CONFIG_CPU_IBRS_ENTRY => CONFIG_MITIGATION_IBRS_ENTRY
x86/bugs: Rename CONFIG_CPU_UNRET_ENTRY => CONFIG_MITIGATION_UNRET_ENTRY
x86/bugs: Rename CONFIG_SLS => CONFIG_MITIGATION_SLS
...
Linus Torvalds [Tue, 12 Mar 2024 02:37:56 +0000 (19:37 -0700)]
Merge tag 'x86-cleanups-2024-03-11' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 cleanups from Ingo Molnar:
"Misc cleanups, including a large series from Thomas Gleixner to cure
sparse warnings"
* tag 'x86-cleanups-2024-03-11' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/nmi: Drop unused declaration of proc_nmi_enabled()
x86/callthunks: Use EXPORT_PER_CPU_SYMBOL_GPL() for per CPU variables
x86/cpu: Provide a declaration for itlb_multihit_kvm_mitigation
x86/cpu: Use EXPORT_PER_CPU_SYMBOL_GPL() for x86_spec_ctrl_current
x86/uaccess: Add missing __force to casts in __access_ok() and valid_user_address()
x86/percpu: Cure per CPU madness on UP
smp: Consolidate smp_prepare_boot_cpu()
x86/msr: Add missing __percpu annotations
x86/msr: Prepare for including <linux/percpu.h> into <asm/msr.h>
perf/x86/amd/uncore: Fix __percpu annotation
x86/nmi: Remove an unnecessary IS_ENABLED(CONFIG_SMP)
x86/apm_32: Remove dead function apm_get_battery_status()
x86/insn-eval: Fix function param name in get_eff_addr_sib()
Linus Torvalds [Tue, 12 Mar 2024 02:23:16 +0000 (19:23 -0700)]
Merge tag 'x86-build-2024-03-11' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 build updates from Ingo Molnar:
- Reduce <asm/bootparam.h> dependencies
- Simplify <asm/efi.h>
- Unify *_setup_data definitions into <asm/setup_data.h>
- Reduce the size of <asm/bootparam.h>
* tag 'x86-build-2024-03-11' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86: Do not include <asm/bootparam.h> in several files
x86/efi: Implement arch_ima_efi_boot_mode() in source file
x86/setup: Move internal setup_data structures into setup_data.h
x86/setup: Move UAPI setup structures into setup_data.h
Linus Torvalds [Tue, 12 Mar 2024 02:13:06 +0000 (19:13 -0700)]
Merge tag 'x86-asm-2024-03-11' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 asm updates from Ingo Molnar:
"Two changes to simplify the x86 decoder logic a bit"
* tag 'x86-asm-2024-03-11' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/insn: Directly assign x86_64 state in insn_init()
x86/insn: Remove superfluous checks from instruction decoding routines
Linus Torvalds [Tue, 12 Mar 2024 01:45:16 +0000 (18:45 -0700)]
Merge tag 'sched-core-2024-03-11' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull scheduler updates from Ingo Molnar:
- Fix inconsistency in misfit task load-balancing
- Fix CPU isolation bugs in the task-wakeup logic
- Rework and unify the sched_use_asym_prio() and sched_asym_prefer()
logic
- Clean up and simplify ->avg_* accesses
- Misc cleanups and fixes
* tag 'sched-core-2024-03-11' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
sched/topology: Rename SD_SHARE_PKG_RESOURCES to SD_SHARE_LLC
sched/fair: Check the SD_ASYM_PACKING flag in sched_use_asym_prio()
sched/fair: Rework sched_use_asym_prio() and sched_asym_prefer()
sched/fair: Remove unused parameter from sched_asym()
sched/topology: Remove duplicate descriptions from TOPOLOGY_SD_FLAGS
sched/fair: Simplify the update_sd_pick_busiest() logic
sched/fair: Do strict inequality check for busiest misfit task group
sched/fair: Remove unnecessary goto in update_sd_lb_stats()
sched/fair: Take the scheduling domain into account in select_idle_core()
sched/fair: Take the scheduling domain into account in select_idle_smt()
sched/fair: Add READ_ONCE() and use existing helper function to access ->avg_irq
sched/fair: Use existing helper functions to access ->avg_rt and ->avg_dl
sched/core: Simplify code by removing duplicate #ifdefs
Linus Torvalds [Tue, 12 Mar 2024 01:33:03 +0000 (18:33 -0700)]
Merge tag 'locking-core-2024-03-11' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull locking updates from Ingo Molnar:
- Micro-optimize local_xchg() and the rtmutex code on x86
- Fix percpu-rwsem contention tracepoints
- Simplify debugging Kconfig dependencies
- Update/clarify the documentation of atomic primitives
- Misc cleanups
* tag 'locking-core-2024-03-11' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
locking/rtmutex: Use try_cmpxchg_relaxed() in mark_rt_mutex_waiters()
locking/x86: Implement local_xchg() using CMPXCHG without the LOCK prefix
locking/percpu-rwsem: Trigger contention tracepoints only if contended
locking/rwsem: Make DEBUG_RWSEMS and PREEMPT_RT mutually exclusive
locking/rwsem: Clarify that RWSEM_READER_OWNED is just a hint
locking/mutex: Simplify <linux/mutex.h>
locking/qspinlock: Fix 'wait_early' set but not used warning
locking/atomic: scripts: Clarify ordering of conditional atomics
Linus Torvalds [Tue, 12 Mar 2024 01:14:06 +0000 (18:14 -0700)]
Merge tag 'edac_updates_for_v6.9' of git://git.kernel.org/pub/scm/linux/kernel/git/ras/ras
Pull EDAC updates from Borislav Petkov:
- Add a FRU (Field Replaceable Unit) memory poison manager which
collects and manages previously encountered hw errors in order to
save them to persistent storage across reboots. Previously recorded
errors are "replayed" upon reboot in order to poison memory which has
caused said errors in the past.
The main use case is stacked, on-chip memory which cannot simply be
replaced so poisoning faulty areas of it and thus making them
inaccessible is the only strategy to prolong its lifetime.
- Add an AMD address translation library glue which converts the
reported addresses of hw errors into system physical addresses in
order to be used by other subsystems like memory failure, for
example. Add support for MI300 accelerators to that library.
- igen6: Add support for Alder Lake-N SoC
- i10nm: Add Grand Ridge support
- The usual fixlets and cleanups
* tag 'edac_updates_for_v6.9' of git://git.kernel.org/pub/scm/linux/kernel/git/ras/ras:
EDAC/versal: Convert to platform remove callback returning void
RAS/AMD/FMPM: Fix off by one when unwinding on error
RAS/AMD/FMPM: Add debugfs interface to print record entries
RAS/AMD/FMPM: Save SPA values
RAS: Export helper to get ras_debugfs_dir
RAS/AMD/ATL: Fix bit overflow in denorm_addr_df4_np2()
RAS: Introduce a FRU memory poison manager
RAS/AMD/ATL: Add MI300 row retirement support
Documentation: Move RAS section to admin-guide
EDAC/versal: Make the bit position of injected errors configurable
EDAC/i10nm: Add Intel Grand Ridge micro-server support
EDAC/igen6: Add one more Intel Alder Lake-N SoC support
RAS/AMD/ATL: Add MI300 DRAM to normalized address translation support
RAS/AMD/ATL: Fix array overflow in get_logical_coh_st_fabric_id_mi300()
RAS/AMD/ATL: Add MI300 support
Documentation: RAS: Add index and address translation section
EDAC/amd64: Use new AMD Address Translation Library
RAS: Introduce AMD Address Translation Library
EDAC/synopsys: Convert to devm_platform_ioremap_resource()
We've added 59 non-merge commits during the last 9 day(s) which contain
a total of 88 files changed, 4181 insertions(+), 590 deletions(-).
The main changes are:
1) Enforce VM_IOREMAP flag and range in ioremap_page_range and introduce
VM_SPARSE kind and vm_area_[un]map_pages to be used in bpf_arena,
from Alexei.
2) Introduce bpf_arena which is sparse shared memory region between bpf
program and user space where structures inside the arena can have
pointers to other areas of the arena, and pointers work seamlessly for
both user-space programs and bpf programs, from Alexei and Andrii.
3) Introduce may_goto instruction that is a contract between the verifier
and the program. The verifier allows the program to loop assuming it's
behaving well, but reserves the right to terminate it, from Alexei.
4) Use IETF format for field definitions in the BPF standard
document, from Dave.
5) Extend struct_ops libbpf APIs to allow specify version suffixes for
stuct_ops map types, share the same BPF program between several map
definitions, and other improvements, from Eduard.
6) Enable struct_ops support for more than one page in trampolines,
from Kui-Feng.
7) Support kCFI + BPF on riscv64, from Puranjay.
8) Use bpf_prog_pack for arm64 bpf trampoline, from Puranjay.
9) Fix roundup_pow_of_two undefined behavior on 32-bit archs, from Toke.
====================
Linus Torvalds [Tue, 12 Mar 2024 01:02:44 +0000 (18:02 -0700)]
Merge tag 'x86_misc_for_v6.9_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull misc x86 fixes from Borislav Petkov:
- Fix a wrong check in the function reporting whether a CPU executes
(or not) a NMI handler
- Ratelimit unknown NMIs messages in order to not potentially slow down
the machine
- Other fixlets
* tag 'x86_misc_for_v6.9_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/nmi: Fix the inverse "in NMI handler" check
Documentation/maintainer-tip: Add C++ tail comments exception
Documentation/maintainer-tip: Add Closes tag
x86/nmi: Rate limit unknown NMI messages
Documentation/kernel-parameters: Add spec_rstack_overflow to mitigations=off
Linus Torvalds [Tue, 12 Mar 2024 00:44:11 +0000 (17:44 -0700)]
Merge tag 'x86_sev_for_v6.9_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 SEV updates from Borislav Petkov:
- Add the x86 part of the SEV-SNP host support.
This will allow the kernel to be used as a KVM hypervisor capable of
running SNP (Secure Nested Paging) guests. Roughly speaking, SEV-SNP
is the ultimate goal of the AMD confidential computing side,
providing the most comprehensive confidential computing environment
up to date.
This is the x86 part and there is a KVM part which did not get ready
in time for the merge window so latter will be forthcoming in the
next cycle.
- Rework the early code's position-dependent SEV variable references in
order to allow building the kernel with clang and -fPIE/-fPIC and
-mcmodel=kernel
- The usual set of fixes, cleanups and improvements all over the place
* tag 'x86_sev_for_v6.9_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (36 commits)
x86/sev: Disable KMSAN for memory encryption TUs
x86/sev: Dump SEV_STATUS
crypto: ccp - Have it depend on AMD_IOMMU
iommu/amd: Fix failure return from snp_lookup_rmpentry()
x86/sev: Fix position dependent variable references in startup code
crypto: ccp: Make snp_range_list static
x86/Kconfig: Remove CONFIG_AMD_MEM_ENCRYPT_ACTIVE_BY_DEFAULT
Documentation: virt: Fix up pre-formatted text block for SEV ioctls
crypto: ccp: Add the SNP_SET_CONFIG command
crypto: ccp: Add the SNP_COMMIT command
crypto: ccp: Add the SNP_PLATFORM_STATUS command
x86/cpufeatures: Enable/unmask SEV-SNP CPU feature
KVM: SEV: Make AVIC backing, VMSA and VMCB memory allocation SNP safe
crypto: ccp: Add panic notifier for SEV/SNP firmware shutdown on kdump
iommu/amd: Clean up RMP entries for IOMMU pages during SNP shutdown
crypto: ccp: Handle legacy SEV commands when SNP is enabled
crypto: ccp: Handle non-volatile INIT_EX data when SNP is enabled
crypto: ccp: Handle the legacy TMR allocation when SNP is enabled
x86/sev: Introduce an SNP leaked pages list
crypto: ccp: Provide an API to issue SEV and SNP commands
...
Linus Torvalds [Tue, 12 Mar 2024 00:29:55 +0000 (17:29 -0700)]
Merge tag 'x86_cache_for_v6.9_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull resource control updates from Borislav Petkov:
- Rework different aspects of the resctrl code like adding
arch-specific accessors and splitting the locking, in order to
accomodate ARM's MPAM implementation of hw resource control and be
able to use the same filesystem control interface like on x86. Work
by James Morse
- Improve the memory bandwidth throttling heuristic to handle workloads
with not too regular load levels which end up penalized unnecessarily
- Use CPUID to detect the memory bandwidth enforcement limit on AMD
- The usual set of fixes
* tag 'x86_cache_for_v6.9_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (30 commits)
x86/resctrl: Remove lockdep annotation that triggers false positive
x86/resctrl: Separate arch and fs resctrl locks
x86/resctrl: Move domain helper migration into resctrl_offline_cpu()
x86/resctrl: Add CPU offline callback for resctrl work
x86/resctrl: Allow overflow/limbo handlers to be scheduled on any-but CPU
x86/resctrl: Add CPU online callback for resctrl work
x86/resctrl: Add helpers for system wide mon/alloc capable
x86/resctrl: Make rdt_enable_key the arch's decision to switch
x86/resctrl: Move alloc/mon static keys into helpers
x86/resctrl: Make resctrl_mounted checks explicit
x86/resctrl: Allow arch to allocate memory needed in resctrl_arch_rmid_read()
x86/resctrl: Allow resctrl_arch_rmid_read() to sleep
x86/resctrl: Queue mon_event_read() instead of sending an IPI
x86/resctrl: Add cpumask_any_housekeeping() for limbo/overflow
x86/resctrl: Move CLOSID/RMID matching and setting to use helpers
x86/resctrl: Allocate the cleanest CLOSID by searching closid_num_dirty_rmid
x86/resctrl: Use __set_bit()/__clear_bit() instead of open coding
x86/resctrl: Track the number of dirty RMID a CLOSID has
x86/resctrl: Allow RMID allocation to be scoped by CLOSID
x86/resctrl: Access per-rmid structures by index
...
Linus Torvalds [Tue, 12 Mar 2024 00:27:12 +0000 (17:27 -0700)]
Merge tag 'x86_mtrr_for_v6.9_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 MTRR update from Borislav Petkov:
- Relax the PAT MSR programming which was unnecessarily using the MTRR
programming protocol of disabling the cache around the changes. The
reason behind this is the current algorithm triggering a #VE
exception for TDX guests and unnecessarily complicating things
* tag 'x86_mtrr_for_v6.9_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/pat: Simplify the PAT programming protocol
Andrii Nakryiko [Sat, 9 Mar 2024 00:47:39 +0000 (16:47 -0800)]
bpf: move sleepable flag from bpf_prog_aux to bpf_prog
prog->aux->sleepable is checked very frequently as part of (some) BPF
program run hot paths. So this extra aux indirection seems wasteful and
on busy systems might cause unnecessary memory cache misses.
Let's move sleepable flag into prog itself to eliminate unnecessary
pointer dereference.
Puranjay Mohan [Mon, 11 Mar 2024 12:27:22 +0000 (12:27 +0000)]
bpf: hardcode BPF_PROG_PACK_SIZE to 2MB * num_possible_nodes()
On some architectures like ARM64, PMD_SIZE can be really large in some
configurations. Like with CONFIG_ARM64_64K_PAGES=y the PMD_SIZE is
512MB.
Use 2MB * num_possible_nodes() as the size for allocations done through
the prog pack allocator. On most architectures, PMD_SIZE will be equal
to 2MB in case of 4KB pages and will be greater than 2MB for bigger page
sizes.
Linus Torvalds [Mon, 11 Mar 2024 23:15:43 +0000 (16:15 -0700)]
Merge tag 'x86-entry-2024-03-11' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 entry update from Thomas Gleixner:
"A single update for the x86 entry code:
The current CR3 handling for kernel page table isolation in the
paranoid return paths which are relevant for #NMI, #MCE, #VC, #DB and
#DF is unconditionally writing CR3 with the value retrieved on
exception entry.
In the vast majority of cases when returning to the kernel this is a
pointless exercise because CR3 was not modified on exception entry.
The only situation where this is necessary is when the exception
interrupts a entry from user before switching to kernel CR3 or
interrupts an exit to user after switching back to user CR3.
As CR3 writes can be expensive on some systems this becomes measurable
overhead with high frequency #NMIs such as perf.
Avoid this overhead by checking the CR3 value, which was saved on
entry, and write it back to CR3 only when it is a user CR3"
* tag 'x86-entry-2024-03-11' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/entry: Avoid redundant CR3 write on paranoid returns
Kory Maincent [Mon, 11 Mar 2024 14:47:29 +0000 (15:47 +0100)]
ptp: Move from simple ida to xarray
Move from simple ida to xarray for storing and loading the ptp_clock
pointer. This prepares support for future hardware timestamp selection by
being able to link the ptp clock index to its pointer.
Breno Leitao [Mon, 11 Mar 2024 11:24:31 +0000 (04:24 -0700)]
vxlan: Remove generic .ndo_get_stats64
Commit 3e2f544dd8a33 ("net: get stats64 if device if driver is
configured") moved the callback to dev_get_tstats64() to net core, so,
unless the driver is doing some custom stats collection, it does not
need to set .ndo_get_stats64.
Since this driver is now relying in NETDEV_PCPU_STAT_TSTATS, then, it
doesn't need to set the dev_get_tstats64() generic .ndo_get_stats64
function pointer.
Breno Leitao [Mon, 11 Mar 2024 11:24:30 +0000 (04:24 -0700)]
vxlan: Do not alloc tstats manually
With commit 34d21de99cea9 ("net: Move {l,t,d}stats allocation to core and
convert veth & vrf"), stats allocation could be done on net core
instead of in this driver.
With this new approach, the driver doesn't have to bother with error
handling (allocation failure checking, making sure free happens in the
right spot, etc). This is core responsibility now.
Remove the allocation in the vxlan driver and leverage the network
core allocation instead.
Linus Torvalds [Mon, 11 Mar 2024 23:00:17 +0000 (16:00 -0700)]
Merge tag 'x86-fred-2024-03-10' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 FRED support from Thomas Gleixner:
"Support for x86 Fast Return and Event Delivery (FRED).
FRED is a replacement for IDT event delivery on x86 and addresses most
of the technical nightmares which IDT exposes:
1) Exception cause registers like CR2 need to be manually preserved
in nested exception scenarios.
2) Hardware interrupt stack switching is suboptimal for nested
exceptions as the interrupt stack mechanism rewinds the stack on
each entry which requires a massive effort in the low level entry
of #NMI code to handle this.
3) No hardware distinction between entry from kernel or from user
which makes establishing kernel context more complex than it needs
to be especially for unconditionally nestable exceptions like NMI.
4) NMI nesting caused by IRET unconditionally reenabling NMIs, which
is a problem when the perf NMI takes a fault when collecting a
stack trace.
5) Partial restore of ESP when returning to a 16-bit segment
6) Limitation of the vector space which can cause vector exhaustion
on large systems.
7) Inability to differentiate NMI sources
FRED addresses these shortcomings by:
1) An extended exception stack frame which the CPU uses to save
exception cause registers. This ensures that the meta information
for each exception is preserved on stack and avoids the extra
complexity of preserving it in software.
2) Hardware interrupt stack switching is non-rewinding if a nested
exception uses the currently interrupt stack.
3) The entry points for kernel and user context are separate and GS
BASE handling which is required to establish kernel context for
per CPU variable access is done in hardware.
4) NMIs are now nesting protected. They are only reenabled on the
return from NMI.
5) FRED guarantees full restore of ESP
6) FRED does not put a limitation on the vector space by design
because it uses a central entry points for kernel and user space
and the CPUstores the entry type (exception, trap, interrupt,
syscall) on the entry stack along with the vector number. The
entry code has to demultiplex this information, but this removes
the vector space restriction.
The first hardware implementations will still have the current
restricted vector space because lifting this limitation requires
further changes to the local APIC.
7) FRED stores the vector number and meta information on stack which
allows having more than one NMI vector in future hardware when the
required local APIC changes are in place.
The series implements the initial FRED support by:
- Reworking the existing entry and IDT handling infrastructure to
accomodate for the alternative entry mechanism.
- Expanding the stack frame to accomodate for the extra 16 bytes FRED
requires to store context and meta information
- Providing FRED specific C entry points for events which have
information pushed to the extended stack frame, e.g. #PF and #DB.
- Providing FRED specific C entry points for #NMI and #MCE
- Implementing the FRED specific ASM entry points and the C code to
demultiplex the events
- Providing detection and initialization mechanisms and the necessary
tweaks in context switching, GS BASE handling etc.
The FRED integration aims for maximum code reuse vs the existing IDT
implementation to the extent possible and the deviation in hot paths
like context switching are handled with alternatives to minimalize the
impact. The low level entry and exit paths are seperate due to the
extended stack frame and the hardware based GS BASE swichting and
therefore have no impact on IDT based systems.
It has been extensively tested on existing systems and on the FRED
simulation and as of now there are no outstanding problems"
* tag 'x86-fred-2024-03-10' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (38 commits)
x86/fred: Fix init_task thread stack pointer initialization
MAINTAINERS: Add a maintainer entry for FRED
x86/fred: Fix a build warning with allmodconfig due to 'inline' failing to inline properly
x86/fred: Invoke FRED initialization code to enable FRED
x86/fred: Add FRED initialization functions
x86/syscall: Split IDT syscall setup code into idt_syscall_init()
KVM: VMX: Call fred_entry_from_kvm() for IRQ/NMI handling
x86/entry: Add fred_entry_from_kvm() for VMX to handle IRQ/NMI
x86/entry/calling: Allow PUSH_AND_CLEAR_REGS being used beyond actual entry code
x86/fred: Fixup fault on ERETU by jumping to fred_entrypoint_user
x86/fred: Let ret_from_fork_asm() jmp to asm_fred_exit_user when FRED is enabled
x86/traps: Add sysvec_install() to install a system interrupt handler
x86/fred: FRED entry/exit and dispatch code
x86/fred: Add a machine check entry stub for FRED
x86/fred: Add a NMI entry stub for FRED
x86/fred: Add a debug fault entry stub for FRED
x86/idtentry: Incorporate definitions/declarations of the FRED entries
x86/fred: Make exc_page_fault() work for FRED
x86/fred: Allow single-step trap and NMI when starting a new task
x86/fred: No ESPFIX needed when FRED is enabled
...
The kmalloc_array() in nfp_fl_lag_do_work() will return null, if
the physical memory has run out. As a result, if we dereference
the acti_netdevs, the null pointer dereference bugs will happen.
This patch adds a check to judge whether allocation failure occurs.
If it happens, the delayed work will be rescheduled and try again.
Fixes: bb9a8d031140 ("nfp: flower: monitor and offload LAG groups") Signed-off-by: Duoming Zhou <duoming@zju.edu.cn> Reviewed-by: Louis Peens <louis.peens@corigine.com> Link: https://lore.kernel.org/r/20240308142540.9674-1-duoming@zju.edu.cn Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Juntong Deng [Fri, 8 Mar 2024 11:33:04 +0000 (11:33 +0000)]
net/netlink: Add getsockopt support for NETLINK_LISTEN_ALL_NSID
Currently getsockopt does not support NETLINK_LISTEN_ALL_NSID,
and we are unable to get the value of NETLINK_LISTEN_ALL_NSID
socket option through getsockopt.
This patch adds getsockopt support for NETLINK_LISTEN_ALL_NSID.