Miao-chen Chou [Wed, 17 Jun 2020 14:39:18 +0000 (16:39 +0200)]
Bluetooth: Update background scan and report device based on advertisement monitors
This calls hci_update_background_scan() when there is any update on the
advertisement monitors. If there is at least one advertisement monitor,
the filtering policy of scan parameters should be 0x00. This also reports
device found mgmt events if there is at least one monitor.
The following cases were tested with btmgmt advmon-* commands.
(1) add a ADV monitor and observe that the passive scanning is
triggered.
(2) remove the last ADV monitor and observe that the passive scanning is
terminated.
(3) with a LE peripheral paired, repeat (1) and observe the passive
scanning continues.
(4) with a LE peripheral paired, repeat (2) and observe the passive
scanning continues.
(5) with a ADV monitor, suspend/resume the host and observe the passive
scanning continues.
Miao-chen Chou [Wed, 17 Jun 2020 14:39:17 +0000 (16:39 +0200)]
Bluetooth: Notify adv monitor removed event
This notifies management sockets on MGMT_EV_ADV_MONITOR_REMOVED event.
The following test was performed.
- Start two btmgmt consoles, issue a btmgmt advmon-remove command on one
console and observe a MGMT_EV_ADV_MONITOR_REMOVED event on the other.
Miao-chen Chou [Wed, 17 Jun 2020 14:39:16 +0000 (16:39 +0200)]
Bluetooth: Notify adv monitor added event
This notifies management sockets on MGMT_EV_ADV_MONITOR_ADDED event.
The following test was performed.
- Start two btmgmt consoles, issue a btmgmt advmon-add command on one
console and observe a MGMT_EV_ADV_MONITOR_ADDED event on the other
Miao-chen Chou [Wed, 17 Jun 2020 14:39:15 +0000 (16:39 +0200)]
Bluetooth: Add handler of MGMT_OP_REMOVE_ADV_MONITOR
This adds the request handler of MGMT_OP_REMOVE_ADV_MONITOR command.
Note that the controller-based monitoring is not yet in place. This
removes the internal monitor(s) without sending HCI traffic, so the
request returns immediately.
The following test was performed.
- Issue btmgmt advmon-remove with valid and invalid handles.
Miao-chen Chou [Wed, 17 Jun 2020 14:39:14 +0000 (16:39 +0200)]
Bluetooth: Add handler of MGMT_OP_ADD_ADV_PATTERNS_MONITOR
This adds the request handler of MGMT_OP_ADD_ADV_PATTERNS_MONITOR command.
Note that the controller-based monitoring is not yet in place. This tracks
the content of the monitor without sending HCI traffic, so the request
returns immediately.
The following manual test was performed.
- Issue btmgmt advmon-add with valid and invalid inputs.
- Issue btmgmt advmon-add more the allowed number of monitors.
Miao-chen Chou [Wed, 17 Jun 2020 14:39:13 +0000 (16:39 +0200)]
Bluetooth: Add handler of MGMT_OP_READ_ADV_MONITOR_FEATURES
This adds the request handler of MGMT_OP_READ_ADV_MONITOR_FEATURES
command. Since the controller-based monitoring is not yet in place, this
report only the supported features but not the enabled features.
The following test was performed.
- Issuing btmgmt advmon-features.
Add the get device flags and set device flags mgmt ops and the device
flags changed event. Their behavior is described in detail in
mgmt-api.txt in bluez.
Sample btmon trace when a HID device is added (trimmed to 75 chars):
Alain Michaud [Thu, 11 Jun 2020 02:01:56 +0000 (02:01 +0000)]
Bluetooth: centralize default value initialization.
This patch centralized the initialization of default parameters. This
is required to allow clients to more easily customize the default
system parameters.
Alain Michaud [Thu, 11 Jun 2020 02:01:55 +0000 (02:01 +0000)]
Bluetooth: mgmt: read/set system parameter definitions
This patch submits the corresponding kernel definitions to mgmt.h.
This is submitted before the implementation to avoid any conflicts in
values allocations.
Bluetooth: hci_qca: Request Tx clock vote off only when Tx is pending
Tx pending flag is set to true when HOST IBS state is AWAKE or
AWAKEING. If IBS state is ASLEEP, then Tx clock is already voted
off. To optimize further directly calling serial_clock_vote()
instead of qca_wq_serial_tx_clock_vote_off(), at this point of
qca_suspend() already data is sent out. No need to wake up hci to
send data.
Bluetooth: hci_qca: Increase SoC idle timeout to 200ms
In some version of WCN399x, SoC idle timeout is configured
as 80ms instead of 20ms or 40ms. To honor all the SoC's
supported in the driver increasing SoC idle timeout to 200ms.
Bluetooth: hci_qca: Disable SoC debug logging for WCN3991
By default, WCN3991 sent debug packets to HOST via ACL packet
with header 0xDC2E. This logging is not required on commercial
devices. With this patch SoC logging is disabled post fw
download.
Alain Michaud [Thu, 11 Jun 2020 19:50:41 +0000 (19:50 +0000)]
Bluetooth: Add support for BT_PKT_STATUS CMSG data for SCO connections
This change adds support for reporting the BT_PKT_STATUS to the socket
CMSG data to allow the implementation of a packet loss correction on
erroneous data received on the SCO socket.
The patch was partially developed by Marcel Holtmann and validated by
Hsin-yu Chao.
Use device_init_wakeup to allow the Bluetooth dev to wake the system
from suspend. Currently, the device can wake the system but no
power/wakeup entry is created in sysfs to allow userspace to disable
wakeup.
Signed-off-by: Abhishek Pandit-Subedi <abhishekpandit@chromium.org> Reviewed-by: Douglas Anderson <dianders@chromium.org> Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
Use the parent device's power/wakeup to control whether we support
remote wake. If remote wakeup is disabled, Bluetooth will not enable
scanning for incoming connections.
Signed-off-by: Abhishek Pandit-Subedi <abhishekpandit@chromium.org> Reviewed-by: Douglas Anderson <dianders@chromium.org> Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
Set the correct parent dev when registering hdev. This allows userspace
tools to find the parent device (for example, to set the power/wakeup
property).
Before this change, the path was /sys/devices/virtual/bluetooth/hci0
and after this change, it looks more like:
/sys/bus/mmc/devices/mmc1:0001/mmc1:0001:2/bluetooth/hci0
Signed-off-by: Abhishek Pandit-Subedi <abhishekpandit@chromium.org> Reviewed-by: Douglas Anderson <dianders@chromium.org> Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
Due to race conditions between qca_hw_error and qca_controller_memdump
during SSR timeout,the same pointer is freed twice. This results in a
double free. Now a lock is acquired before checking the stauts of SSR
state.
Bluetooth: Allow suspend even when preparation has failed
It is preferable to allow suspend even when Bluetooth has problems
preparing for sleep. When Bluetooth fails to finish preparing for
suspend, log the error and allow the suspend notifier to continue
instead.
To also make it clearer why suspend failed, change bt_dev_dbg to
bt_dev_err when handling the suspend timeout.
Fixes: dd522a7429b07e ("Bluetooth: Handle LE devices during suspend") Reported-by: Len Brown <len.brown@intel.com> Signed-off-by: Abhishek Pandit-Subedi <abhishekpandit@chromium.org> Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
Bluetooth: hci_qca: Refactor error handling in qca_suspend()
If waiting for IBS sleep times out jump to the error handler, this is
easier to read than multiple 'if' branches and a fall through to the
error handler.
Bluetooth: hci_qca: Skip serdev wait when no transfer is pending
qca_suspend() calls serdev_device_wait_until_sent() regardless of
whether a transfer is pending. While it does no active harm since
the function should return immediately it makes the code more
confusing. Add a flag to track whether a transfer is pending and
only call serdev_device_wait_until_sent() is needed.
Bluetooth: hci_qca: Only remove TX clock vote after TX is completed
qca_suspend() removes the vote for the UART TX clock after
writing an IBS sleep request to the serial buffer. This is
not a good idea since there is no guarantee that the request
has been sent at this point. Instead remove the vote after
successfully entering IBS sleep. This also fixes the issue
of the vote being removed in case of an aborted suspend due
to a failure of entering IBS sleep.
Bluetooth: hci_qca: Simplify determination of serial clock on/off state from votes
The serial clocks should be on when there is a vote for at least one
of the clocks (RX or TX), and off when there is no 'on' vote. The
current logic to determine the combined state is a bit redundant
in the code paths for different types of votes, use a single
statement in the common path instead.
Dan Carpenter [Fri, 29 May 2020 09:59:48 +0000 (12:59 +0300)]
Bluetooth: hci_qca: Fix an error pointer dereference
When a function like devm_clk_get_optional() function returns both error
pointers on error and NULL then the NULL return means that the optional
feature is deliberately disabled. It is a special sort of success and
should not trigger an error message. The surrounding code should be
written to check for NULL and not crash.
On the other hand, if we encounter an error, then the probe from should
clean up and return a failure.
In this code, if devm_clk_get_optional() returns an error pointer then
the kernel will crash inside the call to:
clk_set_rate(qcadev->susclk, SUSCLK_RATE_32KHZ);
The error handling must be updated to prevent that.
Fixes: 77131dfec6af ("Bluetooth: hci_qca: Replace devm_gpiod_get() with devm_gpiod_get_optional()") Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com> Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
Alain Michaud [Mon, 1 Jun 2020 14:20:59 +0000 (14:20 +0000)]
Bluetooth: Removing noisy dbg message
This patch removes a particularly noisy dbg message. The debug message
isn't particularly interesting for debuggability so it was simply
removed to reduce noise in dbg logs.
When running with conntrack rules, the dropped overlap fragments may cause
EPERM to be returned to sendto. Instead of completely failing, just ignore
those errors and continue. If this causes packets with overlap fragments to
be dropped as expected, that is okay. And if it causes packets that are
expected to be received to be dropped, which should not happen, it will be
detected as failure.
Signed-off-by: Thadeu Lima de Souza Cascardo <cascardo@canonical.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Vasily Averin [Tue, 2 Jun 2020 12:55:26 +0000 (15:55 +0300)]
net_failover: fixed rollback in net_failover_open()
found by smatch:
drivers/net/net_failover.c:65 net_failover_open() error:
we previously assumed 'primary_dev' could be null (see line 43)
Fixes: cfc80d9a1163 ("net: Introduce net_failover driver") Signed-off-by: Vasily Averin <vvs@virtuozzo.com> Signed-off-by: David S. Miller <davem@davemloft.net>
There is no actual tipc_node refcnt leak as stated in the above commit.
The refcnt is hold carefully for the case of an asynchronous decryption
(i.e. -EINPROGRESS/-EBUSY and skb = NULL is returned), so that the node
object cannot be freed in the meantime. The counter will be re-balanced
when the operation's callback arrives with the decrypted buffer if any.
In other cases, e.g. a synchronous crypto the counter will be decreased
immediately when it is done.
Now with that commit, a kernel panic will occur when there is no node
found (i.e. n = NULL) in the 'tipc_rcv()' or a premature release of the
node object.
This commit solves the issues by reverting the said commit, but keeping
one valid case that the 'skb_linearize()' is failed.
Acked-by: Jon Maloy <jmaloy@redhat.com> Signed-off-by: Tuong Lien <tuong.t.lien@dektech.com.au> Tested-by: Hoang Le <hoang.h.le@dektech.com.au> Signed-off-by: David S. Miller <davem@davemloft.net>
Ronak Doshi [Tue, 2 Jun 2020 03:02:39 +0000 (20:02 -0700)]
vmxnet3: allow rx flow hash ops only when rss is enabled
It makes sense to allow changes to get/set rx flow hash callback only
when rss is enabled. This patch restricts get_rss_hash_opts and
set_rss_hash_opts methods to allow querying and configuring different
Rx flow hash configurations only when rss is enabled
Signed-off-by: Ronak Doshi <doshir@vmware.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Luo bin [Tue, 2 Jun 2020 00:40:32 +0000 (08:40 +0800)]
hinic: add set_channels ethtool_ops support
add support to change TX/RX queue number with "ethtool -L combined".
V5 -> V6: remove check for carrier in hinic_xmit_frame
V4 -> V5: change time zone in patch header
V3 -> V4: update date in patch header
V2 -> V3: remove check for zero channels->combined_count
V1 -> V2: update commit message("ethtool -L" to "ethtool -L combined")
V0 -> V1: remove check for channels->tx_count/rx_count/other_count
Signed-off-by: Luo bin <luobin9@huawei.com> Signed-off-by: David S. Miller <davem@davemloft.net>
When using make kselftest TARGETS=bpf, tools/bpf is built with
MAKEFLAGS=rR, which causes $(CXX) to be undefined, which in turn causes
the build to fail with
CXX test_cpp
/bin/sh: 2: g: not found
Fix by adding a default $(CXX) value, like tools/build/feature/Makefile
already does.
When using make kselftest TARGETS=bpf, tools/bpf is built with
MAKEFLAGS=rR, which causes $(COMPILE.c) to be undefined, which in turn
causes the build to fail with
CC kselftest/bpf/tools/build/bpftool/map_perf_ring.o
/bin/sh: 1: -MMD: not found
Fix by using $(CC) $(CFLAGS) -c instead of $(COMPILE.c).
Since commit 0ebeea8ca8a4 ("bpf: Restrict bpf_probe_read{, str}() only to
archs where they work") 44 verifier tests fail on s390 due to not having
bpf_probe_read anymore. Fix by using bpf_probe_read_kernel.
Certain kernel functions (e.g. get_vtimer/set_vtimer) cause kernel
panic when the stack is not 8-byte aligned. Currently JITed BPF programs
may trigger this by allocating stack frames with non-rounded sizes and
then being interrupted. Fix by using rounded fp->aux->stack_depth.
Andrii Nakryiko [Tue, 2 Jun 2020 05:03:49 +0000 (22:03 -0700)]
selftests/bpf: Fix sample_cnt shared between two threads
Make sample_cnt volatile to fix possible selftests failure due to compiler
optimization preventing latest sample_cnt value to be visible to main thread.
sample_cnt is incremented in background thread, which is then joined into main
thread. So in terms of visibility sample_cnt update is ok. But because it's
not volatile, compiler might make optimizations that would prevent main thread
to see latest updated value. Fix this by marking global variable volatile.
====================
This series fixes an issue originally reported by Lorenz Bauer where using
the bpf_skb_adjust_room() helper hid a checksum bug since it wasn't adjusting
CHECKSUM_UNNECESSARY's skb->csum_level after decap. The fix is two-fold:
i) We do a safe reset in bpf_skb_adjust_room() to CHECKSUM_NONE with an opt-
out flag BPF_F_ADJ_ROOM_NO_CSUM_RESET.
ii) We add a new bpf_csum_level() for the latter in order to allow users to
manually inc/dec the skb->csum_level when needed.
The series is rebased against latest bpf-next tree. It can be applied there,
or to bpf after the merge win sync from net-next.
Daniel Borkmann [Tue, 2 Jun 2020 14:58:34 +0000 (16:58 +0200)]
bpf, selftests: Adapt cls_redirect to call csum_level helper
Adapt bpf_skb_adjust_room() to pass in BPF_F_ADJ_ROOM_NO_CSUM_RESET flag and
use the new bpf_csum_level() helper to inc/dec the checksum level by one after
the encap/decap.
Daniel Borkmann [Tue, 2 Jun 2020 14:58:33 +0000 (16:58 +0200)]
bpf: Add csum_level helper for fixing up csum levels
Add a bpf_csum_level() helper which BPF programs can use in combination
with bpf_skb_adjust_room() when they pass in BPF_F_ADJ_ROOM_NO_CSUM_RESET
flag to the latter to avoid falling back to CHECKSUM_NONE.
The bpf_csum_level() allows to adjust CHECKSUM_UNNECESSARY skb->csum_levels
via BPF_CSUM_LEVEL_{INC,DEC} which calls __skb_{incr,decr}_checksum_unnecessary()
on the skb. The helper also allows a BPF_CSUM_LEVEL_RESET which sets the skb's
csum to CHECKSUM_NONE as well as a BPF_CSUM_LEVEL_QUERY to just return the
current level. Without this helper, there is no way to otherwise adjust the
skb->csum_level. I did not add an extra dummy flags as there is plenty of free
bitspace in level argument itself iff ever needed in future.
Daniel Borkmann [Tue, 2 Jun 2020 14:58:32 +0000 (16:58 +0200)]
bpf: Fix up bpf_skb_adjust_room helper's skb csum setting
Lorenz recently reported:
In our TC classifier cls_redirect [0], we use the following sequence of
helper calls to decapsulate a GUE (basically IP + UDP + custom header)
encapsulated packet:
It seems like some checksums of the inner headers are not validated in
this case. For example, a TCP SYN packet with invalid TCP checksum is
still accepted by the network stack and elicits a SYN ACK. [...]
That is, we receive the following packet from the driver:
| ETH | IP | UDP | GUE | IP | TCP |
skb->ip_summed == CHECKSUM_UNNECESSARY
ip_summed is CHECKSUM_UNNECESSARY because our NICs do rx checksum offloading.
On this packet we run skb_adjust_room_mac(-encap_len), and get the following:
| ETH | IP | TCP |
skb->ip_summed == CHECKSUM_UNNECESSARY
Note that ip_summed is still CHECKSUM_UNNECESSARY. After bpf_redirect()'ing
into the ingress, we end up in tcp_v4_rcv(). There, skb_checksum_init() is
turned into a no-op due to CHECKSUM_UNNECESSARY.
The bpf_skb_adjust_room() helper is not aware of protocol specifics. Internally,
it handles the CHECKSUM_COMPLETE case via skb_postpull_rcsum(), but that does
not cover CHECKSUM_UNNECESSARY. In this case skb->csum_level of the original
skb prior to bpf_skb_adjust_room() call was 0, that is, covering UDP. Right now
there is no way to adjust the skb->csum_level. NICs that have checksum offload
disabled (CHECKSUM_NONE) or that support CHECKSUM_COMPLETE are not affected.
Use a safe default for CHECKSUM_UNNECESSARY by resetting to CHECKSUM_NONE and
add a flag to the helper called BPF_F_ADJ_ROOM_NO_CSUM_RESET that allows users
from opting out. Opting out is useful for the case where we don't remove/add
full protocol headers, or for the case where a user wants to adjust the csum
level manually e.g. through bpf_csum_level() helper that is added in subsequent
patch.
The bpf_skb_proto_{4_to_6,6_to_4}() for NAT64/46 translation from the BPF
bpf_skb_change_proto() helper uses bpf_skb_net_hdr_{push,pop}() pair internally
as well but doesn't change layers, only transitions between v4 to v6 and vice
versa, therefore no adoption is required there.
Jules Irenge [Mon, 1 Jun 2020 18:45:52 +0000 (19:45 +0100)]
sfc: add missing annotation for efx_ef10_try_update_nic_stats_vf()
Sparse reports a warning at efx_ef10_try_update_nic_stats_vf()
warning: context imbalance in efx_ef10_try_update_nic_stats_vf()
- unexpected unlock
The root cause is the missing annotation at
efx_ef10_try_update_nic_stats_vf()
Add the missing _must_hold(&efx->stats_lock) annotation
Signed-off-by: Jules Irenge <jbi.octave@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Ayush Sawal [Mon, 1 Jun 2020 17:41:59 +0000 (23:11 +0530)]
Crypto/chcr: Fixes a coccinile check error
This fixes an error observed after running coccinile check.
drivers/crypto/chelsio/chcr_algo.c:1462:5-8: Unneeded variable:
"err". Return "0" on line 1480
This line is missed in the commit 567be3a5d227 ("crypto:
chelsio - Use multiple txq/rxq per tfm to process the requests").
Fixes: 567be3a5d227 ("crypto:
chelsio - Use multiple txq/rxq per tfm to process the requests").
V1->V2
-Modified subject.
Signed-off-by: Ayush Sawal <ayush.sawal@chelsio.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Ayush Sawal [Mon, 1 Jun 2020 17:41:58 +0000 (23:11 +0530)]
Crypto/chcr: Fixes compilations warnings
This patch fixes the compilation warnings displayed by sparse tool for
chcr driver.
V1->V2
Avoid type casting by using get_unaligned_be32() and
put_unaligned_be16/32() functions.
The key which comes from stack is an u8 byte stream so we store it in
an unsigned char array(ablkctx->key). The function get_aes_decrypt_key()
is a used to calculate the reverse round key for decryption, for this
operation the key has to be divided into 4 bytes, so to extract 4 bytes
from an u8 byte stream and store it in an u32 variable, get_aligned_be32()
is used. Similarly for copying back the key from u32 variable to the
original u8 key stream, put_aligned_be32() is used.
Signed-off-by: Ayush Sawal <ayush.sawal@chelsio.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Current design enables ktls setting from start, which is not
efficient. Now the feature will be enabled when user demands
TLS offload on any interface.
v1->v2:
- taking ULD module refcount till any single connection exists.
- taking rtnl_lock() before clearing tls_devops.
v2->v3:
- cxgb4 is now registering to tlsdev_ops.
- module refcount inc/dec in chcr.
- refcount is only for connections.
- removed new code from cxgb_set_feature().
v3->v4:
- fixed warning message.
Signed-off-by: Rohit Maheshwari <rohitm@chelsio.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Hangbin Liu [Mon, 1 Jun 2020 03:55:03 +0000 (11:55 +0800)]
ipv6: fix IPV6_ADDRFORM operation logic
Socket option IPV6_ADDRFORM supports UDP/UDPLITE and TCP at present.
Previously the checking logic looks like:
if (sk->sk_protocol == IPPROTO_UDP || sk->sk_protocol == IPPROTO_UDPLITE)
do_some_check;
else if (sk->sk_protocol != IPPROTO_TCP)
break;
After commit b6f6118901d1 ("ipv6: restrict IPV6_ADDRFORM operation"), TCP
was blocked as the logic changed to:
if (sk->sk_protocol == IPPROTO_UDP || sk->sk_protocol == IPPROTO_UDPLITE)
do_some_check;
else if (sk->sk_protocol == IPPROTO_TCP)
do_some_check;
break;
else
break;
Then after commit 82c9ae440857 ("ipv6: fix restrict IPV6_ADDRFORM operation")
UDP/UDPLITE were blocked as the logic changed to:
if (sk->sk_protocol == IPPROTO_UDP || sk->sk_protocol == IPPROTO_UDPLITE)
do_some_check;
if (sk->sk_protocol == IPPROTO_TCP)
do_some_check;
if (sk->sk_protocol != IPPROTO_TCP)
break;
Fix it by using Eric's code and simply remove the break in TCP check, which
looks like:
if (sk->sk_protocol == IPPROTO_UDP || sk->sk_protocol == IPPROTO_UDPLITE)
do_some_check;
else if (sk->sk_protocol == IPPROTO_TCP)
do_some_check;
else
break;
Fixes: 82c9ae440857 ("ipv6: fix restrict IPV6_ADDRFORM operation") Signed-off-by: Hangbin Liu <liuhangbin@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
YueHaibing [Thu, 28 May 2020 14:34:07 +0000 (22:34 +0800)]
tipc: Fix NULL pointer dereference in __tipc_sendstream()
tipc_sendstream() may send zero length packet, then tipc_msg_append()
do not alloc skb, skb_peek_tail() will get NULL, msg_set_ack_required
will trigger NULL pointer dereference.
Reported-by: syzbot+8eac6d030e7807c21d32@syzkaller.appspotmail.com Fixes: 0a3e060f340d ("tipc: add test for Nagle algorithm effectiveness") Signed-off-by: YueHaibing <yuehaibing@huawei.com> Signed-off-by: David S. Miller <davem@davemloft.net>
====================
One of the pieces of feedback from recent review of BPF hooks for socket
lookup [0] was that new program types should use bpf_link-based
attachment.
This series introduces new bpf_link type for attaching to network
namespace. All link operations are supported. Errors returned from ops
follow cgroup example. Patch 4 description goes into error semantics.
The major change in v2 is a switch away from RCU to mutex-only
synchronization. Andrii pointed out that it is not needed, and it makes
sense to keep locking straightforward.
Also, there were a couple of bugs in update_prog and fill_info initial
implementation, one picked up by kbuild. Those are now fixed. Tests have
been extended to cover them. Full changelog below.
Series is organized as so:
Patches 1-3 prepare a space in struct net to keep state for attached BPF
programs, and massage the code in flow_dissector to make it attach type
agnostic, to finally move it under kernel/bpf/.
Patch 4, the most important one, introduces new bpf_link link type for
attaching to network namespace.
Patch 5 unifies the update error (ENOLINK) between BPF cgroup and netns.
Patches 6-8 make libbpf and bpftool aware of the new link type.
Patches 9-12 Add and extend tests to check that link low- and high-level
API for operating on links to netns works as intended.
Thanks to Alexei, Andrii, Lorenz, Marek, and Stanislav for feedback.
- Switch to mutex-only synchronization. Don't rely on RCU grace period
guarantee when accessing struct net from link release / update /
fill_info, and when accessing bpf_link from pernet pre_exit
callback. (Andrii)
- Drop patch 1, no longer needed with mutex-only synchronization.
- Don't leak uninitialized variable contents from fill_info callback
when link is in defunct state. (kbuild)
- Make fill_info treat the link as defunct (i.e. no attached netns) when
struct net refcount is 0, but link has not been yet auto-detached.
- Add missing BPF_LINK_TYPE define in bpf_types.h for new link type.
- Fix link update_prog callback to update the prog that will run, and
not just the link itself.
- Return EEXIST on prog attach when link already exists, and on link
create when prog is already attached directly. (Andrii)
- Return EINVAL on prog detach when link is attached. (Andrii)
- Fold __netns_bpf_link_attach into its only caller. (Stanislav)
- Get rid of a wrapper around container_of() (Andrii)
- Use rcu_dereference_protected instead of rcu_access_pointer on
update-side. (Stanislav)
- Make return-on-success from netns_bpf_link_create less
confusing. (Andrii)
- Adapt bpf_link for cgroup to return ENOLINK when updating a defunct
link. (Andrii, Alexei)
- Order new exported symbols in libbpf.map alphabetically (Andrii)
- Keep libbpf's "failed to attach link" warning message clear as to what
we failed to attach to (cgroup vs netns). (Andrii)
- Extract helpers for printing link attach type. (bpftool, Andrii)
- Switch flow_dissector tests to BPF skeleton and extend them to
exercise link-based flow dissector attachment. (Andrii)
- Harden flow dissector attachment tests with prog query checks after
prog attach/detach, or link create/update/close.
- Extend flow dissector tests to cover fill_info for defunct links.
- Rebase onto recent bpf-next
====================
Jakub Sitnicki [Sun, 31 May 2020 08:28:46 +0000 (10:28 +0200)]
selftests/bpf: Extend test_flow_dissector to cover link creation
Extend the existing flow_dissector test case to run tests once using direct
prog attachments, and then for the second time using indirect attachment
via link.
The intention is to exercises the newly added high-level API for attaching
programs to network namespace with links (bpf_program__attach_netns).
Jakub Sitnicki [Sun, 31 May 2020 08:28:45 +0000 (10:28 +0200)]
selftests/bpf: Convert test_flow_dissector to use BPF skeleton
Switch flow dissector test setup from custom BPF object loader to BPF
skeleton to save boilerplate and prepare for testing higher-level API for
attaching flow dissector with bpf_link.
To avoid depending on program order in the BPF object when populating the
flow dissector PROG_ARRAY map, change the program section names to contain
the program index into the map. This follows the example set by tailcall
tests.
Jakub Sitnicki [Sun, 31 May 2020 08:28:44 +0000 (10:28 +0200)]
selftests/bpf, flow_dissector: Close TAP device FD after the test
test_flow_dissector leaves a TAP device after it's finished, potentially
interfering with other tests that will run after it. Fix it by closing the
TAP descriptor on cleanup.
Jakub Sitnicki [Sun, 31 May 2020 08:28:42 +0000 (10:28 +0200)]
bpftool: Support link show for netns-attached links
Make `bpf link show` aware of new link type, that is links attached to
netns. When listing netns-attached links, display netns inode number as its
identifier and link attach type.
Sample session:
# readlink /proc/self/ns/net
net:[4026532251]
# bpftool prog show
357: flow_dissector tag a04f5eef06a7f555 gpl
loaded_at 2020-05-30T16:53:51+0200 uid 0
xlated 16B jited 37B memlock 4096B
358: flow_dissector tag a04f5eef06a7f555 gpl
loaded_at 2020-05-30T16:53:51+0200 uid 0
xlated 16B jited 37B memlock 4096B
# bpftool link show
108: netns prog 357
netns_ino 4026532251 attach_type flow_dissector
# bpftool link -jp show
[{
"id": 108,
"type": "netns",
"prog_id": 357,
"netns_ino": 4026532251,
"attach_type": "flow_dissector"
}
]
(... after netns is gone ...)
# bpftool link show
108: netns prog 357
netns_ino 0 attach_type flow_dissector
# bpftool link -jp show
[{
"id": 108,
"type": "netns",
"prog_id": 357,
"netns_ino": 0,
"attach_type": "flow_dissector"
}
]
Jakub Sitnicki [Sun, 31 May 2020 08:28:41 +0000 (10:28 +0200)]
bpftool: Extract helpers for showing link attach type
Code for printing link attach_type is duplicated in a couple of places, and
likely will be duplicated for future link types as well. Create helpers to
prevent duplication.
Jakub Sitnicki [Sun, 31 May 2020 08:28:40 +0000 (10:28 +0200)]
libbpf: Add support for bpf_link-based netns attachment
Add bpf_program__attach_nets(), which uses LINK_CREATE subcommand to create
an FD-based kernel bpf_link, for attach types tied to network namespace,
that is BPF_FLOW_DISSECTOR for the moment.
Jakub Sitnicki [Sun, 31 May 2020 08:28:39 +0000 (10:28 +0200)]
bpf, cgroup: Return ENOLINK for auto-detached links on update
Failure to update a bpf_link because it has been auto-detached by a dying
cgroup currently results in EINVAL error, even though the arguments passed
to bpf() syscall are not wrong.
bpf_links attaching to netns in this case will return ENOLINK, which
carries the message that the link is no longer attached to anything.
Change cgroup bpf_links to do the same to keep the uAPI errors consistent.
Fixes: 0c991ebc8c69 ("bpf: Implement bpf_prog replacement for an active bpf_cgroup_link") Suggested-by: Lorenz Bauer <lmb@cloudflare.com> Signed-off-by: Jakub Sitnicki <jakub@cloudflare.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Link: https://lore.kernel.org/bpf/20200531082846.2117903-6-jakub@cloudflare.com
Jakub Sitnicki [Sun, 31 May 2020 08:28:38 +0000 (10:28 +0200)]
bpf: Add link-based BPF program attachment to network namespace
Extend bpf() syscall subcommands that operate on bpf_link, that is
LINK_CREATE, LINK_UPDATE, OBJ_GET_INFO, to accept attach types tied to
network namespaces (only flow dissector at the moment).
Link-based and prog-based attachment can be used interchangeably, but only
one can exist at a time. Attempts to attach a link when a prog is already
attached directly, and the other way around, will be met with -EEXIST.
Attempts to detach a program when link exists result in -EINVAL.
Attachment of multiple links of same attach type to one netns is not
supported with the intention to lift the restriction when a use-case
presents itself. Because of that link create returns -E2BIG when trying to
create another netns link, when one already exists.
Link-based attachments to netns don't keep a netns alive by holding a ref
to it. Instead links get auto-detached from netns when the latter is being
destroyed, using a pernet pre_exit callback.
When auto-detached, link lives in defunct state as long there are open FDs
for it. -ENOLINK is returned if a user tries to update a defunct link.
Because bpf_link to netns doesn't hold a ref to struct net, special care is
taken when releasing, updating, or filling link info. The netns might be
getting torn down when any of these link operations are in progress. That
is why auto-detach and update/release/fill_info are synchronized by the
same mutex. Also, link ops have to always check if auto-detach has not
happened yet and if netns is still alive (refcnt > 0).
Jakub Sitnicki [Sun, 31 May 2020 08:28:37 +0000 (10:28 +0200)]
flow_dissector: Move out netns_bpf prog callbacks
Move functions to manage BPF programs attached to netns that are not
specific to flow dissector to a dedicated module named
bpf/net_namespace.c.
The set of functions will grow with the addition of bpf_link support for
netns attached programs. This patch prepares ground by creating a place
for it.
This is a code move with no functional changes intended.
Jakub Sitnicki [Sun, 31 May 2020 08:28:36 +0000 (10:28 +0200)]
net: Introduce netns_bpf for BPF programs attached to netns
In order to:
(1) attach more than one BPF program type to netns, or
(2) support attaching BPF programs to netns with bpf_link, or
(3) support multi-prog attach points for netns
we will need to keep more state per netns than a single pointer like we
have now for BPF flow dissector program.
Prepare for the above by extracting netns_bpf that is part of struct net,
for storing all state related to BPF programs attached to netns.
Turn flow dissector callbacks for querying/attaching/detaching a program
into generic ones that operate on netns_bpf. Next patch will move the
generic callbacks into their own module.
This is similar to how it is organized for cgroup with cgroup_bpf.
Jakub Sitnicki [Sun, 31 May 2020 08:28:35 +0000 (10:28 +0200)]
flow_dissector: Pull locking up from prog attach callback
Split out the part of attach callback that happens with attach/detach lock
acquired. This structures the prog attach callback in a way that opens up
doors for moving the locking out of flow_dissector and into generic
callbacks for attaching/detaching progs to netns in subsequent patches.
Andrii Nakryiko [Mon, 1 Jun 2020 20:26:01 +0000 (13:26 -0700)]
libbpf: Add _GNU_SOURCE for reallocarray to ringbuf.c
On systems with recent enough glibc, reallocarray compat won't kick in, so
reallocarray() itself has to come from stdlib.h include. But _GNU_SOURCE is
necessary to enable it. So add it.
Fixes: bf99c936f947 ("libbpf: Add BPF ring buffer support") Signed-off-by: Andrii Nakryiko <andriin@fb.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Song Liu <songliubraving@fb.com> Link: https://lore.kernel.org/bpf/20200601202601.2139477-1-andriin@fb.com
Jiri Olsa [Sun, 31 May 2020 15:42:55 +0000 (17:42 +0200)]
bpf: Use tracing helpers for lsm programs
Currenty lsm uses bpf_tracing_func_proto helpers which do
not include stack trace or perf event output. It's useful
to have those for bpftrace lsm support [1].
Using tracing_prog_func_proto helpers for lsm programs.
Introduce xdp_convert_frame_to_buff utility routine to initialize xdp_buff
fields from xdp_frames ones. Rely on xdp_convert_frame_to_buff in veth xdp
code.
====================
This option makes it possible to programatically bind sockets
to netdevices. With the help of this option sockets
of VRF unaware applications could be distributed between
multiple VRFs with an eBPF program. This lets the applications
benefit from multiple possible routes.
v2:
- splitting up the patch to three parts
- lock_sk parameter for optional locking in sock_bindtoindex - Stanislav Fomichev
- testing the SO_BINDTODEVICE option - Andrii Nakryiko
====================
Ferenc Fejes [Sat, 30 May 2020 21:09:02 +0000 (23:09 +0200)]
selftests/bpf: Add test for SO_BINDTODEVICE opt of bpf_setsockopt
This test intended to verify if SO_BINDTODEVICE option works in
bpf_setsockopt. Because we already in the SOL_SOCKET level in this
connect bpf prog its safe to verify the sanity in the beginning of
the connect_v4_prog by calling the bind_to_device test helper.
The testing environment already created by the test_sock_addr.sh
script so this test assume that two netdevices already existing in
the system: veth pair with names test_sock_addr1 and test_sock_addr2.
The test will try to bind the socket to those devices first.
Then the test assume there are no netdevice with "nonexistent_dev"
name so the bpf_setsockopt will give use ENODEV error.
At the end the test remove the device binding from the socket
by binding it to an empty name.
Ferenc Fejes [Sat, 30 May 2020 21:09:01 +0000 (23:09 +0200)]
bpf: Allow SO_BINDTODEVICE opt in bpf_setsockopt
Extending the supported sockopts in bpf_setsockopt with
SO_BINDTODEVICE. We call sock_bindtoindex with parameter
lock_sk = false in this context because we already owning
the socket.
Ferenc Fejes [Sat, 30 May 2020 21:09:00 +0000 (23:09 +0200)]
net: Make locking in sock_bindtoindex optional
The sock_bindtoindex intended for kernel wide usage however
it will lock the socket regardless of the context. This modification
relax this behavior optionally: locking the socket will be optional
by calling the sock_bindtoindex with lock_sk = true.
The modification applied to all users of the sock_bindtoindex.
====================
If a socket is running a BPF_SK_SKB_SREAM_VERDICT program and KTLS is
enabled the data stream may be broken if both TLS stream parser and
BPF stream parser try to handle data. Fix this here by making KTLS
stream parser run first to ensure TLS messages are received correctly
and then calling the verdict program. This analogous to how we handle
a similar conflict on the TX side.
Note, this is a fix but it doesn't make sense to push this late to
bpf tree so targeting bpf-next and keeping fixes tags.
====================
John Fastabend [Fri, 29 May 2020 23:07:19 +0000 (16:07 -0700)]
bpf, selftests: Add test for ktls with skb bpf ingress policy
This adds a test for bpf ingress policy. To ensure data writes happen
as expected with extra TLS headers we run these tests with data
verification enabled by default. This will test receive packets have
"PASS" stamped into the front of the payload.
====================
Implementation of Daniel's proposal for allowing DEVMAP entries to be
a device index, program fd pair.
Programs are run after XDP_REDIRECT and have access to both Rx device
and Tx device.
v4
- moved struct bpf_devmap_val from uapi to devmap.c, named the union
and dropped the prefix from the elements - Jesper
- fixed 2 bugs in selftests
v3
- renamed struct to bpf_devmap_val
- used offsetofend to check for expected map size, modification of
Toke's comment
- check for explicit value sizes
- adjusted switch statement in dev_map_run_prog per Andrii's comment
- changed SEC shortcut to xdp_devmap
- changed selftests to use skeleton and new map declaration
v2
- moved dev_map_ext_val definition to uapi to formalize the API for devmap
extensions; add bpf_ prefix to the prog_fd and prog_id entries
- changed devmap code to handle struct in a way that it can support future
extensions
- fixed subject in libbpf patch
v1
- fixed prog put on invalid program - Toke
- changed write value from id to fd per Toke's comments about capabilities
- add test cases
====================
John Fastabend [Fri, 29 May 2020 23:06:59 +0000 (16:06 -0700)]
bpf: Fix running sk_skb program types with ktls
KTLS uses a stream parser to collect TLS messages and send them to
the upper layer tls receive handler. This ensures the tls receiver
has a full TLS header to parse when it is run. However, when a
socket has BPF_SK_SKB_STREAM_VERDICT program attached before KTLS
is enabled we end up with two stream parsers running on the same
socket.
The result is both try to run on the same socket. First the KTLS
stream parser runs and calls read_sock() which will tcp_read_sock
which in turn calls tcp_rcv_skb(). This dequeues the skb from the
sk_receive_queue. When this is done KTLS code then data_ready()
callback which because we stacked KTLS on top of the bpf stream
verdict program has been replaced with sk_psock_start_strp(). This
will in turn kick the stream parser again and eventually do the
same thing KTLS did above calling into tcp_rcv_skb() and dequeuing
a skb from the sk_receive_queue.
At this point the data stream is broke. Part of the stream was
handled by the KTLS side some other bytes may have been handled
by the BPF side. Generally this results in either missing data
or more likely a "Bad Message" complaint from the kTLS receive
handler as the BPF program steals some bytes meant to be in a
TLS header and/or the TLS header length is no longer correct.
We've already broke the idealized model where we can stack ULPs
in any order with generic callbacks on the TX side to handle this.
So in this patch we do the same thing but for RX side. We add
a sk_psock_strp_enabled() helper so TLS can learn a BPF verdict
program is running and add a tls_sw_has_ctx_rx() helper so BPF
side can learn there is a TLS ULP on the socket.
Then on BPF side we omit calling our stream parser to avoid
breaking the data stream for the KTLS receiver. Then on the
KTLS side we call BPF_SK_SKB_STREAM_VERDICT once the KTLS
receiver is done with the packet but before it posts the
msg to userspace. This gives us symmetry between the TX and
RX halfs and IMO makes it usable again. On the TX side we
process packets in this order BPF -> TLS -> TCP and on
the receive side in the reverse order TCP -> TLS -> BPF.
Discovered while testing OpenSSL 3.0 Alpha2.0 release.
David Ahern [Fri, 29 May 2020 22:07:16 +0000 (16:07 -0600)]
selftest: Add tests for XDP programs in devmap entries
Add tests to verify ability to add an XDP program to a
entry in a DEVMAP.
Add negative tests to show DEVMAP programs can not be
attached to devices as a normal XDP program, and accesses
to egress_ifindex require BPF_XDP_DEVMAP attach type.
John Fastabend [Fri, 29 May 2020 23:06:41 +0000 (16:06 -0700)]
bpf: Refactor sockmap redirect code so its easy to reuse
We will need this block of code called from tls context shortly
lets refactor the redirect logic so its easy to use. This also
cleans up the switch stmt so we have fewer fallthrough cases.
No logic changes are intended.
Fixes: d829e9c4112b5 ("tls: convert to generic sk_msg interface") Signed-off-by: John Fastabend <john.fastabend@gmail.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Reviewed-by: Jakub Sitnicki <jakub@cloudflare.com> Acked-by: Song Liu <songliubraving@fb.com> Link: https://lore.kernel.org/bpf/159079360110.5745.7024009076049029819.stgit@john-Precision-5820-Tower Signed-off-by: Alexei Starovoitov <ast@kernel.org>
David Ahern [Fri, 29 May 2020 22:07:14 +0000 (16:07 -0600)]
xdp: Add xdp_txq_info to xdp_buff
Add xdp_txq_info as the Tx counterpart to xdp_rxq_info. At the
moment only the device is added. Other fields (queue_index)
can be added as use cases arise.
>From a UAPI perspective, add egress_ifindex to xdp context for
bpf programs to see the Tx device.
Update the verifier to only allow accesses to egress_ifindex by
XDP programs with BPF_XDP_DEVMAP expected attach type.
David Ahern [Fri, 29 May 2020 22:07:13 +0000 (16:07 -0600)]
bpf: Add support to attach bpf program to a devmap entry
Add BPF_XDP_DEVMAP attach type for use with programs associated with a
DEVMAP entry.
Allow DEVMAPs to associate a program with a device entry by adding
a bpf_prog.fd to 'struct bpf_devmap_val'. Values read show the program
id, so the fd and id are a union. bpf programs can get access to the
struct via vmlinux.h.
The program associated with the fd must have type XDP with expected
attach type BPF_XDP_DEVMAP. When a program is associated with a device
index, the program is run on an XDP_REDIRECT and before the buffer is
added to the per-cpu queue. At this point rxq data is still valid; the
next patch adds tx device information allowing the prorgam to see both
ingress and egress device indices.
XDP generic is skb based and XDP programs do not work with skb's. Block
the use case by walking maps used by a program that is to be attached
via xdpgeneric and fail if any of them are DEVMAP / DEVMAP_HASH with
Block attach of BPF_XDP_DEVMAP programs to devices.
Yonghong Song [Fri, 29 May 2020 00:48:10 +0000 (17:48 -0700)]
bpf: Use strncpy_from_unsafe_strict() in bpf_seq_printf() helper
In bpf_seq_printf() helper, when user specified a "%s" in the
format string, strncpy_from_unsafe() is used to read the actual string
to a buffer. The string could be a format string or a string in
the kernel data structure. It is really unlikely that the string
will reside in the user memory.
This is different from Commit b2a5212fb634 ("bpf: Restrict bpf_trace_printk()'s %s
usage and add %pks, %pus specifier") which still used
strncpy_from_unsafe() for "%s" to preserve the old behavior.
If in the future, bpf_seq_printf() indeed needs to read user
memory, we can implement "%pus" format string.
Based on discussion in [1], if the intent is to read kernel memory,
strncpy_from_unsafe_strict() should be used. So this patch
changed to use strncpy_from_unsafe_strict().