git.proxmox.com Git - mirror_ubuntu-jammy-kernel.git/log

libbpf: refactor bpf_*_get_next_id() functions

In preparation for the introduction of a similar function for retrieving
the id of the next BTF object, consolidate the code from
bpf_prog_get_next_id() and bpf_map_get_next_id() in libbpf.

Signed-off-by: Quentin Monnet <quentin.monnet@netronome.com>
Reviewed-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>

tools: bpf: synchronise BPF UAPI header with tools

Synchronise the bpf.h header under tools, to report the addition of the
new BPF_BTF_GET_NEXT_ID syscall command for bpf().

Signed-off-by: Quentin Monnet <quentin.monnet@netronome.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>

bpf: add new BPF_BTF_GET_NEXT_ID syscall command

Add a new command for the bpf() system call: BPF_BTF_GET_NEXT_ID is used
to cycle through all BTF objects loaded on the system.

The motivation is to be able to inspect (list) all BTF objects presents
on the system.

Signed-off-by: Quentin Monnet <quentin.monnet@netronome.com>
Reviewed-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>

test_bpf: Fix a new clang warning about xor-ing two numbers

r369217 in clang added a new warning about potential misuse of the xor
operator as an exponentiation operator:

../lib/test_bpf.c:870:13: warning: result of '10 ^ 300' is 294; did you
mean '1e300'? [-Wxor-used-as-pow]
                { { 4, 10 ^ 300 }, { 20, 10 ^ 300 } },
                       ~~~^~~~~
                       1e300
../lib/test_bpf.c:870:13: note: replace expression with '0xA ^ 300' to
silence this warning
../lib/test_bpf.c:870:31: warning: result of '10 ^ 300' is 294; did you
mean '1e300'? [-Wxor-used-as-pow]
                { { 4, 10 ^ 300 }, { 20, 10 ^ 300 } },
                                         ~~~^~~~~
                                         1e300
../lib/test_bpf.c:870:31: note: replace expression with '0xA ^ 300' to
silence this warning

The commit link for this new warning has some good logic behind wanting
to add it but this instance appears to be a false positive. Adopt its
suggestion to silence the warning but not change the code. According to
the differential review link in the clang commit, GCC may eventually
adopt this warning as well.

Link: https://github.com/ClangBuiltLinux/linux/issues/643
Link: https://github.com/llvm/llvm-project/commit/920890e26812f808a74c60ebc14cc636dac661c1
Signed-off-by: Nathan Chancellor <natechancellor@gmail.com>
Acked-by: Yonghong Song <yhs@fb.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>

bpf: add include guard to tnum.h

Add a header include guard just in case.

Signed-off-by: Masahiro Yamada <yamada.masahiro@socionext.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>

bpf: add BTF ids in procfs for file descriptors to BTF objects

Implement the show_fdinfo hook for BTF FDs file operations, and make it
print the id of the BTF object. This allows for a quick retrieval of the
BTF id from its FD; or it can help understanding what type of object
(BTF) the file descriptor points to.

v2:
- Do not expose data_size, only btf_id, in FD info.

Signed-off-by: Quentin Monnet <quentin.monnet@netronome.com>
Reviewed-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>

bpf: Use PTR_ERR_OR_ZERO in xsk_map_inc()

Use PTR_ERR_OR_ZERO rather than if(IS_ERR(...)) + PTR_ERR.

Signed-off-by: YueHaibing <yuehaibing@huawei.com>
Acked-by: Björn Töpel <bjorn.topel@intel.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>

Merge branch 'bpf-af-xdp-xskmap-improvements'

Björn Töpel says:

====================
This series (v5 and counting) add two improvements for the XSKMAP,
used by AF_XDP sockets.

1. Automatic cleanup when an AF_XDP socket goes out of scope/is
   released. Instead of require that the user manually clears the
   "released" state socket from the map, this is done
   automatically. Each socket tracks which maps it resides in, and
   remove itself from those maps at relase. A notable implementation
   change, is that the sockets references the map, instead of the map
   referencing the sockets. Which implies that when the XSKMAP is
   freed, it is by definition cleared of sockets.

2. The XSKMAP did not honor the BPF_EXIST/BPF_NOEXIST flag on insert,
   which this patch addresses.

v1->v2: Fixed deadlock and broken cleanup. (Daniel)
v2->v3: Rebased onto bpf-next
v3->v4: {READ, WRITE}_ONCE consistency. (Daniel)
        Socket release/map update race. (Daniel)
v4->v5: Avoid use-after-free on XSKMAP self-assignment [1]. (Daniel)
        Removed redundant assignment in xsk_map_update_elem().
        Variable name consistency; Use map_entry everywhere.

[1] https://lore.kernel.org/bpf/20190802081154.30962-1-bjorn.topel@gmail.com/T/#mc68439e97bc07fa301dad9fc4850ed5aa392f385
====================

Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>

xsk: support BPF_EXIST and BPF_NOEXIST flags in XSKMAP

The XSKMAP did not honor the BPF_EXIST/BPF_NOEXIST flags when updating
an entry. This patch addresses that.

Signed-off-by: Björn Töpel <bjorn.topel@intel.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>

xsk: remove AF_XDP socket from map when the socket is released

When an AF_XDP socket is released/closed the XSKMAP still holds a
reference to the socket in a "released" state. The socket will still
use the netdev queue resource, and block newly created sockets from
attaching to that queue, but no user application can access the
fill/complete/rx/tx queues. This results in that all applications need
to explicitly clear the map entry from the old "zombie state"
socket. This should be done automatically.

In this patch, the sockets tracks, and have a reference to, which maps
it resides in. When the socket is released, it will remove itself from
all maps.

Suggested-by: Bruce Richardson <bruce.richardson@intel.com>
Signed-off-by: Björn Töpel <bjorn.topel@intel.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>

Merge branch 'bpf-sk-storage-clone'

Stanislav Fomichev says:

====================
Currently there is no way to propagate sk storage from the listener
socket to a newly accepted one. Consider the following use case:

        fd = socket();
        setsockopt(fd, SOL_IP, IP_TOS,...);
        /* ^^^ setsockopt BPF program triggers here and saves something
         * into sk storage of the listener.
         */
        listen(fd, ...);
        while (client = accept(fd)) {
                /* At this point all association between listener
                 * socket and newly accepted one is gone. New
                 * socket will not have any sk storage attached.
                 */
        }

Let's add new BPF_F_CLONE flag that can be specified when creating
a socket storage map. This new flag indicates that map contents
should be cloned when the socket is cloned.

v4:
* drop 'goto err' in bpf_sk_storage_clone (Yonghong Song)
* add comment about race with bpf_sk_storage_map_free to the
  bpf_sk_storage_clone side as well (Daniel Borkmann)

v3:
* make sure BPF_F_NO_PREALLOC is always present when creating
  a map (Martin KaFai Lau)
* don't call bpf_sk_storage_free explicitly, rely on
  sk_free_unlock_clone to do the cleanup (Martin KaFai Lau)

v2:
* remove spinlocks around selem_link_map/sk (Martin KaFai Lau)
* BPF_F_CLONE on a map, not selem (Martin KaFai Lau)
* hold a map while cloning (Martin KaFai Lau)
* use BTF maps in selftests (Yonghong Song)
* do proper cleanup selftests; don't call close(-1) (Yonghong Song)
* export bpf_map_inc_not_zero
====================

Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>

selftests/bpf: add sockopt clone/inheritance test

Add a test that calls setsockopt on the listener socket which triggers
BPF program. This BPF program writes to the sk storage and sets
clone flag. Make sure that sk storage is cloned for a newly
accepted connection.

We have two cloned maps in the tests to make sure we hit both cases
in bpf_sk_storage_clone: first element (sk_storage_alloc) and
non-first element(s) (selem_link_map).

Cc: Martin KaFai Lau <kafai@fb.com>
Cc: Yonghong Song <yhs@fb.com>
Acked-by: Martin KaFai Lau <kafai@fb.com>
Acked-by: Yonghong Song <yhs@fb.com>
Signed-off-by: Stanislav Fomichev <sdf@google.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>

bpf: sync bpf.h to tools/

Sync new sk storage clone flag.

Cc: Martin KaFai Lau <kafai@fb.com>
Cc: Yonghong Song <yhs@fb.com>
Acked-by: Martin KaFai Lau <kafai@fb.com>
Acked-by: Yonghong Song <yhs@fb.com>
Signed-off-by: Stanislav Fomichev <sdf@google.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>

bpf: support cloning sk storage on accept()

Add new helper bpf_sk_storage_clone which optionally clones sk storage
and call it from sk_clone_lock.

Cc: Martin KaFai Lau <kafai@fb.com>
Cc: Yonghong Song <yhs@fb.com>
Acked-by: Martin KaFai Lau <kafai@fb.com>
Acked-by: Yonghong Song <yhs@fb.com>
Signed-off-by: Stanislav Fomichev <sdf@google.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>

bpf: export bpf_map_inc_not_zero

Rename existing bpf_map_inc_not_zero to __bpf_map_inc_not_zero to
indicate that it's caller's responsibility to do proper locking.
Create and export bpf_map_inc_not_zero wrapper that properly
locks map_idr_lock. Will be used in the next commit to
hold a map while cloning a socket.

Cc: Martin KaFai Lau <kafai@fb.com>
Cc: Yonghong Song <yhs@fb.com>
Acked-by: Martin KaFai Lau <kafai@fb.com>
Acked-by: Yonghong Song <yhs@fb.com>
Signed-off-by: Stanislav Fomichev <sdf@google.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>

selftests/bpf: fix race in test_tcp_rtt test

There is a race in this test between receiving the ACK for the
single-byte packet sent in the test, and reading the values from the
map.

This patch fixes this by having the client wait until there are no more
unacknowledged packets.

Before:
for i in {1..1000}; do ../net/in_netns.sh ./test_tcp_rtt; \
done | grep -c PASSED
< trimmed error messages >
993

After:
for i in {1..10000}; do ../net/in_netns.sh ./test_tcp_rtt; \
done | grep -c PASSED
10000

Fixes: b55873984dab ("selftests/bpf: test BPF_SOCK_OPS_RTT_CB")
Signed-off-by: Petar Penkov <ppenkov@google.com>
Reviewed-by: Stanislav Fomichev <sdf@google.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>

libbpf: relicense bpf_helpers.h and bpf_endian.h

bpf_helpers.h and bpf_endian.h contain useful macros and BPF helper
definitions essential to almost every BPF program. Which makes them
useful not just for selftests. To be able to expose them as part of
libbpf, though, we need them to be dual-licensed as LGPL-2.1 OR
BSD-2-Clause. This patch updates licensing of those two files.

Acked-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Hechao Li <hechaol@fb.com>
Acked-by: Martin KaFai Lau <kafai@fb.com>
Acked-by: Andrey Ignatov <rdna@fb.com>
Acked-by: Yonghong Song <yhs@fb.com>
Acked-by: Lawrence Brakmo <brakmo@fb.com>
Acked-by: Adam Barth <arb@fb.com>
Acked-by: Roman Gushchin <guro@fb.com>
Acked-by: Josef Bacik <jbacik@fb.com>
Acked-by: Joe Stringer <joe@wand.net.nz>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Joel Fernandes (Google) <joel@joelfernandes.org>
Acked-by: David Ahern <dsahern@gmail.com>
Acked-by: Jesper Dangaard Brouer <brouer@redhat.com>
Acked-by: Ilya Leoshkevich <iii@linux.ibm.com>
Acked-by: Lorenz Bauer <lmb@cloudflare.com>
Acked-by: Adrian Ratiu <adrian.ratiu@collabora.com>
Acked-by: Nikita V. Shirokov <tehnerd@tehnerd.com>
Acked-by: Willem de Bruijn <willemb@google.com>
Acked-by: Petar Penkov <ppenkov@google.com>
Acked-by: Teng Qin <palmtenor@gmail.com>
Cc: Michael Holzheu <holzheu@linux.vnet.ibm.com>
Cc: Naveen N. Rao <naveen.n.rao@linux.vnet.ibm.com>
Cc: David S. Miller <davem@davemloft.net>
Cc: Michal Rostecki <mrostecki@opensuse.org>
Cc: John Fastabend <john.fastabend@gmail.com>
Cc: Sargun Dhillon <sargun@sargun.me>
Signed-off-by: Andrii Nakryiko <andriin@fb.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>

net: Don't call XDP_SETUP_PROG when nothing is changed

Don't uninstall an XDP program when none is installed, and don't install
an XDP program that has the same ID as the one already installed.

dev_change_xdp_fd doesn't perform any checks in case it uninstalls an
XDP program. It means that the driver's ndo_bpf can be called with
XDP_SETUP_PROG asking to set it to NULL even if it's already NULL. This
case happens if the user runs `ip link set eth0 xdp off` when there is
no XDP program attached.

The symmetrical case is possible when the user tries to set the program
that is already set.

The drivers typically perform some heavy operations on XDP_SETUP_PROG,
so they all have to handle these cases internally to return early if
they happen. This patch puts this check into the kernel code, so that
all drivers will benefit from it.

Signed-off-by: Maxim Mikityanskiy <maximmi@mellanox.com>
Acked-by: Jonathan Lemon <jonathan.lemon@gmail.com>
Reviewed-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>

Merge branch 'bpf-af-xdp-wakeup'

Magnus Karlsson says:

====================
This patch set adds support for a new flag called need_wakeup in the
AF_XDP Tx and fill rings. When this flag is set by the driver, it
means that the application has to explicitly wake up the kernel Rx
(for the bit in the fill ring) or kernel Tx (for bit in the Tx ring)
processing by issuing a syscall. Poll() can wake up both and sendto()
will wake up Tx processing only.

The main reason for introducing this new flag is to be able to
efficiently support the case when application and driver is executing
on the same core. Previously, the driver was just busy-spinning on the
fill ring if it ran out of buffers in the HW and there were none to
get from the fill ring. This approach works when the application and
driver is running on different cores as the application can replenish
the fill ring while the driver is busy-spinning. Though, this is a
lousy approach if both of them are running on the same core as the
probability of the fill ring getting more entries when the driver is
busy-spinning is zero. With this new feature the driver now sets the
need_wakeup flag and returns to the application. The application can
then replenish the fill queue and then explicitly wake up the Rx
processing in the kernel using the syscall poll(). For Tx, the flag is
only set to one if the driver has no outstanding Tx completion
interrupts. If it has some, the flag is zero as it will be woken up by
a completion interrupt anyway. This flag can also be used in other
situations where the driver needs to be woken up explicitly.

As a nice side effect, this new flag also improves the Tx performance
of the case where application and driver are running on two different
cores as it reduces the number of syscalls to the kernel. The kernel
tells user space if it needs to be woken up by a syscall, and this
eliminates many of the syscalls. The Rx performance of the 2-core case
is on the other hand slightly worse, since there is a need to use a
syscall now to wake up the driver, instead of the driver
busy-spinning. It does waste less CPU cycles though, which might lead
to better overall system performance.

This new flag needs some simple driver support. If the driver does not
support it, the Rx flag is always zero and the Tx flag is always
one. This makes any application relying on this feature default to the
old behavior of not requiring any syscalls in the Rx path and always
having to call sendto() in the Tx path.

For backwards compatibility reasons, this feature has to be explicitly
turned on using a new bind flag (XDP_USE_NEED_WAKEUP). I recommend
that you always turn it on as it has a large positive performance
impact for the one core case and does not degrade 2 core performance
and actually improves it for Tx heavy workloads.

Here are some performance numbers measured on my local,
non-performance optimized development system. That is why you are
seeing numbers lower than the ones from Björn and Jesper. 64 byte
packets at 40Gbit/s line rate. All results in Mpps. Cores == 1 means
that both application and driver is executing on the same core. Cores
== 2 that they are on different cores.

                              Applications
need_wakeup  cores    txpush    rxdrop      l2fwd
---------------------------------------------------------------
     n         1       0.07      0.06        0.03
     y         1       21.6      8.2         6.5
     n         2       32.3      11.7        8.7
     y         2       33.1      11.7        8.7

Overall, the need_wakeup flag provides the same or better performance
in all the micro-benchmarks. The reduction of sendto() calls in txpush
is large. Only a few per second is needed. For l2fwd, the drop is 50%
for the 1 core case and more than 99.9% for the 2 core case. Do not
know why I am not seeing the same drop for the 1 core case yet.

The name and inspiration of the flag has been taken from io_uring by
Jens Axboe. Details about this feature in io_uring can be found in
http://kernel.dk/io_uring.pdf, section 8.3. It also addresses most of
the denial of service and sendto() concerns raised by Maxim
Mikityanskiy in https://www.spinics.net/lists/netdev/msg554657.html.

The typical Tx part of an application will have to change from:

ret = sendto(fd,....)

to:

if (xsk_ring_prod__needs_wakeup(&xsk->tx))
       ret = sendto(fd,....)

and th Rx part from:

rcvd = xsk_ring_cons__peek(&xsk->rx, BATCH_SIZE, &idx_rx);
if (!rcvd)
       return;

to:

rcvd = xsk_ring_cons__peek(&xsk->rx, BATCH_SIZE, &idx_rx);
if (!rcvd) {
       if (xsk_ring_prod__needs_wakeup(&xsk->umem->fq))
              ret = poll(fd,.....);
       return;
}

v3 -> v4:
* Maxim found a possible race in the Tx part of the driver. The
  setting of the flag needs to happen before the sending, otherwise it
  might trigger this race. Fixed in ixgbe and i40e driver.
* Mellanox support contributed by Maxim
* Removed the XSK_DRV_CAN_SLEEP flag as it was not used
  anymore. Thanks to Sridhar for discovering this.
* For consistency the feature is now always called need_wakeup. There
  were some places where it was referred to as might_sleep, but they
  have been removed. Thanks to Sridhar for spotting.
* Fixed some typos in the commit messages

v2 -> v3:
* Converted the Mellanox driver to the new ndo in patch 1 as pointed
  out by Maxim
* Fixed the compatibility code of XDP_MMAP_OFFSETS so it now works.

v1 -> v2:
* Fixed bisectability problem pointed out by Jakub
* Added missing initiliztion of the Tx need_wakeup flag to 1

This patch has been applied against commit b753c5a7f99f ("Merge branch 'r8152-RX-improve'")

Structure of the patch set:

Patch 1: Replaces the ndo_xsk_async_xmit with ndo_xsk_wakeup to
         support waking up both Rx and Tx processing
Patch 2: Implements the need_wakeup functionality in common code
Patch 3-4: Add need_wakeup support to the i40e and ixgbe drivers
Patch 5: Add need_wakeup support to libbpf
Patch 6: Add need_wakeup support to the xdpsock sample application
Patch 7-8: Add need_wakeup support to the Mellanox mlx5 driver
====================

Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>

net/mlx5e: Add AF_XDP need_wakeup support

This commit adds support for the new need_wakeup feature of AF_XDP. The
applications can opt-in by using the XDP_USE_NEED_WAKEUP bind() flag.
When this feature is enabled, some behavior changes:

RX side: If the Fill Ring is empty, instead of busy-polling, set the
flag to tell the application to kick the driver when it refills the Fill
Ring.

TX side: If there are pending completions or packets queued for
transmission, set the flag to tell the application that it can skip the
sendto() syscall and save time.

The performance testing was performed on a machine with the following
configuration:

- 24 cores of Intel Xeon E5-2620 v3 @ 2.40 GHz
- Mellanox ConnectX-5 Ex with 100 Gbit/s link

The results with retpoline disabled:

       | without need_wakeup  | with need_wakeup     |
       |----------------------|----------------------|
       | one core | two cores | one core | two cores |
-------|----------|-----------|----------|-----------|
txonly | 20.1     | 33.5      | 29.0     | 34.2      |
rxdrop | 0.065    | 14.1      | 12.0     | 14.1      |
l2fwd  | 0.032    | 7.3       | 6.6      | 7.2       |

"One core" means the application and NAPI run on the same core. "Two
cores" means they are pinned to different cores.

Signed-off-by: Maxim Mikityanskiy <maximmi@mellanox.com>
Reviewed-by: Tariq Toukan <tariqt@mellanox.com>
Reviewed-by: Saeed Mahameed <saeedm@mellanox.com>
Acked-by: Jonathan Lemon <jonathan.lemon@gmail.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>

net/mlx5e: Move the SW XSK code from NAPI poll to a separate function

Two XSK tasks are performed during NAPI polling, that are not bound to
hardware interrupts: TXing packets and polling for frames in the Fill
Ring. They are special in a way that the hardware doesn't know about
these tasks, so it doesn't trigger interrupts if there is still some
work to be done, it's our driver's responsibility to ensure NAPI will be
rescheduled if needed.

Create a new function to handle these tasks and move the corresponding
code from mlx5e_napi_poll to the new function to improve modularity and
prepare for the changes in the following patch.

Signed-off-by: Maxim Mikityanskiy <maximmi@mellanox.com>
Reviewed-by: Tariq Toukan <tariqt@mellanox.com>
Reviewed-by: Saeed Mahameed <saeedm@mellanox.com>
Acked-by: Jonathan Lemon <jonathan.lemon@gmail.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>

samples/bpf: add use of need_wakeup flag in xdpsock

This commit adds using the need_wakeup flag to the xdpsock sample
application. It is turned on by default as we think it is a feature
that seems to always produce a performance benefit, if the application
has been written taking advantage of it. It can be turned off in the
sample app by using the '-m' command line option.

The txpush and l2fwd sub applications have also been updated to
support poll() with multiple sockets.

Signed-off-by: Magnus Karlsson <magnus.karlsson@intel.com>
Acked-by: Jonathan Lemon <jonathan.lemon@gmail.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>

libbpf: add support for need_wakeup flag in AF_XDP part

This commit adds support for the new need_wakeup flag in AF_XDP. The
xsk_socket__create function is updated to handle this and a new
function is introduced called xsk_ring_prod__needs_wakeup(). This
function can be used by the application to check if Rx and/or Tx
processing needs to be explicitly woken up.

Signed-off-by: Magnus Karlsson <magnus.karlsson@intel.com>
Acked-by: Jonathan Lemon <jonathan.lemon@gmail.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>

ixgbe: add support for AF_XDP need_wakeup feature

This patch adds support for the need_wakeup feature of AF_XDP. If the
application has told the kernel that it might sleep using the new bind
flag XDP_USE_NEED_WAKEUP, the driver will then set this flag if it has
no more buffers on the NIC Rx ring and yield to the application. For
Tx, it will set the flag if it has no outstanding Tx completion
interrupts and return to the application.

Signed-off-by: Magnus Karlsson <magnus.karlsson@intel.com>
Acked-by: Jonathan Lemon <jonathan.lemon@gmail.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>

i40e: add support for AF_XDP need_wakeup feature

This patch adds support for the need_wakeup feature of AF_XDP. If the
application has told the kernel that it might sleep using the new bind
flag XDP_USE_NEED_WAKEUP, the driver will then set this flag if it has
no more buffers on the NIC Rx ring and yield to the application. For
Tx, it will set the flag if it has no outstanding Tx completion
interrupts and return to the application.

Signed-off-by: Magnus Karlsson <magnus.karlsson@intel.com>
Acked-by: Jonathan Lemon <jonathan.lemon@gmail.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>

xsk: add support for need_wakeup flag in AF_XDP rings

This commit adds support for a new flag called need_wakeup in the
AF_XDP Tx and fill rings. When this flag is set, it means that the
application has to explicitly wake up the kernel Rx (for the bit in
the fill ring) or kernel Tx (for bit in the Tx ring) processing by
issuing a syscall. Poll() can wake up both depending on the flags
submitted and sendto() will wake up tx processing only.

The main reason for introducing this new flag is to be able to
efficiently support the case when application and driver is executing
on the same core. Previously, the driver was just busy-spinning on the
fill ring if it ran out of buffers in the HW and there were none on
the fill ring. This approach works when the application is running on
another core as it can replenish the fill ring while the driver is
busy-spinning. Though, this is a lousy approach if both of them are
running on the same core as the probability of the fill ring getting
more entries when the driver is busy-spinning is zero. With this new
feature the driver now sets the need_wakeup flag and returns to the
application. The application can then replenish the fill queue and
then explicitly wake up the Rx processing in the kernel using the
syscall poll(). For Tx, the flag is only set to one if the driver has
no outstanding Tx completion interrupts. If it has some, the flag is
zero as it will be woken up by a completion interrupt anyway.

As a nice side effect, this new flag also improves the performance of
the case where application and driver are running on two different
cores as it reduces the number of syscalls to the kernel. The kernel
tells user space if it needs to be woken up by a syscall, and this
eliminates many of the syscalls.

This flag needs some simple driver support. If the driver does not
support this, the Rx flag is always zero and the Tx flag is always
one. This makes any application relying on this feature default to the
old behaviour of not requiring any syscalls in the Rx path and always
having to call sendto() in the Tx path.

For backwards compatibility reasons, this feature has to be explicitly
turned on using a new bind flag (XDP_USE_NEED_WAKEUP). I recommend
that you always turn it on as it so far always have had a positive
performance impact.

The name and inspiration of the flag has been taken from io_uring by
Jens Axboe. Details about this feature in io_uring can be found in
http://kernel.dk/io_uring.pdf, section 8.3.

Signed-off-by: Magnus Karlsson <magnus.karlsson@intel.com>
Acked-by: Jonathan Lemon <jonathan.lemon@gmail.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>

xsk: replace ndo_xsk_async_xmit with ndo_xsk_wakeup

This commit replaces ndo_xsk_async_xmit with ndo_xsk_wakeup. This new
ndo provides the same functionality as before but with the addition of
a new flags field that is used to specifiy if Rx, Tx or both should be
woken up. The previous ndo only woke up Tx, as implied by the
name. The i40e and ixgbe drivers (which are all the supported ones)
are updated with this new interface.

This new ndo will be used by the new need_wakeup functionality of XDP
sockets that need to be able to wake up both Rx and Tx driver
processing.

Signed-off-by: Magnus Karlsson <magnus.karlsson@intel.com>
Acked-by: Jonathan Lemon <jonathan.lemon@gmail.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>

btf: fix return value check in btf_vmlinux_init()

In case of error, the function kobject_create_and_add() returns NULL
pointer not ERR_PTR(). The IS_ERR() test in the return value check
should be replaced with NULL test.

Fixes: 341dfcf8d78e ("btf: expose BTF info through sysfs")
Signed-off-by: Wei Yongjun <weiyongjun1@huawei.com>
Acked-by: Andrii Nakryiko <andriin@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>

Merge branch 'fix-printf'

Quentin Monnet says:

====================
Because the "__printf()" attributes were used only where the functions are
implemented, and not in header files, the checks have not been enforced on
all the calls to printf()-like functions, and a number of errors slipped in
bpftool over time.

This set cleans up such errors, and then moves the "__printf()" attributes
to header files, so that the checks are performed at all locations.
====================

Signed-off-by: Alexei Starovoitov <ast@kernel.org>

tools: bpftool: move "__printf()" attributes to header file

Some functions in bpftool have a "__printf()" format attributes to tell
the compiler they should expect printf()-like arguments. But because
these attributes are not used for the function prototypes in the header
files, the compiler does not run the checks everywhere the functions are
used, and some mistakes on format string and corresponding arguments
slipped in over time.

Let's move the __printf() attributes to the correct places.

Note: We add guards around the definition of GCC_VERSION in
tools/include/linux/compiler-gcc.h to prevent a conflict in jit_disasm.c
on GCC_VERSION from headers pulled via libbfd.

Fixes: c101189bc968 ("tools: bpftool: fix -Wmissing declaration warnings")
Reported-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Signed-off-by: Quentin Monnet <quentin.monnet@netronome.com>
Reviewed-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>

tools: bpftool: fix format string for p_err() in detect_common_prefix()

There is one call to the p_err() function in detect_common_prefix()
where the message to print is passed directly as the first argument,
without using a format string. This is harmless, but may trigger
warnings if the "__printf()" attribute is used correctly for the p_err()
function. Let's fix it by using a "%s" format string.

Fixes: ba95c7452439 ("tools: bpftool: add "prog run" subcommand to test-run programs")
Signed-off-by: Quentin Monnet <quentin.monnet@netronome.com>
Reviewed-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>

tools: bpftool: fix format string for p_err() in query_flow_dissector()

The format string passed to one call to the p_err() function in
query_flow_dissector() does not match the value that should be printed,
resulting in some garbage integer being printed instead of
strerror(errno) if /proc/self/ns/net cannot be open. Let's fix the
format string.

Fixes: 7f0c57fec80f ("bpftool: show flow_dissector attachment status")
Signed-off-by: Quentin Monnet <quentin.monnet@netronome.com>
Reviewed-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>

tools: bpftool: fix argument for p_err() in BTF do_dump()

The last argument passed to one call to the p_err() function is not
correct, it should be "*argv" instead of "**argv". This may lead to a
segmentation fault error if BTF id cannot be parsed correctly. Let's fix
this.

Fixes: c93cc69004dt ("bpftool: add ability to dump BTF types")
Signed-off-by: Quentin Monnet <quentin.monnet@netronome.com>
Reviewed-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>

tools: bpftool: fix format strings and arguments for jsonw_printf()

There are some mismatches between format strings and arguments passed to
jsonw_printf() in the BTF dumper for bpftool, which seems harmless but
may result in warnings if the "__printf()" attribute is used correctly
for jsonw_printf(). Let's fix relevant format strings and type cast.

Fixes: b12d6ec09730 ("bpf: btf: add btf print functionality")
Signed-off-by: Quentin Monnet <quentin.monnet@netronome.com>
Reviewed-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>

tools: bpftool: fix arguments for p_err() in do_event_pipe()

The last argument passed to some calls to the p_err() functions is not
correct, it should be "*argv" instead of "**argv". This may lead to a
segmentation fault error if CPU IDs or indices from the command line
cannot be parsed correctly. Let's fix this.

Fixes: f412eed9dfde ("tools: bpftool: add simple perf event output reader")
Signed-off-by: Quentin Monnet <quentin.monnet@netronome.com>
Reviewed-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>

libbpf: make libbpf.map source of truth for libbpf version

Currently libbpf version is specified in 2 places: libbpf.map and
Makefile. They easily get out of sync and it's very easy to update one,
but forget to update another one. In addition, Github projection of
libbpf has to maintain its own version which has to be remembered to be
kept in sync manually, which is very error-prone approach.

This patch makes libbpf.map a source of truth for libbpf version and
uses shell invocation to parse out correct full and major libbpf version
to use during build. Now we need to make sure that once new release
cycle starts, we need to add (initially) empty section to libbpf.map
with correct latest version.

This also will make it possible to keep Github projection consistent
with kernel sources version of libbpf by adopting similar parsing of
version from libbpf.map.

v2->v3:
- grep -o + sort -rV (Andrey);

v1->v2:
- eager version vars evaluation (Jakub);
- simplified version regex (Andrey);

Cc: Andrey Ignatov <rdna@fb.com>
Signed-off-by: Andrii Nakryiko <andriin@fb.com>
Acked-by: Andrey Ignatov <rdna@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>

Merge branch 'bpftool-net-attach'

Daniel T. Lee says:

====================
Currently, bpftool net only supports dumping progs attached on the
interface. To attach XDP prog on interface, user must use other tool
(eg. iproute2). By this patch, with `bpftool net attach/detach`, user
can attach/detach XDP prog on interface.

    # bpftool prog
        16: xdp  name xdp_prog1  tag 539ec6ce11b52f98  gpl
        loaded_at 2019-08-07T08:30:17+0900  uid 0
        ...
        20: xdp  name xdp_fwd_prog  tag b9cb69f121e4a274  gpl
        loaded_at 2019-08-07T08:30:17+0900  uid 0

    # bpftool net attach xdpdrv id 16 dev enp6s0np0
    # bpftool net
    xdp:
        enp6s0np0(4) driver id 16

    # bpftool net attach xdpdrv id 20 dev enp6s0np0 overwrite
    # bpftool net
    xdp:
        enp6s0np0(4) driver id 20

    # bpftool net detach xdpdrv dev enp6s0np0
    # bpftool net
    xdp:

While this patch only contains support for XDP, through `net
attach/detach`, bpftool can further support other prog attach types.

XDP attach/detach tested on Mellanox ConnectX-4 and Netronome Agilio.

---
Changes in v5:
  - fix wrong error message, from errno to err with do_attach/detach

Changes in v4:
  - rename variable, attach/detach error message enhancement
  - bash-completion cleanup, doc update with brief description (attach
    types)

Changes in v3:
  - added 'overwrite' option for replacing previously attached XDP prog
  - command argument order has been changed ('ATTACH_TYPE' comes first)
  - add 'dev' keyword in front of <devname>
  - added bash-completion and documentation

Changes in v2:
  - command 'load/unload' changed to 'attach/detach' for the consistency
====================

Reviewed-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>

tools: bpftool: add documentation for net attach/detach

Since, new sub-command 'net attach/detach' has been added for
attaching XDP program on interface,
this commit documents usage and sample output of `net attach/detach`.

Signed-off-by: Daniel T. Lee <danieltimlee@gmail.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>

tools: bpftool: add bash-completion for net attach/detach

This commit adds bash-completion for new "net attach/detach"
subcommand for attaching XDP program on interface.

Signed-off-by: Daniel T. Lee <danieltimlee@gmail.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>

tools: bpftool: add net detach command to detach XDP on interface

By this commit, using `bpftool net detach`, the attached XDP prog can
be detached. Detaching the BPF prog will be done through libbpf
'bpf_set_link_xdp_fd' with the progfd set to -1.

Acked-by: Yonghong Song <yhs@fb.com>
Signed-off-by: Daniel T. Lee <danieltimlee@gmail.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>

tools: bpftool: add net attach command to attach XDP on interface

By this commit, using `bpftool net attach`, user can attach XDP prog on
interface. New type of enum 'net_attach_type' has been made, as stat ted at
cover-letter, the meaning of 'attach' is, prog will be attached on interface.

With 'overwrite' option at argument, attached XDP program could be replaced.
Added new helper 'net_parse_dev' to parse the network device at argument.

BPF prog will be attached through libbpf 'bpf_set_link_xdp_fd'.

Acked-by: Yonghong Song <yhs@fb.com>
Signed-off-by: Daniel T. Lee <danieltimlee@gmail.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>

tools: bpftool: compile with $(EXTRA_WARNINGS)

Compile bpftool with $(EXTRA_WARNINGS), as defined in
scripts/Makefile.include, and fix the new warnings produced.

Simply leave -Wswitch-enum out of the warning list, as we have several
switch-case structures where it is not desirable to process all values
of an enum.

Remove -Wshadow from the warnings we manually add to CFLAGS, as it is
handled in $(EXTRA_WARNINGS).

Signed-off-by: Quentin Monnet <quentin.monnet@netronome.com>
Reviewed-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>

Merge git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf-next

Pablo Neira Ayuso says:

====================
Netfilter/IPVS updates for net-next

The following patchset contains Netfilter/IPVS updates for net-next:

1) Rename mss field to mss_option field in synproxy, from Fernando Mancera.

2) Use SYSCTL_{ZERO,ONE} definitions in conntrack, from Matteo Croce.

3) More strict validation of IPVS sysctl values, from Junwei Hu.

4) Remove unnecessary spaces after on the right hand side of assignments,
   from yangxingwu.

5) Add offload support for bitwise operation.

6) Extend the nft_offload_reg structure to store immediate date.

7) Collapse several ip_set header files into ip_set.h, from
   Jeremy Sowden.

8) Make netfilter headers compile with CONFIG_KERNEL_HEADER_TEST=y,
   from Jeremy Sowden.

9) Fix several sparse warnings due to missing prototypes, from
   Valdis Kletnieks.

10) Use static lock initialiser to ensure connlabel spinlock is
    initialized on boot time to fix sched/act_ct.c, patch
    from Florian Westphal.
====================

Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>

Merge branch 'r8152-RX-improve'

Hayes says:

====================
v2:
For patch #2, replace list_for_each_safe with list_for_each_entry_safe.
Remove unlikely in WARN_ON. Adjust the coding style.

For patch #4, replace list_for_each_safe with list_for_each_entry_safe.
Remove "else" after "continue".

For patch #5. replace sysfs with ethtool to modify rx_copybreak and
rx_pending.

v1:
The different chips use different rx buffer size.

Use skb_add_rx_frag() to reduce memory copy for RX.
====================

Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>

r8152: change rx_copybreak and rx_pending through ethtool

Let the rx_copybreak and rx_pending could be modified by
ethtool.

Signed-off-by: Hayes Wang <hayeswang@realtek.com>
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>

r8152: support skb_add_rx_frag

Use skb_add_rx_frag() to reduce the memory copy for rx data.

Use a new list of rx_used to store the rx buffer which couldn't be
reused yet.

Besides, the total number of rx buffer may be increased or decreased
dynamically. And it is limited by RTL8152_MAX_RX_AGG.

Signed-off-by: Hayes Wang <hayeswang@realtek.com>
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>

r8152: use alloc_pages for rx buffer

Replace kmalloc_node() with alloc_pages() for rx buffer.

Signed-off-by: Hayes Wang <hayeswang@realtek.com>
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>

r8152: replace array with linking list for rx information

The original method uses an array to store the rx information. The
new one uses a list to link each rx structure. Then, it is possible
to increase/decrease the number of rx structure dynamically.

Signed-off-by: Hayes Wang <hayeswang@realtek.com>
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>

r8152: separate the rx buffer size

The different chips may accept different rx buffer sizes. The RTL8152
supports 16K bytes, and RTL8153 support 32K bytes.

Signed-off-by: Hayes Wang <hayeswang@realtek.com>
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>

Merge branch 'net-phy-let-phy_speed_down-up-support-speeds-1Gbps'

Heiner says:

====================
So far phy_speed_down/up can be used up to 1Gbps only. Remove this
restriction and add needed helpers to phy-core.c

v2:
- remove unused parameter in patch 1
- rename __phy_speed_down to phy_speed_down_core in patch 2
====================

Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>

net: phy: let phy_speed_down/up support speeds >1Gbps

So far phy_speed_down/up can be used up to 1Gbps only. Remove this
restriction by using new helper __phy_speed_down. New member adv_old
in struct phy_device is used by phy_speed_up to restore the advertised
modes before calling phy_speed_down. Don't simply advertise what is
supported because a user may have intentionally removed modes from
advertisement.

Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>

net: phy: add phy_speed_down_core and phy_resolve_min_speed

phy_speed_down_core provides most of the functionality for
phy_speed_down. It makes use of new helper phy_resolve_min_speed that is
based on the sorting of the settings[] array. In certain cases it may be
helpful to be able to exclude legacy half duplex modes, therefore
prepare phy_resolve_min_speed() for it.

v2:
- rename __phy_speed_down to phy_speed_down_core

Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>

net: phy: add __set_linkmode_max_speed

We will need the functionality of __set_linkmode_max_speed also for
linkmode bitmaps other than phydev->supported. Therefore split it.

v2:
- remove unused parameter from __set_linkmode_max_speed

Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>

net: devlink: remove redundant rtnl lock assert

It is enough for caller of devlink_compat_switch_id_get() to hold the net
device to guarantee that devlink port is not destroyed concurrently. Remove
rtnl lock assertion and modify comment to warn user that they must hold
either rtnl lock or reference to net device. This is necessary to
accommodate future implementation of rtnl-unlocked TC offloads driver
callbacks.

Signed-off-by: Vlad Buslov <vladbu@mellanox.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>

Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next

Daniel Borkmann says:

====================
The following pull-request contains BPF updates for your *net-next* tree.

There is a small merge conflict in libbpf (Cc Andrii so he's in the loop
as well):

        for (i = 1; i <= btf__get_nr_types(btf); i++) {
                t = (struct btf_type *)btf__type_by_id(btf, i);

                if (!has_datasec && btf_is_var(t)) {
                        /* replace VAR with INT */
                        t->info = BTF_INFO_ENC(BTF_KIND_INT, 0, 0);
  <<<<<<< HEAD
                        /*
                         * using size = 1 is the safest choice, 4 will be too
                         * big and cause kernel BTF validation failure if
                         * original variable took less than 4 bytes
                         */
                        t->size = 1;
                        *(int *)(t+1) = BTF_INT_ENC(0, 0, 8);
                } else if (!has_datasec && kind == BTF_KIND_DATASEC) {
  =======
                        t->size = sizeof(int);
                        *(int *)(t + 1) = BTF_INT_ENC(0, 0, 32);
                } else if (!has_datasec && btf_is_datasec(t)) {
  >>>>>>> 72ef80b5ee131e96172f19e74b4f98fa3404efe8
                        /* replace DATASEC with STRUCT */

Conflict is between the two commits 1d4126c4e119 ("libbpf: sanitize VAR to
conservative 1-byte INT") and b03bc6853c0e ("libbpf: convert libbpf code to
use new btf helpers"), so we need to pick the sanitation fixup as well as
use the new btf_is_datasec() helper and the whitespace cleanup. Looks like
the following:

  [...]
                if (!has_datasec && btf_is_var(t)) {
                        /* replace VAR with INT */
                        t->info = BTF_INFO_ENC(BTF_KIND_INT, 0, 0);
                        /*
                         * using size = 1 is the safest choice, 4 will be too
                         * big and cause kernel BTF validation failure if
                         * original variable took less than 4 bytes
                         */
                        t->size = 1;
                        *(int *)(t + 1) = BTF_INT_ENC(0, 0, 8);
                } else if (!has_datasec && btf_is_datasec(t)) {
                        /* replace DATASEC with STRUCT */
  [...]

The main changes are:

1) Addition of core parts of compile once - run everywhere (co-re) effort,
   that is, relocation of fields offsets in libbpf as well as exposure of
   kernel's own BTF via sysfs and loading through libbpf, from Andrii.

   More info on co-re: http://vger.kernel.org/bpfconf2019.html#session-2
   and http://vger.kernel.org/lpc-bpf2018.html#session-2

2) Enable passing input flags to the BPF flow dissector to customize parsing
   and allowing it to stop early similar to the C based one, from Stanislav.

3) Add a BPF helper function that allows generating SYN cookies from XDP and
   tc BPF, from Petar.

4) Add devmap hash-based map type for more flexibility in device lookup for
   redirects, from Toke.

5) Improvements to XDP forwarding sample code now utilizing recently enabled
   devmap lookups, from Jesper.

6) Add support for reporting the effective cgroup progs in bpftool, from Jakub
   and Takshak.

7) Fix reading kernel config from bpftool via /proc/config.gz, from Peter.

8) Fix AF_XDP umem pages mapping for 32 bit architectures, from Ivan.

9) Follow-up to add two more BPF loop tests for the selftest suite, from Alexei.

10) Add perf event output helper also for other skb-based program types, from Allan.

11) Fix a co-re related compilation error in selftests, from Yonghong.
====================

Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>

net: hns3: Make hclge_func_reset_sync_vf static

Fix sparse warning:

drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_main.c:3190:5:
warning: symbol 'hclge_func_reset_sync_vf' was not declared. Should it be static?

Reported-by: Hulk Robot <hulkci@huawei.com>
Signed-off-by: YueHaibing <yuehaibing@huawei.com>
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>

devlink: send notifications for deleted snapshots on region destroy

Currently the notifications for deleted snapshots are sent only in case
user deletes a snapshot manually. Send the notifications in case region
is destroyed too.

Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>

Merge branch 'bpf-libbpf-read-sysfs-btf'

Andrii Nakryiko says:

====================
Now that kernel's BTF is exposed through sysfs at well-known location, attempt
to load it first as a target BTF for the purpose of BPF CO-RE relocations.

Patch #1 is a follow-up patch to rename /sys/kernel/btf/kernel into
/sys/kernel/btf/vmlinux.

Patch #2 adds ability to load raw BTF contents from sysfs and expands the list
of locations libbpf attempts to load vmlinux BTF from.
====================

Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>

libbpf: attempt to load kernel BTF from sysfs first

Add support for loading kernel BTF from sysfs (/sys/kernel/btf/vmlinux)
as a target BTF. Also extend the list of on disk search paths for
vmlinux ELF image with entries that perf is searching for.

Signed-off-by: Andrii Nakryiko <andriin@fb.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>

btf: rename /sys/kernel/btf/kernel into /sys/kernel/btf/vmlinux

Expose kernel's BTF under the name vmlinux to be more uniform with using
kernel module names as file names in the future.

Fixes: 341dfcf8d78e ("btf: expose BTF info through sysfs")
Suggested-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Andrii Nakryiko <andriin@fb.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>

selftests/bpf: fix race in flow dissector tests

Since the "last_dissection" map holds only the flow keys for the most
recent packet, there is a small race in the skb-less flow dissector
tests if a new packet comes between transmitting the test packet, and
reading its keys from the map. If this happens, the test packet keys
will be overwritten and the test will fail.

Changing the "last_dissection" map to a hash map, keyed on the
source/dest port pair resolves this issue. Additionally, let's clear the
last test results from the map between tests to prevent previous test
cases from interfering with the following test cases.

Fixes: 0905beec9f52 ("selftests/bpf: run flow dissector tests in skb-less mode")
Signed-off-by: Petar Penkov <ppenkov@google.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>

tools: bpftool: add feature check for zlib

bpftool requires libelf, and zlib for decompressing /proc/config.gz.
zlib is a transitive dependency via libelf, and became mandatory since
elfutils 0.165 (Jan 2016). The feature check of libelf is already done
in the elfdep target of tools/lib/bpf/Makefile, pulled in by bpftool via
a dependency on libbpf.a. Add a similar feature check for zlib.

Suggested-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Signed-off-by: Peter Wu <peter@lekensteyn.nl>
Acked-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>

btf: expose BTF info through sysfs

Make .BTF section allocated and expose its contents through sysfs.

/sys/kernel/btf directory is created to contain all the BTFs present
inside kernel. Currently there is only kernel's main BTF, represented as
/sys/kernel/btf/kernel file. Once kernel modules' BTFs are supported,
each module will expose its BTF as /sys/kernel/btf/<module-name> file.

Current approach relies on a few pieces coming together:
1. pahole is used to take almost final vmlinux image (modulo .BTF and
   kallsyms) and generate .BTF section by converting DWARF info into
   BTF. This section is not allocated and not mapped to any segment,
   though, so is not yet accessible from inside kernel at runtime.
2. objcopy dumps .BTF contents into binary file and subsequently
   convert binary file into linkable object file with automatically
   generated symbols _binary__btf_kernel_bin_start and
   _binary__btf_kernel_bin_end, pointing to start and end, respectively,
   of BTF raw data.
3. final vmlinux image is generated by linking this object file (and
   kallsyms, if necessary). sysfs_btf.c then creates
   /sys/kernel/btf/kernel file and exposes embedded BTF contents through
   it. This allows, e.g., libbpf and bpftool access BTF info at
   well-known location, without resorting to searching for vmlinux image
   on disk (location of which is not standardized and vmlinux image
   might not be even available in some scenarios, e.g., inside qemu
   during testing).

Alternative approach using .incbin assembler directive to embed BTF
contents directly was attempted but didn't work, because sysfs_proc.o is
not re-compiled during link-vmlinux.sh stage. This is required, though,
to update embedded BTF data (initially empty data is embedded, then
pahole generates BTF info and we need to regenerate sysfs_btf.o with
updated contents, but it's too late at that point).

If BTF couldn't be generated due to missing or too old pahole,
sysfs_btf.c handles that gracefully by detecting that
_binary__btf_kernel_bin_start (weak symbol) is 0 and not creating
/sys/kernel/btf at all.

v2->v3:
- added Documentation/ABI/testing/sysfs-kernel-btf (Greg K-H);
- created proper kobject (btf_kobj) for btf directory (Greg K-H);
- undo v2 change of reusing vmlinux, as it causes extra kallsyms pass
  due to initially missing  __binary__btf_kernel_bin_{start/end} symbols;

v1->v2:
- allow kallsyms stage to re-use vmlinux generated by gen_btf();

Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Andrii Nakryiko <andriin@fb.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>

netfilter: connlabels: prefer static lock initialiser

seen during boot:
BUG: spinlock bad magic on CPU#2, swapper/0/1
lock: nf_connlabels_lock+0x0/0x60, .magic: 00000000, .owner: <none>/-1, .owner_cpu: 0
Call Trace:
do_raw_spin_lock+0x14e/0x1b0
nf_connlabels_get+0x15/0x40
ct_init_net+0xc4/0x270
ops_init+0x56/0x1c0
register_pernet_operations+0x1c8/0x350
register_pernet_subsys+0x1f/0x40
tcf_register_action+0x7c/0x1a0
do_one_initcall+0x13d/0x2d9

Problem is that ct action init function can run before
connlabels_init(). Lock has not been initialised yet.

Fix it by using a static initialiser.

Fixes: b57dc7c13ea9 ("net/sched: Introduce action ct")
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>

netfilter: nf_nat_proto: make tables static

Sparse warns about two tables not being declared.

CHECK net/netfilter/nf_nat_proto.c
net/netfilter/nf_nat_proto.c:725:26: warning: symbol 'nf_nat_ipv4_ops' was not declared. Should it be static?
net/netfilter/nf_nat_proto.c:964:26: warning: symbol 'nf_nat_ipv6_ops' was not declared. Should it be static?

And in fact they can indeed be static.

Signed-off-by: Valdis Kletnieks <valdis.kletnieks@vt.edu>
Acked-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>

netfilter: nf_tables: add missing prototypes.

Sparse rightly complains about undeclared symbols.

  CHECK   net/netfilter/nft_set_hash.c
net/netfilter/nft_set_hash.c:647:21: warning: symbol 'nft_set_rhash_type' was not declared. Should it be static?
net/netfilter/nft_set_hash.c:670:21: warning: symbol 'nft_set_hash_type' was not declared. Should it be static?
net/netfilter/nft_set_hash.c:690:21: warning: symbol 'nft_set_hash_fast_type' was not declared. Should it be static?
  CHECK   net/netfilter/nft_set_bitmap.c
net/netfilter/nft_set_bitmap.c:296:21: warning: symbol 'nft_set_bitmap_type' was not declared. Should it be static?
  CHECK   net/netfilter/nft_set_rbtree.c
net/netfilter/nft_set_rbtree.c:470:21: warning: symbol 'nft_set_rbtree_type' was not declared. Should it be static?

Include nf_tables_core.h rather than nf_tables.h to pick up the additional definitions.

Signed-off-by: Valdis Kletnieks <valdis.kletnieks@vt.edu>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>

kbuild: remove all netfilter headers from header-test blacklist.

All the blacklisted NF headers can now be compiled stand-alone, so
removed them from the blacklist.

Cc: Masahiro Yamada <yamada.masahiro@socionext.com>
Signed-off-by: Jeremy Sowden <jeremy@azazel.net>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>

netfilter: remove "#ifdef __KERNEL__" guards from some headers.

A number of non-UAPI Netfilter header-files contained superfluous
"#ifdef __KERNEL__" guards. Removed them.

Signed-off-by: Jeremy Sowden <jeremy@azazel.net>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>

netfilter: add missing IS_ENABLED(CONFIG_NETFILTER) checks to some header-files.

linux/netfilter.h defines a number of struct and inline function
definitions which are only available is CONFIG_NETFILTER is enabled.
These structs and functions are used in declarations and definitions in
other header-files. Added preprocessor checks to make sure these
headers will compile if CONFIG_NETFILTER is disabled.

Signed-off-by: Jeremy Sowden <jeremy@azazel.net>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>

netfilter: add missing IS_ENABLED(CONFIG_NF_CONNTRACK) checks to some header-files.

struct nf_conn contains a "struct nf_conntrack ct_general" member and
struct net contains a "struct netns_ct ct" member which are both only
defined in CONFIG_NF_CONNTRACK is enabled. These members are used in a
number of inline functions defined in other header-files. Added
preprocessor checks to make sure the headers will compile if
CONFIG_NF_CONNTRACK is disabled.

Signed-off-by: Jeremy Sowden <jeremy@azazel.net>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>

netfilter: add missing IS_ENABLED(CONFIG_NF_TABLES) check to header-file.

nf_tables.h defines an API comprising several inline functions and
macros that depend on the nft member of struct net. However, this is
only defined is CONFIG_NF_TABLES is enabled. Added preprocessor checks
to ensure that nf_tables.h will compile if CONFIG_NF_TABLES is disabled.

Signed-off-by: Jeremy Sowden <jeremy@azazel.net>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>

netfilter: add missing IS_ENABLED(CONFIG_BRIDGE_NETFILTER) checks to header-file.

br_netfilter.h defines inline functions that use an enum constant and
struct member that are only defined if CONFIG_BRIDGE_NETFILTER is
enabled. Added preprocessor checks to ensure br_netfilter.h will
compile if CONFIG_BRIDGE_NETFILTER is disabled.

Signed-off-by: Jeremy Sowden <jeremy@azazel.net>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>

netfilter: add missing includes to a number of header-files.

A number of netfilter header-files used declarations and definitions
from other headers without including them. Added include directives to
make those declarations and definitions available.

Signed-off-by: Jeremy Sowden <jeremy@azazel.net>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>

netfilter: inline four headers files into another one.

linux/netfilter/ipset/ip_set.h included four other header files:

  include/linux/netfilter/ipset/ip_set_comment.h
  include/linux/netfilter/ipset/ip_set_counter.h
  include/linux/netfilter/ipset/ip_set_skbinfo.h
  include/linux/netfilter/ipset/ip_set_timeout.h

Of these the first three were not included anywhere else.  The last,
ip_set_timeout.h, was included in a couple of other places, but defined
inline functions which call other inline functions defined in ip_set.h,
so ip_set.h had to be included before it.

Inlined all four into ip_set.h, and updated the other files that
included ip_set_timeout.h.

Signed-off-by: Jeremy Sowden <jeremy@azazel.net>
Acked-by: Jozsef Kadlecsik <kadlec@netfilter.org>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>

netfilter: nf_tables: store data in offload context registers

Store immediate data into offload context register. This allows follow
up instructions to take it from the corresponding source register.

This patch is required to support for payload mangling, although other
instructions that take data from source register will benefit from this
too.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>

netfilter: nft_bitwise: add offload support

Extract mask from bitwise operation and store it into the corresponding
context register so the cmp instruction can set the mask accordingly.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>

netfilter: remove unnecessary spaces

This patch removes extra spaces.

Signed-off-by: yangxingwu <xingwu.yang@gmail.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>

tools: bpftool: fix reading from /proc/config.gz

/proc/config has never existed as far as I can see, but /proc/config.gz
is present on Arch Linux. Add support for decompressing config.gz using
zlib which is a mandatory dependency of libelf anyway. Replace existing
stdio functions with gzFile operations since the latter transparently
handles uncompressed and gzip-compressed files.

Cc: Quentin Monnet <quentin.monnet@netronome.com>
Signed-off-by: Peter Wu <peter@lekensteyn.nl>
Reviewed-by: Quentin Monnet <quentin.monnet@netronome.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>

caif: no need to check return value of debugfs_create functions

When calling debugfs functions, there is no need to ever check the
return value. The function can work or not, but the code logic should
never do something different based on this.

Cc: Richard Fontana <rfontana@redhat.com>
Cc: Steve Winslow <swinslow@gmail.com>
Cc: netdev@vger.kernel.org
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: David S. Miller <davem@davemloft.net>

xen-netback: no need to check return value of debugfs_create functions

When calling debugfs functions, there is no need to ever check the
return value. The function can work or not, but the code logic should
never do something different based on this.

Cc: Wei Liu <wei.liu@kernel.org>
Cc: Paul Durrant <paul.durrant@citrix.com>
Cc: xen-devel@lists.xenproject.org
Cc: netdev@vger.kernel.org
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Acked-by: Wei Liu <wei.liu@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>

Merge branch 'net-dsa-mv88e6xxx-prepare-Wait-Bit-operation'

Vivien Didelot says:

====================
net: dsa: mv88e6xxx: prepare Wait Bit operation

The Remote Management Interface has its own implementation of a Wait
Bit operation, which requires a bit number and a value to wait for.

In order to prepare the introduction of this implementation, rework the
code waiting for bits and masks in mv88e6xxx to match this signature.

This has the benefit to unify the implementation of wait routines while
removing obsolete wait and update functions and also reducing the code.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>

net: dsa: mv88e6xxx: add delay in direct SMI wait

The mv88e6xxx_smi_direct_wait routine is used to wait on indirect
registers access. It is of no exception and must delay between read
attempts, like other wait routines.

Signed-off-by: Vivien Didelot <vivien.didelot@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: dsa: mv88e6xxx: fix SMI bit checking

The current mv88e6xxx_smi_direct_wait function is only used to check
the 16th bit of the (16-bit) SMI Command register. But the bit shift
operation is not enough if we eventually use this function to check
other bits, thus replace it with a mask.

Fixes: e7ba0fad9c53 ("net: dsa: mv88e6xxx: refine SMI support")
Signed-off-by: Vivien Didelot <vivien.didelot@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: dsa: mv88e6xxx: remove wait and update routines

Now that we have proper Wait Bit and Wait Mask routines, remove the
unused mv88e6xxx_wait routine and its Global 1 and Global 2 variants.

The indirect tables such as the Device Mapping Table or Priority
Override Table make use of an Update bit to distinguish reading (0)
from writing (1) operations. After a write operation occurs, the bit
self clears right away so there's no need to wait on it. Thus keep
things simple and remove the mv88e6xxx_update helper as well.

Signed-off-by: Vivien Didelot <vivien.didelot@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: dsa: mv88e6xxx: wait for AVB Busy bit

The AVB is not an indirect table using an Update bit, but a unit using
a Busy bit. This means that we must ensure that this bit is cleared
before setting it and wait until it gets cleared again after writing
an operation. Reflect that.

Signed-off-by: Vivien Didelot <vivien.didelot@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: dsa: mv88e6xxx: introduce wait bit routine

Many portions of the driver need to wait until a given bit is set
or cleared. Some busses even have a specific implementation for this
operation. In preparation for such variant, implement a generic Wait
Bit routine that can be used by the driver core functions.

This allows us to get rid of the custom implementations we may find
in the driver. Note that for the EEPROM bits, BUSY and RUNNING bits
are independent, thus it is more efficient to wait independently for
each bit instead of waiting for their mask.

Signed-off-by: Vivien Didelot <vivien.didelot@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: dsa: mv88e6xxx: introduce wait mask routine

The current mv88e6xxx_wait routine is used to wait for a given mask
to be cleared to zero. However in some cases, the driver may have
to wait for a given mask to be of a certain non-zero value.

Thus provide a generic wait mask routine that will be used to implement
the current mv88e6xxx_wait function, and use it to wait for 88E6185
PPU states.

Signed-off-by: Vivien Didelot <vivien.didelot@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: dsa: mv88e6xxx: wait for 88E6185 PPU disabled

The PPU state of 88E6185 can be either "Disabled at Reset" or
"Disabled after Initialization". Because we intentionally clear the
PPU Enabled bit before checking its state, it is safe to wait for the
MV88E6185_G1_STS_PPU_STATE_DISABLED state explicitly instead of waiting
for any state different than MV88E6185_G1_STS_PPU_STATE_POLLING.

Signed-off-by: Vivien Didelot <vivien.didelot@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

r8169: inline rtl8169_free_rx_databuff

rtl8169_free_rx_databuff is used in only one place, so let's inline it.
We can improve the loop because rtl8169_init_ring zero's RX_databuff
before calling rtl8169_rx_fill, and rtl8169_rx_fill fills
Rx_databuff starting from index 0.

Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

Merge branch 'realtek-phy-next'

Heiner Kallweit says:

====================
net: phy: realtek: add support for integrated 2.5Gbps PHY in RTL8125

This series adds support for the integrated 2.5Gbps PHY in RTL8125.
First three patches add necessary functionality to phylib.

Changes in v2:
- added patch 1
- changed patch 4 to use a fake PHY ID that is injected by the
network driver. This allows to use a dedicated PHY driver.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>

net: phy: realtek: add support for the 2.5Gbps PHY in RTL8125

This adds support for the integrated 2.5Gbps PHY in Realtek RTL8125.
Advertisement of 2.5Gbps mode is done via a vendor-specific register.
Same applies to reading NBase-T link partner advertisement.
Unfortunately this 2.5Gbps PHY shares the PHY ID with the integrated
1Gbps PHY's in other Realtek network chips and so far no method is
known to differentiate them. As a workaround use a dedicated fake PHY ID
that is set by the network driver by intercepting the MDIO PHY ID read.

v2:
- Create dedicated PHY driver and use a fake PHY ID that is injected by
the network driver. Suggested by Andrew Lunn.

Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: phy: add phy_modify_paged_changed

Add helper function phy_modify_paged_changed, behavios is the same
as for phy_modify_changed.

Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: phy: prepare phylib to deal with PHY's extending Clause 22

The integrated PHY in 2.5Gbps chip RTL8125 is the first (known to me)
PHY that uses standard Clause 22 for all modes up to 1Gbps and adds
2.5Gbps control using vendor-specific registers. To use phylib for
the standard part little extensions are needed:
- Move most of genphy_config_aneg to a new function
  __genphy_config_aneg that takes a parameter whether restarting
  auto-negotiation is needed (depending on whether content of
  vendor-specific advertisement register changed).
- Don't clear phydev->lp_advertising in genphy_read_status so that
  we can set non-C22 mode flags before.

Basically both changes mimic the behavior of the equivalent Clause 45
functions.

Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: phy: simplify genphy_config_advert by using the linkmode_adv_to_xxx_t functions

Using linkmode_adv_to_mii_adv_t and linkmode_adv_to_mii_ctrl1000_t
allows to simplify the code. In addition avoiding the conversion to
the legacy u32 advertisement format allows to remove the warning.

Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com>
Suggested-by: Andrew Lunn <andrew@lunn.ch>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>

netdevsim: register couple of devlink params

Register couple of devlink params, one generic, one driver-specific.
Make the values available over debugfs.

Example:
$ echo "111" > /sys/bus/netdevsim/new_device
$ devlink dev param
netdevsim/netdevsim111:
  name max_macs type generic
    values:
      cmode driverinit value 32
  name test1 type driver-specific
    values:
      cmode driverinit value true
$ cat /sys/kernel/debug/netdevsim/netdevsim111/max_macs
32
$ cat /sys/kernel/debug/netdevsim/netdevsim111/test1
Y
$ devlink dev param set netdevsim/netdevsim111 name max_macs cmode driverinit value 16
$ devlink dev param set netdevsim/netdevsim111 name test1 cmode driverinit value false
$ devlink dev reload netdevsim/netdevsim111
$ cat /sys/kernel/debug/netdevsim/netdevsim111/max_macs
16
$ cat /sys/kernel/debug/netdevsim/netdevsim111/test1

Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

Merge branch 'drop_monitor-Capture-dropped-packets-and-metadata'

Ido Schimmel says:

====================
drop_monitor: Capture dropped packets and metadata

So far drop monitor supported only one mode of operation in which a
summary of recent packet drops is periodically sent to user space as a
netlink event. The event only includes the drop location (program
counter) and number of drops in the last interval.

While this mode of operation allows one to understand if the system is
dropping packets, it is not sufficient if a more detailed analysis is
required. Both the packet itself and related metadata are missing.

This patchset extends drop monitor with another mode of operation where
the packet - potentially truncated - and metadata (e.g., drop location,
timestamp, netdev) are sent to user space as a netlink event. Thanks to
the extensible nature of netlink, more metadata can be added in the
future.

To avoid performing expensive operations in the context in which
kfree_skb() is called, the dropped skbs are cloned and queued on per-CPU
skb drop list. The list is then processed in process context (using a
workqueue), where the netlink messages are allocated, prepared and
finally sent to user space.

A follow-up patchset will integrate drop monitor with devlink and allow
the latter to call into drop monitor to report hardware drops. In the
future, XDP drops can be added as well, thereby making drop monitor the
go-to netlink channel for diagnosing all packet drops.

Example usage with patched dropwatch [1] can be found here [2]. Example
dissection of drop monitor netlink events with patched wireshark [3] can
be found here [4]. I will submit both changes upstream after the kernel
changes are accepted. Another change worth making is adding a dropmon
pseudo interface to libpcap, similar to the nflog interface [5]. This
will allow users to specifically listen on dropmon traffic instead of
capturing all netlink packets via the nlmon netdev.

Patches #1-#5 prepare the code towards the actual changes in later
patches.

Patch #6 adds another mode of operation to drop monitor in which the
dropped packet itself is notified to user space along with metadata.

Patch #7 allows users to truncate reported packets to a specific length,
in case only the headers are of interest. The original length of the
packet is added as metadata to the netlink notification.

Patch #8 allows user to query the current configuration of drop monitor
(e.g., alert mode, truncation length).

Patches #9-#10 allow users to tune the length of the per-CPU skb drop
list according to their needs.

Changes since v1 [6]:
* Add skb protocol as metadata. This allows user space to correctly
  dissect the packet instead of blindly assuming it is an Ethernet
  packet

Changes since RFC [7]:
* Limit the length of the per-CPU skb drop list and make it configurable
* Do not use the hysteresis timer in packet alert mode
* Introduce alert mode operations in a separate patch and only then
  introduce the new alert mode
* Use 'skb->skb_iif' instead of 'skb->dev' because the latter is inside
  a union with 'dev_scratch' and therefore not guaranteed to point to a
  valid netdev
* Return '-EBUSY' instead of '-EOPNOTSUPP' when trying to configure drop
  monitor while it is monitoring
* Did not change schedule_work() in favor of schedule_work_on() as I did
  not observe a change in number of tail drops

[1] https://github.com/idosch/dropwatch/tree/packet-mode
[2] https://gist.github.com/idosch/3d524b887e16bc11b4b19e25c23dcc23#file-gistfile1-txt
[3] https://github.com/idosch/wireshark/tree/drop-monitor-v2
[4] https://gist.github.com/idosch/3d524b887e16bc11b4b19e25c23dcc23#file-gistfile2-txt
[5] https://github.com/the-tcpdump-group/libpcap/blob/master/pcap-netfilter-linux.c
[6] https://patchwork.ozlabs.org/cover/1143443/
[7] https://patchwork.ozlabs.org/cover/1135226/
====================

Signed-off-by: David S. Miller <davem@davemloft.net>

drop_monitor: Expose tail drop counter

Previous patch made the length of the per-CPU skb drop list
configurable. Expose a counter that shows how many packets could not be
enqueued to this list.

This allows users determine the desired queue length.

Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

drop_monitor: Make drop queue length configurable

In packet alert mode, each CPU holds a list of dropped skbs that need to
be processed in process context and sent to user space. To avoid
exhausting the system's memory the maximum length of this queue is
currently set to 1000.

Allow users to tune the length of this queue according to their needs.
The configured length is reported to user space when drop monitor
configuration is queried.

Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

drop_monitor: Add a command to query current configuration

Users should be able to query the current configuration of drop monitor
before they start using it. Add a command to query the existing
configuration which currently consists of alert mode and packet
truncation length.

Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

drop_monitor: Allow truncation of dropped packets

When sending dropped packets to user space it is not always necessary to
copy the entire packet as usually only the headers are of interest.

Allow user to specify the truncation length and add the original length
of the packet as additional metadata to the netlink message.

By default no truncation is performed.

Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>