Rahul Lakkireddy [Fri, 15 May 2020 17:11:04 +0000 (22:41 +0530)]
cxgb4: tune burst buffer size for TC-MQPRIO offload
For each traffic class, firmware handles up to 4 * MTU amount of data
per burst cycle. Under heavy load, this small buffer size is a
bottleneck when buffering large TSO packets in <= 1500 MTU case.
Increase the burst buffer size to 8 * MTU when supported.
Also, keep the driver's traffic class configuration API similar to
the firmware API counterpart.
Signed-off-by: Rahul Lakkireddy <rahul.lakkireddy@chelsio.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Rahul Lakkireddy [Fri, 15 May 2020 17:11:03 +0000 (22:41 +0530)]
cxgb4: improve credits recovery in TC-MQPRIO Tx path
Request credit update for every half credits consumed, including
the current request. Also, avoid re-trying to post packets when there
are no credits left. The credit update reply via interrupt will
eventually restore the credits and will invoke the Tx path again.
Signed-off-by: Rahul Lakkireddy <rahul.lakkireddy@chelsio.com> Signed-off-by: David S. Miller <davem@davemloft.net>
The following pull-request contains BPF updates for your *net-next* tree.
We've added 37 non-merge commits during the last 1 day(s) which contain
a total of 67 files changed, 741 insertions(+), 252 deletions(-).
The main changes are:
1) bpf_xdp_adjust_tail() now allows to grow the tail as well, from Jesper.
2) bpftool can probe CONFIG_HZ, from Daniel.
3) CAP_BPF is introduced to isolate user processes that use BPF infra and
to secure BPF networking services by dropping CAP_SYS_ADMIN requirement
in certain cases, from Alexei.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
DENG Qingfang [Fri, 15 May 2020 15:25:55 +0000 (23:25 +0800)]
net: dsa: mt7530: fix VLAN setup
Allow DSA to add VLAN entries even if VLAN filtering is disabled, so
enabling it will not block the traffic of existent ports in the bridge
Signed-off-by: DENG Qingfang <dqfext@gmail.com> Reviewed-by: Florian Fainelli <f.fainelli@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Output rate of current upstream kernel TC filter dump implementation if
relatively low (~100k rules/sec depending on configuration). This
constraint impacts performance of software switch implementation that
rely on TC for their datapath implementation and periodically call TC
filter dump to update rules stats. Moreover, TC filter dump output a lot
of static data that don't change during the filter lifecycle (filter
key, specific action details, etc.) which constitutes significant
portion of payload on resulting netlink packets and increases amount of
syscalls necessary to dump all filters on particular Qdisc. In order to
significantly improve filter dump rate this patch sets implement new
mode of TC filter dump operation named "terse dump" mode. In this mode
only parameters necessary to identify the filter (handle, action cookie,
etc.) and data that can change during filter lifecycle (filter flags,
action stats, etc.) are preserved in dump output while everything else
is omitted.
Userspace API is implemented using new TCA_DUMP_FLAGS tlv with only
available flag value TCA_DUMP_FLAGS_TERSE. Internally, new API requires
individual classifier support (new tcf_proto_ops->terse_dump()
callback). Support for action terse dump is implemented in act API and
don't require changing individual action implementations.
The following table provides performance comparison between regular
filter dump and new terse dump mode for two classifier-action profiles:
one minimal config with L2 flower classifier and single gact action and
another heavier config with L2+5tuple flower classifier with
tunnel_key+mirred actions.
Benchmark details: to measure the rate tc filter dump and terse dump
commands are invoked on ingress Qdisc that have one million filters
configured using following commands.
> time sudo tc -s filter show dev ens1f0 ingress >/dev/null
> time sudo tc -s filter show terse dev ens1f0 ingress >/dev/null
Value in results table is calculated by dividing 1000000 total rules by
"real" time reported by time command.
Vlad Buslov [Fri, 15 May 2020 11:40:12 +0000 (14:40 +0300)]
net: sched: implement terse dump support in act
Extend tcf_action_dump() with boolean argument 'terse' that is used to
request terse-mode action dump. In terse mode only essential data needed to
identify particular action (action kind, cookie, etc.) and its stats is put
to resulting skb and everything else is omitted. Implement
tcf_exts_terse_dump() helper in cls API that is intended to be used to
request terse dump of all exts (actions) attached to the filter.
Signed-off-by: Vlad Buslov <vladbu@mellanox.com> Reviewed-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Vlad Buslov [Fri, 15 May 2020 11:40:11 +0000 (14:40 +0300)]
net: sched: introduce terse dump flag
Add new TCA_DUMP_FLAGS attribute and use it in cls API to request terse
filter output from classifiers with TCA_DUMP_FLAGS_TERSE flag. This option
is intended to be used to improve performance of TC filter dump when
userland only needs to obtain stats and not the whole classifier/action
data. Extend struct tcf_proto_ops with new terse_dump() callback that must
be defined by supporting classifier implementations.
Support of the options in specific classifiers and actions is
implemented in following patches in the series.
Signed-off-by: Vlad Buslov <vladbu@mellanox.com> Reviewed-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
The assumption that a device node is associated either with the
netdev's device, or the parent of that device, does not hold for all
drivers. E.g. Freescale's DPAA has two layers of platform devices
above the netdev. Instead, recursively walk up the tree from the
netdev, allowing any parent to match against the sought after node.
Signed-off-by: Tobias Waldekranz <tobias@waldekranz.com> Reviewed-by: Florian Fainelli <f.fainelli@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Daniel Borkmann [Fri, 15 May 2020 15:29:42 +0000 (17:29 +0200)]
Merge branch 'bpf-cap'
Alexei Starovoitov says:
====================
v6->v7:
- permit SK_REUSEPORT program type under CAP_BPF as suggested by Marek Majkowski.
It's equivalent to SOCKET_FILTER which is unpriv.
v5->v6:
- split allow_ptr_leaks into four flags.
- retain bpf_jit_limit under cap_sys_admin.
- fixed few other issues spotted by Daniel.
v4->v5:
Split BPF operations that are allowed under CAP_SYS_ADMIN into combination of
CAP_BPF, CAP_PERFMON, CAP_NET_ADMIN and keep some of them under CAP_SYS_ADMIN.
The user process has to have
- CAP_BPF to create maps, do other sys_bpf() commands and load SK_REUSEPORT progs.
Note: dev_map, sock_hash, sock_map map types still require CAP_NET_ADMIN.
That could be relaxed in the future.
- CAP_BPF and CAP_PERFMON to load tracing programs.
- CAP_BPF and CAP_NET_ADMIN to load networking programs.
(or CAP_SYS_ADMIN for backward compatibility).
CAP_BPF solves three main goals:
1. provides isolation to user space processes that drop CAP_SYS_ADMIN and switch to CAP_BPF.
More on this below. This is the major difference vs v4 set back from Sep 2019.
2. makes networking BPF progs more secure, since CAP_BPF + CAP_NET_ADMIN
prevents pointer leaks and arbitrary kernel memory access.
3. enables fuzzers to exercise all of the verifier logic. Eventually finding bugs
and making BPF infra more secure. Currently fuzzers run in unpriv.
They will be able to run with CAP_BPF.
The patchset is long overdue follow-up from the last plumbers conference.
Comparing to what was discussed at LPC the CAP* checks at attach time are gone.
For tracing progs the CAP_SYS_ADMIN check was done at load time only. There was
no check at attach time. For networking and cgroup progs CAP_SYS_ADMIN was
required at load time and CAP_NET_ADMIN at attach time, but there are several
ways to bypass CAP_NET_ADMIN:
- if networking prog is using tail_call writing FD into prog_array will
effectively attach it, but bpf_map_update_elem is an unprivileged operation.
- freplace prog with CAP_SYS_ADMIN can replace networking prog
Consolidating all CAP checks at load time makes security model similar to
open() syscall. Once the user got an FD it can do everything with it.
read/write/poll don't check permissions. The same way when bpf_prog_load
command returns an FD the user can do everything (including attaching,
detaching, and bpf_test_run).
The important design decision is to allow ID->FD transition for
CAP_SYS_ADMIN only. What it means that user processes can run
with CAP_BPF and CAP_NET_ADMIN and they will not be able to affect each
other unless they pass FDs via scm_rights or via pinning in bpffs.
ID->FD is a mechanism for human override and introspection.
An admin can do 'sudo bpftool prog ...'. It's possible to enforce via LSM that
only bpftool binary does bpf syscall with CAP_SYS_ADMIN and the rest of user
space processes do bpf syscall with CAP_BPF isolating bpf objects (progs, maps,
links) that are owned by such processes from each other.
Another significant change from LPC is that the verifier checks are split into
four flags. The allow_ptr_leaks flag allows pointer manipulations. The
bpf_capable flag enables all modern verifier features like bpf-to-bpf calls,
BTF, bounded loops, dead code elimination, etc. All the goodness. The
bypass_spec_v1 flag enables indirect stack access from bpf programs and
disables speculative analysis and bpf array mitigations. The bypass_spec_v4
flag disables store sanitation. That allows networking progs with CAP_BPF +
CAP_NET_ADMIN enjoy modern verifier features while being more secure.
Some networking progs may need CAP_BPF + CAP_NET_ADMIN + CAP_PERFMON,
since subtracting pointers (like skb->data_end - skb->data) is a pointer leak,
but the verifier may get smarter in the future.
====================
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Implement permissions as stated in uapi/linux/capability.h
In order to do that the verifier allow_ptr_leaks flag is split
into four flags and they are set as:
env->allow_ptr_leaks = bpf_allow_ptr_leaks();
env->bypass_spec_v1 = bpf_bypass_spec_v1();
env->bypass_spec_v4 = bpf_bypass_spec_v4();
env->bpf_capable = bpf_capable();
The first three currently equivalent to perfmon_capable(), since leaking kernel
pointers and reading kernel memory via side channel attacks is roughly
equivalent to reading kernel memory with cap_perfmon.
'bpf_capable' enables bounded loops, precision tracking, bpf to bpf calls and
other verifier features. 'allow_ptr_leaks' enable ptr leaks, ptr conversions,
subtraction of pointers. 'bypass_spec_v1' disables speculative analysis in the
verifier, run time mitigations in bpf array, and enables indirect variable
access in bpf programs. 'bypass_spec_v4' disables emission of sanitation code
by the verifier.
That means that the networking BPF program loaded with CAP_BPF + CAP_NET_ADMIN
will have speculative checks done by the verifier and other spectre mitigation
applied. Such networking BPF program will not be able to leak kernel pointers
and will not be able to access arbitrary kernel memory.
Split BPF operations that are allowed under CAP_SYS_ADMIN into
combination of CAP_BPF, CAP_PERFMON, CAP_NET_ADMIN.
For backward compatibility include them in CAP_SYS_ADMIN as well.
The end result provides simple safety model for applications that use BPF:
- to load tracing program types
BPF_PROG_TYPE_{KPROBE, TRACEPOINT, PERF_EVENT, RAW_TRACEPOINT, etc}
use CAP_BPF and CAP_PERFMON
- to load networking program types
BPF_PROG_TYPE_{SCHED_CLS, XDP, SK_SKB, etc}
use CAP_BPF and CAP_NET_ADMIN
There are few exceptions from this rule:
- bpf_trace_printk() is allowed in networking programs, but it's using
tracing mechanism, hence this helper needs additional CAP_PERFMON
if networking program is using this helper.
- BPF_F_ZERO_SEED flag for hash/lru map is allowed under CAP_SYS_ADMIN only
to discourage production use.
- BPF HW offload is allowed under CAP_SYS_ADMIN.
- bpf_probe_write_user() is allowed under CAP_SYS_ADMIN only.
CAPs are not checked at attach/detach time with two exceptions:
- loading BPF_PROG_TYPE_CGROUP_SKB is allowed for unprivileged users,
hence CAP_NET_ADMIN is required at attach time.
- flow_dissector detach doesn't check prog FD at detach,
hence CAP_NET_ADMIN is required at detach time.
CAP_SYS_ADMIN is required to iterate BPF objects (progs, maps, links) via get_next_id
command and convert them to file descriptor via GET_FD_BY_ID command.
This restriction guarantees that mutliple tasks with CAP_BPF are not able to
affect each other. That leads to clean isolation of tasks. For example:
task A with CAP_BPF and CAP_NET_ADMIN loads and attaches a firewall via bpf_link.
task B with the same capabilities cannot detach that firewall unless
task A explicitly passed link FD to task B via scm_rights or bpffs.
CAP_SYS_ADMIN can still detach/unload everything.
Two networking user apps with CAP_SYS_ADMIN and CAP_NET_ADMIN can
accidentely mess with each other programs and maps.
Two networking user apps with CAP_NET_ADMIN and CAP_BPF cannot affect each other.
CAP_NET_ADMIN + CAP_BPF allows networking programs access only packet data.
Such networking progs cannot access arbitrary kernel memory or leak pointers.
bpftool, bpftrace, bcc tools binaries should NOT be installed with
CAP_BPF and CAP_PERFMON, since unpriv users will be able to read kernel secrets.
But users with these two permissions will be able to use these tracing tools.
CAP_PERFMON is least secure, since it allows kprobes and kernel memory access.
CAP_NET_ADMIN can stop network traffic via iproute2.
CAP_BPF is the safest from security point of view and harmless on its own.
Having CAP_BPF and/or CAP_NET_ADMIN is not enough to write into arbitrary map
and if that map is used by firewall-like bpf prog.
CAP_BPF allows many bpf prog_load commands in parallel. The verifier
may consume large amount of memory and significantly slow down the system.
Existing unprivileged BPF operations are not affected.
In particular unprivileged users are allowed to load socket_filter and cg_skb
program types and to create array, hash, prog_array, map-in-map map types.
Daniel Borkmann [Wed, 13 May 2020 07:58:49 +0000 (09:58 +0200)]
bpf, bpftool: Allow probing for CONFIG_HZ from kernel config
In Cilium we've recently switched to make use of bpf_jiffies64() for
parts of our tc and XDP datapath since bpf_ktime_get_ns() is more
expensive and high-precision is not needed for our timeouts we have
anyway. Our agent has a probe manager which picks up the json of
bpftool's feature probe and we also use the macro output in our C
programs e.g. to have workarounds when helpers are not available on
older kernels.
Extend the kernel config info dump to also include the kernel's
CONFIG_HZ, and rework the probe_kernel_image_config() for allowing a
macro dump such that CONFIG_HZ can be propagated to BPF C code as a
simple define if available via config. Latter allows to have _compile-
time_ resolution of jiffies <-> sec conversion in our code since all
are propagated as known constants.
Given we cannot generally assume availability of kconfig everywhere,
we also have a kernel hz probe [0] as a fallback. Potentially, bpftool
could have an integrated probe fallback as well, although to derive it,
we might need to place it under 'bpftool feature probe full' or similar
given it would slow down the probing process overall. Yet 'full' doesn't
fit either for us since we don't want to pollute the kernel log with
warning messages from bpf_probe_write_user() and bpf_trace_printk() on
agent startup; I've left it out for the time being.
====================
V4:
- Fixup checkpatch.pl issues
- Collected more ACKs
V3:
- Fix issue on virtio_net patch spotted by Jason Wang
- Adjust name for variable in mlx5 patch
- Collected more ACKs
V2:
- Fix bug in mlx5 for XDP_PASS case
- Collected nitpicks and ACKs from mailing list
V1:
- Fix bug in dpaa2
XDP have evolved to support several frame sizes, but xdp_buff was not
updated with this information. This have caused the side-effect that
XDP frame data hard end is unknown. This have limited the BPF-helper
bpf_xdp_adjust_tail to only shrink the packet. This patchset address
this and add packet tail extend/grow.
The purpose of the patchset is ALSO to reserve a memory area that can be
used for storing extra information, specifically for extending XDP with
multi-buffer support. One proposal is to use same layout as
skb_shared_info, which is why this area is currently 320 bytes.
When converting xdp_frame to SKB (veth and cpumap), the full tailroom
area can now be used and SKB truesize is now correct. For most
drivers this result in a much larger tailroom in SKB "head" data
area. The network stack can now take advantage of this when doing SKB
coalescing. Thus, a good driver test is to use xdp_redirect_cpu from
samples/bpf/ and do some TCP stream testing.
Use-cases for tail grow/extend:
(1) IPsec / XFRM needs a tail extend[1][2].
(2) DNS-cache responses in XDP.
(3) HAProxy ALOHA would need it to convert to XDP.
(4) Add tail info e.g. timestamp and collect via tcpdump
Extend BPF selftest xdp_adjust_tail with grow tail tests, which is added
as subtest's. The first grow test stays in same form as original shrink
test. The second grow test use the newer bpf_prog_test_run_xattr() calls,
and does extra checking of data contents.
Update the memory requirements, when adding xdp.frame_sz in BPF test_run
function bpf_prog_test_run_xdp() which e.g. is used by XDP selftests.
Specifically add the expected reserved tailroom, but also allocated a
larger memory area to reflect that XDP frames usually comes in this
format. Limit the provided packet data size to 4096 minus headroom +
tailroom, as this also reflect a common 3520 bytes MTU limit with XDP.
Note that bpf_test_init already use a memory allocation method that clears
memory. Thus, this already guards against leaking uninit kernel memory.
Clearing memory of tail when grow happens, because it is too easy
to write a XDP_PASS program that extend the tail, which expose
this memory to users that can run tcpdump.
xdp: Allow bpf_xdp_adjust_tail() to grow packet size
Finally, after all drivers have a frame size, allow BPF-helper
bpf_xdp_adjust_tail() to grow or extend packet size at frame tail.
Remember that helper/macro xdp_data_hard_end have reserved some
tailroom. Thus, this helper makes sure that the BPF-prog don't have
access to this tailroom area.
V2: Remove one chicken check and use WARN_ONCE for other
mlx5: Rx queue setup time determine frame_sz for XDP
The mlx5 driver have multiple memory models, which are also changed
according to whether a XDP bpf_prog is attached.
The 'rx_striding_rq' setting is adjusted via ethtool priv-flags e.g.:
# ethtool --set-priv-flags mlx5p2 rx_striding_rq off
On the general case with 4K page_size and regular MTU packet, then
the frame_sz is 2048 and 4096 when XDP is enabled, in both modes.
The info on the given frame size is stored differently depending on the
RQ-mode and encoded in a union in struct mlx5e_rq union wqe/mpwqe.
In rx striding mode rq->mpwqe.log_stride_sz is either 11 or 12, which
corresponds to 2048 or 4096 (MLX5_WQ_TYPE_LINKED_LIST_STRIDING_RQ).
In non-striding mode (MLX5_WQ_TYPE_CYCLIC) the frag_stride is stored
in rq->wqe.info.arr[0].frag_stride, for the first fragment, which is
what the XDP case cares about.
To reduce effect on fast-path, this patch determine the frame_sz at
setup time, to avoid determining the memory model runtime. Variable
is named frame0_sz to make it clear that this is only the frame
size of the first fragment.
This mlx5 driver does a DMA-sync on XDP_TX action, but grow is safe
as it have done a DMA-map on the entire PAGE_SIZE. The driver also
already does a XDP length check against sq->hw_mtu on the possible
XDP xmit paths mlx5e_xmit_xdp_frame() + mlx5e_xmit_xdp_frame_mpwqe().
V3+4: Change variable name first_frame_sz to frame0_sz
V2: Fix that frag_size need to be recalc before creating SKB.
Intel drivers implement native AF_XDP zerocopy in separate C-files,
that have its own invocation of bpf_prog_run_xdp(). The setup of
xdp_buff is also handled in separately from normal code path.
This patch update XDP frame_sz for AF_XDP zerocopy drivers i40e, ice
and ixgbe, as the code changes needed are very similar. Introduce a
helper function xsk_umem_xdp_frame_sz() for calculating frame size.
This driver uses different memory models depending on PAGE_SIZE at
compile time. For PAGE_SIZE 4K it uses page splitting, meaning for
normal MTU frame size is 2048 bytes (and headroom 192 bytes). For
larger MTUs the driver still use page splitting, by allocating
order-1 pages (8192 bytes) for RX frames. For PAGE_SIZE larger than
4K, driver instead advance its rx_buffer->page_offset with the frame
size "truesize".
For XDP frame size calculations, this mean that in PAGE_SIZE larger
than 4K mode the frame_sz change on a per packet basis. For the page
split 4K PAGE_SIZE mode, xdp.frame_sz is more constant and can be
updated once outside the main NAPI loop.
The default setting in the driver uses build_skb(), which provides
the necessary headroom and tailroom for XDP-redirect in RX-frame
(in both modes).
There is one complication, which is legacy-rx mode (configurable via
ethtool priv-flags). There are zero headroom in this mode, which is a
requirement for XDP-redirect to work. The conversion to xdp_frame
(convert_to_xdp_frame) will detect this insufficient space, and
xdp_do_redirect() call will fail. This is deemed acceptable, as it
allows other XDP actions to still work in legacy-mode. In
legacy-mode + larger PAGE_SIZE due to lacking tailroom, we also
accept that xdp_adjust_tail shrink doesn't work.
This driver uses different memory models depending on PAGE_SIZE at
compile time. For PAGE_SIZE 4K it uses page splitting, meaning for
normal MTU frame size is 2048 bytes (and headroom 192 bytes). For
larger MTUs the driver still use page splitting, by allocating
order-1 pages (8192 bytes) for RX frames. For PAGE_SIZE larger than
4K, driver instead advance its rx_buffer->page_offset with the frame
size "truesize".
For XDP frame size calculations, this mean that in PAGE_SIZE larger
than 4K mode the frame_sz change on a per packet basis. For the page
split 4K PAGE_SIZE mode, xdp.frame_sz is more constant and can be
updated once outside the main NAPI loop.
The default setting in the driver uses build_skb(), which provides
the necessary headroom and tailroom for XDP-redirect in RX-frame
(in both modes).
There is one complication, which is legacy-rx mode (configurable via
ethtool priv-flags). There are zero headroom in this mode, which is a
requirement for XDP-redirect to work. The conversion to xdp_frame
(convert_to_xdp_frame) will detect this insufficient space, and
xdp_do_redirect() call will fail. This is deemed acceptable, as it
allows other XDP actions to still work in legacy-mode. In
legacy-mode + larger PAGE_SIZE due to lacking tailroom, we also
accept that xdp_adjust_tail shrink doesn't work.
This patch mirrors the changes to ixgbe in previous patch.
This VF driver doesn't support XDP_REDIRECT, but correct tailroom is
still necessary for BPF-helper xdp_adjust_tail. In legacy-mode +
larger PAGE_SIZE, due to lacking tailroom, we accept that
xdp_adjust_tail shrink doesn't work.
This driver uses different memory models depending on PAGE_SIZE at
compile time. For PAGE_SIZE 4K it uses page splitting, meaning for
normal MTU frame size is 2048 bytes (and headroom 192 bytes). For
larger MTUs the driver still use page splitting, by allocating
order-1 pages (8192 bytes) for RX frames. For PAGE_SIZE larger than
4K, driver instead advance its rx_buffer->page_offset with the frame
size "truesize".
For XDP frame size calculations, this mean that in PAGE_SIZE larger
than 4K mode the frame_sz change on a per packet basis. For the page
split 4K PAGE_SIZE mode, xdp.frame_sz is more constant and can be
updated once outside the main NAPI loop.
The default setting in the driver uses build_skb(), which provides
the necessary headroom and tailroom for XDP-redirect in RX-frame
(in both modes).
There is one complication, which is legacy-rx mode (configurable via
ethtool priv-flags). There are zero headroom in this mode, which is a
requirement for XDP-redirect to work. The conversion to xdp_frame
(convert_to_xdp_frame) will detect this insufficient space, and
xdp_do_redirect() call will fail. This is deemed acceptable, as it
allows other XDP actions to still work in legacy-mode. In
legacy-mode + larger PAGE_SIZE due to lacking tailroom, we also
accept that xdp_adjust_tail shrink doesn't work.
ixgbe: Fix XDP redirect on archs with PAGE_SIZE above 4K
The ixgbe driver have another memory model when compiled on archs with
PAGE_SIZE above 4096 bytes. In this mode it doesn't split the page in
two halves, but instead increment rx_buffer->page_offset by truesize of
packet (which include headroom and tailroom for skb_shared_info).
This is done correctly in ixgbe_build_skb(), but in ixgbe_rx_buffer_flip
which is currently only called on XDP_TX and XDP_REDIRECT, it forgets
to add the tailroom for skb_shared_info. This breaks XDP_REDIRECT, for
veth and cpumap. Fix by adding size of skb_shared_info tailroom.
Maintainers notice: This fix have been queued to Jeff.
The virtio_net driver is running inside the guest-OS. There are two
XDP receive code-paths in virtio_net, namely receive_small() and
receive_mergeable(). The receive_big() function does not support XDP.
In receive_small() the frame size is available in buflen. The buffer
backing these frames are allocated in add_recvbuf_small() with same
size, except for the headroom, but tailroom have reserved room for
skb_shared_info. The headroom is encoded in ctx pointer as a value.
In receive_mergeable() the frame size is more dynamic. There are two
basic cases: (1) buffer size is based on a exponentially weighted
moving average (see DECLARE_EWMA) of packet length. Or (2) in case
virtnet_get_headroom() have any headroom then buffer size is
PAGE_SIZE. The ctx pointer is this time used for encoding two values;
the buffer len "truesize" and headroom. In case (1) if the rx buffer
size is underestimated, the packet will have been split over more
buffers (num_buf info in virtio_net_hdr_mrg_rxbuf placed in top of
buffer area). If that happens the XDP path does a xdp_linearize_page
operation.
V3: Adjust frame_sz in receive_mergeable() case, spotted by Jason Wang.
The code is really hard to follow, so some hints to reviewers.
The receive_mergeable() case gets frames that were allocated in
add_recvbuf_mergeable() which uses headroom=virtnet_get_headroom(),
and 'buf' ptr is advanced this headroom. The headroom can only
be 0 or VIRTIO_XDP_HEADROOM, as virtnet_get_headroom is really
simple:
As frame_sz is an offset size from xdp.data_hard_start, reviewers
should notice how this is calculated in receive_mergeable():
int offset = buf - page_address(page);
[...]
data = page_address(xdp_page) + offset;
xdp.data_hard_start = data - VIRTIO_XDP_HEADROOM + vi->hdr_len;
The calculated offset will always be VIRTIO_XDP_HEADROOM when
reaching this code. Thus, xdp.data_hard_start will be page-start
address plus vi->hdr_len. Given this xdp.frame_sz need to be
reduced with vi->hdr_len size.
IMHO a followup patch should cleanup this code to make it easier
to maintain and understand, but it is outside the scope of this
patchset.
In vhost_net_build_xdp() the 'buf' that gets queued via an xdp_buff
have embedded a struct tun_xdp_hdr (located at xdp->data_hard_start)
which contains the buffer length 'buflen' (with tailroom for
skb_shared_info). Also storing this buflen in xdp->frame_sz, does not
obsolete struct tun_xdp_hdr, as it also contains a struct
virtio_net_hdr with other information.
The netronome nfp driver use PAGE_SIZE when xdp_prog is set, but
xdp.data_hard_start begins at offset NFP_NET_RX_BUF_HEADROOM.
Thus, adjust for this when setting xdp.frame_sz, as it counts
from data_hard_start.
When doing XDP_TX this driver is smart and instead of a full DMA-map
does a DMA-sync on with packet length. As xdp_adjust_tail can now
grow packet length, add checks to make sure that grow size is within
the DMA-mapped size.
The mlx4 drivers size of memory backing the RX packet is stored in
frag_stride. For XDP mode this will be PAGE_SIZE (normally 4096).
For normal mode frag_stride is 2048.
Also adjust MLX4_EN_MAX_XDP_MTU to take tailroom into account.
Frame size ENA_PAGE_SIZE is limited to 16K on systems with larger
PAGE_SIZE than 16K. Change ENA_XDP_MAX_MTU to also take into account
the reserved tailroom.
The driver qede uses a full page, when XDP is enabled. The drivers value
in rx_buf_seg_size (struct qede_rx_queue) will be PAGE_SIZE when an
XDP bpf_prog is attached.
The hyperv NIC driver does memory allocation and copy even without XDP.
In XDP mode it will allocate a new page for each packet and copy over
the payload, before invoking the XDP BPF-prog.
The positive thing it that its easy to determine the xdp.frame_sz.
The XDP implementation for hv_netvsc transparently passes xdp_prog
to the associated VF NIC. Many of the Azure VMs are using SRIOV, so
majority of the data are actually processed directly on the VF driver's XDP
path. So the overhead of the synthetic data path (hv_netvsc) is minimal.
Then XDP is enabled on this driver, XDP_PASS and XDP_TX will create the
SKB via build_skb (based on the newly allocated page). Now using XDP
frame_sz this will provide more skb_tailroom, which netstack can use for
SKB coalescing (e.g tcp_try_coalesce -> skb_try_coalesce).
V3: Adjust patch desc to be more positive.
Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Cc: Wei Liu <wei.liu@kernel.org> Cc: "K. Y. Srinivasan" <kys@microsoft.com> Cc: Haiyang Zhang <haiyangz@microsoft.com> Cc: Stephen Hemminger <sthemmin@microsoft.com> Link: https://lore.kernel.org/bpf/158945339857.97035.10212138582505736163.stgit@firesoul
The dpaa2-eth driver reserve some headroom used for hardware and
software annotation area in RX/TX buffers. Thus, xdp.data_hard_start
doesn't start at page boundary.
When XDP is configured the area reserved via dpaa2_fd_get_offset(fd) is
448 bytes of which XDP have reserved 256 bytes. As frame_sz is
calculated as an offset from xdp_buff.data_hard_start, an adjust from
the full PAGE_SIZE == DPAA2_ETH_RX_BUF_RAW_SIZE.
When doing XDP_REDIRECT, the driver doesn't need this reserved headroom
any-longer and allows xdp_do_redirect() to use it. This is an advantage
for the drivers own ndo-xdp_xmit, as it uses part of this headroom for
itself. Patch also adjust frame_sz in this case.
The driver cannot support XDP data_meta, because it uses the headroom
just before xdp.data for struct dpaa2_eth_swa (DPAA2_ETH_SWA_SIZE=64),
when transmitting the packet. When transmitting a xdp_frame in
dpaa2_eth_xdp_xmit_frame (call via ndo_xdp_xmit) is uses this area to
store a pointer to xdp_frame and dma_size, which is used in TX
completion (free_tx_fd) to return frame via xdp_return_frame().
The veth driver can run XDP in "native" mode in it's own NAPI
handler, and since commit 9fc8d518d9d5 ("veth: Handle xdp_frames in
xdp napi ring") packets can come in two forms either xdp_frame or
skb, calling respectively veth_xdp_rcv_one() or veth_xdp_rcv_skb().
For packets to arrive in xdp_frame format, they will have been
redirected from an XDP native driver. In case of XDP_PASS or no
XDP-prog attached, the veth driver will allocate and create an SKB.
The current code in veth_xdp_rcv_one() xdp_frame case, had to guess
the frame truesize of the incoming xdp_frame, when using
veth_build_skb(). With xdp_frame->frame_sz this is not longer
necessary.
Calculating the frame_sz in veth_xdp_rcv_skb() skb case, is done
similar to the XDP-generic handling code in net/core/dev.c.
veth: Adjust hard_start offset on redirect XDP frames
When native XDP redirect into a veth device, the frame arrives in the
xdp_frame structure. It is then processed in veth_xdp_rcv_one(),
which can run a new XDP bpf_prog on the packet. Doing so requires
converting xdp_frame to xdp_buff, but the tricky part is that
xdp_frame memory area is located in the top (data_hard_start) memory
area that xdp_buff will point into.
The current code tried to protect the xdp_frame area, by assigning
xdp_buff.data_hard_start past this memory. This results in 32 bytes
less headroom to expand into via BPF-helper bpf_xdp_adjust_head().
This protect step is actually not needed, because BPF-helper
bpf_xdp_adjust_head() already reserve this area, and don't allow
BPF-prog to expand into it. Thus, it is safe to point data_hard_start
directly at xdp_frame memory area.
xdp: Cpumap redirect use frame_sz and increase skb_tailroom
Knowing the memory size backing the packet/xdp_frame data area, and
knowing it already have reserved room for skb_shared_info, simplifies
using build_skb significantly.
With this change we no-longer lie about the SKB truesize, but more
importantly a significant larger skb_tailroom is now provided, e.g. when
drivers uses a full PAGE_SIZE. This extra tailroom (in linear area) can be
used by the network stack when coalescing SKBs (e.g. in skb_try_coalesce,
see TCP cases where tcp_queue_rcv() can 'eat' skb).
xdp: Xdp_frame add member frame_sz and handle in convert_to_xdp_frame
Use hole in struct xdp_frame, when adding member frame_sz, which keeps
same sizeof struct (32 bytes)
Drivers ixgbe and sfc had bug cases where the necessary/expected
tailroom was not reserved. This can lead to some hard to catch memory
corruption issues. Having the drivers frame_sz this can be detected when
packet length/end via xdp->data_end exceed the xdp_data_hard_end
pointer, which accounts for the reserved the tailroom.
When detecting this driver issue, simply fail the conversion with NULL,
which results in feedback to driver (failing xdp_do_redirect()) causing
driver to drop packet. Given the lack of consistent XDP stats, this can
be hard to troubleshoot. And given this is a driver bug, we want to
generate some more noise in form of a WARN stack dump (to ID the driver
code that inlined convert_to_xdp_frame).
Inlining the WARN macro is problematic, because it adds an asm
instruction (on Intel CPUs ud2) what influence instruction cache
prefetching. Thus, introduce xdp_warn and macro XDP_WARN, to avoid this
and at the same time make identifying the function and line of this
inlined function easier.
The SKB "head" pointer points to the data area that contains
skb_shared_info, that can be found via skb_end_pointer(). Given
xdp->data_hard_start have been established (basically pointing to
skb->head), frame size is between skb_end_pointer() and data_hard_start,
plus the size reserved to skb_shared_info.
Change the bpf_xdp_adjust_tail offset adjust of skb->len, to be a positive
offset number on grow, and negative number on shrink. As this seems more
natural when reading the code.
Ilias Apalodimas [Thu, 14 May 2020 10:49:23 +0000 (12:49 +0200)]
net: netsec: Add support for XDP frame size
This driver takes advantage of page_pool PP_FLAG_DMA_SYNC_DEV that
can help reduce the number of cache-lines that need to be flushed
when doing DMA sync for_device. Due to xdp_adjust_tail can grow the
area accessible to the by the CPU (can possibly write into), then max
sync length *after* bpf_prog_run_xdp() needs to be taken into account.
For XDP_TX action the driver is smart and does DMA-sync. When growing
tail this is still safe, because page_pool have DMA-mapped the entire
page size.
This marvell driver mvneta uses PAGE_SIZE frames, which makes it
really easy to convert. Driver updates rxq and now frame_sz
once per NAPI call.
This driver takes advantage of page_pool PP_FLAG_DMA_SYNC_DEV that
can help reduce the number of cache-lines that need to be flushed
when doing DMA sync for_device. Due to xdp_adjust_tail can grow the
area accessible to the by the CPU (can possibly write into), then max
sync length *after* bpf_prog_run_xdp() needs to be taken into account.
For XDP_TX action the driver is smart and does DMA-sync. When growing
tail this is still safe, because page_pool have DMA-mapped the entire
page size.
This driver uses RX page-split when possible. It was recently fixed
in commit 86e85bf6981c ("sfc: fix XDP-redirect in this driver") to
add needed tailroom for XDP-redirect.
After the fix efx->rx_page_buf_step is the frame size, with enough
head and tail-room for XDP-redirect.
This driver uses full PAGE_SIZE pages when XDP is enabled.
In case of XDP uses driver uses __bnxt_alloc_rx_page which does full
page DMA-map. Thus, xdp_adjust_tail grow is DMA compliant for XDP_TX
action that does DMA-sync.
Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Reviewed-by: Andy Gospodarek <gospo@broadcom.com> Cc: Michael Chan <michael.chan@broadcom.com> Cc: Andy Gospodarek <andrew.gospodarek@broadcom.com> Link: https://lore.kernel.org/bpf/158945334769.97035.13437970179897613984.stgit@firesoul
XDP have evolved to support several frame sizes, but xdp_buff was not
updated with this information. The frame size (frame_sz) member of
xdp_buff is introduced to know the real size of the memory the frame is
delivered in.
When introducing this also make it clear that some tailroom is
reserved/required when creating SKBs using build_skb().
It would also have been an option to introduce a pointer to
data_hard_end (with reserved offset). The advantage with frame_sz is
that (like rxq) drivers only need to setup/assign this value once per
NAPI cycle. Due to XDP-generic (and some drivers) it's not possible to
store frame_sz inside xdp_rxq_info, because it's varies per packet as it
can be based/depend on packet length.
====================
v2->v3:
- better documentation for bpf_sk_cgroup_id in uapi (Yonghong Song)
- save/restore errno in network helpers (Yonghong Song)
- cleanup leftover after switching selftest to skeleton (Yonghong Song)
- switch from map to skel->bss in selftest (Yonghong Song)
v1->v2:
- switch selftests to skeleton.
This patch set allows a bunch of existing sk lookup and skb cgroup id
helpers, and adds two new bpf_sk_{,ancestor_}cgroup_id helpers to be used
in cgroup skb programs.
It fills the gap to cover a use-case to apply intra-host cgroup-bpf network
policy based on a source cgroup a packet comes from.
For example, there can be multiple containers A, B, C running on a host.
Every such container runs in its own cgroup that can have multiple
sub-cgroups. But all these containers can share some IP addresses.
At the same time container A wants to have a policy for a server S running
in it so that only clients from this same container can connect to S, but
not from other containers (such as B, C). Source IP address can't be used
to decide whether to allow or deny a packet, but it looks reasonable to
filter by cgroup id.
The patch set allows to implement the following policy:
* when an ingress packet comes to container's cgroup, lookup peer (client)
socket this packet comes from;
* having peer socket, get its cgroup id;
* compare peer cgroup id with self cgroup id and allow packet only if they
match, i.e. it comes from same cgroup;
* the "sub-cgroup" part of the story can be addressed by getting not direct
cgroup id of the peer socket, but ancestor cgroup id on specified level,
similar to existing "ancestor" flavors of cgroup id helpers.
A newly introduced selftest implements such a policy in its basic form to
provide a better idea on the use-case.
Patch 1 allows existing sk lookup helpers in cgroup skb.
Patch 2 allows skb_ancestor_cgroup_id in cgrou skb.
Patch 3 introduces two new helpers to get cgroup id of socket.
Patch 4 extends network helpers to use them in the next patch.
Patch 5 adds selftest / example of use-case.
====================
Andrey Ignatov [Thu, 14 May 2020 20:03:49 +0000 (13:03 -0700)]
selftests/bpf: Test for sk helpers in cgroup skb
Test bpf_sk_lookup_tcp, bpf_sk_release, bpf_sk_cgroup_id and
bpf_sk_ancestor_cgroup_id helpers from cgroup skb program.
The test creates a testing cgroup, starts a TCPv6 server inside the
cgroup and creates two client sockets: one inside testing cgroup and one
outside.
Then it attaches cgroup skb program to the cgroup that checks all TCP
segments coming to the server and allows only those coming from the
cgroup of the server. If a segment comes from a peer outside of the
cgroup, it'll be dropped.
Finally the test checks that client from inside testing cgroup can
successfully connect to the server, but client outside the cgroup fails
to connect by timeout.
The main goal of the test is to check newly introduced
bpf_sk_{,ancestor_}cgroup_id helpers.
It also checks a couple of socket lookup helpers (tcp & release), but
lookup helpers were introduced much earlier and covered by other tests.
Here it's mostly checked that they can be called from cgroup skb.
Andrey Ignatov [Thu, 14 May 2020 20:03:48 +0000 (13:03 -0700)]
selftests/bpf: Add connect_fd_to_fd, connect_wait net helpers
Add two new network helpers.
connect_fd_to_fd connects an already created client socket fd to address
of server fd. Sometimes it's useful to separate client socket creation
and connecting this socket to a server, e.g. if client socket has to be
created in a cgroup different from that of server cgroup.
Additionally connect_to_fd is now implemented using connect_fd_to_fd,
both helpers don't treat EINPROGRESS as an error and let caller decide
how to proceed with it.
connect_wait is a helper to work with non-blocking client sockets so
that if connect_to_fd or connect_fd_to_fd returned -1 with errno ==
EINPROGRESS, caller can wait for connect to finish or for connection
timeout. The helper returns -1 on error, 0 on timeout (1sec,
hard-coded), and positive number on success.
With having ability to lookup sockets in cgroup skb programs it becomes
useful to access cgroup id of retrieved sockets so that policies can be
implemented based on origin cgroup of such socket.
For example, a container running in a cgroup can have cgroup skb ingress
program that can lookup peer socket that is sending packets to a process
inside the container and decide whether those packets should be allowed
or denied based on cgroup id of the peer.
More specifically such ingress program can implement intra-host policy
"allow incoming packets only from this same container and not from any
other container on same host" w/o relying on source IP addresses since
quite often it can be the case that containers share same IP address on
the host.
Introduce two new helpers for this use-case: bpf_sk_cgroup_id() and
bpf_sk_ancestor_cgroup_id().
These helpers are similar to existing bpf_skb_{,ancestor_}cgroup_id
helpers with the only difference that sk is used to get cgroup id
instead of skb, and share code with them.
Andrey Ignatov [Thu, 14 May 2020 20:03:46 +0000 (13:03 -0700)]
bpf: Allow skb_ancestor_cgroup_id helper in cgroup skb
cgroup skb programs already can use bpf_skb_cgroup_id. Allow
bpf_skb_ancestor_cgroup_id as well so that container policies can be
implemented for a container that can have sub-cgroups dynamically
created, but policies should still be implemented based on cgroup id of
container itself not on an id of a sub-cgroup.
Andrey Ignatov [Thu, 14 May 2020 20:03:45 +0000 (13:03 -0700)]
bpf: Allow sk lookup helpers in cgroup skb
Currently sk lookup helpers are allowed in tc, xdp, sk skb, and cgroup
sock_addr programs.
But they would be useful in cgroup skb as well so that for example
cgroup skb ingress program can lookup a peer socket a packet comes from
on same host and make a decision whether to allow or deny this packet
based on the properties of that socket, e.g. cgroup that peer socket
belongs to.
Allow the following sk lookup helpers in cgroup skb:
* bpf_sk_lookup_tcp;
* bpf_sk_lookup_udp;
* bpf_sk_release;
* bpf_skc_lookup_tcp.
Andrii Nakryiko [Thu, 14 May 2020 05:51:37 +0000 (22:51 -0700)]
bpf: Fix bpf_iter's task iterator logic
task_seq_get_next might stop prematurely if get_pid_task() fails to get
task_struct. Failure to do so doesn't mean that there are no more tasks with
higher pids. Procfs's iteration algorithm (see next_tgid in fs/proc/base.c)
does a retry in such case. After this fix, instead of stopping prematurely
after about 300 tasks on my server, bpf_iter program now returns >4000, which
sounds much closer to reality.
Lorenzo Bianconi [Tue, 12 May 2020 16:30:40 +0000 (18:30 +0200)]
samples/bpf: xdp_redirect_cpu: Set MAX_CPUS according to NR_CPUS
xdp_redirect_cpu is currently failing in bpf_prog_load_xattr()
allocating cpu_map map if CONFIG_NR_CPUS is less than 64 since
cpu_map_alloc() requires max_entries to be less than NR_CPUS.
Set cpu_map max_entries according to NR_CPUS in xdp_redirect_cpu_kern.c
and get currently running cpus in xdp_redirect_cpu_user.c
Heiner Kallweit [Thu, 14 May 2020 21:44:07 +0000 (23:44 +0200)]
r8169: don't include linux/moduleparam.h
93882c6f210a ("r8169: switch from netif_xxx message functions to
netdev_xxx") removed the last module parameter from the driver,
therefore there's no need any longer to include linux/moduleparam.h.
Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Heiner Kallweit [Thu, 14 May 2020 21:39:34 +0000 (23:39 +0200)]
r8169: remove not needed checks in rtl8169_set_eee
After 9de5d235b60a ("net: phy: fix aneg restart in phy_ethtool_set_eee")
we don't need the check for aneg being enabled any longer, and as
discussed with Russell configuring the EEE advertisement should be
supported even if we're in a half-duplex mode currently.
Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Colin Ian King [Thu, 14 May 2020 18:33:02 +0000 (19:33 +0100)]
net: dsa: felix: fix incorrect clamp calculation for burst
Currently burst is clamping on rate and not burst, the assignment
of burst from the clamping discards the previous assignment of burst.
This looks like a cut-n-paste error from the previous clamping
calculation on ramp. Fix this by replacing ramp with burst.
Addresses-Coverity: ("Unused value") Fixes: 0fbabf875d18 ("net: dsa: felix: add support Credit Based Shaper(CBS) for hardware offload") Signed-off-by: Colin Ian King <colin.king@canonical.com> Acked-by: Vladimir Oltean <vladimir.oltean@nxp.com> Signed-off-by: David S. Miller <davem@davemloft.net>
mdio-moxart doesn't use regulators in the driver code. We can remove
the regulator include.
Signed-off-by: Bartosz Golaszewski <bgolaszewski@baylibre.com> Reviewed-by: Andrew Lunn <andrew@lunn.ch> Reviewed-by: Florian Fainelli <f.fainelli@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Jakub Kicinski [Wed, 13 May 2020 17:28:22 +0000 (10:28 -0700)]
devlink: refactor end checks in devlink_nl_cmd_region_read_dumpit
Clean up after recent fixes, move address calculations
around and change the variable init, so that we can have
just one start_offset == end_offset check.
Make the check a little stricter to preserve the -EINVAL
error if requested start offset is larger than the region
itself.
Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
====================
am65-cpsw: add taprio/EST offload support
AM65 CPSW h/w supports Enhanced Scheduled Traffic (EST – defined
in P802.1Qbv/D2.2 that later got included in IEEE 802.1Q-2018)
configuration. EST allows express queue traffic to be scheduled
(placed) on the wire at specific repeatable time intervals. In
Linux kernel, EST configuration is done through tc command and
the taprio scheduler in the net core implements a software only
scheduler (SCH_TAPRIO). If the NIC is capable of EST configuration,
user indicate "flag 2" in the command which is then parsed by
taprio scheduler in net core and indicate that the command is to
be offloaded to h/w. taprio then offloads the command to the
driver by calling ndo_setup_tc() ndo ops. This patch implements
ndo_setup_tc() as well as other changes required to offload EST
configuration to CPSW h/w
For more details please refer patch 2/2.
This series is based on original work done by Ivan Khoronzhuk
<ivan.khoronzhuk@linaro.org> to add taprio offload support to
AM65 CPSW 2G.
1. Example configuration 3 Gates
ifconfig eth0 down
ethtool -L eth0 tx 3
ethtool --set-priv-flags eth0 p0-rx-ptype-rrobin off
ethtool --set-priv-flags eth0 p0-rx-ptype-rrobin off
ifconfig eth0 192.168.2.20
tc qdisc replace dev eth0 parent root handle 100 taprio \
num_tc 8 \
map 0 1 2 3 4 5 6 7 0 0 0 0 0 0 0 0 \
queues 1@0 1@1 1@2 1@3 1@4 1@5 1@6 1@7 \
base-time 0000 \
sched-entry S 80 125000 \
sched-entry S 40 125000 \
sched-entry S 20 125000 \
sched-entry S 10 125000 \
sched-entry S 08 125000 \
sched-entry S 04 125000 \
sched-entry S 02 125000 \
sched-entry S 01 125000 \
flags 2
Classify frames to particular priority using skbedit so that they land at
a specific queue in cpsw h/w which is Gated by the EST gate which opens based
on the sched-entry.
tc qdisc add dev eth0 clsact
In the below for example an iperf3 session with destination port 5007
will go through Q7.
tc filter add dev eth0 egress protocol ip prio 1 u32 match ip dport 5007 0xffff action skbedit priority 7
tc filter add dev eth0 egress protocol ip prio 1 u32 match ip dport 5006 0xffff action skbedit priority 6
tc filter add dev eth0 egress protocol ip prio 1 u32 match ip dport 5005 0xffff action skbedit priority 5
tc filter add dev eth0 egress protocol ip prio 1 u32 match ip dport 5004 0xffff action skbedit priority 4
tc filter add dev eth0 egress protocol ip prio 1 u32 match ip dport 5003 0xffff action skbedit priority 3
tc filter add dev eth0 egress protocol ip prio 1 u32 match ip dport 5002 0xffff action skbedit priority 2
tc filter add dev eth0 egress protocol ip prio 1 u32 match ip dport 5001 0xffff action skbedit priority 1
Testing was done by capturing frames at the PC using wireshark and checking for
the bust interval or cycle time of UDP frames with a specific port number.
Verified that the distance between first frame of a burst (cycle-time) is 1
milli second and burst duration is within 125 usec based on the received packet
timestamp shown in wireshark packet display.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Ivan Khoronzhuk [Wed, 13 May 2020 13:26:15 +0000 (09:26 -0400)]
ethernet: ti: am65-cpsw-qos: add TAPRIO offload support
AM65 CPSW h/w supports Enhanced Scheduled Traffic (EST – defined
in P802.1Qbv/D2.2 that later got included in IEEE 802.1Q-2018)
configuration. EST allows express queue traffic to be scheduled
(placed) on the wire at specific repeatable time intervals. In
Linux kernel, EST configuration is done through tc command and
the taprio scheduler in the net core implements a software only
scheduler (SCH_TAPRIO). If the NIC is capable of EST configuration,
user indicate "flag 2" in the command which is then parsed by
taprio scheduler in net core and indicate that the command is to
be offloaded to h/w. taprio then offloads the command to the
driver by calling ndo_setup_tc() ndo ops. This patch implements
ndo_setup_tc() to offload EST configuration to CPSW h/w.
Currently driver supports only SetGateStates operation. EST
operates on a repeating time interval generated by the CPTS EST
function generator. Each Ethernet port has a global EST fetch
RAM that can be configured as 2 buffers, each of 64 locations
or one large buffer of 128 locations. In 2 buffer configuration,
a ping pong mechanism is used to hold the active schedule (oper)
in one buffer and new (admin) command in the other. Each 22-bit
fetch command consists of a 14-bit fetch count (14 MSB’s) and an
8-bit priority fetch allow (8 LSB’s) that will be applied for the
fetch count time in wireside clocks. Driver process each of the
sched-entry in the offload command and update the fetch RAM.
Driver configures duration in sched-entry into the fetch count
and Gate mask into the priority fetch bits of the RAM. Then
configures the CPTS EST function generator to activate the
schedule. Currently driver supports only 2 buffer configuration
which means driver supports a max cycle time of ~8 msec.
CPSW supports a configurable number of priority queues (up to 8)
and needs to be switched to this mode from the default round
robin mode before EST can be offloaded. User configures
these through ethtool commands (-L for changing number of
queues and --set-priv-flags to disable round robin mode).
Driver doesn't enable EST if pf_p0_rx_ptype_rrobin privat flag
is set. The flag is common for all ports, and so can't be just
overridden by taprio configuration w/o user involvement.
Command fails if pf_p0_rx_ptype_rrobin is already set in the
driver.
Scheds (commands) configuration depends on interface speed so
driver translates the duration to the fetch count based on
link speed. Each schedule can be constructed with several
command entries in fetch RAM depending on interval. For example
if each sched has timer interval < ~130us on 1000 Mb link then
each sched consumes one command and have 1:1 mapping. When
Ethernet link goes down, driver purge the configuration if link
is down for more than 1 second.
The patch allows to update the timer and scheds memory only if it's
really needed, and skip cases required the user to stop timer by
configuring only shceds memory.
Signed-off-by: Ivan Khoronzhuk <ivan.khoronzhuk@linaro.org> Signed-off-by: Murali Karicheri <m-karicheri2@ti.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Ivan Khoronzhuk [Wed, 13 May 2020 13:26:14 +0000 (09:26 -0400)]
ethernet: ti: am65-cpts: add routines to support taprio offload
TAPRIO/EST offload support in CPSW2G requires EST scheduler
function enabled in CPTS. So this patch add a function to
set cycle time for EST scheduler. It also add a function for
getting time in ns of PHC clock for taprio qdisc configuration.
Mostly to verify if timer update is needed or to get actual
state of oper/admin schedule.
Signed-off-by: Ivan Khoronzhuk <ivan.khoronzhuk@linaro.org> Signed-off-by: Murali Karicheri <m-karicheri2@ti.com> Signed-off-by: David S. Miller <davem@davemloft.net>
FastLinQ devices as a complex systems may observe various hardware
level error conditions, both severe and recoverable.
Driver is able to detect and report this, but so far it only did
trace/dmesg based reporting.
Here we implement an extended hw error detection, service task
handler captures a dump for the later analysis.
I also resubmit a patch from Denis Bolotin on tx timeout handler,
addressing David's comment regarding recovery procedure as an extra
reaction on this event.
v2:
Removing the patch with ethtool dump and udev magic. Its quite isolated,
I'm working on devlink based logic for this separately.
Igor Russkikh [Thu, 14 May 2020 09:57:27 +0000 (12:57 +0300)]
net: qed: fix bad formatting
On some adjacent code, fix bad code formatting
Signed-off-by: Ariel Elior <ariel.elior@marvell.com> Signed-off-by: Michal Kalderon <michal.kalderon@marvell.com> Signed-off-by: Igor Russkikh <irusskikh@marvell.com> Signed-off-by: David S. Miller <davem@davemloft.net>
MCP may signal driver about generic critical failure.
Driver has to collect mdump information (get_retain),
it pushes that to logs and triggers generic notification on
"hardware attention" event.
Signed-off-by: Ariel Elior <ariel.elior@marvell.com> Signed-off-by: Michal Kalderon <michal.kalderon@marvell.com> Signed-off-by: Igor Russkikh <irusskikh@marvell.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Igor Russkikh [Thu, 14 May 2020 09:57:25 +0000 (12:57 +0300)]
net: qed: introduce critical fan failure handler
Fan failure is sent by firmware, driver reacts on this error with
newly introduced notification path. It will collect dump and shut down
the device to prevent physical breakage
Signed-off-by: Ariel Elior <ariel.elior@marvell.com> Signed-off-by: Michal Kalderon <michal.kalderon@marvell.com> Signed-off-by: Igor Russkikh <irusskikh@marvell.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Denis Bolotin [Thu, 14 May 2020 09:57:24 +0000 (12:57 +0300)]
net: qede: Implement ndo_tx_timeout
Upon tx timeout detection we do disable carrier and print TX queue
info on TX timeout. We then raise hw error condition and trigger
service task to handle this.
This handler will capture extra debug info and then optionally
trigger recovery procedure to try restore function.
Signed-off-by: Denis Bolotin <dbolotin@marvell.com> Signed-off-by: Ariel Elior <aelior@marvell.com> Signed-off-by: Igor Russkikh <irusskikh@marvell.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Igor Russkikh [Thu, 14 May 2020 09:57:23 +0000 (12:57 +0300)]
net: qede: optional hw recovery procedure
Driver has an ability to initiate a recovery process as a reaction to
detected errors. But the codepath (recovery_process) was disabled and
never active.
Here we add ethtool private flag to allow user have the recovery
procedure activated.
We still do not enable this by default though, since in some configurations
this is not desirable. E.g. this may impact other PFs/VFs.
Signed-off-by: Ariel Elior <ariel.elior@marvell.com> Signed-off-by: Michal Kalderon <michal.kalderon@marvell.com> Signed-off-by: Igor Russkikh <irusskikh@marvell.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Igor Russkikh [Thu, 14 May 2020 09:57:22 +0000 (12:57 +0300)]
net: qed: attention clearing properties
On different hardware events we have to respond differently,
on some of hardware indications hw attention (error condition)
should be cleared by the driver to continue normal functioning.
Here we introduce attention clear flags, and put them on some
important events (in aeu_descs).
Signed-off-by: Ariel Elior <ariel.elior@marvell.com> Signed-off-by: Michal Kalderon <michal.kalderon@marvell.com> Signed-off-by: Igor Russkikh <irusskikh@marvell.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Igor Russkikh [Thu, 14 May 2020 09:57:21 +0000 (12:57 +0300)]
net: qed: cleanup debug related declarations
Thats probably a legacy code had double declaration of some fields.
Cleanup this, removing copy and fixing references.
Signed-off-by: Ariel Elior <ariel.elior@marvell.com> Signed-off-by: Michal Kalderon <michal.kalderon@marvell.com> Signed-off-by: Igor Russkikh <irusskikh@marvell.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Igor Russkikh [Thu, 14 May 2020 09:57:20 +0000 (12:57 +0300)]
net: qed: critical err reporting to management firmware
On various critical errors, notification handler should also report
the err information into the management firmware.
MFW can interact with server/motherboard backend agents - these are
used by server manufacturers to monitor server HW health.
Thus, it is important for driver to report on any faulty conditions
Signed-off-by: Ariel Elior <ariel.elior@marvell.com> Signed-off-by: Michal Kalderon <michal.kalderon@marvell.com> Signed-off-by: Igor Russkikh <irusskikh@marvell.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Igor Russkikh [Thu, 14 May 2020 09:57:19 +0000 (12:57 +0300)]
net: qed: invoke err notify on critical areas
In a number of critical places not only debug trace should be printed,
but the appropriate hw error condition should be raised and error
handling/recovery should start.
Introduce our new qed_hw_err_notify invocation in these places to
record and indicate critical error conditions in hardware.
Signed-off-by: Ariel Elior <ariel.elior@marvell.com> Signed-off-by: Michal Kalderon <michal.kalderon@marvell.com> Signed-off-by: Igor Russkikh <irusskikh@marvell.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Igor Russkikh [Thu, 14 May 2020 09:57:18 +0000 (12:57 +0300)]
net: qede: add hw err scheduled handler
qede (ethernet level driver) registers a callback handler.
This handler maintains eth dev state flags/bits to track error processing.
It implements in place processing part for nonsleeping context (WARN_ON
trigger), and a deferred (delayed work) part which triggers recovery
process for recoverable errors.
In later patches this atomic handler will come with more meat.
We introduce err_flags on ethdevice structure, its being used to record
error handling properties.
Signed-off-by: Ariel Elior <ariel.elior@marvell.com> Signed-off-by: Michal Kalderon <michal.kalderon@marvell.com> Signed-off-by: Igor Russkikh <irusskikh@marvell.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Igor Russkikh [Thu, 14 May 2020 09:57:17 +0000 (12:57 +0300)]
net: qed: adding hw_err states and handling
Here we introduce qed device error tracking flags and error types.
qed_hw_err_notify is an entrace point to report errors.
It'll notify higher level drivers (qede/qedr/etc) to handle and recover
the error.
List of posible errors comes from hardware interfaces, but could be
extended in future.
Signed-off-by: Ariel Elior <ariel.elior@marvell.com> Signed-off-by: Michal Kalderon <michal.kalderon@marvell.com> Signed-off-by: Igor Russkikh <irusskikh@marvell.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Huazhong Tan [Thu, 14 May 2020 12:41:26 +0000 (20:41 +0800)]
net: hns3: remove unnecessary frag list checking in hns3_nic_net_xmit()
The skb_has_frag_list() in hns3_nic_net_xmit() is redundant, since
skb_walk_frags() includes this checking implicitly.
Reported-by: Yunsheng Lin <linyunsheng@huawei.com> Signed-off-by: Huazhong Tan <tanhuazhong@huawei.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Huazhong Tan [Thu, 14 May 2020 12:41:25 +0000 (20:41 +0800)]
net: hns3: remove some unused macros
There are some macros defined in hns3_enet.h, but not used in
anywhere.
Reported-by: Yonglong Liu <liuyonglong@huawei.com> Signed-off-by: Huazhong Tan <tanhuazhong@huawei.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Huazhong Tan [Thu, 14 May 2020 12:41:24 +0000 (20:41 +0800)]
net: hns3: modify an incorrect error log in hclge_mbx_handler()
When handling HCLGE_MBX_GET_LINK_STATUS, PF will return the link
status to the VF, so the error log of hclge_get_link_info() is
incorrect.
Reported-by: Jian Shen <shenjian15@huawei.com> Signed-off-by: Huazhong Tan <tanhuazhong@huawei.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Huazhong Tan [Thu, 14 May 2020 12:41:23 +0000 (20:41 +0800)]
net: hns3: remove a duplicated printing in hclge_configure()
Since hclge_get_cfg() already has error print, so hclge_configure()
should not print error when calling hclge_get_cfg() fail.
Reported-by: Guangbin Huang <huangguangbin2@huawei.com> Signed-off-by: Huazhong Tan <tanhuazhong@huawei.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Huazhong Tan [Thu, 14 May 2020 12:41:22 +0000 (20:41 +0800)]
net: hns3: modify some incorrect spelling
This patch modifies some incorrect spelling.
Reported-by: Jian Shen <shenjian15@huawei.com> Signed-off-by: Huazhong Tan <tanhuazhong@huawei.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Thierry Reding [Thu, 14 May 2020 12:38:48 +0000 (14:38 +0200)]
r8152: Use MAC address from device tree if available
If a MAC address was passed via the device tree node for the r8152
device, use it and fall back to reading from EEPROM otherwise. This is
useful for devices where the r8152 EEPROM was not programmed with a
valid MAC address, or if users want to explicitly set a MAC address in
the bootloader and pass that to the kernel.
Signed-off-by: Thierry Reding <treding@nvidia.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Vlad Buslov [Thu, 14 May 2020 06:35:52 +0000 (09:35 +0300)]
selftests: fix flower parent qdisc
Flower tests used to create ingress filter with specified parent qdisc
"parent ffff:" but dump them on "ingress". With recent commit that fixed
tcm_parent handling in dump those are not considered same parent anymore,
which causes iproute2 tc to emit additional "parent ffff:" in first line of
filter dump output. The change in output causes filter match in tests to
fail.
Prevent parent qdisc output when dumping filters in flower tests by always
correctly specifying "ingress" parent both when creating and dumping
filters.
Fixes: a7df4870d79b ("net_sched: fix tcm_parent in tc filter dump") Signed-off-by: Vlad Buslov <vladbu@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
DENG Qingfang [Wed, 13 May 2020 15:37:17 +0000 (23:37 +0800)]
net: dsa: mt7530: set CPU port to fallback mode
Currently, setting a bridge's self PVID to other value and deleting
the default VID 1 renders untagged ports of that VLAN unable to talk to
the CPU port:
bridge vlan add dev br0 vid 2 pvid untagged self
bridge vlan del dev br0 vid 1 self
bridge vlan add dev sw0p0 vid 2 pvid untagged
bridge vlan del dev sw0p0 vid 1
# br0 cannot send untagged frames out of sw0p0 anymore
That is because the CPU port is set to security mode and its PVID is
still 1, and untagged frames are dropped due to VLAN member violation.
Set the CPU port to fallback mode so untagged frames can pass through.
Fixes: 83163f7dca56 ("net: dsa: mediatek: add VLAN support for MT7530") Signed-off-by: DENG Qingfang <dqfext@gmail.com> Reviewed-by: Florian Fainelli <f.fainelli@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
net: mvneta: speed down the PHY, if WoL used, to save energy
Some PHYs connected to this ethernet hardware support the WoL feature.
But when WoL is enabled and the machine is powered off, the PHY remains
waiting for a magic packet at max speed (i.e. 1Gbps), which is a waste of
energy.
Slow down the PHY speed before stopping the ethernet if WoL is enabled,
and save some energy while the machine is powered off or sleeping.
Tested using an Armada 370 based board (LS421DE) equipped with a Marvell 88E1518 PHY.
Signed-off-by: Daniel González Cabanelas <dgcbueu@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Colin Ian King [Tue, 12 May 2020 17:13:55 +0000 (18:13 +0100)]
sfc: fix dereference of table before it is null checked
Currently pointer table is being dereferenced on a null check of
table->must_restore_filters before it is being null checked, leading
to a potential null pointer dereference issue. Fix this by null
checking table before dereferencing it when checking for a null
table->must_restore_filters.
Addresses-Coverity: ("Dereference before null check") Fixes: e4fe938cff04 ("sfc: move 'must restore' flags out of ef10-specific nic_data") Signed-off-by: Colin Ian King <colin.king@canonical.com> Acked-by: Edward Cree <ecree@solarflare.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Michael Walle [Wed, 13 May 2020 20:38:07 +0000 (22:38 +0200)]
net: phy: at803x: add cable diagnostics support
The AR8031/AR8033 and the AR8035 support cable diagnostics. Adding
driver support is straightforward, so lets add it.
The PHY just do one pair at a time, so we have to start the test four
times. The cable_test_get_status() can block and therefore we can just
busy poll the test completion and continue with the next pair until we
are done.
The time delta counter seems to run at 125MHz which just gives us a
resolution of about 82.4cm per tick.
100m cable, A/B/C/D open:
Cable test started for device eth0.
Cable test completed for device eth0.
Pair: Pair A, result: Open Circuit
Pair: Pair A, fault length: 107.94m
Pair: Pair B, result: Open Circuit
Pair: Pair B, fault length: 104.64m
Pair: Pair C, result: Open Circuit
Pair: Pair C, fault length: 105.47m
Pair: Pair D, result: Open Circuit
Pair: Pair D, fault length: 107.94m
1m cable, A/B connected, C shorted, D open:
Cable test started for device eth0.
Cable test completed for device eth0.
Pair: Pair A, result: OK
Pair: Pair B, result: OK
Pair: Pair C, result: Short within Pair
Pair: Pair C, fault length: 0.82m
Pair: Pair D, result: Open Circuit
Pair: Pair D, fault length: 0.82m
Signed-off-by: Michael Walle <michael@walle.cc> Reviewed-by: Andrew Lunn <andrew@lunn.ch> Reviewed-by: Florian Fainelli <f.fainelli@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
ipv6: set msg_control_is_user in do_ipv6_getsockopt
While do_ipv6_getsockopt does not call the high-level recvmsg helper,
the msghdr eventually ends up being passed to put_cmsg anyway, and thus
needs msg_control_is_user set to the proper value.
Fixes: 1f466e1f15cf ("net: cleanly handle kernel vs user buffers for ->msg_control") Reported-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: David S. Miller <davem@davemloft.net>
====================
net: phy: broadcom: cable tester support
Add cable tester support for the Broadcom PHYs. Support for it was
developed on a BCM54140 Quad PHY which RDB register access.
If there is a link partner the results are not as good as with an open
cable. I guess we could retry if the measurement until all pairs had at
least one valid result.
changes since v1:
- added Reviewed-by: tags
- removed "div by 2" for cross shorts, just mention it in the commit
message. The results are inconclusive if the tests are repeated. So
just report the length as is for now.
- fixed typo in commit message
====================
Signed-off-by: David S. Miller <davem@davemloft.net>