Fixes: 1c1efc2af158 ("xsk: Create and free buffer pool independently from umem") Reported-by: Christoph Hellwig <hch@infradead.org> Signed-off-by: Björn Töpel <bjorn.topel@intel.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/bpf/20201005090525.116689-1-bjorn.topel@gmail.com
bpf: Deref map in BPF_PROG_BIND_MAP when it's already used
We are missing a deref for the case when we are doing BPF_PROG_BIND_MAP
on a map that's being already held by the program.
There is 'if (ret) bpf_map_put(map)' below which doesn't trigger
because we don't consider this an error.
Let's add missing bpf_map_put() for this specific condition.
====================
This implements the helper skb_adjust_room() for BPF_SKS_SK_STREAM_VERDICT
programs so we can push/pop headers from the data on recieve. One use
case is to pop TLS headers off kTLS packets.
The first patch implements the helper and the second updates test_sockmap
to use it removing some case handling we had to do earlier to account for
the TLS headers in the kTLS tests.
v1->v2:
Fix error path for TLS case (Daniel)
check mode input is 0 because we don't use it now (Daniel)
Remove incorrect/misleading comment (Lorenz)
Thanks,
John Acked-by: Martin KaFai Lau <kafai@fb.com>
---
====================
John Fastabend [Fri, 2 Oct 2020 01:10:09 +0000 (18:10 -0700)]
bpf, sockmap: Update selftests to use skb_adjust_room
Instead of working around TLS headers in sockmap selftests use the
new skb_adjust_room helper. This allows us to avoid special casing
the receive side to skip headers.
John Fastabend [Fri, 2 Oct 2020 01:09:52 +0000 (18:09 -0700)]
bpf, sockmap: Add skb_adjust_room to pop bytes off ingress payload
This implements a new helper skb_adjust_room() so users can push/pop
extra bytes from a BPF_SK_SKB_STREAM_VERDICT program.
Some protocols may include headers and other information that we may
not want to include when doing a redirect from a BPF_SK_SKB_STREAM_VERDICT
program. One use case is to redirect TLS packets into a receive socket
that doesn't expect TLS data. In TLS case the first 13B or so contain the
protocol header. With KTLS the payload is decrypted so we should be able
to redirect this to a receiving socket, but the receiving socket may not
be expecting to receive a TLS header and discard the data. Using the
above helper we can pop the header off and put an appropriate header on
the payload. This allows for creating a proxy between protocols without
extra hops through the stack or userspace.
So in order to fix this case add skb_adjust_room() so users can strip the
header. After this the user can strip the header and an unmodified receiver
thread will work correctly when data is redirected into the ingress path
of a sock.
====================
v3 -> v4:
- Rebasing
- Cast bpf_[per|this]_cpu_ptr's parameter to void __percpu * before
passing into per_cpu_ptr.
v2 -> v3:
- Rename functions and variables in verifier for better readability.
- Stick to logging message convention in libbpf.
- Move bpf_per_cpu_ptr and bpf_this_cpu_ptr from trace-specific
helper set to base helper set.
- More specific test in ksyms_btf.
- Fix return type cast in bpf_*_cpu_ptr.
- Fix btf leak in ksyms_btf selftest.
- Fix return error code for kallsyms_find().
v1 -> v2:
- Move check_pseudo_btf_id from check_ld_imm() to
replace_map_fd_with_map_ptr() and rename the latter.
- Add bpf_this_cpu_ptr().
- Use bpf_core_types_are_compat() in libbpf.c for checking type
compatibility.
- Rewrite typed ksym extern type in BTF with int to save space.
- Minor revision of bpf_per_cpu_ptr()'s comments.
- Avoid using long in tests that use skeleton.
- Refactored test_ksyms.c by moving kallsyms_find() to trace_helpers.c
- Fold the patches that sync include/linux/uapi and
tools/include/linux/uapi.
rfc -> v1:
- Encode VAR's btf_id for PSEUDO_BTF_ID.
- More checks in verifier. Checking the btf_id passed as
PSEUDO_BTF_ID is valid VAR, its name and type.
- Checks in libbpf on type compatibility of ksyms.
- Add bpf_per_cpu_ptr() to access kernel percpu vars. Introduced
new ARG and RET types for this helper.
This patch series extends the previously added __ksym externs with
btf support.
Right now the __ksym externs are treated as pure 64-bit scalar value.
Libbpf replaces ld_imm64 insn of __ksym by its kernel address at load
time. This patch series extend those externs with their btf info. Note
that btf support for __ksym must come with the kernel btf that has
VARs encoded to work properly. The corresponding chagnes in pahole
is available at [1] (with a fix at [2] for gcc 4.9+).
The first 3 patches in this series add support for general kernel
global variables, which include verifier checking (01/06), libpf
support (02/06) and selftests for getting typed ksym extern's kernel
address (03/06).
The next 3 patches extends that capability further by introducing
helpers bpf_per_cpu_ptr() and bpf_this_cpu_ptr(), which allows accessing
kernel percpu variables correctly (04/06 and 05/06).
The tests of this feature were performed against pahole that is extended
with [1] and [2]. For kernel BTF that does not have VARs encoded, the
selftests will be skipped.
Hao Luo [Tue, 29 Sep 2020 23:50:49 +0000 (16:50 -0700)]
bpf/selftests: Test for bpf_per_cpu_ptr() and bpf_this_cpu_ptr()
Test bpf_per_cpu_ptr() and bpf_this_cpu_ptr(). Test two paths in the
kernel. If the base pointer points to a struct, the returned reg is
of type PTR_TO_BTF_ID. Direct pointer dereference can be applied on
the returned variable. If the base pointer isn't a struct, the
returned reg is of type PTR_TO_MEM, which also supports direct pointer
dereference.
Hao Luo [Tue, 29 Sep 2020 23:50:48 +0000 (16:50 -0700)]
bpf: Introducte bpf_this_cpu_ptr()
Add bpf_this_cpu_ptr() to help access percpu var on this cpu. This
helper always returns a valid pointer, therefore no need to check
returned value for NULL. Also note that all programs run with
preemption disabled, which means that the returned pointer is stable
during all the execution of the program.
Hao Luo [Tue, 29 Sep 2020 23:50:47 +0000 (16:50 -0700)]
bpf: Introduce bpf_per_cpu_ptr()
Add bpf_per_cpu_ptr() to help bpf programs access percpu vars.
bpf_per_cpu_ptr() has the same semantic as per_cpu_ptr() in the kernel
except that it may return NULL. This happens when the cpu parameter is
out of range. So the caller must check the returned value.
Hao Luo [Tue, 29 Sep 2020 23:50:46 +0000 (16:50 -0700)]
selftests/bpf: Ksyms_btf to test typed ksyms
Selftests for typed ksyms. Tests two types of ksyms: one is a struct,
the other is a plain int. This tests two paths in the kernel. Struct
ksyms will be converted into PTR_TO_BTF_ID by the verifier while int
typed ksyms will be converted into PTR_TO_MEM.
Hao Luo [Tue, 29 Sep 2020 23:50:45 +0000 (16:50 -0700)]
bpf/libbpf: BTF support for typed ksyms
If a ksym is defined with a type, libbpf will try to find the ksym's btf
information from kernel btf. If a valid btf entry for the ksym is found,
libbpf can pass in the found btf id to the verifier, which validates the
ksym's type and value.
Typeless ksyms (i.e. those defined as 'void') will not have such btf_id,
but it has the symbol's address (read from kallsyms) and its value is
treated as a raw pointer.
Hao Luo [Tue, 29 Sep 2020 23:50:44 +0000 (16:50 -0700)]
bpf: Introduce pseudo_btf_id
Pseudo_btf_id is a type of ld_imm insn that associates a btf_id to a
ksym so that further dereferences on the ksym can use the BTF info
to validate accesses. Internally, when seeing a pseudo_btf_id ld insn,
the verifier reads the btf_id stored in the insn[0]'s imm field and
marks the dst_reg as PTR_TO_BTF_ID. The btf_id points to a VAR_KIND,
which is encoded in btf_vminux by pahole. If the VAR is not of a struct
type, the dst reg will be marked as PTR_TO_MEM instead of PTR_TO_BTF_ID
and the mem_size is resolved to the size of the VAR's type.
>From the VAR btf_id, the verifier can also read the address of the
ksym's corresponding kernel var from kallsyms and use that to fill
dst_reg.
Therefore, the proper functionality of pseudo_btf_id depends on (1)
kallsyms and (2) the encoding of kernel global VARs in pahole, which
should be available since pahole v1.18.
Merge branch 'Do not limit cb_flags when creating child sk'
Martin KaFai says:
====================
This set fixes an issue that the bpf_skops_init_child() unnecessarily
limited the child sk from inheriting all bpf_sock_ops_cb_flags
of the listen sk. It also adds a test to check that.
====================
Tested-by: Stanislav Fomichev <sdf@google.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org>
bpf: selftest: Ensure the child sk inherited all bpf_sock_ops_cb_flags
This patch adds a test to ensure the child sk inherited everything
from the bpf_sock_ops_cb_flags of the listen sk:
1. Sets one more cb_flags (BPF_SOCK_OPS_STATE_CB_FLAG) to the listen sk
in test_tcp_hdr_options.c
2. Saves the skops->bpf_sock_ops_cb_flags when handling the newly
established passive connection
3. CHECK() it is the same as the listen sk
This also covers the fastopen case as the existing test_tcp_hdr_options.c
does.
bpf: tcp: Do not limit cb_flags when creating child sk from listen sk
The commit 0813a841566f ("bpf: tcp: Allow bpf prog to write and parse TCP header option")
unnecessarily introduced bpf_skops_init_child() which limited the child
sk from inheriting all bpf_sock_ops_cb_flags of the listen sk. That
breaks existing user expectation.
This patch removes the bpf_skops_init_child() and just allows
sock_copy() to do its job to copy everything from listen sk to
the child sk.
Fixes: 0813a841566f ("bpf: tcp: Allow bpf prog to write and parse TCP header option") Reported-by: Stanislav Fomichev <sdf@google.com> Signed-off-by: Martin KaFai Lau <kafai@fb.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Link: https://lore.kernel.org/bpf/20201002013448.2542025-1-kafai@fb.com
selftests/bpf: Properly initialize linfo in sockmap_basic
When using -Werror=missing-braces, compiler complains about missing braces.
Let's use use ={} initialization which should do the job:
tools/testing/selftests/bpf/prog_tests/sockmap_basic.c: In function 'test_sockmap_iter':
tools/testing/selftests/bpf/prog_tests/sockmap_basic.c:181:8: error: missing braces around initializer [-Werror=missing-braces]
union bpf_iter_link_info linfo = {0};
^
tools/testing/selftests/bpf/prog_tests/sockmap_basic.c:181:8: error: (near initialization for 'linfo.map') [-Werror=missing-braces]
tools/testing/selftests/bpf/prog_tests/sockmap_basic.c: At top level:
selftests/bpf: Initialize duration in xdp_noinline.c
Fixes clang error:
tools/testing/selftests/bpf/prog_tests/xdp_noinline.c:35:6: error: variable 'duration' is uninitialized when used here [-Werror,-Wuninitialized]
if (CHECK(!skel, "skel_open_and_load", "failed\n"))
^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Willy Liu [Wed, 30 Sep 2020 06:48:58 +0000 (14:48 +0800)]
net: phy: realtek: Modify 2.5G PHY name to RTL8226
Realtek single-chip Ethernet PHY solutions can be separated as below:
10M/100Mbps: RTL8201X
1Gbps: RTL8211X
2.5Gbps: RTL8226/RTL8221X
RTL8226 is the first version for realtek that compatible 2.5Gbps single PHY.
Since RTL8226 is single port only, realtek changes its name to RTL8221B from
the second version.
PHY ID for RTL8226 is 0x001cc800 and RTL8226B/RTL8221B is 0x001cc840.
RTL8125 is not a single PHY solution, it integrates PHY/MAC/PCIE bus
controller and embedded memory.
Signed-off-by: Willy Liu <willy.liu@realtek.com> Signed-off-by: David S. Miller <davem@davemloft.net>
caif_virtio: Remove redundant initialization of variable err
After commit a8c7687bf216 ("caif_virtio: Check that vringh_config is not
null"), the variable err is being initialized with '-EINVAL' that is
meaningless. So remove it.
Signed-off-by: Jing Xiangfeng <jingxiangfeng@huawei.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Ye Bin [Wed, 30 Sep 2020 01:08:38 +0000 (09:08 +0800)]
net-sysfs: Fix inconsistent of format with argument type in net-sysfs.c
Fix follow warnings:
[net/core/net-sysfs.c:1161]: (warning) %u in format string (no. 1)
requires 'unsigned int' but the argument type is 'int'.
[net/core/net-sysfs.c:1162]: (warning) %u in format string (no. 1)
requires 'unsigned int' but the argument type is 'int'.
Reported-by: Hulk Robot <hulkci@huawei.com> Signed-off-by: Ye Bin <yebin10@huawei.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Ye Bin [Wed, 30 Sep 2020 01:08:37 +0000 (09:08 +0800)]
pktgen: Fix inconsistent of format with argument type in pktgen.c
Fix follow warnings:
[net/core/pktgen.c:925]: (warning) %u in format string (no. 1)
requires 'unsigned int' but the argument type is 'signed int'.
[net/core/pktgen.c:942]: (warning) %u in format string (no. 1)
requires 'unsigned int' but the argument type is 'signed int'.
[net/core/pktgen.c:962]: (warning) %u in format string (no. 1)
requires 'unsigned int' but the argument type is 'signed int'.
[net/core/pktgen.c:984]: (warning) %u in format string (no. 1)
requires 'unsigned int' but the argument type is 'signed int'.
[net/core/pktgen.c:1149]: (warning) %d in format string (no. 1)
requires 'int' but the argument type is 'unsigned int'.
Reported-by: Hulk Robot <hulkci@huawei.com> Signed-off-by: Ye Bin <yebin10@huawei.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Xie He [Mon, 28 Sep 2020 12:56:43 +0000 (05:56 -0700)]
drivers/net/wan/hdlc_fr: Correctly handle special skb->protocol values
The fr_hard_header function is used to prepend the header to skbs before
transmission. It is used in 3 situations:
1) When a control packet is generated internally in this driver;
2) When a user sends an skb on an Ethernet-emulating PVC device;
3) When a user sends an skb on a normal PVC device.
These 3 situations need to be handled differently by fr_hard_header.
Different headers should be prepended to the skb in different situations.
Currently fr_hard_header distinguishes these 3 situations using
skb->protocol. For situation 1 and 2, a special skb->protocol value
will be assigned before calling fr_hard_header, so that it can recognize
these 2 situations. All skb->protocol values other than these special ones
are treated by fr_hard_header as situation 3.
However, it is possible that in situation 3, the user sends an skb with
one of the special skb->protocol values. In this case, fr_hard_header
would incorrectly treat it as situation 1 or 2.
This patch tries to solve this issue by using skb->dev instead of
skb->protocol to distinguish between these 3 situations. For situation
1, skb->dev would be NULL; for situation 2, skb->dev->type would be
ARPHRD_ETHER; and for situation 3, skb->dev->type would be ARPHRD_DLCI.
This way fr_hard_header would be able to distinguish these 3 situations
correctly regardless what skb->protocol value the user tries to use in
situation 3.
Cc: Krzysztof Halasa <khc@pm.waw.pl> Signed-off-by: Xie He <xie.he.0141@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
The following pull-request contains BPF updates for your *net-next* tree.
We've added 90 non-merge commits during the last 8 day(s) which contain
a total of 103 files changed, 7662 insertions(+), 1894 deletions(-).
Note that once bpf(/net) tree gets merged into net-next, there will be a small
merge conflict in tools/lib/bpf/btf.c between commit 1245008122d7 ("libbpf: Fix
native endian assumption when parsing BTF") from the bpf tree and the commit 3289959b97ca ("libbpf: Support BTF loading and raw data output in both endianness")
from the bpf-next tree. Correct resolution would be to stick with bpf-next, it
should look like:
[...]
/* check BTF magic */
if (fread(&magic, 1, sizeof(magic), f) < sizeof(magic)) {
err = -EIO;
goto err_out;
}
if (magic != BTF_MAGIC && magic != bswap_16(BTF_MAGIC)) {
/* definitely not a raw BTF */
err = -EPROTO;
goto err_out;
}
/* get file size */
[...]
The main changes are:
1) Add bpf_snprintf_btf() and bpf_seq_printf_btf() helpers to support displaying
BTF-based kernel data structures out of BPF programs, from Alan Maguire.
2) Speed up RCU tasks trace grace periods by a factor of 50 & fix a few race
conditions exposed by it. It was discussed to take these via BPF and
networking tree to get better testing exposure, from Paul E. McKenney.
3) Support multi-attach for freplace programs, needed for incremental attachment
of multiple XDP progs using libxdp dispatcher model, from Toke Høiland-Jørgensen.
4) libbpf support for appending new BTF types at the end of BTF object, allowing
intrusive changes of prog's BTF (useful for future linking), from Andrii Nakryiko.
5) Several BPF helper improvements e.g. avoid atomic op in cookie generator and add
a redirect helper into neighboring subsys, from Daniel Borkmann.
6) Allow map updates on sockmaps from bpf_iter context in order to migrate sockmaps
from one to another, from Lorenz Bauer.
7) Fix 32 bit to 64 bit assignment from latest alu32 bounds tracking which caused
a verifier issue due to type downgrade to scalar, from John Fastabend.
8) Follow-up on tail-call support in BPF subprogs which optimizes x64 JIT prologue
and epilogue sections, from Maciej Fijalkowski.
9) Add an option to perf RB map to improve sharing of event entries by avoiding remove-
on-close behavior. Also, add BPF_PROG_TEST_RUN for raw_tracepoint, from Song Liu.
10) Fix a crash in AF_XDP's socket_release when memory allocation for UMEMs fails,
from Magnus Karlsson.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
====================
net/ravb: Add support for explicit internal clock delay configuration
Some Renesas EtherAVB variants support internal clock delay
configuration, which can add larger delays than the delays that are
typically supported by the PHY (using an "rgmii-*id" PHY mode, and/or
"[rt]xc-skew-ps" properties).
Historically, the EtherAVB driver configured these delays based on the
"rgmii-*id" PHY mode. This caused issues with PHY drivers that
implement PHY internal delays properly[1]. Hence a backwards-compatible
workaround was added by masking the PHY mode[2].
This patch series implements the next step of the plan outlined in [3],
and adds proper support for explicit configuration of the MAC internal
clock delays using new "[rt]x-internal-delay-ps" properties. If none of
these properties is present, the driver falls back to the old handling.
This can be considered the MAC counterpart of commit 9150069bf5fc0e86
("dt-bindings: net: Add tx and rx internal delays"), which applies to
the PHY. Note that unlike commit 92252eec913b2dd5 ("net: phy: Add a
helper to return the index for of the internal delay"), no helpers are
provided to parse the DT properties, as so far there is a single user
only, which supports only zero or a single fixed value. Of course such
helpers can be added later, when the need arises, or when deemed useful
otherwise.
This series consists of 3 parts:
1. DT binding updates documenting the new properties, for both the
generic ethernet-controller and the EtherAVB-specific bindings,
2. Conversion to json-schema of the Renesas EtherAVB DT bindings.
Technically, the conversion is independent of all of the above.
I included it in this series, as it shows how all sanity checks on
"[rt]x-internal-delay-ps" values are implemented as DT binding
checks,
3. EtherAVB driver update implementing support for the new properties.
Given Rob has provided his acks for the DT binding updates, all of this
can be merged through net-next.
Changes compared to v3[4]:
- Add Reviewed-by,
- Drop the DT updates, as they will be merged through renesas-devel and
arm-soc, and have a hard dependency on this series.
Impacted, tested:
- Salvator-X(S) with R-Car H3 ES1.0 and ES2.0, M3-W, and M3-N.
Not impacted, tested:
- Ebisu with R-Car E3.
Impacted, not tested:
- Salvator-X(S) with other SoC variants,
- ULCB with R-Car H3/M3-W/M3-N variants,
- V3MSK and Eagle with R-Car V3M,
- Draak with R-Car V3H,
- HiHope RZ/G2[MN] with RZ/G2M or RZ/G2N,
- Beacon EmbeddedWorks RZ/G2M Development Kit.
To ease testing, I have pushed this series and the DT updates to the
topic/ravb-internal-clock-delays-v4 branch of my renesas-drivers
repository at
git://git.kernel.org/pub/scm/linux/kernel/git/geert/renesas-drivers.git.
Thanks for applying!
References:
[1] Commit bcf3440c6dd78bfe ("net: phy: micrel: add phy-mode support
for the KSZ9031 PHY")
[2] Commit 9b23203c32ee02cd ("ravb: Mask PHY mode to avoid inserting
delays twice").
https://lore.kernel.org/r/20200529122540.31368-1-geert+renesas@glider.be/
[3] https://lore.kernel.org/r/CAMuHMdU+MR-2tr3-pH55G0GqPG9HwH3XUd=8HZxprFDMGQeWUw@mail.gmail.com/
[4] https://lore.kernel.org/linux-devicetree/20200819134344.27813-1-geert+renesas@glider.be/
[5] https://lore.kernel.org/linux-devicetree/20200706143529.18306-1-geert+renesas@glider.be/
[6] https://lore.kernel.org/linux-devicetree/20200619191554.24942-1-geert+renesas@glider.be/
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
ravb: Add support for explicit internal clock delay configuration
Some EtherAVB variants support internal clock delay configuration, which
can add larger delays than the delays that are typically supported by
the PHY (using an "rgmii-*id" PHY mode, and/or "[rt]xc-skew-ps"
properties).
Historically, the EtherAVB driver configured these delays based on the
"rgmii-*id" PHY mode. This caused issues with PHY drivers that
implement PHY internal delays properly[1]. Hence a backwards-compatible
workaround was added by masking the PHY mode[2].
Add proper support for explicit configuration of the MAC internal clock
delays using the new "[rt]x-internal-delay-ps" properties.
Fall back to the old handling if none of these properties is present.
[1] Commit bcf3440c6dd78bfe ("net: phy: micrel: add phy-mode support for
the KSZ9031 PHY")
[2] Commit 9b23203c32ee02cd ("ravb: Mask PHY mode to avoid inserting
delays twice").
Signed-off-by: Geert Uytterhoeven <geert+renesas@glider.be> Reviewed-by: Sergei Shtylyov <sergei.shtylyov@cogentembedded.com> Reviewed-by: Florian Fainelli <f.fainelli@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
ravb: Split delay handling in parsing and applying
Currently, full delay handling is done in both the probe and resume
paths. Split it in two parts, so the resume path doesn't have to redo
the parsing part over and over again.
Signed-off-by: Geert Uytterhoeven <geert+renesas@glider.be> Reviewed-by: Sergei Shtylyov <sergei.shtylyov@cogentembedded.com> Reviewed-by: Florian Fainelli <f.fainelli@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
dt-bindings: net: renesas,etheravb: Convert to json-schema
Convert the Renesas Ethernet AVB (EthernetAVB-IF) Device Tree binding
documentation to json-schema.
Add missing properties.
Update the example to match reality.
Signed-off-by: Geert Uytterhoeven <geert+renesas@glider.be> Reviewed-by: Sergei Shtylyov <sergei.shtylyov@cogentembedded.com> Reviewed-by: Rob Herring <robh@kernel.org> Reviewed-by: Florian Fainelli <f.fainelli@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Some EtherAVB variants support internal clock delay configuration, which
can add larger delays than the delays that are typically supported by
the PHY (using an "rgmii-*id" PHY mode, and/or "[rt]xc-skew-ps"
properties).
Add properties for configuring the internal MAC delays.
These properties are mandatory, even when specified as zero, to
distinguish between old and new DTBs.
Update the (bogus) example accordingly.
Signed-off-by: Geert Uytterhoeven <geert+renesas@glider.be> Reviewed-by: Sergei Shtylyov <sergei.shtylyov@gmail.com> Reviewed-by: Rob Herring <robh@kernel.org> Reviewed-by: Florian Fainelli <f.fainelli@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Internal Receive and Transmit Clock Delays are a common setting for
RGMII capable devices.
While these delays are typically applied by the PHY, some MACs support
configuring internal clock delay settings, too. Hence add standardized
properties to configure this.
This is the MAC counterpart of commit 9150069bf5fc0e86 ("dt-bindings:
net: Add tx and rx internal delays"), which applies to the PHY.
Signed-off-by: Geert Uytterhoeven <geert+renesas@glider.be> Reviewed-by: Rob Herring <robh@kernel.org> Reviewed-by: Florian Fainelli <f.fainelli@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
====================
This set introduces BPF_F_PRESERVE_ELEMS to perf event array for better
sharing of perf event. By default, perf event array removes the perf event
when the map fd used to add the event is closed. With BPF_F_PRESERVE_ELEMS
set, however, the perf event will stay in the array until it is removed, or
the map is closed.
---
Changes v3 => v5:
1. Clean up in selftest. (Alexei)
Changes v1 => v2:
1. Rename the flag as BPF_F_PRESERVE_ELEMS. (Alexei, Daniel)
2. Simplify the code and selftest. (Daniel, Alexei)
====================
Song Liu [Wed, 30 Sep 2020 22:49:27 +0000 (15:49 -0700)]
selftests/bpf: Add tests for BPF_F_PRESERVE_ELEMS
Add tests for perf event array with and without BPF_F_PRESERVE_ELEMS.
Add a perf event to array via fd mfd. Without BPF_F_PRESERVE_ELEMS, the
perf event is removed when mfd is closed. With BPF_F_PRESERVE_ELEMS, the
perf event is removed when the map is freed.
Song Liu [Wed, 30 Sep 2020 22:49:26 +0000 (15:49 -0700)]
bpf: Introduce BPF_F_PRESERVE_ELEMS for perf event array
Currently, perf event in perf event array is removed from the array when
the map fd used to add the event is closed. This behavior makes it
difficult to the share perf events with perf event array.
Introduce perf event map that keeps the perf event open with a new flag
BPF_F_PRESERVE_ELEMS. With this flag set, perf events in the array are not
removed when the original map fd is closed. Instead, the perf event will
stay in the map until 1) it is explicitly removed from the array; or 2)
the array is freed.
Dan Carpenter [Mon, 28 Sep 2020 09:05:56 +0000 (12:05 +0300)]
net/mlx5e: Fix a use after free on error in mlx5_tc_ct_shared_counter_get()
This code frees "shared_counter" and then dereferences on the next line
to get the error code.
Fixes: 1edae2335adf ("net/mlx5e: CT: Use the same counter for both directions") Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com> Reviewed-by: Leon Romanovsky <leonro@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
net/mlx5: Fix dereference on pointer attr after null check
When removing a flow from the slow path fdb, a flow attr struct is
allocated for the rule removal process. If the allocation fails the
code prints a warning message but continues with the removal flow
which include dereferencing a pointer which could be null.
Fix this by exiting the function in case the attr allocation failed.
Use the PCI device directly for dma accesses as non PCI device unlikely
support IOMMU and dma mappings.
Introduce and use helper routine to access DMA device.
Parav Pandit [Mon, 31 Aug 2020 19:47:47 +0000 (22:47 +0300)]
net/mlx5: E-switch, Move devlink eswitch ports closer to eswitch
Currently devlink eswitch ports are registered and unregistered by the
representor layer.
However it is better to register them at eswitch layer so that in future
user initiated command port add and delete commands can also
register/unregister devlink ports without depending on representor layer.
Signed-off-by: Parav Pandit <parav@nvidia.com> Reviewed-by: Roi Dayan <roid@nvidia.com> Reviewed-by: Vu Pham <vuhuong@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
Currently only 256 vports can be supported as only 8 bits are
reserved for them and 8 bits are reserved for vhca_ids in
metadata reg c0. To support more than 256 vports, replace
vhca_id with a unique shorter 4-bit PF number which covers
upto 16 PF's. Use remaining 12 bits for vports ranging 1-4095.
This will continue to generate unique metadata even if
multiple PCI devices have same switch_id.
Hamdan Igbaria [Mon, 15 Jun 2020 15:18:14 +0000 (18:18 +0300)]
net/mlx5: DR, Add support for rule creation with flow source hint
Skip the rule according to flow arrival source, in case of RX and the
source is local port skip and in case of TX and the source is uplink
skip, we get this info according to the flow source hint we get from
upper layers when creating the rule.
This is needed because for example in case of FDB table which has a TX
and RX tables and we are inserting a rule with an encap action which
is only a TX action, in this case rule will fail on RX, so we can rely
on the flow source hint and skip RX in such case.
Until now we relied on metadata regc_0 that upper layer mapped the
port in the regc_0, but the problem is that upper layer did not always
use regc_0 for port mapping, so now we added support to flow source
hint which upper layers will pass to SW steering when creating a rule.
net/mlx5: DR, Call ste_builder directly with tag pointer
Instead of getting the tag in each function, call the builder
directly with the tag. This will allow to use the same function
for building the tag and the bitmask.
net/mlx5: DR, Remove unneeded vlan check from L2 builder
When we create a matcher we check that all fields are consumed.
There is no need for this specific check. This keeps the STE
builder functions simple and clean.
net/mlx5: DR, Remove unneeded check from source port builder
Mask validity for ste builders is checked by mlx5dr_ste_build_pre_check
during matcher creation.
It already checks the mask value of source_vport, so removing
this duplicated check.
Also, moving there the check of source_eswitch_owner_vhca_id mask.
net/mlx5: DR, Replace the check for valid STE entry
Validity check is done by reading the next lu_type from the STE,
this check can be replaced by checking the refcount.
This will make the check independent on internal STE structure.
Fix a build failure on arm64, due to missing alignment information for
the .BTF_ids section:
resolve_btfids.test.o: in function `test_resolve_btfids':
tools/testing/selftests/bpf/prog_tests/resolve_btfids.c:140:(.text+0x29c): relocation truncated to fit: R_AARCH64_LDST32_ABS_LO12_NC against `.BTF_ids'
ld: tools/testing/selftests/bpf/prog_tests/resolve_btfids.c:140: warning: one possible cause of this error is that the symbol is being referenced in the indicated code as if it had a larger alignment than was declared where it was defined
In vmlinux, the .BTF_ids section is aligned to 4 bytes by vmlinux.lds.h.
In test_progs however, .BTF_ids doesn't have alignment constraints. The
arm64 linker expects the btf_id_set.cnt symbol, a u32, to be naturally
aligned but finds it misaligned and cannot apply the relocation. Enforce
alignment of .BTF_ids to 4 bytes.
====================
drop_monitor: Convert to use devlink tracepoint
Drop monitor is able to monitor both software and hardware originated
drops. Software drops are monitored by having drop monitor register its
probe on the 'kfree_skb' tracepoint. Hardware originated drops are
monitored by having devlink call into drop monitor whenever it receives
a dropped packet from the underlying hardware.
This patch set converts drop monitor to monitor both software and
hardware originated drops in the same way - by registering its probe on
the relevant tracepoint.
In addition to drop monitor being more consistent, it is now also
possible to build drop monitor as module instead of as a builtin and
still monitor hardware originated drops. Initially, CONFIG_NET_DEVLINK
implied CONFIG_NET_DROP_MONITOR, but after commit def2fbffe62c
("kconfig: allow symbols implied by y to become m") we can have
CONFIG_NET_DEVLINK=y and CONFIG_NET_DROP_MONITOR=m and hardware
originated drops will not be monitored.
Patch set overview:
Patch #1 adds a tracepoint in devlink for trap reports.
Patch #2 prepares probe functions in drop monitor for the new
tracepoint.
Patch #3 converts drop monitor to use the new tracepoint.
Patches #4-#6 perform cleanups after the conversion.
Patch #7 adds a test case for drop monitor. Both software originated
drops and hardware originated drops (using netdevsim) are tested.
Tested:
| CONFIG_NET_DEVLINK | CONFIG_NET_DROP_MONITOR | Build | SW drops | HW drops |
| -------------------|-------------------------|-------|----------|----------|
| y | y | v | v | v |
| y | m | v | v | v |
| y | n | v | x | x |
| n | y | v | v | x |
| n | m | v | v | x |
| n | n | v | x | x |
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
drop_monitor: Filter control packets in drop monitor
Previously, devlink called into drop monitor in order to report hardware
originated drops / exceptions. devlink intentionally filtered control
packets and did not pass them to drop monitor as they were not dropped
by the underlying hardware.
Now drop monitor registers its probe on a generic 'devlink_trap_report'
tracepoint and should therefore perform this filtering itself instead of
having devlink do that.
Add the trap type as metadata and have drop monitor ignore control
packets.
Signed-off-by: Ido Schimmel <idosch@mellanox.com> Reviewed-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
The old probe functions that were invoked by drop monitor code are no
longer called and can thus be removed. They were replaced by actual
probe functions that are registered on the recently introduced
'devlink_trap_report' tracepoint.
Signed-off-by: Ido Schimmel <idosch@mellanox.com> Reviewed-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Convert drop monitor to use the recently introduced
'devlink_trap_report' tracepoint instead of having devlink call into
drop monitor.
This is both consistent with software originated drops ('kfree_skb'
tracepoint) and also allows drop monitor to be built as a module and
still report hardware originated drops.
Signed-off-by: Ido Schimmel <idosch@mellanox.com> Reviewed-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
drop_monitor: Prepare probe functions for devlink tracepoint
Drop monitor supports two alerting modes: Summary and packet. Prepare a
probe function for each, so that they could be later registered on the
devlink tracepoint by calling register_trace_devlink_trap_report(),
based on the configured alerting mode.
Signed-off-by: Ido Schimmel <idosch@mellanox.com> Reviewed-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Add a tracepoint for trap reports so that drop monitor could register
its probe on it. Use trace_devlink_trap_report_enabled() to avoid
wasting cycles setting the trap metadata if the tracepoint is not
enabled.
Signed-off-by: Ido Schimmel <idosch@mellanox.com> Reviewed-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
this is a pull request of 13 patches for net-next.
The first 10 target the mcp25xxfd driver (which is renamed to mcp251xfd during
this series).
The first two patches are by Thomas Kopp, which adds reference to the just
related errata and updates the documentation and log messages.
Dan Carpenter's patch fixes a resource leak during ifdown.
A patch by me adds the missing initialization of a variable.
Oleksij Rempel updates the DT binding documentation as requested by Rob
Herring.
The next 5 patches are by Thomas Kopp and me. During review Geert Uytterhoeven
suggested to use "microchip,mcp251xfd" instead of "microchip,mcp25xxfd" as the
DT autodetection compatible to avoid clashes with future but incompatible
devices. We decided not only to rename the compatible but the whole driver from
"mcp25xxfd" to "mcp251xfd". This is done in several patches.
Joakim Zhang contributes three patches for the flexcan driver. The first one
adds support for the ECC feature, which is implemented on some modern IP cores,
by initializing the controller's memory during ifup. The next patch adds
support for the i.MX8MP (which supports ECC) and the last patch properly
disables the runtime PM if device registration fails.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Wed, 30 Sep 2020 22:11:09 +0000 (15:11 -0700)]
Merge branch 'ionic-watchdog-training'
Shannon Nelson says:
====================
ionic watchdog training
Our link watchdog displayed a couple of unfriendly behaviors in some recent
stress testing. These patches change the startup and stop timing in order
to be sure that expected structures are ready to be used by the watchdog.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Shannon Nelson [Wed, 30 Sep 2020 17:48:28 +0000 (10:48 -0700)]
ionic: prevent early watchdog check
In one corner case scenario, the driver device lif setup can
get delayed such that the ionic_watchdog_cb() timer goes off
before the ionic->lif is set, thus causing a NULL pointer panic.
We catch the problem by checking for a NULL lif just a little
earlier in the callback.
Signed-off-by: Shannon Nelson <snelson@pensando.io> Signed-off-by: David S. Miller <davem@davemloft.net>
Shannon Nelson [Wed, 30 Sep 2020 17:48:27 +0000 (10:48 -0700)]
ionic: stop watchdog timer earlier on remove
We need to be better at making sure we don't have a link check
watchdog go off while we're shutting things down, so let's stop
the timer as soon as we start the remove.
Meanwhile, since that was the only thing in
ionic_dev_teardown(), simplify and remove that function.
Signed-off-by: Shannon Nelson <snelson@pensando.io> Signed-off-by: David S. Miller <davem@davemloft.net>
====================
tcp: exponential backoff in tcp_send_ack()
We had outages caused by repeated skb allocation failures in tcp_send_ack()
It is time to add exponential backoff to reduce number of attempts.
Before doing so, first patch removes icsk_ack.blocked to make
room for a new field (icsk_ack.retry)
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Eric Dumazet [Wed, 30 Sep 2020 12:54:57 +0000 (05:54 -0700)]
tcp: add exponential backoff in __tcp_send_ack()
Whenever host is under very high memory pressure,
__tcp_send_ack() skb allocation fails, and we setup
a 200 ms (TCP_DELACK_MAX) timer before retrying.
On hosts with high number of TCP sockets, we can spend
considerable amount of cpu cycles in these attempts,
add high pressure on various spinlocks in mm-layer,
ultimately blocking threads attempting to free space
from making any progress.
This patch adds standard exponential backoff to avoid
adding fuel to the fire.
Signed-off-by: Eric Dumazet <edumazet@google.com> Acked-by: Soheil Hassas Yeganeh <soheil@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Eric Dumazet [Wed, 30 Sep 2020 12:54:56 +0000 (05:54 -0700)]
inet: remove icsk_ack.blocked
TCP has been using it to work around the possibility of tcp_delack_timer()
finding the socket owned by user.
After commit 6f458dfb4092 ("tcp: improve latencies of timer triggered events")
we added TCP_DELACK_TIMER_DEFERRED atomic bit for more immediate recovery,
so we can get rid of icsk_ack.blocked
This frees space that following patch will reuse.
Signed-off-by: Eric Dumazet <edumazet@google.com> Acked-by: Soheil Hassas Yeganeh <soheil@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
struct macb_platform_data is only used by macb_pci to register the platform
device, move its definition to cadence/macb.h and remove platform_data/macb.h
Signed-off-by: Alexandre Belloni <alexandre.belloni@bootlin.com> Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Wed, 30 Sep 2020 21:06:54 +0000 (14:06 -0700)]
Merge branch 'mlxsw-PFC-and-headroom-selftests'
Petr Machata says:
====================
mlxsw: PFC and headroom selftests
Recent changes in the headroom management code made it clear that an
automated way of testing this functionality is needed. This patchset brings
two tests: a synthetic headroom behavior test, which verifies mechanics of
headroom management. And a PFC test, which verifies whether this behavior
actually translates into a working lossless configuration.
Both of these tests rely on mlnx_qos[1], a tool that interfaces with Linux
DCB API. The tool was originally written to work with Mellanox NICs, but
does not actually rely on anything Mellanox-specific, and can be used for
mlxsw as well as for any other NIC-like driver. Unlike Open LLDP it does
support buffer commands and permits a fire-and-forget approach to
configuration, which makes it very handy for writing of selftests.
Patches #1-#3 extend the selftest devlink_lib.sh in various ways. Patch #4
then adds a helper wrapper for mlnx_qos to mlxsw's qos_lib.sh.
Patch #5 adds a test for management of port headroom.
Petr Machata [Wed, 30 Sep 2020 10:49:11 +0000 (12:49 +0200)]
selftests: mlxsw: Add headroom handling test
Add a test for headroom configuration. This covers projection of ETS
configuration to ingress, PFC, adjustments for MTU, the qdisc / TC
mode and the effect of egress SPAN session on buffer configuration.
Signed-off-by: Petr Machata <petrm@nvidia.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Petr Machata [Wed, 30 Sep 2020 10:49:10 +0000 (12:49 +0200)]
selftests: mlxsw: qos_lib: Add a wrapper for running mlnx_qos
mlnx_qos is a script for configuration of DCB. Despite the name it is not
actually Mellanox-specific in any way. It is currently the only ad-hoc tool
available (in contrast to a daemon that manages an interface on an ongoing
basis). However, it is very verbose and parsing out error messages is not
really possible. Add a wrapper that makes it easier to use the tool.
Signed-off-by: Petr Machata <petrm@nvidia.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Petr Machata [Wed, 30 Sep 2020 10:49:09 +0000 (12:49 +0200)]
selftests: forwarding: devlink_lib: Support port-less topologies
Some selftests may not need any actual ports. Technically those are not
forwarding selftests, but devlink_lib can still be handy. Fall back on
NETIF_NO_CABLE in those cases.
Signed-off-by: Petr Machata <petrm@nvidia.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Petr Machata [Wed, 30 Sep 2020 10:49:07 +0000 (12:49 +0200)]
selftests: forwarding: devlink_lib: Split devlink_..._set() into save & set
Changing pool type from static to dynamic causes reinterpretation of
threshold values. They therefore need to be saved before pool type is
changed, then the pool type can be changed, and then the new values need
to be set up.
For that reason, set cannot subsume save, because it would be saving the
wrong thing, with possibly a nonsensical value, and restore would then fail
to restore the nonsensical value.
Thus extract a _save() from each of the relevant _set()'s. This way it is
possible to save everything up front, then to tweak it, and then restore in
the required order.
Signed-off-by: Petr Machata <petrm@nvidia.com> Signed-off-by: David S. Miller <davem@davemloft.net>
can: flexcan: initialize all flexcan memory for ECC function
One issue was reported at a baremetal environment, which is used for
FPGA verification. "The first transfer will fail for extended ID
format(for both 2.0B and FD format), following frames can be transmitted
and received successfully for extended format, and standard format don't
have this issue. This issue occurred randomly with high possiblity, when
it occurs, the transmitter will detect a BIT1 error, the receiver a CRC
error. According to the spec, a non-correctable error may cause this
transfer failure."
With FLEXCAN_QUIRK_DISABLE_MECR quirk, it supports correctable errors,
disable non-correctable errors interrupt and freeze mode. Platform has
ECC hardware support, but select this quirk, this issue may not come to
light. Initialize all FlexCAN memory before accessing them, at least it
can avoid non-correctable errors detected due to memory uninitialized.
The internal region can't be initialized when the hardware doesn't support
ECC.
According to IMX8MPRM, Rev.C, 04/2020. There is a NOTE at the section
11.8.3.13 Detection and correction of memory errors:
"All FlexCAN memory must be initialized before starting its operation in
order to have the parity bits in memory properly updated. CTRL2[WRMFRZ]
grants write access to all memory positions that require initialization,
ranging from 0x080 to 0xADF and from 0xF28 to 0xFFF when the CAN FD feature
is enabled. The RXMGMASK, RX14MASK, RX15MASK, and RXFGMASK registers need to
be initialized as well. MCR[RFEN] must not be set during memory initialization."
Memory range from 0x080 to 0xADF, there are reserved memory (unimplemented
by hardware, e.g. only configure 64 MBs), these memory can be initialized or not.
In this patch, initialize all flexcan memory which includes reserved memory.
In this patch, create FLEXCAN_QUIRK_SUPPORT_ECC for platforms which has ECC
feature. If you have a ECC platform in your hand, please select this
qurik to initialize all flexcan memory firstly, then you can select
FLEXCAN_QUIRK_DISABLE_MECR to only enable correctable errors.
can: mcp251xfd: rename all remaining occurrence to mcp251xfd
In [1] Geert noted that the autodetect compatible for the mcp25xxfd driver,
which is "microchip,mcp25xxfd" might be too generic and overlap with upcoming,
but incompatible chips.
In the previous patch the autodetect DT compatbile has been renamed to
"microchip,mcp251xfd", this patch changes all non user facing occurrence of
"mcp25xxfd" to "mcp251xfd" and "MCP25XXFD" to "MCP251XFD".
can: mcp251xfd: rename all user facing strings to mcp251xfd
In [1] Geert noted that the autodetect compatible for the mcp25xxfd driver,
which is "microchip,mcp25xxfd" might be too generic and overlap with upcoming,
but incompatible chips.
In the previous patch the autodetect DT compatbile has been renamed to
"microchip,mcp251xfd", this patch changes all user facing strings from
"mcp25xxfd" to "mcp251xfd" and "MCP25XXFD" to "MCP251XFD", including:
- kconfig symbols
- name of kernel module
- DT and SPI compatible
can: mcp251xfd: rename driver files and subdir to mcp251xfd
In [1] Geert noted that the autodetect compatible for the mcp25xxfd driver,
which is "microchip,mcp25xxfd" might be too generic and overlap with upcoming,
but incompatible chips.
In the previous patch the autodetect DT compatbile has been renamed to
"microchip,mcp251xfd", this patch changes the name of the driver subdir and the
individual files accordinly.
Ensure that btf_dump can accommodate new BTF types being appended to BTF
instance after struct btf_dump was created. This came up during attemp to
use btf_dump for raw type dumping in selftests, but given changes are not
excessive, it's good to not have any gotchas in API usage, so I decided to
support such use case in general.
====================
This series adds two BPF helpers, that is, one for retrieving the classid
of an skb and another one to redirect via the neigh subsystem, and improves
also the cookie helpers by removing the atomic counter. I've also added
the bpf_tail_call_static() helper to the libbpf API that we've been using
in Cilium for a while now, and last but not least the series adds a few
selftests. For details, please check individual patches, thanks!
v3 -> v4:
- Removed out_rec error path (Martin)
- Integrate BPF_F_NEIGH flag into rejecting invalid flags (Martin)
- I think this way it's better to avoid bit overlaps given it's
right in the place that would need to be extended on new flags
v2 -> v3:
- Removed double skb->dev = dev assignment (David)
- Added headroom check for v6 path (David)
- Set set flowi4_proto for ip_route_output_flow (David)
- Rebased onto latest bpf-next
v1 -> v2:
- Rework cookie generator to support nested contexts (Eric)
- Use ip_neigh_gw6() and container_of() (David)
- Rename __throw_build_bug() and improve comments (Andrii)
- Use bpf_tail_call_static() also in BPF samples (Maciej)
====================
Daniel Borkmann [Wed, 30 Sep 2020 15:18:18 +0000 (17:18 +0200)]
bpf, libbpf: Add bpf_tail_call_static helper for bpf programs
Port of tail_call_static() helper function from Cilium's BPF code base [0]
to libbpf, so others can easily consume it as well. We've been using this
in production code for some time now. The main idea is that we guarantee
that the kernel's BPF infrastructure and JIT (here: x86_64) can patch the
JITed BPF insns with direct jumps instead of having to fall back to using
expensive retpolines. By using inline asm, we guarantee that the compiler
won't merge the call from different paths with potentially different
content of r2/r3.
We're also using Cilium's __throw_build_bug() macro (here as: __bpf_unreachable())
in different places as a neat trick to trigger compilation errors when
compiler does not remove code at compilation time. This works for the BPF
back end as it does not implement the __builtin_trap().
Daniel Borkmann [Wed, 30 Sep 2020 15:18:17 +0000 (17:18 +0200)]
bpf: Add redirect_neigh helper as redirect drop-in
Add a redirect_neigh() helper as redirect() drop-in replacement
for the xmit side. Main idea for the helper is to be very similar
in semantics to the latter just that the skb gets injected into
the neighboring subsystem in order to let the stack do the work
it knows best anyway to populate the L2 addresses of the packet
and then hand over to dev_queue_xmit() as redirect() does.
This solves two bigger items: i) skbs don't need to go up to the
stack on the host facing veth ingress side for traffic egressing
the container to achieve the same for populating L2 which also
has the huge advantage that ii) the skb->sk won't get orphaned in
ip_rcv_core() when entering the IP routing layer on the host stack.
Given that skb->sk neither gets orphaned when crossing the netns
as per 9c4c325252c5 ("skbuff: preserve sock reference when scrubbing
the skb.") the helper can then push the skbs directly to the phys
device where FQ scheduler can do its work and TCP stack gets proper
backpressure given we hold on to skb->sk as long as skb is still
residing in queues.
With the helper used in BPF data path to then push the skb to the
phys device, I observed a stable/consistent TCP_STREAM improvement
on veth devices for traffic going container -> host -> host ->
container from ~10Gbps to ~15Gbps for a single stream in my test
environment.
Daniel Borkmann [Wed, 30 Sep 2020 15:18:16 +0000 (17:18 +0200)]
bpf, net: Rework cookie generator as per-cpu one
With its use in BPF, the cookie generator can be called very frequently
in particular when used out of cgroup v2 hooks (e.g. connect / sendmsg)
and attached to the root cgroup, for example, when used in v1/v2 mixed
environments. In particular, when there's a high churn on sockets in the
system there can be many parallel requests to the bpf_get_socket_cookie()
and bpf_get_netns_cookie() helpers which then cause contention on the
atomic counter.
As similarly done in f991bd2e1421 ("fs: introduce a per-cpu last_ino
allocator"), add a small helper library that both can use for the 64 bit
counters. Given this can be called from different contexts, we also need
to deal with potential nested calls even though in practice they are
considered extremely rare. One idea as suggested by Eric Dumazet was
to use a reverse counter for this situation since we don't expect 64 bit
overflows anyways; that way, we can avoid bigger gaps in the 64 bit
counter space compared to just batch-wise increase. Even on machines
with small number of cores (e.g. 4) the cookie generation shrinks from
min/max/med/avg (ns) of 22/50/40/38.9 down to 10/35/14/17.3 when run
in parallel from multiple CPUs.
Daniel Borkmann [Wed, 30 Sep 2020 15:18:15 +0000 (17:18 +0200)]
bpf: Add classid helper only based on skb->sk
Similarly to 5a52ae4e32a6 ("bpf: Allow to retrieve cgroup v1 classid
from v2 hooks"), add a helper to retrieve cgroup v1 classid solely
based on the skb->sk, so it can be used as key as part of BPF map
lookups out of tc from host ns, in particular given the skb->sk is
retained these days when crossing net ns thanks to 9c4c325252c5
("skbuff: preserve sock reference when scrubbing the skb."). This
is similar to bpf_skb_cgroup_id() which implements the same for v2.
Kubernetes ecosystem is still operating on v1 however, hence net_cls
needs to be used there until this can be dropped in with the v2
helper of bpf_skb_cgroup_id().
Thomas Kopp [Wed, 30 Sep 2020 09:14:22 +0000 (11:14 +0200)]
can: mcp25xxfd: narrow down wildcards in device tree bindings to "microchip,mcp251xfd"
The wildcard should be narrowed down to prevent existing and future devices
that are not compatible from matching. It is very unlikely that incompatible
devices will be released that do not match the wildcard.
Thomas Kopp [Wed, 30 Sep 2020 09:14:23 +0000 (11:14 +0200)]
dt-binding: can: mcp251xfd: narrow down wildcards in device tree bindings to "microchip,mcp251xfd"
The wildcard should be narrowed down to prevent existing and future devices
that are not compatible from matching. It is very unlikely that incompatible
devices will be released that do not match the wildcard.
Apply following fixes:
- Use 'interrupts'. (interrupts-extended will automagically be supported
by the tools)
- *-supply is always a single item. So, drop maxItems=1
- add "additionalProperties: false" flag to detect unneeded properties.
Signed-off-by: Oleksij Rempel <o.rempel@pengutronix.de> Link: https://lore.kernel.org/r/20200923125301.27200-1-o.rempel@pengutronix.de Reported-by: Rob Herring <robh@kernel.org> Reviewed-by: Rob Herring <robh@kernel.org> Fixes: 1b5a78e69c1f ("dt-binding: can: mcp25xxfd: document device tree bindings") Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>
Thomas Kopp [Fri, 25 Sep 2020 06:56:06 +0000 (08:56 +0200)]
can: mcp25xxfd: mcp25xxfd_handle_eccif(): add ECC related errata and update log messages
This patch adds a reference to the recent released MCP2517FD and MCP2518FD
errata sheets and paste the explanation.
The single error correction does not always work, so always indicate that a
single error occurred. If the location of the ECC error is outside of the
TX-RAM always use netdev_notice() to log the problem. For ECC errors in the
TX-RAM, there is a recovery procedure.
====================
HW support for VCAP IS1 and ES0 in mscc_ocelot
The patches from RFC series "Offload tc-flower to mscc_ocelot switch
using VCAP chains" have been split into 2:
https://patchwork.ozlabs.org/project/netdev/list/?series=204810&state=*
This is the boring part, that deals with the prerequisites, and not with
tc-flower integration. Apart from the initialization of some hardware
blocks, which at this point still don't do anything, no new
functionality is introduced.
- Key and action field offsets are defined for the supported switches.
- VCAP properties are added to the driver for the new TCAM blocks. But
instead of adding them manually as was done for IS2, which is error
prone, the driver is refactored to read these parameters from
hardware, which is possible.
- Some improvements regarding the processing of struct ocelot_vcap_filter.
- Extending the code to be compatible with full and quarter keys.
This series was tested, along with other patches not yet submitted, on
the Felix and Seville switches.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Vladimir Oltean [Tue, 29 Sep 2020 22:27:33 +0000 (01:27 +0300)]
net: mscc: ocelot: look up the filters in flower_stats() and flower_destroy()
Currently a new filter is created, containing just enough correct
information to be able to call ocelot_vcap_block_find_filter_by_index()
on it.
This will be limiting us in the future, when we'll have more metadata
associated with a filter, which will matter in the stats() and destroy()
callbacks, and which we can't make up on the spot. For example, we'll
start "offloading" some dummy tc filter entries for the TCAM skeleton,
but we won't actually be adding them to the hardware, or to block->rules.
So, it makes sense to avoid deleting those rules too. That's the kind of
thing which is difficult to determine unless we look up the real filter.
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Vladimir Oltean [Tue, 29 Sep 2020 22:27:32 +0000 (01:27 +0300)]
net: mscc: ocelot: add a new ocelot_vcap_block_find_filter_by_id function
And rename the existing find to ocelot_vcap_block_find_filter_by_index.
The index is the position in the TCAM, and the id is the flow cookie
given by tc.
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Signed-off-by: David S. Miller <davem@davemloft.net>