Huazhong Tan [Mon, 16 Jul 2018 15:36:22 +0000 (16:36 +0100)]
net: hns3: Correct reset event status register
According to hardware's description, driver should get reset event
from VECTOR0_PF_OTHER_INT_ST(0x20800) instead of
VECTOR0_PF_OTHER_INT_SRC(0x20700).
Signed-off-by: Huazhong Tan <tanhuazhong@huawei.com> Signed-off-by: Peng Li <lipeng321@huawei.com> Signed-off-by: Salil Mehta <salil.mehta@huawei.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Huazhong Tan [Mon, 16 Jul 2018 15:36:21 +0000 (16:36 +0100)]
net: hns3: Prevent to request reset frequently
Netdevice reset should not be requested frequently, a new one
must wait a moment since there may be some work not completed.
Signed-off-by: Huazhong Tan <tanhuazhong@huawei.com> Signed-off-by: Peng Li <lipeng321@huawei.com> Signed-off-by: Salil Mehta <salil.mehta@huawei.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Huazhong Tan [Mon, 16 Jul 2018 15:36:20 +0000 (16:36 +0100)]
net: hns3: Reset net device with rtnl_lock
Since current locking was not covering certain code where
netdev was being accessed or manipulated, this patch fixes
it.
Signed-off-by: Huazhong Tan <tanhuazhong@huawei.com> Signed-off-by: Peng Li <lipeng321@huawei.com> Signed-off-by: Salil Mehta <salil.mehta@huawei.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Huazhong Tan [Mon, 16 Jul 2018 15:36:19 +0000 (16:36 +0100)]
net: hns3: Modify the order of initializing command queue register
According to hardware's description, the head pointer register should
be written before the tail pointer register while doing command queue
initialization.
Signed-off-by: Huazhong Tan <tanhuazhong@huawei.com> Signed-off-by: Peng Li <lipeng321@huawei.com> Signed-off-by: Salil Mehta <salil.mehta@huawei.com> Signed-off-by: David S. Miller <davem@davemloft.net>
The following series provides TLS RX inline crypto offload.
v5->v4:
- Remove the Kconfig to mutually exclude both IPsec and TLS
v4->v3:
- Remove the iov revert for zero copy send flow
v2->v3:
- Fix typo
- Adjust cover letter
- Fix bug in zero copy flows
- Use network byte order for the record number in resync
- Adjust the sequence provided in resync
v1->v2:
- Fix bisectability problems due to variable name changes
- Fix potential uninitialized return value
This series completes the generic infrastructure to offload TLS crypto to
a network devices. It enables the kernel TLS socket to skip decryption and
authentication operations for SKBs marked as decrypted on the receive
side of the data path. Leaving those computationally expensive operations
to the NIC.
This infrastructure doesn't require a TCP offload engine. Instead, the
NIC decrypts a packet's payload if the packet contains the expected TCP
sequence number. The TLS record authentication tag remains unmodified
regardless of decryption. If the packet is decrypted successfully and it
contains an authentication tag, then the authentication check has passed.
Otherwise, if the authentication fails, then the packet is provided
unmodified and the KTLS layer is responsible for handling it.
Out-Of-Order TCP packets are provided unmodified. As a result,
in the slow path some of the SKBs are decrypted while others remain as
ciphertext.
The GRO and TCP layers must not coalesce decrypted and non-decrypted SKBs.
At the worst case a received TLS record consists of both plaintext
and ciphertext packets. These partially decrypted records must be
reencrypted, only to be decrypted.
The notable differences between SW KTLS and NIC offloaded TLS
implementations are as follows:
1. Partial decryption - Software must handle the case of a TLS record
that was only partially decrypted by HW. This can happen due to packet
reordering.
2. Resynchronization - tls_read_size calls the device driver to
resynchronize HW whenever it lost track of the TLS record framing in
the TCP stream.
The infrastructure should be extendable to support various NIC offload
implementations. However it is currently written with the
implementation below in mind:
The NIC identifies packets that should be offloaded according to
the 5-tuple and the TCP sequence number. If these match and the
packet is decrypted and authenticated successfully, then a syndrome
is provided to software. Otherwise, the packet is unmodified.
Decrypted and non-decrypted packets aren't coalesced by the network stack,
and the KTLS layer decrypts and authenticates partially decrypted records.
The NIC provides an indication whenever a resync is required. The resync
operation is triggered by the KTLS layer while parsing TLS record headers.
Finally, we measure the performance obtained by running single stream
iperf with two Intel(R) Xeon(R) CPU E5-2620 v3 @ 2.40GHz machines connected
back-to-back with Innova TLS (40Gbps) NICs. We compare TCP (upper bound)
and KTLS-Offload running both in Tx and Rx. The results show that the
performance of offload is comparable to TCP.
Boris Pismenny [Fri, 13 Jul 2018 11:33:48 +0000 (14:33 +0300)]
net/mlx5e: TLS, add Innova TLS rx data path
Implement the TLS rx offload data path according to the
requirements of the TLS generic NIC offload infrastructure.
Special metadata ethertype is used to pass information to
the hardware.
When hardware loses synchronization a special resync request
metadata message is used to request resync.
Signed-off-by: Boris Pismenny <borisp@mellanox.com> Signed-off-by: Ilya Lesokhin <ilyal@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Boris Pismenny [Fri, 13 Jul 2018 11:33:47 +0000 (14:33 +0300)]
net/mlx5e: TLS, add innova rx support
Add the mlx5 implementation of the TLS Rx routines to add/del TLS
contexts, also add the tls_dev_resync_rx routine
to work with the TLS inline Rx crypto offload infrastructure.
Signed-off-by: Boris Pismenny <borisp@mellanox.com> Signed-off-by: Ilya Lesokhin <ilyal@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Boris Pismenny [Fri, 13 Jul 2018 11:33:46 +0000 (14:33 +0300)]
net/mlx5: Accel, add TLS rx offload routines
In Innova TLS, TLS contexts are added or deleted
via a command message over the SBU connection.
The HW then sends a response message over the same connection.
Complete the implementation for Innova TLS (FPGA-based) hardware by
adding support for rx inline crypto offload.
Signed-off-by: Boris Pismenny <borisp@mellanox.com> Signed-off-by: Ilya Lesokhin <ilyal@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Boris Pismenny [Fri, 13 Jul 2018 11:33:45 +0000 (14:33 +0300)]
net/mlx5e: TLS, refactor variable names
For symmetry, we rename mlx5e_tls_offload_context to
mlx5e_tls_offload_context_tx before we add mlx5e_tls_offload_context_rx.
Signed-off-by: Boris Pismenny <borisp@mellanox.com> Reviewed-by: Aviad Yehezkel <aviadye@mellanox.com> Reviewed-by: Tariq Toukan <tariqt@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Boris Pismenny [Fri, 13 Jul 2018 11:33:44 +0000 (14:33 +0300)]
tls: Fix zerocopy_from_iter iov handling
zerocopy_from_iter iterates over the message, but it doesn't revert the
updates made by the iov iteration. This patch fixes it. Now, the iov can
be used after calling zerocopy_from_iter.
Fixes: 3c4d75591 ("tls: kernel TLS support") Signed-off-by: Boris Pismenny <borisp@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Boris Pismenny [Fri, 13 Jul 2018 11:33:43 +0000 (14:33 +0300)]
tls: Add rx inline crypto offload
This patch completes the generic infrastructure to offload TLS crypto to a
network device. It enables the kernel to skip decryption and
authentication of some skbs marked as decrypted by the NIC. In the fast
path, all packets received are decrypted by the NIC and the performance
is comparable to plain TCP.
This infrastructure doesn't require a TCP offload engine. Instead, the
NIC only decrypts packets that contain the expected TCP sequence number.
Out-Of-Order TCP packets are provided unmodified. As a result, at the
worst case a received TLS record consists of both plaintext and ciphertext
packets. These partially decrypted records must be reencrypted,
only to be decrypted.
The notable differences between SW KTLS Rx and this offload are as
follows:
1. Partial decryption - Software must handle the case of a TLS record
that was only partially decrypted by HW. This can happen due to packet
reordering.
2. Resynchronization - tls_read_size calls the device driver to
resynchronize HW after HW lost track of TLS record framing in
the TCP stream.
Signed-off-by: Boris Pismenny <borisp@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Boris Pismenny [Fri, 13 Jul 2018 11:33:41 +0000 (14:33 +0300)]
tls: Split tls_sw_release_resources_rx
This patch splits tls_sw_release_resources_rx into two functions one
which releases all inner software tls structures and another that also
frees the containing structure.
In TLS_DEVICE we will need to release the software structures without
freeeing the containing structure, which contains other information.
Signed-off-by: Boris Pismenny <borisp@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Boris Pismenny [Fri, 13 Jul 2018 11:33:40 +0000 (14:33 +0300)]
tls: Split decrypt_skb to two functions
Previously, decrypt_skb also updated the TLS context.
Now, decrypt_skb only decrypts the payload using the current context,
while decrypt_skb_update also updates the state.
Later, in the tls_device Rx flow, we will use decrypt_skb directly.
Signed-off-by: Boris Pismenny <borisp@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Boris Pismenny [Fri, 13 Jul 2018 11:33:38 +0000 (14:33 +0300)]
tcp: Don't coalesce decrypted and encrypted SKBs
Prevent coalescing of decrypted and encrypted SKBs in GRO
and TCP layer.
Signed-off-by: Boris Pismenny <borisp@mellanox.com> Signed-off-by: Ilya Lesokhin <ilyal@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
This patch adds a netdev feature to configure TLS RX inline crypto offload.
Signed-off-by: Ilya Lesokhin <ilyal@mellanox.com> Signed-off-by: Boris Pismenny <borisp@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Boris Pismenny [Fri, 13 Jul 2018 11:33:35 +0000 (14:33 +0300)]
net: Add decrypted field to skb
The decrypted bit is propogated to cloned/copied skbs.
This will be used later by the inline crypto receive side offload
of tls.
Signed-off-by: Boris Pismenny <borisp@mellanox.com> Signed-off-by: Ilya Lesokhin <ilyal@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
The PPv2 Header Parser and Classifier are not straightforward to debug,
having easy access to some of the many lookup tables configuration is
helpful during development and debug.
This series adds a basic debugfs interface, allowing to read data from
the Header Parser and some of the Classifier tables.
For now, the interface is read-only, and contains only some basic info.
This was actually used during RSS development, and might be useful to
troubleshoot some issues we might find.
The first patch of the series converts the mvpp2 files to SPDX, which
eases adding the new debugfs dedicated file.
The second patch adds the interface, and exposes basic Header Parser data.
The 3rd patch adds a hit counter for the Header Parser TCAM.
The 4th patch exposes classifier info.
The 5th patch adds some hit counters for some of the classifier engines.
Changes since V1:
- Rebased on the lastest net-next
- Made cls_flow_get non static so that it can be used in mvpp2_debugfs
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
The classification operations that are used for RSS make use of several
lookup tables. Having hit counters for these tables is really helpful
to determine what flows were matched by ingress traffic, and see the
path of packets among all the classifier tables.
This commit adds hit counters for the 3 tables used at the moment :
- The decoding table (also called lookup_id table), that links flows
identified by the Header Parser to the flow table.
There's one entry per flow, located at :
.../mvpp2/<controller>/flows/XX/dec_hits
Note that there are 21 flows in the decoding table, whereas there are
52 flows in the Header Parser. That's because there are several kind
of traffic that will match a given flow. Reading the hit counter from
one sub-flow will clear all hit counter that have the same flow_id.
This also applies to the flow_hits.
- The flow table, that contains all the different lookups to be
performed by the classifier for each packet of a given flow. The match
is done on the first entry of the flow sequence.
- The C2 engine entries, that are used to assign the default rx queue,
and enable or disable RSS for a given port.
There's one entry per flow, located at:
.../mvpp2/<controller>/flows/XX/flow_hits
There is one C2 entry per port, so the c2 hit counter is located at :
.../mvpp2/<controller>/ethX/c2_hits
All hit counter values are 16-bits clear-on-read values.
Signed-off-by: Maxime Chevallier <maxime.chevallier@bootlin.com> Signed-off-by: David S. Miller <davem@davemloft.net>
net: mvpp2: debugfs: add entries for classifier flows
The classifier configuration for RSS is quite complex, with several
lookup tables being used. This commit adds useful info in debugfs to
see how the different tables are configured :
Added 2 new entries in the per-port directory :
- .../eth0/default_rxq : The default rx queue on that port
- .../eth0/rss_enable : Indicates if RSS is enabled in the C2 entry
Added the 'flows' directory :
It contains one entry per sub-flow. a 'sub-flow' is a unique path from
Header Parser to the flow table. Multiple sub-flows can point to the
same 'flow' (each flow has an id from 8 to 29, which is its index in the
Lookup Id table) :
- .../flows/00/...
/01/...
...
/51/id : The flow id. There are 21 unique flows. There's one
flow per combination of the following parameters :
- L4 protocol (TCP, UDP, none)
- L3 protocol (IPv4, IPv6)
- L3 parameters (Fragmented or not)
- L2 parameters (Vlan tag presence or not)
.../type : The flow type. This is an even higher level flow,
that we manipulate with ethtool. It can be :
"udp4" "tcp4" "udp6" "tcp6" "ipv4" "ipv6" "other".
.../eth0/...
.../eth1/engine : The hash generation engine used for this
flow on the given port
.../hash_opts : The hash generation options indicating on
what data we base the hash (vlan tag, src
IP, src port, etc.)
Signed-off-by: Maxime Chevallier <maxime.chevallier@bootlin.com> Signed-off-by: David S. Miller <davem@davemloft.net>
net: mvpp2: debugfs: add hit counter stats for Header Parser entries
One helpful feature to help debug the Header Parser TCAM filter in PPv2
is to be able to see if the entries did match something when a packet
comes in. This can be done by using the built-in hit counter for TCAM
entries.
This commit implements reading the counter, and exposing its value on
debugfs for each filter entry.
The counter is a 16-bits clear-on-read value, located at:
.../mvpp2/<controller>/parser/XXX/hits
Signed-off-by: Maxime Chevallier <maxime.chevallier@bootlin.com> Signed-off-by: David S. Miller <davem@davemloft.net>
net: mvpp2: add a debugfs interface for the Header Parser
Marvell PPv2 Packer Header Parser has a TCAM based filter, that is not
trivial to configure and debug. Being able to dump TCAM entries from
userspace can be really helpful to help development of new features
and debug existing ones.
This commit adds a basic debugfs interface for the PPv2 driver, focusing
on TCAM related features.
Antoine Tenart [Sat, 14 Jul 2018 11:29:24 +0000 (13:29 +0200)]
net: mvpp2: switch to SPDX identifiers
Use the appropriate SPDX license identifiers and drop the license text.
This patch is only cosmetic.
Signed-off-by: Antoine Tenart <antoine.tenart@bootlin.com> Signed-off-by: Maxime Chevallier <maxime.chevallier@bootlin.com> Signed-off-by: David S. Miller <davem@davemloft.net>
The following pull-request contains BPF updates for your *net-next* tree.
The main changes are:
1) Various different arm32 JIT improvements in order to optimize code emission
and make the JIT code itself more robust, from Russell.
2) Support simultaneous driver and offloaded XDP in order to allow for advanced
use-cases where some work is offloaded to the NIC and some to the host. Also
add ability for bpftool to load programs and maps beyond just the cgroup case,
from Jakub.
3) Add BPF JIT support in nfp for multiplication as well as division. For the
latter in particular, it uses the reciprocal algorithm to emulate it, from Jiong.
4) Add BTF pretty print functionality to bpftool in plain and JSON output
format, from Okash.
5) Add build and installation to the BPF helper man page into bpftool, from Quentin.
6) Add a TCP BPF callback for listening sockets which is triggered right after
the socket transitions to TCP_LISTEN state, from Andrey.
7) Add a new cgroup tree command to bpftool which iterates over the whole cgroup
tree and prints all attached programs, from Roman.
8) Improve xdp_redirect_cpu sample to support parsing of double VLAN tagged
packets, from Jesper.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
selftests/bpf: Test case for BPF_SOCK_OPS_TCP_LISTEN_CB
Cover new TCP-BPF callback in test_tcpbpf: when listen() is called on
socket, set BPF_SOCK_OPS_STATE_CB_FLAG so that BPF_SOCK_OPS_STATE_CB
callback can be called on future state transition, and when such a
transition happens (TCP_LISTEN -> TCP_CLOSE), track it in the map and
verify it in user space later.
Reduce amount of copy/paste for debug info when result is verified in
the test and keep that info together with values being checked so that
they won't get out of sync.
It also improves debug experience: instead of checking manually what
doesn't match in debug output for all fields, only unexpected field is
printed.
Add new TCP-BPF callback that is called on listen(2) right after socket
transition to TCP_LISTEN state.
It fills the gap for listening sockets in TCP-BPF. For example BPF
program can set BPF_SOCK_OPS_STATE_CB_FLAG when socket becomes listening
and track later transition from TCP_LISTEN to TCP_CLOSE with
BPF_SOCK_OPS_STATE_CB callback.
Before there was no way to do it with TCP-BPF and other options were
much harder to work with. E.g. socket state tracking can be done with
tracepoints (either raw or regular) but they can't be attached to cgroup
and their lifetime has to be managed separately.
David S. Miller [Sat, 14 Jul 2018 18:23:26 +0000 (11:23 -0700)]
Merge branch 'mlxsw-VRRP'
Ido Schimmel says:
====================
mlxsw: Add VRRP support
When a router that is acting as the default gateway of a host stops
functioning, the host will encounter packet loss until the router starts
functioning again.
To increase the reliability of the default gateway without performing
reconfiguration on the host, a host can use a Virtual Router Redundancy
Protocol (VRRP) Router. This virtual router is composed from several
routers where only one is actually forwarding packets from the host (the
master router) while the other routers act as backup routers. The
election of the master router is determined by the VRRP protocol [1].
Packets addressed to the virtual router are always sent to the virtual
router MAC address (IPv4: 00-00-5E-00-01-XX, IPv6: 00-00-5E-00-02-XX).
Such packets can only be accepted by the master router and must be
discarded by the backup routers.
In Linux, VRRP is usually implemented by configuring a macvlan with the
virtual router MAC on top of the router interface that is connected to
the host / LAN. The macvlan on the master router is assigned the virtual
IP (VIP) that the host uses as its gateway.
In order to support VRRP in mlxsw, we first need to enable macvlan upper
devices on top of mlxsw netdevs and their uppers. This is done by the
first patch, which also takes care of sanitizing macvlan configurations
that are not currently supported by the driver.
The second patch directs packets with destination MAC addresses as the
macvlans to the router so that they will undergo an L3 lookup. This is
consistent with the kernel's behavior where the macvlan's Rx handler
will re-inject such packets to the Rx path so that they will be picked
up by the IPvX protocol handlers and undergo an L3 lookup. Note that the
driver prevents the macvlans from being enslaved to other devices, to
ensure the packets will be picked up by the protocol handler and not by
another Rx handler.
The third patch adds packet traps for VRRP control packets for both IPv4
and IPv6. Finally, the last patch optimizes the reception of VRRP MACs
by potentially skipping one L2 lookup for them.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Where VRID is the ID of the virtual router. Such packets are directed to
the router block in the ASIC by an FDB entry that was added in the
previous patch.
However, in certain cases it is possible to skip this FDB lookup and
send such packets directly to the router. This is accomplished by adding
these special MAC addresses to the RIF cache. If the cache is hit, the
packet will skip the L2 lookup and ingress the router with the RIF
specified in the cache entry.
mlxsw: spectrum_router: Direct macvlans' MACs to router
An IP packet received on a netdev with a macvlan upper whose MAC matches
the packet's destination MAC will be re-injected to the Rx path as if it
was received by the macvlan, and perform an L3 lookup.
Reflect this functionality to the ASIC by programming FDB entries that
will direct MACs of macvlan uppers to the router.
In a similar fashion to router interfaces (RIFs) that are programmed
upon the addition of the first IP address on an interface and destroyed
upon the removal of the last IP address, the FDB entries for the macvlan
are added and destroyed based on the addition of the first and removal
of the last IP address on the macvlan.
Signed-off-by: Ido Schimmel <idosch@mellanox.com> Reviewed-by: Petr Machata <petrm@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
In order to allow more unicast MAC addresses (e.g., VRRP virtual MAC) to
be directed to the router we need to enable macvlan uppers on top of
mlxsw netdevs.
Allow macvlan upper devices on top of mlxsw netdevs and sanitize
configurations that can't work. For example, a macvlan can't be enslaved
to a bridge as without ACLs the device doesn't take the destination MAC
into account when classifying a packet to a bridge instance (i.e., a
FID).
Signed-off-by: Ido Schimmel <idosch@mellanox.com> Reviewed-by: Petr Machata <petrm@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
tcp_rcv_nxt_update() is already executed in tcp_data_queue().
This line is redundant.
See bellow,
tcp_queue_rcv
tcp_rcv_nxt_update(tcp_sk(sk), TCP_SKB_CB(skb)->end_seq);
tcp_rcv_nxt_update(tp, TCP_SKB_CB(skb)->end_seq); <<<< redundant
Signed-off-by: Yafang Shao <laoar.shao@gmail.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
This patch augments the output of bpftool's map dump and map lookup
commands to print data along side btf info, if the correspondin btf
info is available. The outputs for each of map dump and map lookup
commands are augmented in two ways:
1. when neither of -j and -p are supplied, btf-ful map data is printed
whose aim is human readability. This means no commitments for json- or
backward- compatibility.
2. when either -j or -p are supplied, a new json object named
"formatted" is added for each key-value pair. This object contains the
same data as the key-value pair, but with btf info. "formatted" object
promises json- and backward- compatibility. Below is a sample output.
This patch calls btf_dumper introduced in previous patch to accomplish
the above. Indeed, btf-ful info is only displayed if btf data for the
given map is available. Otherwise existing output is displayed as-is.
Signed-off-by: Okash Khawaja <osk@fb.com> Acked-by: Martin KaFai Lau <kafai@fb.com> Reviewed-by: Jakub Kicinski <jakub.kicinski@netronome.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
This consumes functionality exported in the previous patch. It does the
main job of printing with BTF data. This is used in the following patch
to provide a more readable output of a map's dump. It relies on
json_writer to do json printing. Below is sample output where map keys
are ints and values are of type struct A:
typedef int int_type;
enum E {
E0,
E1,
};
struct B {
int x;
int y;
};
struct A {
int m;
unsigned long long n;
char o;
int p[8];
int q[4][8];
enum E r;
void *s;
struct B t;
const int u;
int_type v;
unsigned int w1: 3;
unsigned int w2: 3;
};
This patch uses json's {} and [] to imply struct/union and array. More
explicit information can be added later. For example, a command line
option can be introduced to print whether a key or value is struct
or union, name of a struct etc. This will however come at the expense
of duplicating info when, for example, printing an array of structs.
enums are printed as ints without their names.
Signed-off-by: Okash Khawaja <osk@fb.com> Acked-by: Martin KaFai Lau <kafai@fb.com> Reviewed-by: Jakub Kicinski <jakub.kicinski@netronome.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
bpf: btf: export btf types and name by offset from lib
This patch introduces btf__resolve_type() function and exports two
existing functions from libbpf. btf__resolve_type follows modifier
types like const and typedef until it hits a type which actually takes
up memory, and then returns it. This function follows similar pattern
to btf__resolve_size but instead of computing size, it just returns
the type.
These functions will be used in the followig patch which parses
information inside array of `struct btf_type *`. btf_name_by_offset is
used for printing variable names.
Signed-off-by: Okash Khawaja <osk@fb.com> Acked-by: Martin KaFai Lau <kafai@fb.com> Acked-by: Song Liu <songliubraving@fb.com> Reviewed-by: Jakub Kicinski <jakub.kicinski@netronome.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Jakub Kicinski [Sat, 14 Jul 2018 02:08:59 +0000 (19:08 -0700)]
tools: include reallocarray feature test in FEATURE_TESTS_BASIC
perf propagates its feature check results to libbpf. This means
features for which perf probes must be a superset of libbpf's
required features. perf depends on FEATURE_TESTS_BASIC for its list
of features.
commit 531b014e7a2f ("tools: bpf: make use of reallocarray") added
reallocarray use to libbpf, make perf also perform the reallocarray
feature check.
Fixes: 531b014e7a2f ("tools: bpf: make use of reallocarray") Reported-by: Guenter Roeck <linux@roeck-us.net> Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com> Tested-by: Guenter Roeck <linux@roeck-us.net> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Fixes: f9358e12a0af ("net: mvpp2: split ingress traffic into multiple flows") Signed-off-by: kbuild test robot <fengguang.wu@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
This is very helpful for connecting random ethernet ports
to e.g. DSA switches that typically reside on fixed links.
The phy-mode is still there as the fixes link in this case
is still an RGMII link.
Tested on the Cortina Gemini driver with the Vitesse DSA
router chip on a fixed 1Gbit link.
Suggested-by: Andrew Lunn <andrew@lunn.ch> Signed-off-by: Linus Walleij <linus.walleij@linaro.org> Reviewed-by: Andrew Lunn <andrew@lunn.ch> Signed-off-by: David S. Miller <davem@davemloft.net>
net: sched: refactor flower walk to iterate over idr
Extend struct tcf_walker with additional 'cookie' field. It is intended to
be used by classifier walk implementations to continue iteration directly
from particular filter, instead of iterating 'skip' number of times.
Change flower walk implementation to save filter handle in 'cookie'. Each
time flower walk is called, it looks up filter with saved handle directly
with idr, instead of iterating over filter linked list 'skip' number of
times. This change improves complexity of dumping flower classifier from
quadratic to linearithmic. (assuming idr lookup has logarithmic complexity)
Reviewed-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: Vlad Buslov <vladbu@mellanox.com> Reported-by: Simon Horman <simon.horman@netronome.com> Reviewed-by: Simon Horman <simon.horman@netronome.com> Signed-off-by: David S. Miller <davem@davemloft.net>
samples/bpf: xdp_redirect_cpu handle parsing of double VLAN tagged packets
People noticed that the code match on IEEE 802.1ad (ETH_P_8021AD) ethertype,
and this implies Q-in-Q or double tagged VLANs. Thus, we better parse
the next VLAN header too. It is even marked as a TODO.
This is relevant for real world use-cases, as XDP cpumap redirect can be
used when the NIC RSS hashing is broken. E.g. the ixgbe driver HW cannot
handle double tagged VLAN packets, and places everything into a single
RX queue. Using cpumap redirect, users can redistribute traffic across
CPUs to solve this, which is faster than the network stacks RPS solution.
It is left as an exerise how to distribute the packets across CPUs. It
would be convenient to use the RX hash, but that is not _yet_ exposed
to XDP programs. For now, users can code their own hash, as I've demonstrated
in the Suricata code (where Q-in-Q is handled correctly).
Reported-by: Florian Maury <florian.maury-cv@x-cli.eu> Reported-by: Marek Majkowski <marek@cloudflare.com> Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
net: ipmr: add support for passing full packet on wrong vif
This patch adds support for IGMPMSG_WRVIFWHOLE which is used to pass
full packet and real vif id when the incoming interface is wrong.
While the RP and FHR are setting up state we need to be sending the
registers encapsulated with all the data inside otherwise we lose it.
The RP then decapsulates it and forwards it to the interested parties.
Currently with WRONGVIF we can only be sending empty register packets
and will lose that data.
This behaviour can be enabled by using MRT_PIM with
val == IGMPMSG_WRVIFWHOLE. This doesn't prevent IGMPMSG_WRONGVIF from
happening, it happens in addition to it, also it is controlled by the same
throttling parameters as WRONGVIF (i.e. 1 packet per 3 seconds currently).
Both messages are generated to keep backwards compatibily and avoid
breaking someone who was enabling MRT_PIM with val == 4, since any
positive val is accepted and treated the same.
Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Daniel Borkmann [Fri, 13 Jul 2018 18:26:36 +0000 (20:26 +0200)]
Merge branch 'bpf-xdp-driver-and-hw'
Jakub Kicinski says:
====================
This set is adding support for loading driver and offload XDP
at the same time. This enables advanced use cases where some
of the work is offloaded to the NIC and some is done by the host.
Separate netlink attributes are added for each mode of operation.
Driver callbacks for offload are cleaned up a little, including
removal of .prog_attached flag.
====================
Acked-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Jakub Kicinski [Thu, 12 Jul 2018 03:36:44 +0000 (20:36 -0700)]
nfp: add support for simultaneous driver and hw XDP
Split handling of offloaded and driver programs completely. Since
offloaded programs always come with XDP_FLAGS_HW_MODE set in reality
there could be no sharing, anyway, programs would only be installed
in driver or in hardware. Splitting the handling allows us to install
programs in HW and in driver at the same time.
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com> Reviewed-by: Quentin Monnet <quentin.monnet@netronome.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Jakub Kicinski [Thu, 12 Jul 2018 03:36:41 +0000 (20:36 -0700)]
xdp: support simultaneous driver and hw XDP attachment
Split the query of HW-attached program from the software one.
Introduce new .ndo_bpf command to query HW-attached program.
This will allow drivers to install different programs in HW
and SW at the same time. Netlink can now also carry multiple
programs on dump (in which case mode will be set to
XDP_ATTACHED_MULTI and user has to check per-attachment point
attributes, IFLA_XDP_PROG_ID will not be present). We reuse
IFLA_XDP_PROG_ID skb space for second mode, so rtnl_xdp_size()
doesn't need to be updated.
Note that the installation side is still not there, since all
drivers currently reject installing more than one program at
the time.
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com> Reviewed-by: Quentin Monnet <quentin.monnet@netronome.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Jakub Kicinski [Thu, 12 Jul 2018 03:36:40 +0000 (20:36 -0700)]
xdp: factor out common program/flags handling from drivers
Basic operations drivers perform during xdp setup and query can
be moved to helpers in the core. Encapsulate program and flags
into a structure and add helpers. Note that the structure is
intended as the "main" program information source in the driver.
Most drivers will additionally place the program pointer in their
fast path or ring structures.
The helpers don't have a huge impact now, but they will
decrease the code duplication when programs can be installed
in HW and driver at the same time. Encapsulating the basic
operations in helpers will hopefully also reduce the number
of changes to drivers which adopt them.
Helpers could really be static inline, but they depend on
definition of struct netdev_bpf which means they'd have
to be placed in netdevice.h, an already 4500 line header.
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com> Reviewed-by: Quentin Monnet <quentin.monnet@netronome.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Jakub Kicinski [Thu, 12 Jul 2018 03:36:39 +0000 (20:36 -0700)]
xdp: don't make drivers report attachment mode
prog_attached of struct netdev_bpf should have been superseded
by simply setting prog_id long time ago, but we kept it around
to allow offloading drivers to communicate attachment mode (drv
vs hw). Subsequently drivers were also allowed to report back
attachment flags (prog_flags), and since nowadays only programs
attached will XDP_FLAGS_HW_MODE can get offloaded, we can tell
the attachment mode from the flags driver reports. Remove
prog_attached member.
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com> Reviewed-by: Quentin Monnet <quentin.monnet@netronome.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Jakub Kicinski [Thu, 12 Jul 2018 03:36:38 +0000 (20:36 -0700)]
xdp: add per mode attributes for attached programs
In preparation for support of simultaneous driver and hardware XDP
support add per-mode attributes. The catch-all IFLA_XDP_PROG_ID
will still be reported, but user space can now also access the
program ID in a new IFLA_XDP_<mode>_PROG_ID attribute.
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com> Reviewed-by: Quentin Monnet <quentin.monnet@netronome.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
net: gemini: Indicate that we can handle jumboframes
The hardware supposedly handles frames up to 10236 bytes and
implements .ndo_change_mtu() so accept 10236 minus the ethernet
header for a VLAN tagged frame on the netdevices. Use
ETH_MIN_MTU as minimum MTU.
Signed-off-by: Linus Walleij <linus.walleij@linaro.org> Reviewed-by: Andrew Lunn <andrew@lunn.ch> Signed-off-by: David S. Miller <davem@davemloft.net>
The initialization sequence for the ethernet, setting up
interrupt routing and such things, need to be done after
both the ports are clocked and reset. Before this the
config will not "take". Move the initialization to the
port probe function and keep track of init status in
the state.
Signed-off-by: Linus Walleij <linus.walleij@linaro.org> Signed-off-by: David S. Miller <davem@davemloft.net>
The code was not tested with two ports actually in use at
the same time. (I blame this on lack of actual hardware using
that feature.) Now after locating a system using both ports,
add necessary fix to make both ports come up.
Signed-off-by: Linus Walleij <linus.walleij@linaro.org> Signed-off-by: David S. Miller <davem@davemloft.net>
Switch over to using a module parameter and debug prints
that can be controlled by this or ethtool like everyone
else. Depromote all other prints to debug messages.
The phy_print_status() was already in place, albeit never
really used because the debuglevel hiding it had to be
set up using ethtool.
Signed-off-by: Linus Walleij <linus.walleij@linaro.org> Reviewed-by: Andrew Lunn <andrew@lunn.ch> Signed-off-by: David S. Miller <davem@davemloft.net>
The code to calculate the hardware register enumerator
for the maximum L3 length isn't entirely simple to read.
Use the existing defines and rewrite the function into a
table look-up.
Acked-by: Michał Mirosław <mirq-linux@rere.qmqm.pl> Signed-off-by: Linus Walleij <linus.walleij@linaro.org> Signed-off-by: David S. Miller <davem@davemloft.net>
====================
devlink: Add support for region access
This is a proposal which will allow access to driver defined address
regions using devlink. Each device can create its supported address
regions and register them. A device which exposes a region will allow
access to it using devlink.
The suggested implementation will allow exposing regions to the user,
reading and dumping snapshots taken from different regions.
A snapshot represents a memory image of a region taken by the driver.
If a device collects a snapshot of an address region it can be later
exposed using devlink region read or dump commands.
This functionality allows for future analyses on the snapshots to be
done.
The major benefit of this support is not only to provide access to
internal address regions which were inaccessible to the user but also
to provide an additional way to debug complex error states using the
region snapshots.
Implemented commands:
$ devlink region help
$ devlink region show [ DEV/REGION ]
$ devlink region del DEV/REGION snapshot SNAPSHOT_ID
$ devlink region dump DEV/REGION [ snapshot SNAPSHOT_ID ]
$ devlink region read DEV/REGION [ snapshot SNAPSHOT_ID ]
address ADDRESS length length
Show all of the exposed regions with region sizes:
$ devlink region show
pci/0000:00:05.0/cr-space: size 1048576 snapshot [1 2]
pci/0000:00:05.0/fw-health: size 64 snapshot [1 2]
Delete a snapshot using:
$ devlink region del pci/0000:00:05.0/cr-space snapshot 1
Dump a snapshot:
$ devlink region dump pci/0000:00:05.0/fw-health snapshot 1 0000000000000000 0014 95dc 0014 9514 0035 1670 0034 db30 0000000000000010 0000 0000 ffff ff04 0029 8c00 0028 8cc8 0000000000000020 0016 0bb8 0016 1720 0000 0000 c00f 3ffc 0000000000000030 bada cce5 bada cce5 bada cce5 bada cce5
Read a specific part of a snapshot:
$ devlink region read pci/0000:00:05.0/fw-health snapshot 1 address 0
length 16 0000000000000000 0014 95dc 0014 9514 0035 1670 0034 db30
For more information you can check devlink-region.8 man page
Future:
There is a plan to extend the support to include a write command
as well as performing read and dump live region
v1->v2:
-Add a parameter to enable devlink region snapshot
-Allocate snapshot memory using kvmalloc
-Introduce destructor function devlink_snapshot_data_dest_t to avoid
double allocation
v2->v3:
-Fix incorrect comment in devlink.h for DEVLINK_ATTR_REGION_SIZE
from u32 to u64
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Alex Vesker [Thu, 12 Jul 2018 12:13:18 +0000 (15:13 +0300)]
net/mlx4_core: Use devlink region_snapshot parameter
This parameter enables capturing region snapshot of the crspace
during critical errors. The default value of this parameter is
disabled, it can be enabled using devlink param commands.
It is possible to configure during runtime and also driver init.
Signed-off-by: Alex Vesker <valex@mellanox.com> Signed-off-by: Jiri Pirko <jiri@mellanox.com> Reviewed-by: Moshe Shemesh <moshe@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Alex Vesker [Thu, 12 Jul 2018 12:13:17 +0000 (15:13 +0300)]
devlink: Add generic parameters region_snapshot
region_snapshot - When set enables capturing region snapshots
Signed-off-by: Alex Vesker <valex@mellanox.com> Signed-off-by: Jiri Pirko <jiri@mellanox.com> Reviewed-by: Moshe Shemesh <moshe@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Alex Vesker [Thu, 12 Jul 2018 12:13:16 +0000 (15:13 +0300)]
net/mlx4_core: Add Crdump FW snapshot support
Crdump allows the driver to create a snapshot of the FW PCI
crspace and health buffer during a critical FW issue.
In case of a FW command timeout, FW getting stuck or a non zero
value on the catastrophic buffer, a snapshot will be taken.
The snapshot is exposed using devlink, cr-space, fw-health
address regions are registered on init and snapshots are attached
once a new snapshot is collected by the driver.
Signed-off-by: Alex Vesker <valex@mellanox.com> Signed-off-by: Tariq Toukan <tariqt@mellanox.com> Signed-off-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Alex Vesker [Thu, 12 Jul 2018 12:13:15 +0000 (15:13 +0300)]
net/mlx4_core: Add health buffer address capability
Health buffer address is a 32 bit PCI address offset provided by
the FW. This offset is used for reading FW health debug data
located on the shared CR space. Cr space is accessible in both
driver and FW and allows for different queries and configurations.
Health buffer size is always 64B of readable data followed by a
lock which is used to block volatile CR space access.
Signed-off-by: Alex Vesker <valex@mellanox.com> Signed-off-by: Tariq Toukan <tariqt@mellanox.com> Signed-off-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Alex Vesker [Thu, 12 Jul 2018 12:13:14 +0000 (15:13 +0300)]
devlink: Add support for region snapshot read command
Add support for DEVLINK_CMD_REGION_READ_GET used for both reading
and dumping region data. Read allows reading from a region specific
address for given length. Dump allows reading the full region.
If only snapshot ID is provided a snapshot dump will be done.
If snapshot ID, Address and Length are provided a snapshot read
will done.
This is used for both snapshot access and will be used in the same
way to access current data on the region.
Signed-off-by: Alex Vesker <valex@mellanox.com> Signed-off-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Alex Vesker [Thu, 12 Jul 2018 12:13:13 +0000 (15:13 +0300)]
devlink: Add support for region snapshot delete command
Add support for DEVLINK_CMD_REGION_DEL used
for deleting a snapshot from a region. The snapshot ID is required.
Also added notification support for NEW and DEL of snapshots.
Signed-off-by: Alex Vesker <valex@mellanox.com> Signed-off-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Alex Vesker [Thu, 12 Jul 2018 12:13:12 +0000 (15:13 +0300)]
devlink: Extend the support querying for region snapshot IDs
Extend the support for DEVLINK_CMD_REGION_GET command to also
return the IDs of the snapshot currently present on the region.
Each reply will include a nested snapshots attribute that
can contain multiple snapshot attributes each with an ID.
Signed-off-by: Alex Vesker <valex@mellanox.com> Signed-off-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Alex Vesker [Thu, 12 Jul 2018 12:13:11 +0000 (15:13 +0300)]
devlink: Add support for region get command
Add support for DEVLINK_CMD_REGION_GET command which is used for
querying for the supported DEV/REGION values of devlink devices.
The support is both for doit and dumpit.
Alex Vesker [Thu, 12 Jul 2018 12:13:10 +0000 (15:13 +0300)]
devlink: Add support for creating region snapshots
Each device address region can store multiple snapshots,
each snapshot is identified using a different numerical ID.
This ID is used when deleting a snapshot or showing an address
region specific snapshot. This patch exposes a callback to add
a new snapshot to an address region.
The snapshot will be deleted using the destructor function
when destroying a region or when a snapshot delete command
from devlink user tool.
Signed-off-by: Alex Vesker <valex@mellanox.com> Signed-off-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Alex Vesker [Thu, 12 Jul 2018 12:13:09 +0000 (15:13 +0300)]
devlink: Add callback to query for snapshot id before snapshot create
To restrict the driver with the snapshot ID selection a new callback
is introduced for the driver to get the snapshot ID before creating
a new snapshot. This will also allow giving the same ID for multiple
snapshots taken of different regions on the same time.
Signed-off-by: Alex Vesker <valex@mellanox.com> Signed-off-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Alex Vesker [Thu, 12 Jul 2018 12:13:08 +0000 (15:13 +0300)]
devlink: Add support for creating and destroying regions
This allows a device to register its supported address regions.
Each address region can be accessed directly for example reading
the snapshots taken of this address space.
Drivers are not limited in the name selection for different regions.
An example of a region-name can be: pci cr-space, register-space.
Signed-off-by: Alex Vesker <valex@mellanox.com> Signed-off-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Fri, 13 Jul 2018 00:30:49 +0000 (17:30 -0700)]
Merge branch 'mvpp2-add-RSS-support'
Maxime Chevallier says:
====================
net: mvpp2: add RSS support
This series adds support for RSS on PPv2. There already was some code to
handle the RSS tables, but the driver was missing all the classification
steps required to actually use these tables.
RSS is used through the classifier, using at least 2 lookups :
- One using the C2 engine, a TCAM engine that match the packet based on
some header extracted fields, assigns the default rx queue for that
packet and tag it for RSS
- One using the C3Hx engine, which computes the hash that's used to perform
the lookup in the RSS table.
Since RSS spreads the load across CPUs, we need to make sure that packets
from the same flow are always assigned the same rx queue, to prevent
re-ordering.
This series therefore adds a classification step based on the Header Parser,
that separate ingress traffic into 52 flows, based on some L2, L3 and L4
parameters.
Patches 1 and 2 fix some header issues, from the driver splitting
Patches 3 to 7 make sure the correct receive queue setup is used for RSS
Patches 8 to 14 deal with the way we handle the RSS tables
Patch 15 implement basic classifier configuration, by using it to assign the
default receive queue
Patch 16 implement the ingress traffic splitting into multiple flows
Patch 17 adds RSS support, by using the needed classification steps
Patch 18 adds the required ethtool ops to configure the flow hash parameters
This was tested on MacchiatoBin, giving some nice performance improvements
using ip forwarding (going from 5Gbps to 9.6Gbps total throughput).
RSS is disabled by default.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
net: mvpp2: allow setting RSS flow hash parameters with ethtool
This commit allows setting the RSS hash generation parameters from
ethtool. When setting parameters for a given flow type from ethtool
(e.g. tcp4), all the corresponding flows in the flow table are updated,
according to the supported hash parameters.
For example, when configuring TCP over IPv4 hash parameters to be
src/dst IP + src/dst port ("ethtool -N eth0 rx-flow-hash tcp4 sdfn"),
we only set the "src/dst port" hash parameters on the non-fragmented TCP
over IPv4 flows.
Signed-off-by: Maxime Chevallier <maxime.chevallier@bootlin.com> Signed-off-by: David S. Miller <davem@davemloft.net>
net: mvpp2: add an RSS classification step for each flow
One of the classification action that can be performed is to compute a
hash of the packet header based on some header fields, and lookup a RSS
table based on this hash to determine the final RxQ.
This is done by adding one lookup entry per flow per port, so that we
can configure the hash generation parameters for each flow and each
port.
There are 2 possible engines that can be used for RSS hash generation :
- C3HA, that generates a hash based on up to 4 header-extracted fields
- C3HB, that does the same as c3HA, but also includes L4 info in the hash
There are a lot of fields that can be extracted from the header. For now,
we only use the ones that we can configure using ethtool :
- DST MAC address
- L3 info
- Source IP
- Destination IP
- Source port
- Destination port
The C3HB engine is selected when we use L4 fields (src/dst port).
net: mvpp2: split ingress traffic into multiple flows
The PPv2 classifier allows to perform classification operations on each
ingress packet, based on the flow the packet is assigned to.
The current code uses only 1 flow per port, and the only classification
action consists of assigning the rx queue to the packet, depending on the
port.
In preparation for adding RSS support, we have to split all incoming
traffic into different flows. Since RSS assigns a rx queue depending on
the hash of some header fields, we have to make sure that the hash is
generated in a consistent way for all packets in the same flow.
What we call a "flow" is actually a set of attributes attached to a
packet that depends on various L2/L3/L4 info.
This patch introduces 52 flows, wich are a combination of various L2, L3
and L4 attributes :
- Whether or not the packet has a VLAN tag
- Whether the packet is IPv4, IPv6 or something else
- Whether the packet is TCP, UDP or something else
- Whether or not the packet is fragmented at L3 level.
The flow is associated to a packet by the Header Parser. Each flow
corresponds to an entry in the decoding table. This entry then points to
the sequence of classification lookups to be performed by the
classifier, represented in the flow table.
For now, the only lookup we perform is a C2 lookup to set the default
rx queue.
net: mvpp2: use classifier to assign default rx queue
The PPv2 Controller has a classifier, that can perform multiple lookup
operations for each packet, using different engines.
One of these engines is the C2 engine, which performs TCAM based lookups
on data extracted from the packet header. When a packet matches an
entry, the engine sets various attributes, used to perform
classification operations.
One of these attributes is the rx queue in which the packet should be sent.
The current code uses the lookup_id table (also called decoding table)
to assign the rx queue. However, this only works if we use one entry per
port in the decoding table, which won't be the case once we add RSS
lookups.
This patch uses the C2 engine to assign the rx queue to each packet.
The C2 engine is used through the flow table, which dictates what
classification operations are done for a given flow.
Right now, we have one flow per port, which contains every ingress
packet for this port.
Signed-off-by: Maxime Chevallier <maxime.chevallier@bootlin.com> Signed-off-by: David S. Miller <davem@davemloft.net>
mvpp22_init_rss function configures the RSS parameters for each port, so
rename it accordingly. Since this function relies on classifier
configuration, move its call right after the classifier config.
Signed-off-by: Maxime Chevallier <maxime.chevallier@bootlin.com> Signed-off-by: David S. Miller <davem@davemloft.net>
net: mvpp2: make sure we don't spread load on disabled CPUs
When filling the RSS table, we have to make sure that the rx queue is
attached to an online CPU.
This patch is not a full support for cpu_hotplug, but rather a way to
make sure that we don't break network on system booted with the maxcpus
parameter.
Signed-off-by: Maxime Chevallier <maxime.chevallier@bootlin.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Antoine Tenart [Thu, 12 Jul 2018 11:54:21 +0000 (13:54 +0200)]
net: mvpp2: improve the distribution of packets on CPUs when using RSS
This patch adds an extra indirection when setting the indirection table
into the RSS hardware table to improve the packets distribution across
CPUs. For example, if 2 queues are used on a multi-core system this new
indirection will choose two queues on two different CPUs instead of the
two first queues which are on the same first CPU.
Signed-off-by: Antoine Tenart <antoine.tenart@bootlin.com> Signed-off-by: Maxime Chevallier <maxime.chevallier@bootlin.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Antoine Tenart [Thu, 12 Jul 2018 11:54:20 +0000 (13:54 +0200)]
net: mvpp2: RSS indirection table support
This patch adds the RSS indirection table support, allowing to use the
ethtool -x and -X options to dump and set this table.
Signed-off-by: Antoine Tenart <antoine.tenart@bootlin.com>
[Maxime: Small warning fixes, use one table per port] Signed-off-by: Maxime Chevallier <maxime.chevallier@bootlin.com> Signed-off-by: David S. Miller <davem@davemloft.net>
PPv2 Controller has 8 RSS Tables, of 32 entries each. A lookup in the
RXQ2RSS_TABLE is performed for each incoming packet, and the RSS Table
to be used is chosen according to the default rx queue that would be
used for the packet.
This default rx queue is set in the Lookup_id Table (also called
Decoding Table), and is equal to the port->first_rxq.
Since the Classifier itself isn't active at any time for the moment,
this doesn't have a direct effect, the default rx queue at the moment is
the one where all packets end-up into.
Signed-off-by: Maxime Chevallier <maxime.chevallier@bootlin.com> Signed-off-by: David S. Miller <davem@davemloft.net>
There is no RSS_TABLE register in PPv2 Controller. The register 0x1510
which was specified is actually named "RSS_HASH_SEL", but isn't used by
this driver at all.
Based on how this register was used, it should have been the
RXQ2RSS_TABLE register, which allows to select the RSS table that will
be used for the incoming packet.
The RSS_TABLE_POINTER is actually a field of this RXQ2RSS_TABLE
register.
Since RSS tables are actually not used by the driver for now, this
commit does not fix a runtime bug.
Signed-off-by: Maxime Chevallier <maxime.chevallier@bootlin.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Antoine Tenart [Thu, 12 Jul 2018 11:54:17 +0000 (13:54 +0200)]
net: mvpp2: fix a typo in the RSS code
Cosmetic patch fixing a typo in one of the RSS comments.
Signed-off-by: Antoine Tenart <antoine.tenart@bootlin.com> Signed-off-by: Maxime Chevallier <maxime.chevallier@bootlin.com> Signed-off-by: David S. Miller <davem@davemloft.net>
net: mvpp2: use only one rx queue per port per CPU
The number of receive queue per port is :
- MVPP2_DEFAULT_RXQ if in single queue mode
- MVPP2_DEFAULT_RXQ * num_possible_cpus if in multi queue mode
with MVPP2_DEFAULT_RXQ = 4.
However, we don't use the extra rx queues at the moment, we really only
need one per port per CPU, until some more advanced classification rules
are implemented.
Suggested-by: Stefan Chulski <stefanc@marvell.com> Signed-off-by: Maxime Chevallier <maxime.chevallier@bootlin.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Yan Markman [Thu, 12 Jul 2018 11:54:14 +0000 (13:54 +0200)]
net: mvpp2: use RSS only when using multi-queue mode
Since RSS only applies when we have per-cpu rx queues, it should only
be enabled when the driver is configured to make use of multi-queue
mode.
Signed-off-by: Yan Markman <ymarkman@marvell.com>
[Maxime: Commit message] Signed-off-by: Maxime Chevallier <maxime.chevallier@bootlin.com> Signed-off-by: David S. Miller <davem@davemloft.net>
net: mvpp2: make sure we use single queue mode on PPv2.1
The PPv2 driver defines 2 "queue_modes" :
- QDIST_SINGLE_MODE, where each port share one rx queue vector
between all CPUs
- QDIST_MULTI_MODE, where each port has one rx queue vector per CPU.
Multi queue mode isn't available on PPv2.1, make sure we fallback to
single mode when running on this revision.
Signed-off-by: Maxime Chevallier <maxime.chevallier@bootlin.com> Signed-off-by: David S. Miller <davem@davemloft.net>