netfilter: move checksum_partial indirection to struct nf_ipv6_ops
We cannot make a direct call to nf_ip6_checksum_partial() because that
would result in autoloading the 'ipv6' module because of symbol
dependencies. Therefore, define checksum_partial indirection in
nf_ipv6_ops where this really belongs to.
For IPv4, we can indeed make a direct function call, which is faster,
given IPv4 is built-in in the networking code by default. Still,
CONFIG_INET=n and CONFIG_NETFILTER=y is possible, so define empty inline
stub for IPv4 in such case.
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
netfilter: move checksum indirection to struct nf_ipv6_ops
We cannot make a direct call to nf_ip6_checksum() because that would
result in autoloading the 'ipv6' module because of symbol dependencies.
Therefore, define checksum indirection in nf_ipv6_ops where this really
belongs to.
For IPv4, we can indeed make a direct function call, which is faster,
given IPv4 is built-in in the networking code by default. Still,
CONFIG_INET=n and CONFIG_NETFILTER=y is possible, so define empty inline
stub for IPv4 in such case.
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
netfilter: connlimit: split xt_connlimit into front and backend
This allows to reuse xt_connlimit infrastructure from nf_tables.
The upcoming nf_tables frontend can just pass in an nftables register
as input key, this allows limiting by any nft-supported key, including
concatenations.
For xt_connlimit, pass in the zone and the ip/ipv6 address.
netfilter: nf_tables: remove multihook chains and families
Since NFPROTO_INET is handled from the core, we don't need to maintain
extra infrastructure in nf_tables to handle the double hook
registration, one for IPv4 and another for IPv6.
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
netfilter: nf_tables: explicit nft_set_pktinfo() call from hook path
Instead of calling this function from the family specific variant, this
reduces the code size in the fast path for the netdev, bridge and inet
families. After this change, we must call nft_set_pktinfo() upfront from
the chain hook indirection.
Before:
text data bss dec hex filename
2145 208 0 2353 931 net/netfilter/nf_tables_netdev.o
After:
text data bss dec hex filename
2125 208 0 2333 91d net/netfilter/nf_tables_netdev.o
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
netfilter: core: only allow one nat hook per hook point
The netfilter NAT core cannot deal with more than one NAT hook per hook
location (prerouting, input ...), because the NAT hooks install a NAT null
binding in case the iptables nat table (iptable_nat hooks) or the
corresponding nftables chain (nft nat hooks) doesn't specify a nat
transformation.
Null bindings are needed to detect port collsisions between NAT-ed and
non-NAT-ed connections.
This causes nftables NAT rules to not work when iptable_nat module is
loaded, and vice versa because nat binding has already been attached
when the second nat hook is consulted.
The netfilter core is not really the correct location to handle this
(hooks are just hooks, the core has no notion of what kinds of side
effects a hook implements), but its the only place where we can check
for conflicts between both iptables hooks and nftables hooks without
adding dependencies.
So add nat annotation to hook_ops to describe those hooks that will
add NAT bindings and then make core reject if such a hook already exists.
The annotation fills a padding hole, in case further restrictions appar
we might change this to a 'u8 type' instead of bool.
iptables error if nft nat hook active:
iptables -t nat -A POSTROUTING -j MASQUERADE
iptables v1.4.21: can't initialize iptables table `nat': File exists
Perhaps iptables or your kernel needs to be upgraded.
nftables error if iptables nat table present:
nft -f /etc/nftables/ipv4-nat
/usr/etc/nftables/ipv4-nat:3:1-2: Error: Could not process rule: File exists
table nat {
^^
Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
netfilter: xtables: add and use xt_request_find_table_lock
currently we always return -ENOENT to userspace if we can't find
a particular table, or if the table initialization fails.
Followup patch will make nat table init fail in case nftables already
registered a nat hook so this change makes xt_find_table_lock return
an ERR_PTR to return the errno value reported from the table init
function.
Add xt_request_find_table_lock as try_then_request_module replacement
and use it where needed.
Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
netfilter: don't allocate space for arp/bridge hooks unless needed
no need to define hook points if the family isn't supported.
Because we need these hooks for either nftables, arp/ebtables
or the 'call-iptables' hack we have in the bridge layer add two
new dependencies, NETFILTER_FAMILY_{ARP,BRIDGE}, and have the
users select them.
Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
which store the hook entry point locations for the various protocol
families and the hooks.
Using array results in compact c code when doing accesses, i.e.
x = rcu_dereference(net->nf.hooks[pf][hook]);
but its also wasting a lot of memory, as most families are
not used.
So split the array into those families that are used, which
are only 5 (instead of 13). In most cases, the 'pf' argument is
constant, i.e. gcc removes switch statement.
Florian Westphal [Thu, 30 Nov 2017 23:21:04 +0000 (00:21 +0100)]
netfilter: core: free hooks with call_rcu
Giuseppe Scrivano says:
"SELinux, if enabled, registers for each new network namespace 6
netfilter hooks."
Cost for this is high. With synchronize_net() removed:
"The net benefit on an SMP machine with two cores is that creating a
new network namespace takes -40% of the original time."
This patch replaces synchronize_net+kvfree with call_rcu().
We store rcu_head at the tail of a structure that has no fixed layout,
i.e. we cannot use offsetof() to compute the start of the original
allocation. Thus store this information right after the rcu head.
We could simplify this by just placing the rcu_head at the start
of struct nf_hook_entries. However, this structure is used in
packet processing hotpath, so only place what is needed for that
at the beginning of the struct.
Reported-by: Giuseppe Scrivano <gscrivan@redhat.com> Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Florian Westphal [Thu, 30 Nov 2017 23:21:03 +0000 (00:21 +0100)]
netfilter: core: remove synchronize_net call if nfqueue is used
since commit 960632ece6949b ("netfilter: convert hook list to an array")
nfqueue no longer stores a pointer to the hook that caused the packet
to be queued. Therefore no extra synchronize_net() call is needed after
dropping the packets enqueued by the old rule blob.
Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Florian Westphal [Thu, 30 Nov 2017 23:21:02 +0000 (00:21 +0100)]
netfilter: core: make nf_unregister_net_hooks simple wrapper again
This reverts commit d3ad2c17b4047
("netfilter: core: batch nf_unregister_net_hooks synchronize_net calls").
Nothing wrong with it. However, followup patch will delay freeing of hooks
with call_rcu, so all synchronize_net() calls become obsolete and there
is no need anymore for this batching.
This revert causes a temporary performance degradation when destroying
network namespace, but its resolved with the upcoming call_rcu conversion.
Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Florian Westphal [Thu, 30 Nov 2017 20:08:05 +0000 (21:08 +0100)]
netfilter: ipset: add resched points during set listing
When sets are extremely large we can get softlockup during ipset -L.
We could fix this by adding cond_resched_rcu() at the right location
during iteration, but this only works if RCU nesting depth is 1.
At this time entire variant->list() is called under under rcu_read_lock_bh.
This used to be a read_lock_bh() but as rcu doesn't really lock anything,
it does not appear to be needed, so remove it (ipset increments set
reference count before this, so a set deletion should not be possible).
Reported-by: Li Shuang <shuali@redhat.com> Signed-off-by: Florian Westphal <fw@strlen.de> Acked-by: Jozsef Kadlecsik <kadlec@blackhole.kfki.hu> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
In preparation to enabling -Wimplicit-fallthrough, mark switch cases
where we are expecting to fall through.
Signed-off-by: Gustavo A. R. Silva <garsilva@embeddedor.com> Signed-off-by: Simon Horman <horms@verge.net.au> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
David S. Miller [Mon, 8 Jan 2018 02:29:41 +0000 (21:29 -0500)]
Merge branch 'ipv6-ipv4-nexthop-align'
Ido Schimmel says:
====================
ipv6: Align nexthop behaviour with IPv4
This set tries to eliminate some differences between IPv4's and IPv6's
treatment of nexthops. These differences are most likely a side effect
of IPv6's data structures (specifically 'rt6_info') that incorporate
both the route and the nexthop and the late addition of ECMP support in
commit 51ebd3181572 ("ipv6: add support of equal cost multipath
(ECMP)").
IPv4 and IPv6 do not react the same to certain netdev events. For
example, upon carrier change affected IPv4 nexthops are marked using the
RTNH_F_LINKDOWN flag and the nexthop group is rebalanced accordingly.
IPv6 on the other hand, does nothing which forces us to perform a
carrier check during route lookup and dump. This makes it difficult to
introduce features such as non-equal-cost multipath that are built on
top of this set [1].
In addition, when a netdev is put administratively down IPv4 nexthops
are marked using the RTNH_F_DEAD flag, whereas IPv6 simply flushes all
the routes using these nexthops. To be consistent with IPv4, multipath
routes should only be flushed when all nexthops in the group are
considered dead.
The first 12 patches introduce non-functional changes that store the
RTNH_F_DEAD and RTNH_F_LINKDOWN flags in IPv6 routes based on netdev
events, in a similar fashion to IPv4. This allows us to remove the
carrier check performed during route lookup and dump.
The next three patches make sure we only flush a multipath route when
all of its nexthops are dead.
Last three patches add test cases for IPv4/IPv6 FIB. These verify that
both address families react similarly to netdev events.
Finally, this series also serves as a good first step towards David
Ahern's goal of treating nexthops as standalone objects [2], as it makes
the code more in line with IPv4 where the nexthop and the nexthop group
are separate objects from the route itself.
Changes since RFC (feedback from David Ahern):
* Remove redundant declaration of rt6_ifdown() in patch 4 and adjust
comment referencing it accordingly
* Drop patch to flush multipath routes upon NETDEV_UNREGISTER. Reword
cover letter accordingly
* Use a temporary variable to make code more readable in patch 15
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Ido Schimmel [Sun, 7 Jan 2018 10:45:18 +0000 (12:45 +0200)]
selftests: fib_tests: Add test cases for netdev carrier change
Check that IPv4 and IPv6 react the same when the carrier of a netdev is
toggled. Local routes should not be affected by this, whereas unicast
routes should.
Signed-off-by: Ido Schimmel <idosch@mellanox.com> Acked-by: David Ahern <dsahern@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Ido Schimmel [Sun, 7 Jan 2018 10:45:15 +0000 (12:45 +0200)]
ipv6: Flush multipath routes when all siblings are dead
By default, IPv6 deletes nexthops from a multipath route when the
nexthop device is put administratively down. This differs from IPv4
where the nexthops are kept, but marked with the RTNH_F_DEAD flag. A
multipath route is flushed when all of its nexthops become dead.
Align IPv6 with IPv4 and have it conform to the same guidelines.
In case the multipath route needs to be flushed, its siblings are
flushed one by one. Otherwise, the nexthops are marked with the
appropriate flags and the tree walker is instructed to skip all the
siblings.
As explained in previous patches, care is taken to update the sernum of
the affected tree nodes, so as to prevent the use of wrong dst entries.
Signed-off-by: Ido Schimmel <idosch@mellanox.com> Acked-by: David Ahern <dsahern@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Ido Schimmel [Sun, 7 Jan 2018 10:45:14 +0000 (12:45 +0200)]
ipv6: Take table lock outside of sernum update function
The next patch is going to allow dead routes to remain in the FIB tree
in certain situations.
When this happens we need to be sure to bump the sernum of the nodes
where these are stored so that potential copies cached in sockets are
invalidated.
The function that performs this update assumes the table lock is not
taken when it is invoked, but that will not be the case when it is
invoked by the tree walker.
Have the function assume the lock is taken and make the single caller
take the lock itself.
Signed-off-by: Ido Schimmel <idosch@mellanox.com> Acked-by: David Ahern <dsahern@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Ido Schimmel [Sun, 7 Jan 2018 10:45:13 +0000 (12:45 +0200)]
ipv6: Export sernum update function
We are going to allow dead routes to stay in the FIB tree (e.g., when
they are part of a multipath route, directly connected route with no
carrier) and revive them when their nexthop device gains carrier or when
it is put administratively up.
This is equivalent to the addition of the route to the FIB tree and we
should therefore take care of updating the sernum of all the parent
nodes of the node where the route is stored. Otherwise, we risk sockets
caching and using sub-optimal dst entries.
Export the function that performs the above, so that it could be invoked
from fib6_ifup() later on.
Signed-off-by: Ido Schimmel <idosch@mellanox.com> Acked-by: David Ahern <dsahern@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Ido Schimmel [Sun, 7 Jan 2018 10:45:12 +0000 (12:45 +0200)]
ipv6: Teach tree walker to skip multipath routes
As explained in previous patch, fib6_ifdown() needs to consider the
state of all the sibling routes when a multipath route is traversed.
This is done by evaluating all the siblings when the first sibling in a
multipath route is traversed. If the multipath route does not need to be
flushed (e.g., not all siblings are dead), then we should just skip the
multipath route as our work is done.
Have the tree walker jump to the last sibling when it is determined that
the multipath route needs to be skipped.
Signed-off-by: Ido Schimmel <idosch@mellanox.com> Acked-by: David Ahern <dsahern@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Ido Schimmel [Sun, 7 Jan 2018 10:45:11 +0000 (12:45 +0200)]
ipv6: Add explicit flush indication to routes
When routes that are a part of a multipath route are evaluated by
fib6_ifdown() in response to NETDEV_DOWN and NETDEV_UNREGISTER events
the state of their sibling routes is not considered.
This will change in subsequent patches in order to align IPv6 with
IPv4's behavior. For example, when the last sibling in a multipath route
becomes dead, the entire multipath route needs to be removed.
To prevent the tree walker from re-evaluating all the sibling routes
each time, we can simply evaluate them once - when the first sibling is
traversed.
If we determine the entire multipath route needs to be removed, then the
'should_flush' bit is set in all the siblings, which will cause the
walker to flush them when it traverses them.
Signed-off-by: Ido Schimmel <idosch@mellanox.com> Acked-by: David Ahern <dsahern@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Ido Schimmel [Sun, 7 Jan 2018 10:45:10 +0000 (12:45 +0200)]
ipv6: Report dead flag during route dump
Up until now the RTNH_F_DEAD flag was only reported in route dump when
the 'ignore_routes_with_linkdown' sysctl was set. This is expected as
dead routes were flushed otherwise.
The reliance on this sysctl is going to be removed, so we need to report
the flag regardless of the sysctl's value.
Signed-off-by: Ido Schimmel <idosch@mellanox.com> Acked-by: David Ahern <dsahern@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Ido Schimmel [Sun, 7 Jan 2018 10:45:08 +0000 (12:45 +0200)]
ipv6: Check nexthop flags in route dump instead of carrier
Similar to previous patch, there is no need to check for the carrier of
the nexthop device when dumping the route and we can instead check for
the presence of the RTNH_F_LINKDOWN flag.
Signed-off-by: Ido Schimmel <idosch@mellanox.com> Acked-by: David Ahern <dsahern@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Ido Schimmel [Sun, 7 Jan 2018 10:45:07 +0000 (12:45 +0200)]
ipv6: Check nexthop flags during route lookup instead of carrier
Now that the RTNH_F_LINKDOWN flag is set in nexthops, we can avoid the
need to dereference the nexthop device and check its carrier and instead
check for the presence of the flag.
Signed-off-by: Ido Schimmel <idosch@mellanox.com> Acked-by: David Ahern <dsahern@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Ido Schimmel [Sun, 7 Jan 2018 10:45:05 +0000 (12:45 +0200)]
ipv6: Set nexthop flags upon carrier change
Similar to IPv4, when the carrier of a netdev changes we should toggle
the 'linkdown' flag on all the nexthops using it as their nexthop
device.
This will later allow us to test for the presence of this flag during
route lookup and dump.
Up until commit 4832c30d5458 ("net: ipv6: put host and anycast routes on
device with address") host and anycast routes used the loopback netdev
as their nexthop device and thus were not marked with the 'linkdown'
flag. The patch preserves this behavior and allows one to ping the local
address even when the nexthop device does not have a carrier and the
'ignore_routes_with_linkdown' sysctl is set.
Signed-off-by: Ido Schimmel <idosch@mellanox.com> Acked-by: David Ahern <dsahern@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Ido Schimmel [Sun, 7 Jan 2018 10:45:04 +0000 (12:45 +0200)]
ipv6: Prepare to handle multiple netdev events
To make IPv6 more in line with IPv4 we need to be able to respond
differently to different netdev events. For example, when a netdev is
unregistered all the routes using it as their nexthop device should be
flushed, whereas when the netdev's carrier changes only the 'linkdown'
flag should be toggled.
Currently, this is not possible, as the function that traverses the
routing tables is not aware of the triggering event.
Propagate the triggering event down, so that it could be used in later
patches.
Signed-off-by: Ido Schimmel <idosch@mellanox.com> Acked-by: David Ahern <dsahern@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Ido Schimmel [Sun, 7 Jan 2018 10:45:02 +0000 (12:45 +0200)]
ipv6: Mark dead nexthops with appropriate flags
When a netdev is put administratively down or unregistered all the
nexthops using it as their nexthop device should be marked with the
'dead' and 'linkdown' flags.
Currently, when a route is dumped its nexthop device is tested and the
flags are set accordingly. A similar check is performed during route
lookup.
Instead, we can simply mark the nexthops based on netdev events and
avoid checking the netdev's state during route dump and lookup.
Signed-off-by: Ido Schimmel <idosch@mellanox.com> Acked-by: David Ahern <dsahern@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Ido Schimmel [Sun, 7 Jan 2018 10:45:01 +0000 (12:45 +0200)]
ipv6: Remove redundant route flushing during namespace dismantle
By the time fib6_net_exit() is executed all the netdevs in the namespace
have been either unregistered or pushed back to the default namespace.
That is because pernet subsys operations are always ordered before
pernet device operations and therefore invoked after them during
namespace dismantle.
Thus, all the routing tables in the namespace are empty by the time
fib6_net_exit() is invoked and the call to rt6_ifdown() can be removed.
This allows us to simplify the condition in fib6_ifdown() as it's only
ever called with an actual netdev.
Signed-off-by: Ido Schimmel <idosch@mellanox.com> Acked-by: David Ahern <dsahern@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
The following pull-request contains BPF updates for your *net-next* tree.
The main changes are:
1) Add a start of a framework for extending struct xdp_buff without
having the overhead of populating every data at runtime. Idea
is to have a new per-queue struct xdp_rxq_info that holds read
mostly data (currently that is, queue number and a pointer to
the corresponding netdev) which is set up during rxqueue config
time. When a XDP program is invoked, struct xdp_buff holds a
pointer to struct xdp_rxq_info that the BPF program can then
walk. The user facing BPF program that uses struct xdp_md for
context can use these members directly, and the verifier rewrites
context access transparently by walking the xdp_rxq_info and
net_device pointers to load the data, from Jesper.
2) Redo the reporting of offload device information to user space
such that it works in combination with network namespaces. The
latter is reported through a device/inode tuple as similarly
done in other subsystems as well (e.g. perf) in order to identify
the namespace. For this to work, ns_get_path() has been generalized
such that the namespace can be retrieved not only from a specific
task (perf case), but also from a callback where we deduce the
netns (ns_common) from a netdevice. bpftool support using the new
uapi info and extensive test cases for test_offload.py in BPF
selftests have been added as well, from Jakub.
3) Add two bpftool improvements: i) properly report the bpftool
version such that it corresponds to the version from the kernel
source tree. So pick the right linux/version.h from the source
tree instead of the installed one. ii) fix bpftool and also
bpf_jit_disasm build with bintutils >= 2.9. The reason for the
build breakage is that binutils library changed the function
signature to select the disassembler. Given this is needed in
multiple tools, add a proper feature detection to the
tools/build/features infrastructure, from Roman.
4) Implement the BPF syscall command BPF_MAP_GET_NEXT_KEY for the
stacktrace map. It is currently unimplemented, but there are
use cases where user space needs to walk all stacktrace map
entries e.g. for dumping or deleting map entries w/o having to
close and recreate the map. Add BPF selftests along with it,
from Yonghong.
5) Few follow-up cleanups for the bpftool cgroup code: i) rename
the cgroup 'list' command into 'show' as we have it for other
subcommands as well, ii) then alias the 'show' command such that
'list' is accepted which is also common practice in iproute2,
and iii) remove couple of newlines from error messages using
p_err(), from Jakub.
6) Two follow-up cleanups to sockmap code: i) remove the unused
bpf_compute_data_end_sk_skb() function and ii) only build the
sockmap infrastructure when CONFIG_INET is enabled since it's
only aware of TCP sockets at this time, from John.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
====================
The patch set implements bpf syscall command BPF_MAP_GET_NEXT_KEY
for stacktrace map. Patch #1 is the core implementation
and Patch #2 implements a bpf test at tools/testing/selftests/bpf
directory. Please see individual patch comments for details.
Changelog:
v1 -> v2:
- For invalid key (key pointer is non-NULL), sets next_key to be the first valid key.
====================
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Yonghong Song [Thu, 4 Jan 2018 21:55:04 +0000 (13:55 -0800)]
tools/bpf: add a bpf selftest for stacktrace
Added a bpf selftest in test_progs at tools directory for stacktrace.
The test will populate a hashtable map and a stacktrace map
at the same time with the same key, stackid.
The user space will compare both maps, using BPF_MAP_LOOKUP_ELEM
command and BPF_MAP_GET_NEXT_KEY command, to ensure that both have
the same set of keys.
Signed-off-by: Yonghong Song <yhs@fb.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Yonghong Song [Thu, 4 Jan 2018 21:55:03 +0000 (13:55 -0800)]
bpf: implement syscall command BPF_MAP_GET_NEXT_KEY for stacktrace map
Currently, bpf syscall command BPF_MAP_GET_NEXT_KEY is not
supported for stacktrace map. However, there are use cases where
user space wants to enumerate all stacktrace map entries where
BPF_MAP_GET_NEXT_KEY command will be really helpful.
In addition, if user space wants to delete all map entries
in order to save memory and does not want to close the
map file descriptor, BPF_MAP_GET_NEXT_KEY may help improve
performance if map entries are sparsely populated.
The implementation has similar behavior for
BPF_MAP_GET_NEXT_KEY implementation in hashtab. If user provides
a NULL key pointer or an invalid key, the first key is returned.
Otherwise, the first valid key after the input parameter "key"
is returned, or -ENOENT if no valid key can be found.
Signed-off-by: Yonghong Song <yhs@fb.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
====================
V4:
* Added reviewers/acks to patches
* Fix patch desc in i40e that got out-of-sync with code
* Add SPDX license headers for the two new files added in patch 14
V3:
* Fixed bug in virtio_net driver
* Removed export of xdp_rxq_info_init()
V2:
* Changed API exposed to drivers
- Removed invocation of "init" in drivers, and only call "reg"
(Suggested by Saeed)
- Allow "reg" to fail and handle this in drivers
(Suggested by David Ahern)
* Removed the SINKQ qtype, instead allow to register as "unused"
* Also fixed some drivers during testing on actual HW (noted in patches)
There is a need for XDP to know more about the RX-queue a given XDP
frames have arrived on. For both the XDP bpf-prog and kernel side.
Instead of extending struct xdp_buff each time new info is needed,
this patchset takes a different approach. Struct xdp_buff is only
extended with a pointer to a struct xdp_rxq_info (allowing for easier
extending this later). This xdp_rxq_info contains information related
to how the driver have setup the individual RX-queue's. This is
read-mostly information, and all xdp_buff frames (in drivers
napi_poll) point to the same xdp_rxq_info (per RX-queue).
We stress this data/cache-line is for read-mostly info. This is NOT
for dynamic per packet info, use the data_meta for such use-cases.
This patchset start out small, and only expose ingress_ifindex and the
RX-queue index to the XDP/BPF program. Access to tangible info like
the ingress ifindex and RX queue index, is fairly easy to comprehent.
The other future use-cases could allow XDP frames to be recycled back
to the originating device driver, by providing info on RX device and
queue number.
As XDP doesn't have driver feature flags, and eBPF code due to
bpf-tail-calls cannot determine that XDP driver invoke it, this
patchset have to update every driver that support XDP.
For driver developers (review individual driver patches!):
The xdp_rxq_info is tied to the drivers RX-ring(s). Whenever a RX-ring
modification require (temporary) stopping RX frames, then the
xdp_rxq_info should (likely) also be unregistred and re-registered,
especially if reallocating the pages in the ring. Make sure ethtool
set_channels does the right thing. When replacing XDP prog, if and
only if RX-ring need to be changed, then also re-register the
xdp_rxq_info.
I'm Cc'ing the individual driver patches to the registered maintainers.
Testing:
I've only tested the NIC drivers I have hardware for. The general
test procedure is to (DUT = Device Under Test):
(1) run pktgen script pktgen_sample04_many_flows.sh (against DUT)
(2) run samples/bpf program xdp_rxq_info --dev $DEV (on DUT)
(3) runtime modify number of NIC queues via ethtool -L (on DUT)
(4) runtime modify number of NIC ring-size via ethtool -G (on DUT)
Patch based on git tree bpf-next (at commit fb982666e380c1632a):
https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next.git/
====================
samples/bpf: program demonstrating access to xdp_rxq_info
This sample program can be used for monitoring and reporting how many
packets per sec (pps) are received per NIC RX queue index and which
CPU processed the packet. In itself it is a useful tool for quickly
identifying RSS imbalance issues, see below.
The default XDP action is XDP_PASS in-order to provide a monitor
mode. For benchmarking purposes it is possible to specify other XDP
actions on the cmdline --action.
Output below shows an imbalance RSS case where most RXQ's deliver to
CPU-0 while CPU-2 only get packets from a single RXQ. Looking at
things from a CPU level the two CPUs are processing approx the same
amount, BUT looking at the rx_queue_index levels it is clear that
RXQ-2 receive much better service, than other RXQs which all share CPU-0.
Running XDP on dev:i40e1 (ifindex:3) action:XDP_PASS
XDP stats CPU pps issue-pps
XDP-RX CPU 0 900,473 0
XDP-RX CPU 2 906,921 0
XDP-RX CPU total 1,807,395
bpf: finally expose xdp_rxq_info to XDP bpf-programs
Now all XDP driver have been updated to setup xdp_rxq_info and assign
this to xdp_buff->rxq. Thus, it is now safe to enable access to some
of the xdp_rxq_info struct members.
This patch extend xdp_md and expose UAPI to userspace for
ingress_ifindex and rx_queue_index. Access happens via bpf
instruction rewrite, that load data directly from struct xdp_rxq_info.
* ingress_ifindex map to xdp_rxq_info->dev->ifindex
* rx_queue_index map to xdp_rxq_info->queue_index
The net_device have some members (num_rx_queues + real_num_rx_queues)
and data-area (dev->_rx with struct netdev_rx_queue's) that were
primarily used for exporting information about RPS (CONFIG_RPS) queues
to sysfs (CONFIG_SYSFS).
For generic XDP extend struct netdev_rx_queue with the xdp_rxq_info,
and remove some of the CONFIG_SYSFS ifdefs.
The virtio_net driver doesn't dynamically change the RX-ring queue
layout and backing pages, but instead reject XDP setup if all the
conditions for XDP is not meet. Thus, the xdp_rxq_info also remains
fairly static. This allow us to simply add the reg/unreg to
net_device open/close functions.
V3:
- bugfix, also setup xdp.rxq in receive_mergeable()
- Tested bpf-sample prog inside guest on a virtio_net device
Cc: "Michael S. Tsirkin" <mst@redhat.com> Cc: Jason Wang <jasowang@redhat.com> Cc: virtualization@lists.linux-foundation.org Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com> Reviewed-by: Jason Wang <jasowang@redhat.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org>
I've done some manual testing of this tun driver, but I would
appriciate good review and someone else running their use-case tests,
as I'm not 100% sure I understand the tfile->detached semantics.
V2: Removed the skb_array_cleanup() call from V1 by request from Jason Wang.
Cc: Jason Wang <jasowang@redhat.com> Cc: "Michael S. Tsirkin" <mst@redhat.com> Cc: Willem de Bruijn <willemb@google.com> Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com> Reviewed-by: Jason Wang <jasowang@redhat.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org>
This driver uses a bool scheme for "enable"/"disable" when setting up
different resources. Thus, the hook points for xdp_rxq_info is done
in the same function call nicvf_rcv_queue_config(). This is activated
through enable/disable via nicvf_config_data_transfer(), which is tied
into nicvf_stop()/nicvf_open().
Extending driver packet handler call-path nicvf_rcv_pkt_handler() with
a pointer to the given struct rcv_queue, in-order to access the
xdp_rxq_info data area (in nicvf_xdp_rx()).
V2: Driver have no proper error path for failed XDP RX-queue info reg,
as nicvf_rcv_queue_config is a void function.
xdp/qede: setup xdp_rxq_info and intro xdp_rxq_info_is_reg
The driver code qede_free_fp_array() depend on kfree() can be called
with a NULL pointer. This stems from the qede_alloc_fp_array()
function which either (kz)alloc memory for fp->txq or fp->rxq.
This also simplifies error handling code in case of memory allocation
failures, but xdp_rxq_info_unreg need to know the difference.
Introduce xdp_rxq_info_is_reg() to handle if a memory allocation fails
and detect this is the failure path by seeing that xdp_rxq_info was
not registred yet, which first happens after successful alloaction in
qede_init_fp().
The i40e driver has a special "FDIR" RX-ring (I40E_VSI_FDIR) which is
a sideband channel for configuring/updating the flow director tables.
This (i40e_vsi_)type does not invoke XDP-ebpf code.
As suggested by Björn (V2): Instead of marking this I40E_VSI_FDIR RX-ring
a special case, reverse the logic and only select RX-rings of type
I40E_VSI_MAIN to register xdp_rxq_info's for.
The mlx5 driver have a special drop-RQ queue (one per interface) that
simply drops all incoming traffic. It helps driver keep other HW
objects (flow steering) alive upon down/up operations. It is
temporarily pointed by flow steering objects during the interface
setup, and when interface is down. It lacks many fields that are set
in a regular RQ (for example its state is never switched to
MLX5_RQC_STATE_RDY). (Thanks to Tariq Toukan for explanation).
The XDP RX-queue info for this drop-RQ marked as unused, which
allow us to use the same takedown/free code path as other RX-queues.
This patch only introduce the core data structures and API functions.
All XDP enabled drivers must use the API before this info can used.
There is a need for XDP to know more about the RX-queue a given XDP
frames have arrived on. For both the XDP bpf-prog and kernel side.
Instead of extending xdp_buff each time new info is needed, the patch
creates a separate read-mostly struct xdp_rxq_info, that contains this
info. We stress this data/cache-line is for read-only info. This is
NOT for dynamic per packet info, use the data_meta for such use-cases.
The performance advantage is this info can be setup at RX-ring init
time, instead of updating N-members in xdp_buff. A possible (driver
level) micro optimization is that xdp_buff->rxq assignment could be
done once per XDP/NAPI loop. The extra pointer deref only happens for
program needing access to this info (thus, no slowdown to existing
use-cases).
Jakub Kicinski [Thu, 4 Jan 2018 15:10:19 +0000 (16:10 +0100)]
nfp: add basic multicast filtering
We currently always pass all multicast traffic through.
Only set L2MC when actually needed. Since the driver
was not making use of the capability to filter out mcast
frames, some FW projects don't implement it any more.
Don't warn users if capability is not present (like we
do for promisc flag). The lack of L2MC capability is
assumed to mean all multicast traffic goes through.
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com> Signed-off-by: Simon Horman <simon.horman@netronome.com> Signed-off-by: David S. Miller <davem@davemloft.net>
====================
rds: use RCU between work-enqueue and connection teardown
This patchset follows up on the root-cause mentioned in
https://www.spinics.net/lists/netdev/msg472849.html
Patch1 implements some code refactoring that was suggeseted
as an enhancement in http://patchwork.ozlabs.org/patch/843157/
It replaces the c_destroy_in_prog bit in rds_connection with
an atomically managed flag in rds_conn_path.
Patch2 builds on Patch1 and uses RCU to make sure that
work is only enqueued if the connection destroy is not already
in progress: the test-flag-and-enqueue is done under rcu_read_lock,
while destroy first sets the flag, uses synchronize_rcu to
wait for existing reader threads to complete, and then starts
all the work-cancellation.
Since I have not been able to reproduce the original stack traces
reported by syszbot, and these are fixes for a race condition that
are based on code-inspection I am not marking these as reported-by
at this time.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
rds: use RCU to synchronize work-enqueue with connection teardown
rds_sendmsg() can enqueue work on cp_send_w from process context, but
it should not enqueue this work if connection teardown has commenced
(else we risk enquing work after rds_conn_path_destroy() has assumed that
all work has been cancelled/flushed).
Similarly some other functions like rds_cong_queue_updates
and rds_tcp_data_ready are called in softirq context, and may end
up enqueuing work on rds_wq after rds_conn_path_destroy() has assumed
that all workqs are quiesced.
Check the RDS_DESTROY_PENDING bit and use rcu synchronization to avoid
all these races.
Signed-off-by: Sowmini Varadhan <sowmini.varadhan@oracle.com> Acked-by: Santosh Shilimkar <santosh.shilimkar@oracle.com> Signed-off-by: David S. Miller <davem@davemloft.net>
rds: Use atomic flag to track connections being destroyed
Replace c_destroy_in_prog by using a bit in cp_flags that
can set/tested atomically.
Signed-off-by: Sowmini Varadhan <sowmini.varadhan@oracle.com> Acked-by: Santosh Shilimkar <santosh.shilimkar@oracle.com> Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Fri, 5 Jan 2018 18:37:04 +0000 (13:37 -0500)]
Merge branch 'tipc-two-small-cleanups'
Jon Maloy says:
====================
tipc: two small cleanups
These two commits are based on commit f9c935db8086 ("tipc: fix
problems with multipoint-to-point flow control") which has been
applied to 'net' but not yet to 'net-next'.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Jon Maloy [Thu, 4 Jan 2018 14:20:45 +0000 (15:20 +0100)]
tipc: simplify small window members' sorting algorithm
We simplify the sorting algorithm in tipc_update_member(). We also make
the remaining conditional call to this function unconditional, since the
same condition now is tested for inside the said function.
Acked-by: Ying Xue <ying.xue@windriver.com> Signed-off-by: Jon Maloy <jon.maloy@ericsson.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Jon Maloy [Thu, 4 Jan 2018 14:20:44 +0000 (15:20 +0100)]
tipc: some clarifying name changes
We rename some functions and variables, to make their purpose clearer.
- tipc_group::congested -> tipc_group::small_win. Members in this list
are not necessarily (and typically) congested. Instead, they may
*potentially* be subject to congestion because their send window is
less than ADV_IDLE, and therefore need to be checked during message
transmission.
- tipc_group_is_receiver() -> tipc_group_is_sender(). This socket will
accept messages coming from members fulfilling this condition, i.e.,
they are senders from this member's viewpoint.
- tipc_group_is_enabled() -> tipc_group_is_receiver(). Members
fulfilling this condition will accept messages sent from the current
socket, i.e., they are receivers from its viewpoint.
There are no functional changes in this commit.
Acked-by: Ying Xue <ying.xue@windriver.com> Signed-off-by: Jon Maloy <jon.maloy@ericsson.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Wei Yongjun [Thu, 4 Jan 2018 07:30:28 +0000 (07:30 +0000)]
net: dsa: lan9303: Fix error return code in lan9303_check_device()
Fix to return error code -ENODEV from the chip not found error handling
case instead of 0(ret have been overwritten to 0 by lan9303_read()), as
done elsewhere in this function.
Signed-off-by: Wei Yongjun <weiyongjun1@huawei.com> Reviewed-by: Egil Hjelmeland <privat@egil-hjelmeland.no> Signed-off-by: David S. Miller <davem@davemloft.net>
====================
net: dsa: Move padding into Broadcom tagger
This patch series moves the padding of short packets to where it belongs
within the DSA Broadcom tagger code, I just found myself doing this for
a third driver, which was a clear indication this was wrong and did not
scale.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Short packet padding added to the driver is only necessary when using
Broadcom tags, but since this is now taken care of net/dsa/tag_brcm.c,
we are guaranteed being given correctly padded packets.
Signed-off-by: Florian Fainelli <f.fainelli@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Instead of having the different master network device drivers
potentially used by DSA/Broadcom tags, move the padding necessary for
the switches to accept short packets where it makes most sense: within
tag_brcm.c. This avoids multiplying the number of similar commits to
e.g: bgmac, bcmsysport, etc.
Signed-off-by: Florian Fainelli <f.fainelli@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Quentin Monnet [Thu, 4 Jan 2018 01:30:45 +0000 (17:30 -0800)]
net: sched: fix tcf_block_get_ext() in case CONFIG_NET_CLS is not set
The definition of functions tcf_block_get() and tcf_block_get_ext()
depends of CONFIG_NET_CLS being set. When those functions gained extack
support, only one version of the declaration of those functions was
updated. Function tcf_block_get() was later fixed with commit 3c1490913f3b ("net: sch: api: fix tcf_block_get").
Change arguments of tcf_block_get_ext() for the case when CONFIG_NET_CLS
is not set.
Fixes: 8d1a77f974ca ("net: sch: api: add extack support in tcf_block_get") Signed-off-by: Quentin Monnet <quentin.monnet@netronome.com> Acked-by: Cong Wang <xiyou.wangcong@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
net: revert "Update RFS target at poll for tcp/udp"
On multi-threaded processes, one common architecture is to have
one (or a small number of) threads polling sockets, and a
considerably larger pool of threads reading form and writing to the
sockets. When we set RPS core on tcp_poll() or udp_poll() we essentially
steer all packets of all the polled FDs to one (or small number of)
cores, creaing a bottleneck and/or RPS misprediction.
Another common architecture is to shard FDs among threads pinned
to cores. In such a setting, setting RPS core in tcp_poll() and
udp_poll() is redundant because the RFS core is correctly
set in recvmsg and sendmsg.
Thus, revert the following commit: c3f1dbaf6e28 ("net: Update RFS target at poll for tcp/udp").
Signed-off-by: Soheil Hassas Yeganeh <soheil@google.com> Signed-off-by: Willem de Bruijn <willemb@google.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
We should only record RPS on normal reads and writes.
In single threaded processes, all calls record the same state. In
multi-threaded processes where a separate thread processes
errors, the RFS table mispredicts.
Note that, when CONFIG_RPS is disabled, sock_rps_record_flow
is a noop and no branch is added as a result of this patch.
Signed-off-by: Soheil Hassas Yeganeh <soheil@google.com> Signed-off-by: Willem de Bruijn <willemb@google.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
This patch series removes all code to support a configurable offset in
transmitted l2tp packets. Code to handle this is incomplete and buggy
and has been this way for years. If anyone tried to configure an
offset, it would be ignored for L2TPv2 tunnels, or for L2TPv3 tunnels,
could result in L2TPv3 packets being transmitted which are not
compliant with L2TPv3 RFC3931. This patch series removes the support
for configurable offsets.
No known userspace l2tp daemon configures an offset. However,
iproute2's "ip l2tp" command has an offset parameter and if set, the
value is passed to the kernel. This is the most likely use case where
offsets might be configured, e.g.
ip l2tp add tunnel local 1.1.1.1 remote 1.1.1.2 tunnel_id 1 \
peer_tunnel_id 2 encap ip
ip l2tp add session name l2tp0 tunnel_id 1 session_id 1 \
peer_session_id 2 offset 8
The above would result in packets being transmitted to 1.1.1.2 with 8
bytes padding between the L2TPv3 header and the payload. The peer
would need to be configured with the same offset value. However, the
packets are not compliant with the L2TPv3 RFC, hence I think it's
unlikely that offset is being used. With this patch series applied,
the offset would not be configured. The peer would need to be modified to
remove its offset setting too.
iproute2 should be modified to remove or ignore the ip l2tp offset
parameter.
This issue was discovered when reviewing a patch series from
lorenzo.bianconi@redhat.com which adds another netlink attribute to
configure the expected offset in received L2TPv3 packets. This change
is reverted by this series because offsets do not exist in L2TPv3
packets. These commits are:
The L2TPv2 protocol supports a variable offset from the L2TPv2 header
to the payload to give the sender implementation some flexibility for
data alignment when adding L2TP headers on to payloads. The offset
value is indicated by an optional field in the L2TP header. Our L2TP
implementation already detects the presence of the optional offset in
received packets and skips those bytes when parsing packets. All
transmitted L2TPv2 packets are always transmitted with no offset.
L2TPv3 has no optional offset field in the L2TPv3 packet
header. Instead, L2TPv3 defines optional fields in a "Layer-2 Specific
Sublayer". At the time when the original L2TP code was written, there
was talk at IETF of offset being implemented in a new Layer-2 Specific
Sublayer. A L2TP_ATTR_OFFSET netlink attribute was added so that this
offset could be configured and the intention was to allow it to be
also used to set the tx offset for L2TPv2. However, no L2TPv3 offset
was ever specified and the L2TP_ATTR_OFFSET parameter was forgotten
about.
Setting L2TP_ATTR_OFFSET results in L2TPv3 packets being transmitted
with the specified number of bytes padding between L2TPv3 header and
payload. This is not compliant with L2TPv3 RFC3931. So this change
removes the configurable offset altogether while retaining
L2TP_ATTR_OFFSET in the API for backwards compatibility. If
L2TP_ATTR_OFFSET is given, its value is now silently ignored.
====================
Reviewed-by: Guillaume Nault <g.nault@alphalink.fr> Tested-by: Guillaume Nault <g.nault@alphalink.fr> Signed-off-by: David S. Miller <davem@davemloft.net>
James Chapman [Wed, 3 Jan 2018 22:48:06 +0000 (22:48 +0000)]
l2tp: remove configurable payload offset
If L2TP_ATTR_OFFSET is set to a non-zero value in L2TPv3 tunnels, it
results in L2TPv3 packets being transmitted which might not be
compliant with the L2TPv3 RFC. This patch has l2tp ignore the offset
setting and send all packets with no offset.
In more detail:
L2TPv2 supports a variable offset from the L2TPv2 header to the
payload. The offset value is indicated by an optional field in the
L2TP header. Our L2TP implementation already detects the presence of
the optional offset and skips that many bytes when handling data
received packets. All transmitted packets are always transmitted with
no offset.
L2TPv3 has no optional offset field in the L2TPv3 packet
header. Instead, L2TPv3 defines optional fields in a "Layer-2 Specific
Sublayer". At the time when the original L2TP code was written, there
was talk at IETF of offset being implemented in a new Layer-2 Specific
Sublayer. A L2TP_ATTR_OFFSET netlink attribute was added so that this
offset could be configured and the intention was to allow it to be
also used to set the tx offset for L2TPv2. However, no L2TPv3 offset
was ever specified and the L2TP_ATTR_OFFSET parameter was forgotten
about.
Setting L2TP_ATTR_OFFSET results in L2TPv3 packets being transmitted
with the specified number of bytes padding between L2TPv3 header and
payload. This is not compliant with L2TPv3 RFC3931. This change
removes the configurable offset altogether while retaining
L2TP_ATTR_OFFSET for backwards compatibility. Any L2TP_ATTR_OFFSET
value is ignored.
Signed-off-by: James Chapman <jchapman@katalix.com> Signed-off-by: David S. Miller <davem@davemloft.net>
James Chapman [Wed, 3 Jan 2018 22:48:04 +0000 (22:48 +0000)]
l2tp: revert "l2tp: add peer_offset parameter"
Revert commit f15bc54eeecd ("l2tp: add peer_offset parameter"). This
is removed because it is adding another configurable offset and
configurable offsets are being removed.
Signed-off-by: James Chapman <jchapman@katalix.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Arnd Bergmann [Wed, 3 Jan 2018 22:40:11 +0000 (23:40 +0100)]
net/mlx5e: hide an unused variable
The uplink_rpriv variable was added at the start of the function but
only used inside of an #ifdef:
drivers/net/ethernet/mellanox/mlx5/core/en_tc.c: In function 'mlx5e_route_lookup_ipv6':
drivers/net/ethernet/mellanox/mlx5/core/en_tc.c:1549:25: error: unused variable 'uplink_rpriv' [-Werror=unused-variable]
This moves the declaration into that #ifdef as well.
Fixes: 5ed99fb421d4 ("net/mlx5e: Move ethernet representors data into separate struct") Signed-off-by: Arnd Bergmann <arnd@arndb.de> Acked-by: Saeed Mahameed <saeedm@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Thu, 4 Jan 2018 19:33:29 +0000 (14:33 -0500)]
Merge tag 'mac80211-next-for-davem-2018-01-04' of git://git.kernel.org/pub/scm/linux/kernel/git/jberg/mac80211-next
Johannes Berg says:
====================
We have things all over the place, no point listing them.
One thing is notable: I applied two patches and later
reverted them - we'll get back to that once all the driver
situation is sorted out.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Prashant Sreedharan <prashant.sreedharan@broadcom.com> Signed-off-by: Satish Baddipadige <satish.baddipadige@broadcom.com> Reviewed-by: Michael Chan <michael.chan@broadcom.com> Signed-off-by: David S. Miller <davem@davemloft.net>