]> git.proxmox.com Git - mirror_ubuntu-bionic-kernel.git/log
mirror_ubuntu-bionic-kernel.git
6 years agorxrpc: Support service upgrade from a kernel service
David Howells [Wed, 18 Oct 2017 10:36:39 +0000 (11:36 +0100)]
rxrpc: Support service upgrade from a kernel service

Provide support for a kernel service to make use of the service upgrade
facility.  This involves:

 (1) Pass an upgrade request flag to rxrpc_kernel_begin_call().

 (2) Make rxrpc_kernel_recv_data() return the call's current service ID so
     that the caller can detect service upgrade and see what the service
     was upgraded to.

Signed-off-by: David Howells <dhowells@redhat.com>
6 years agonet: export netdev_txq_to_tc to allow sch_mqprio to compile as module
Henrik Austad [Tue, 17 Oct 2017 10:10:10 +0000 (12:10 +0200)]
net: export netdev_txq_to_tc to allow sch_mqprio to compile as module

In commit 32302902ff09 ("mqprio: Reserve last 32 classid values for HW
traffic classes and misc IDs") sch_mqprio started using netdev_txq_to_tc
to find the correct tc instead of dev->tc_to_txq[]

However, when mqprio is compiled as a module, it cannot resolve the
symbol, leading to this error:

     ERROR: "netdev_txq_to_tc" [net/sched/sch_mqprio.ko] undefined!

This adds an EXPORT_SYMBOL() since the other user in the kernel
(netif_set_xps_queue) is also EXPORT_SYMBOL() (and not _GPL) or in a
sysfs-callback.

Cc: Alexander Duyck <alexander.h.duyck@intel.com>
Cc: Jesus Sanchez-Palencia <jesus.sanchez-palencia@intel.com>
Cc: David S. Miller <davem@davemloft.net>
Signed-off-by: Henrik Austad <haustad@cisco.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Acked-by: Alexander Duyck <alexander.h.duyck@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
6 years agoMerge branch 'mlxsw-GRE-Offload-decap-without-encap'
David S. Miller [Mon, 16 Oct 2017 20:30:33 +0000 (21:30 +0100)]
Merge branch 'mlxsw-GRE-Offload-decap-without-encap'

Jiri Pirko says:

====================
mlxsw: GRE: Offload decap without encap

Petr says:

The current code doesn't offload GRE decapsulation unless there's a
corresponding encapsulation route as well. While not strictly incorrect (when
encap route is absent, the decap route traps traffic to CPU and the kernel
handles it), it's a missed optimization opportunity.

With this patchset, IPIP entries are created as soon as offloadable tunneling
netdevice is created. This then leads to offloading of decap route, if one
exists, or is added afterwards, even when no encap route is present.

In Linux, when there is a decap route, matching IP-in-IP packets are always
decapsulated. However, with IPv4 overlays in particular, whether the inner
packet is then forwarded depends on setting of net.ipv4.conf.*.rp_filter. When
RP filtering is turned on, inner packets aren't forwarded unless there's a
matching encap route. The mlxsw driver doesn't reflect this behavior in other
router interfaces, and thus it's not implemented for tunnel types either. A
better support for this will be subject of follow-up work.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
6 years agomlxsw: spectrum: Drop refcounting of IPIP entries
Petr Machata [Mon, 16 Oct 2017 14:26:39 +0000 (16:26 +0200)]
mlxsw: spectrum: Drop refcounting of IPIP entries

Formerly, IPIP entries were created lazily by next hops that referenced
an offloadable IP-in-IP netdevice. However now that they are created
eagerly as a reaction to events on such netdevices, the reference
counting is useless. Hence drop it.

The routes whose next hops reference an offloaded IP-in-IP netdevice
actually linger around a bit after their device is unregistered.
However, mlxsw_sp_ipip_entry_destroy() also destroys the backing
loopback, and mlxsw_sp_rif_destroy() transitively (via
mlxsw_sp_nexthop_rif_gone_sync()) calls mlxsw_sp_nexthop_ipip_fini(),
which unlinks the IPIP entry from a next hop. Thus no dangling pointers
are left behind for the brief window after netdevice is gone, but routes
not yet.

Signed-off-by: Petr Machata <petrm@mellanox.com>
Reviewed-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
6 years agomlxsw: spectrum: Support IPIP overlay VRF migration
Petr Machata [Mon, 16 Oct 2017 14:26:38 +0000 (16:26 +0200)]
mlxsw: spectrum: Support IPIP overlay VRF migration

IPIP entries are created as soon as an offloadable device is created.
That means that when such a device is later moved to a different VRF,
the loopback device that backs the tunnel is wrong.

Thus when an offloadable encapsulating netdevice moves from one VRF to
another, make sure that the loopback is updated as necessary.

Signed-off-by: Petr Machata <petrm@mellanox.com>
Reviewed-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
6 years agomlxsw: spectrum: Support decap-only IP-in-IP tunnels
Petr Machata [Mon, 16 Oct 2017 14:26:37 +0000 (16:26 +0200)]
mlxsw: spectrum: Support decap-only IP-in-IP tunnels

Current code for offloading IP-in-IP tunneling assumes that there is no
decap without encap. But that's never true for IPv6 overlays, and is not
true for IPv4 ones either, if net.ipv4.conf.*.rp_filter is unset.

To support decap-only tunnels, an IPIP entry is now created as soon as
an offloadable tunneling device is created. When that netdevice is up'd,
a decap route is looked up and possibly offloaded. Thus decap is not
handled implicitly as part of mlxsw_sp_ipip_entry_get() call anymore,
but needs to be done explicitly after the get, if desired.

Signed-off-by: Petr Machata <petrm@mellanox.com>
Reviewed-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
6 years agomlxsw: spectrum_router: Move mlxsw_sp_netdev_ipip_type()
Petr Machata [Mon, 16 Oct 2017 14:26:36 +0000 (16:26 +0200)]
mlxsw: spectrum_router: Move mlxsw_sp_netdev_ipip_type()

Signed-off-by: Petr Machata <petrm@mellanox.com>
Reviewed-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
6 years agomlxsw: spectrum: Move netdevice NB to struct mlxsw_sp
Petr Machata [Mon, 16 Oct 2017 14:26:35 +0000 (16:26 +0200)]
mlxsw: spectrum: Move netdevice NB to struct mlxsw_sp

So far, all netdevice notifications that the driver cared about were
related to its own ports, and mlxsw_sp could be retrieved from the
netdevice's private data. For IP-in-IP offloading however, the driver
cares about events on foreign netdevices, and getting at mlxsw_sp or
router data structures from the handler is inconvenient.

Therefore move the netdevice notifier blocks from global scope to struct
mlxsw_sp to allow retrieval from the notifier block pointer itself.

Signed-off-by: Petr Machata <petrm@mellanox.com>
Reviewed-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
6 years agotipc: fix rebasing error
Jon Maloy [Mon, 16 Oct 2017 14:04:51 +0000 (16:04 +0200)]
tipc: fix rebasing error

In commit 2f487712b893 ("tipc: guarantee that group broadcast doesn't
bypass group unicast") there was introduced a last-minute rebasing
error that broke non-group communication.

We fix this here.

Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
6 years agoMerge branch 'net-core-rcuify-rtnl-af_ops'
David S. Miller [Mon, 16 Oct 2017 20:26:40 +0000 (21:26 +0100)]
Merge branch 'net-core-rcuify-rtnl-af_ops'

Florian Westphal says:

====================
net: core: rcuify rtnl af_ops

None of the rtnl af_ops callbacks sleep, so they can be called while
holding rcu read lock.

Switch handling of af_ops to rcu.

This would allow to later call af_ops functions without holding
the rtnl mutex anymore.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
6 years agonet: core: rcu-ify rtnl af_ops
Florian Westphal [Mon, 16 Oct 2017 13:44:36 +0000 (15:44 +0200)]
net: core: rcu-ify rtnl af_ops

rtnl af_ops currently rely on rtnl mutex: unregister (called from module
exit functions) takes the rtnl mutex and all users that do af_ops lookup
also take the rtnl mutex. IOW, parallel rmmod will block until doit()
callback is done.

As none of the af_ops implementation sleep we can use rcu instead.

doit functions that need the af_ops can now use rcu instead of the
rtnl mutex provided the mutex isn't needed for other reasons.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
6 years agortnetlink: place link af dump into own helper
Florian Westphal [Mon, 16 Oct 2017 13:44:35 +0000 (15:44 +0200)]
rtnetlink: place link af dump into own helper

next patch will rcu-ify rtnl af_ops, i.e. allow af_ops
lookup and function calls with rcu read lock held instead
of rtnl mutex.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
6 years agotcp: cdg: make struct tcp_cdg static
Colin Ian King [Mon, 16 Oct 2017 13:33:21 +0000 (14:33 +0100)]
tcp: cdg: make struct tcp_cdg static

The structure tcp_cdg is local to the source and
does not need to be in global scope, so make it static.

Cleans up sparse warning:
symbol 'tcp_cdg' was not declared. Should it be static?

Signed-off-by: Colin Ian King <colin.king@canonical.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
6 years agonet: systemport: add NET_DSA dependency
Arnd Bergmann [Mon, 16 Oct 2017 11:32:36 +0000 (13:32 +0200)]
net: systemport: add NET_DSA dependency

The notifier cause a link error when NET_DSA is a loadable
module:

drivers/net/ethernet/broadcom/bcmsysport.o: In function `bcm_sysport_remove':
bcmsysport.c:(.text+0x1582): undefined reference to `unregister_dsa_notifier'
drivers/net/ethernet/broadcom/bcmsysport.o: In function `bcm_sysport_probe':
bcmsysport.c:(.text+0x278d): undefined reference to `register_dsa_notifier'

This adds a dependency that forces the systemport driver to be
a loadable module as well when that happens, but otherwise
allows it to be built normally when DSA is either built-in or
completely disabled.

Fixes: d156576362c0 ("net: systemport: Establish lower/upper queue mapping")
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
6 years agohamradio: baycom_par: use new parport device model
Sudip Mukherjee [Sun, 15 Oct 2017 21:00:52 +0000 (22:00 +0100)]
hamradio: baycom_par: use new parport device model

Modify baycom driver to use the new parallel port device model.

Signed-off-by: Sudip Mukherjee <sudipm.mukherjee@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
6 years agonet: dccp: mark expected switch fall-throughs
Gustavo A. R. Silva [Sun, 15 Oct 2017 18:22:10 +0000 (13:22 -0500)]
net: dccp: mark expected switch fall-throughs

In preparation to enabling -Wimplicit-fallthrough, mark switch cases
where we are expecting to fall through.

Notice that for options.c file, I placed the "fall through" comment
on its own line, which is what GCC is expecting to find.

Signed-off-by: Gustavo A. R. Silva <garsilva@embeddedor.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
6 years agopch_gbe: Switch to new PCI IRQ allocation API
Andy Shevchenko [Sat, 14 Oct 2017 14:04:40 +0000 (17:04 +0300)]
pch_gbe: Switch to new PCI IRQ allocation API

This removes custom flag handling.

Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
6 years agotracing: bpf: Hide bpf trace events when they are not used
Steven Rostedt (VMware) [Thu, 12 Oct 2017 22:40:02 +0000 (18:40 -0400)]
tracing: bpf: Hide bpf trace events when they are not used

All the trace events defined in include/trace/events/bpf.h are only
used when CONFIG_BPF_SYSCALL is defined. But this file gets included by
include/linux/bpf_trace.h which is included by the networking code with
CREATE_TRACE_POINTS defined.

If a trace event is created but not used it still has data structures
and functions created for its use, even though nothing is using them.
To not waste space, do not define the BPF trace events in bpf.h unless
CONFIG_BPF_SYSCALL is defined.

Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
6 years agoipv6: only update __use and lastusetime once per jiffy at most
Wei Wang [Fri, 13 Oct 2017 22:08:07 +0000 (15:08 -0700)]
ipv6: only update __use and lastusetime once per jiffy at most

In order to not dirty the cacheline too often, we try to only update
dst->__use and dst->lastusetime at most once per jiffy.
As dst->lastusetime is only used by ipv6 garbage collector, it should
be good enough time resolution.
And __use is only used in ipv6_route_seq_show() to show how many times a
dst has been used. And as __use is not atomic_t right now, it does not
show the precise number of usage times anyway. So we think it should be
OK to only update it at most once per jiffy.

According to my latest syn flood test on a machine with intel Xeon 6th
gen processor and 2 10G mlx nics bonded together, each with 8 rx queues
on 2 NUMA nodes:
With this patch, the packet process rate increases from ~3.49Mpps to
~3.75Mpps with a 7% increase rate.

Note: dst_use() is being renamed to dst_hold_and_use() to better specify
the purpose of the function.

Signed-off-by: Wei Wang <weiwan@google.com>
Acked-by: Eric Dumazet <edumazet@googl.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
6 years agoipv6: check fn before doing FIB6_SUBTREE(fn)
Wei Wang [Fri, 13 Oct 2017 22:01:08 +0000 (15:01 -0700)]
ipv6: check fn before doing FIB6_SUBTREE(fn)

In fib6_locate(), we need to first make sure fn is not NULL before doing
FIB6_SUBTREE(fn) to avoid crash.

This fixes the following static checker warning:
net/ipv6/ip6_fib.c:1462 fib6_locate()
         warn: variable dereferenced before check 'fn' (see line 1459)

net/ipv6/ip6_fib.c
  1458          if (src_len) {
  1459                  struct fib6_node *subtree = FIB6_SUBTREE(fn);
                                                    ^^^^^^^^^^^^^^^^
We shifted this dereference

  1460
  1461                  WARN_ON(saddr == NULL);
  1462                  if (fn && subtree)
                            ^^
before the check for NULL.

  1463                          fn = fib6_locate_1(subtree, saddr, src_len,
  1464                                             offsetof(struct rt6_info, rt6i_src)

Fixes: 66f5d6ce53e6 ("ipv6: replace rwlock with rcu and spinlock in fib6_table")
Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Wei Wang <weiwan@google.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Acked-by: Martin KaFai Lau <kafai@fb.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
6 years agobpf: Add -target to clang switch while cross compiling.
Abhijit Ayarekar [Fri, 13 Oct 2017 19:24:06 +0000 (12:24 -0700)]
bpf: Add -target to clang switch while cross compiling.

Update to llvm excludes assembly instructions.
llvm git revision is below

commit 65fad7c26569 ("bpf: add inline-asm support")

This change will be part of llvm  release 6.0

__ASM_SYSREG_H define is not required for native compile.
-target switch includes appropriate target specific files
while cross compiling

Tested on x86 and arm64.

Signed-off-by: Abhijit Ayarekar <abhijit.ayarekar@caviumnetworks.com>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
6 years agoMerge branch 'sched-tp_q-remove'
David S. Miller [Mon, 16 Oct 2017 20:00:41 +0000 (21:00 +0100)]
Merge branch 'sched-tp_q-remove'

Jiri Pirko <jiri@mellanox.com>

====================
net: sched: remove some tp->q usage

In order to prepare for block sharing, tcf_proto instances need to be
independent on particular qdisc instances. This patchset takes care of
removal of couple occurrences of tp->q usage.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
6 years agonet: sched: propagate q and parent from caller down to tcf_fill_node
Jiri Pirko [Fri, 13 Oct 2017 12:01:05 +0000 (14:01 +0200)]
net: sched: propagate q and parent from caller down to tcf_fill_node

The callers have this info, they will pass it down to tcf_fill_node.

Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
6 years agonet: sched: use tcf_block_q helper to get q pointer for sch_tree_lock
Jiri Pirko [Fri, 13 Oct 2017 12:01:04 +0000 (14:01 +0200)]
net: sched: use tcf_block_q helper to get q pointer for sch_tree_lock

Use tcf_block_q helper to get q pointer to be used for direct call of
sch_tree_lock/unlock instead of tcf_tree_lock/unlock.

Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
6 years agonet: sched: tcindex, fw, flow: use tcf_block_q helper to get struct Qdisc
Jiri Pirko [Fri, 13 Oct 2017 12:01:03 +0000 (14:01 +0200)]
net: sched: tcindex, fw, flow: use tcf_block_q helper to get struct Qdisc

Use helper to get q pointer per block.

Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
6 years agonet: sched: cls_u32: use block instead of q in tc_u_common
Jiri Pirko [Fri, 13 Oct 2017 12:01:02 +0000 (14:01 +0200)]
net: sched: cls_u32: use block instead of q in tc_u_common

tc_u_common is now per-q. With blocks, it has to be converted to be
per-block.

Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
6 years agonet: sched: ematch: obtain net pointer from blocks
Jiri Pirko [Fri, 13 Oct 2017 12:01:01 +0000 (14:01 +0200)]
net: sched: ematch: obtain net pointer from blocks

Instead of using tp->q, use block to get the net pointer.

Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
6 years agonet: sched: teach tcf_bind/unbind_filter to use block->q
Jiri Pirko [Fri, 13 Oct 2017 12:01:00 +0000 (14:01 +0200)]
net: sched: teach tcf_bind/unbind_filter to use block->q

Whenever the block->q is set, it can be used instead of tp->q as it
contains the same value. When it is not set, which can't happen now but
it might happen with the follow-up shared blocks introduction, the class
is not set in the result. That would lead to a class lookup instead
of direct class pointer use for classful qdiscs. However, it is not
planned to support classful qdisqs sharing filter blocks, so that may
never happen.

Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
6 years agonet: sched: introduce tcf_block_q and tcf_block_dev helpers
Jiri Pirko [Fri, 13 Oct 2017 12:00:59 +0000 (14:00 +0200)]
net: sched: introduce tcf_block_q and tcf_block_dev helpers

These helpers allows to get a q and netdev pointers
for given block easily.

Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
6 years agonet: sched: store net pointer in block and introduce qdisc_net helper
Jiri Pirko [Fri, 13 Oct 2017 12:00:58 +0000 (14:00 +0200)]
net: sched: store net pointer in block and introduce qdisc_net helper

Store net pointer in the block structure. Along the way, introduce
qdisc_net helper which allows to easily obtain net pointer for
qdisc instance.

Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
6 years agonet: sched: store Qdisc pointer in struct block
Jiri Pirko [Fri, 13 Oct 2017 12:00:57 +0000 (14:00 +0200)]
net: sched: store Qdisc pointer in struct block

Prepare for removal of tp->q and store Qdisc pointer in the block
structure.

Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
6 years agomqprio: Reserve last 32 classid values for HW traffic classes and misc IDs
Alexander Duyck [Thu, 12 Oct 2017 18:38:45 +0000 (11:38 -0700)]
mqprio: Reserve last 32 classid values for HW traffic classes and misc IDs

This patch makes a slight tweak to mqprio in order to bring the
classid values used back in line with what is used for mq. The general idea
is to reserve values :ffe0 - :ffef to identify hardware traffic classes
normally reported via dev->num_tc. By doing this we can maintain a
consistent behavior with mq for classid where :1 - :ffdf will represent a
physical qdisc mapped onto a Tx queue represented by classid - 1, and the
traffic classes will be mapped onto a known subset of classid values
reserved for our virtual qdiscs.

Note I reserved the range from :fff0 - :ffff since this way we might be
able to reuse these classid values with clsact and ingress which would mean
that for mq, mqprio, ingress, and clsact we should be able to maintain a
similar classid layout.

Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com>
Tested-by: Jesus Sanchez-Palencia <jesus.sanchez-palencia@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
6 years agoMerge tag 'mlx5-updates-2017-10-11' of git://git.kernel.org/pub/scm/linux/kernel...
David S. Miller [Mon, 16 Oct 2017 04:42:41 +0000 (05:42 +0100)]
Merge tag 'mlx5-updates-2017-10-11' of git://git.kernel.org/pub/scm/linux/kernel/git/mellanox/linux

Saeed Mahameed says:

====================
mlx5-updates-2017-10-11: IPoIB Multi Pkey support

This series provides the support for IPoIB Multi Pkey.
InfiniBand Pkeys are the equivalent of Ethernet vlans.
Currently IPoIB device driver supports only default Pkey and IPoIB Pkey child
interfaces are not supported with IPoIB offloads mode, this series will add
the support for that by allowing creating mlx5 multiple IPoIB netdevices with
a non-default Pkey.

mlx5 IPoIB Pkey child interface is smaller version of mlx5i IPoIB interfaces and shares
most of its resources with the parent IPoIB interface, namely RX steering and ring
queue resources.

The only mlx5 resources a child Pkey interface will be creating are the TX rings,
since they should be assigned to a specific Pkey.

mlx5i Pkey netdev is implemented via new mlx5e netdev profile implemented in
mlx5/core/ipoib/ipoib_vlan.c.

The series starts with a refactoring of mlx5e PTP and mlx5 clock implementation
to move the code to be part of mlx5 core rather than mlx5e netdevice, in order to
make mlx5 clock and PTP registration part of the core to be shared with mlx5e
master Ethernet netdev/IPoIB parent netdev and mlx5_ib in the near future.

Add the support for attaching multiple underlay QPs for the different Pkeys
in mlx5 core RX steering.

Add Pkey index to rdma_netdev to add the ability to set PKEY index to lower
IPoIB offload netdev.

Use hash-table to map between DQPN (Destination QP number) to child netdev
for the IPoIB parent netdev to forward RX packets to the corresponding
child Pkey netdev, since the RX rings are shared.

The reset of the series adds the ipoib child Pkey: mlx5e netdev profile,
netdev nods implementation and minimal set of ethtool callbacks.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
6 years agoMerge branch '40GbE' of git://git.kernel.org/pub/scm/linux/kernel/git/jkirsher/next...
David S. Miller [Sun, 15 Oct 2017 01:49:42 +0000 (18:49 -0700)]
Merge branch '40GbE' of git://git.kernel.org/pub/scm/linux/kernel/git/jkirsher/next-queue

Jeff Kirsher says:

====================
40GbE Intel Wired LAN Driver Updates 2017-10-13

This series contains updates to mqprio and i40e.

Amritha introduces a new hardware offload mode in tc/mqprio where the TCs,
the queue configurations and bandwidth rate limits are offloaded to the
hardware. The existing mqprio framework is extended to configure the queue
counts and layout and also added support for rate limiting. This is
achieved through new netlink attributes for the 'mode' option which takes
values such as 'dcb' (default) and 'channel' and a 'shaper' option for
QoS attributes such as bandwidth rate limits in hw mode 1.  Legacy devices
can fall back to the existing setup supporting hw mode 1 without these
additional options where only the TCs are offloaded and then the 'mode'
and 'shaper' options defaults to DCB support.  The i40e driver enables the
new mqprio hardware offload mechanism factoring the TCs, queue
configuration and bandwidth rates by creating HW channel VSIs.
In this new mode, the priority to traffic class mapping and the user
specified queue ranges are used to configure the traffic class when the
'mode' option is set to 'channel'. This is achieved by creating HW
channels(VSI). A new channel is created for each of the traffic class
configuration offloaded via mqprio framework except for the first TC (TC0)
which is for the main VSI. TC0 for the main VSI is also reconfigured as
per user provided queue parameters. Finally, bandwidth rate limits are set
on these traffic classes through the shaper attribute by sending these
rates in addition to the number of TCs and the queue configurations.

Colin Ian King makes an array of constant values "constant".

Alan fixes and issue where on some firmware versions, we were failing to
actually fill out the phy_types which caused ethtool to not report any
link types.  Also hardened against a potentially malicious VF by not
letting the VF to reset itself after requesting to change the number of
queues (via ethtool), let the PF reset the VF to institute the requested
changes.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
6 years agoMerge branch 'tc-testing-updates'
David S. Miller [Sun, 15 Oct 2017 01:47:53 +0000 (18:47 -0700)]
Merge branch 'tc-testing-updates'

Lucas Bates says:

====================
tc-testing: Test suite updates

This patch series is a roundup of changes to the tc-testing
suite:

 - Add test cases for police and mirred modules and some coverage
   in already-submitted test categories
 - Break the test case files down into more user-friendly sizes
 - Bug fix to the tdc.py script's handling of the -l argument

v2: fix the lack of final newlines in two new files (thanks David)
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
6 years agotc-testing: fix the -l argument bug in tdc.py
Lucas Bates [Fri, 13 Oct 2017 21:51:25 +0000 (17:51 -0400)]
tc-testing: fix the -l argument bug in tdc.py

This patch fixes a bug in the tdc script, where executing tdc
with the -l argument would cause the tests to start running
as opposed to listing all the known test cases.

Signed-off-by: Lucas Bates <lucasb@mojatatu.com>
Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
6 years agotc-testing: Add test cases for police and skbmod
Lucas Bates [Fri, 13 Oct 2017 21:51:24 +0000 (17:51 -0400)]
tc-testing: Add test cases for police and skbmod

Add basic unit tests for police and skbmod actions in tc.

Signed-off-by: Lucas Bates <lucasb@mojatatu.com>
Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
6 years agotc-testing: Split test case files into smaller chunks
Lucas Bates [Fri, 13 Oct 2017 21:51:23 +0000 (17:51 -0400)]
tc-testing: Split test case files into smaller chunks

The original submission had the test cases stored in one
monolithic file. This can be unwieldy to edit, especially as more
test cases are added. This patch removes the original tests.json
file in favour of individual ones broken down by category.

Signed-off-by: Lucas Bates <lucasb@mojatatu.com>
Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
6 years agotc-testing: Add test cases for flushing actions
Lucas Bates [Fri, 13 Oct 2017 21:51:22 +0000 (17:51 -0400)]
tc-testing: Add test cases for flushing actions

Tests for flushing gact and mirred were missing. This patch
adds test cases to explicitly test the flush of any installed
gact/mirred actions.

Signed-off-by: Lucas Bates <lucasb@mojatatu.com>
Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
6 years agoMerge branch 'macvlan-cleanups'
David S. Miller [Sun, 15 Oct 2017 01:46:36 +0000 (18:46 -0700)]
Merge branch 'macvlan-cleanups'

Alexander Duyck says:

====================
net: Minor macvlan source mode cleanups

So this patch series is just a few minor cleanups for macvlan source mode.
The first patch addresses double receives when a packet is being routed to
the macvlan destination address, and the other addresses the pkt_type being
updated in cases where it most likely should not be.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
6 years agomacvlan: Only update pkt_type if destination MAC address matches
Alexander Duyck [Fri, 13 Oct 2017 20:40:31 +0000 (13:40 -0700)]
macvlan: Only update pkt_type if destination MAC address matches

This patch updates the pkt_type to PACKET_HOST only if the destination MAC
address matches on the on the source based macvlan. It didn't make sense to
be updating broadcast, multicast, and non-local destined frames with
PACKET_HOST.

Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
6 years agomacvlan: Only deliver one copy of the frame to the macvlan interface
Alexander Duyck [Fri, 13 Oct 2017 20:40:24 +0000 (13:40 -0700)]
macvlan: Only deliver one copy of the frame to the macvlan interface

This patch intoduces a slight adjustment for macvlan to address the fact
that in source mode I was seeing two copies of any packet addressed to the
macvlan interface being delivered where there should have been only one.

The issue appears to be that one copy was delivered based on the source MAC
address and then the second copy was being delivered based on the
destination MAC address. To fix it I am just treating a unicast address
match as though it is not a match since source based macvlan isn't supposed
to be matching based on the destination MAC anyway.

Fixes: 79cf79abce71 ("macvlan: add source mode")
Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
6 years agotcp: add a tracepoint for tcp retransmission
Cong Wang [Fri, 13 Oct 2017 20:03:16 +0000 (13:03 -0700)]
tcp: add a tracepoint for tcp retransmission

We need a real-time notification for tcp retransmission
for monitoring.

Of course we could use ftrace to dynamically instrument this
kernel function too, however we can't retrieve the connection
information at the same time, for example perf-tools [1] reads
/proc/net/tcp for socket details, which is slow when we have
a lots of connections.

Therefore, this patch adds a tracepoint for __tcp_retransmit_skb()
and exposes src/dst IP addresses and ports of the connection.
This also makes it easier to integrate into perf.

Note, I expose both IPv4 and IPv6 addresses at the same time:
for a IPv4 socket, v4 mapped address is used as IPv6 addresses,
for a IPv6 socket, LOOPBACK4_IPV6 is already filled by kernel.
Also, add sk and skb pointers as they are useful for BPF.

1. https://github.com/brendangregg/perf-tools/blob/master/net/tcpretrans

Cc: Eric Dumazet <edumazet@google.com>
Cc: Alexei Starovoitov <alexei.starovoitov@gmail.com>
Cc: Hannes Frederic Sowa <hannes@stressinduktion.org>
Cc: Brendan Gregg <brendan.d.gregg@gmail.com>
Cc: Neal Cardwell <ncardwell@google.com>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Brendan Gregg <bgregg@netflix.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
6 years agonet_sched: fix a compile warning in act_ife
Cong Wang [Fri, 13 Oct 2017 19:58:13 +0000 (12:58 -0700)]
net_sched: fix a compile warning in act_ife

Apparently ife_meta_id2name() is only called when
CONFIG_MODULES is defined.

This fixes:

net/sched/act_ife.c:251:20: warning: â€˜ife_meta_id2name’ defined but not used [-Wunused-function]
 static const char *ife_meta_id2name(u32 metaid)
                    ^~~~~~~~~~~~~~~~

Fixes: d3f24ba895f0 ("net sched actions: fix module auto-loading")
Cc: Roman Mashak <mrv@mojatatu.com>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
6 years agoMerge branch 'hv_netvsc-Add-init-of-send-table-and-var-renames'
David S. Miller [Sun, 15 Oct 2017 01:42:56 +0000 (18:42 -0700)]
Merge branch 'hv_netvsc-Add-init-of-send-table-and-var-renames'

Haiyang Zhang says:

====================
hv_netvsc: Add init of send table and var renames

Add initialization of send indirection table. Otherwise it may contain
old info of previous device with different number of channels.

Also, did some variable renaming for easier reading.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
6 years agohv_netvsc: Add initialization of tx_table in netvsc_device_add()
Haiyang Zhang [Fri, 13 Oct 2017 19:28:05 +0000 (12:28 -0700)]
hv_netvsc: Add initialization of tx_table in netvsc_device_add()

tx_table is part of the private data of kernel net_device. It is only
zero-ed out when allocating net_device.

We may recreate netvsc_device w/o recreating net_device, so the private
netdev data, including tx_table, are not zeroed. It may contain channel
numbers for the older netvsc_device.

This patch adds initialization of tx_table each time we recreate
netvsc_device.

Signed-off-by: Haiyang Zhang <haiyangz@microsoft.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
6 years agohv_netvsc: Rename tx_send_table to tx_table
Haiyang Zhang [Fri, 13 Oct 2017 19:28:04 +0000 (12:28 -0700)]
hv_netvsc: Rename tx_send_table to tx_table

Simplify the variable name: tx_send_table

Signed-off-by: Haiyang Zhang <haiyangz@microsoft.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
6 years agohv_netvsc: Rename ind_table to rx_table
Haiyang Zhang [Fri, 13 Oct 2017 19:28:03 +0000 (12:28 -0700)]
hv_netvsc: Rename ind_table to rx_table

Rename this variable because it is the Receive indirection
table.

Signed-off-by: Haiyang Zhang <haiyangz@microsoft.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
6 years agocxgb4: fix missing break in switch and indent return statements
Colin Ian King [Fri, 13 Oct 2017 16:29:00 +0000 (17:29 +0100)]
cxgb4: fix missing break in switch and indent return statements

The break statement for the Macronix case is missing and will
fall through to the Winbond case and re-assign the size setting.
Fix this by adding the missing break statement.  Also correctly
indent the return statements.

Detected by CoverityScan, CID#1458020 ("Missing break in switch")

Fixes: 96ac18f14a5a ("cxgb4: Add support for new flash parts")
Signed-off-by: Colin Ian King <colin.king@canonical.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
6 years agoMerge tag 'mac80211-next-for-davem-2017-10-13' of git://git.kernel.org/pub/scm/linux...
David S. Miller [Sun, 15 Oct 2017 01:36:46 +0000 (18:36 -0700)]
Merge tag 'mac80211-next-for-davem-2017-10-13' of git://git.kernel.org/pub/scm/linux/kernel/git/jberg/mac80211-next

Johannes Berg says:

====================
Three fixes for the recently added new code:
 * make "make -s" silent for the certs file (Arnd)
 * fix missing CONFIG_ in extra certs symbol (Arnd)
 * use crypto_aead_authsize() to use the proper API
and two other changes:
 * remove a set-but-unused variable
 * don't track HT *capability* changes, capabilities
   are supposed to be constant (HT operation changes)
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
6 years agoMerge branch 'cxgb4-hw-debug-logs'
David S. Miller [Sun, 15 Oct 2017 01:35:15 +0000 (18:35 -0700)]
Merge branch 'cxgb4-hw-debug-logs'

Rahul Lakkireddy says:

====================
cxgb4: add support to get hardware debug logs via ethtool

This series of patches add support to collect hardware debug logs
via ethtool --get-dump facility.

Currently supports:

Memory dumps - Collects on-chip EDC0 and EDC1 dumps.
Hardware dumps - Collects firmware and hardware dumps.

Patch 1 adds ethtool set/get dump data.  It also adds template header
that precedes dump data.  This template header gives additional
information needed for extracting and decoding the collected dump
data.

Patch 2 adds base to collect dumps.  Also collects regdump.

Patch 3 collects on-chip EDC0 and EDC1 memory dumps.

Patch 4 collects firmware mbox log and device log.

Patch 5 updates base API for accessing TP indirect registers.

Patch 6 collects hardware TP module dump.

Patch 7 collects hardware SGE, PCIE, PM, UP CIM, MA, and HMA
module dumps.

Patch 8 collects hardware IBQ and OBQ dump.

Thanks,
Rahul

---
v2:
- Prefix symbols that pollute global namespace in files starting
  with cxgb4_* with "cxgb4_"
- Prefix symbols that pollute global namespace in files starting
  with cudbg_* with "cudbg_"
- Make cudbg_collect_mem_info() static.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
6 years agocxgb4: collect IBQ and OBQ dumps
Rahul Lakkireddy [Fri, 13 Oct 2017 13:18:20 +0000 (18:48 +0530)]
cxgb4: collect IBQ and OBQ dumps

Signed-off-by: Rahul Lakkireddy <rahul.lakkireddy@chelsio.com>
Signed-off-by: Ganesh Goudar <ganeshgr@chelsio.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
6 years agocxgb4: collect hardware module dumps
Rahul Lakkireddy [Fri, 13 Oct 2017 13:18:19 +0000 (18:48 +0530)]
cxgb4: collect hardware module dumps

Collect SGE, PCIE, PM, UP CIM, MA and HMA dumps.

Signed-off-by: Rahul Lakkireddy <rahul.lakkireddy@chelsio.com>
Signed-off-by: Ganesh Goudar <ganeshgr@chelsio.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
6 years agocxgb4: collect TP dump
Rahul Lakkireddy [Fri, 13 Oct 2017 13:18:18 +0000 (18:48 +0530)]
cxgb4: collect TP dump

Signed-off-by: Rahul Lakkireddy <rahul.lakkireddy@chelsio.com>
Signed-off-by: Ganesh Goudar <ganeshgr@chelsio.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
6 years agocxgb4: update API for TP indirect register access
Rahul Lakkireddy [Fri, 13 Oct 2017 13:18:17 +0000 (18:48 +0530)]
cxgb4: update API for TP indirect register access

Try to access TP indirect registers via firmware first.  If this fails,
fallback and access them directly.  This ensures that driver and
firmware do not conflict each other while accessing the TP indirect
registers.

Signed-off-by: Rahul Lakkireddy <rahul.lakkireddy@chelsio.com>
Signed-off-by: Ganesh Goudar <ganeshgr@chelsio.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
6 years agocxgb4: collect firmware mbox and device log dump
Rahul Lakkireddy [Fri, 13 Oct 2017 13:18:16 +0000 (18:48 +0530)]
cxgb4: collect firmware mbox and device log dump

Collect firmware mbox and device logs before collecting the rest of
the hardware dumps to snap the firmware state before the mailbox logs
are updated by other hardware dumps.

Signed-off-by: Rahul Lakkireddy <rahul.lakkireddy@chelsio.com>
Signed-off-by: Ganesh Goudar <ganeshgr@chelsio.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
6 years agocxgb4: collect on-chip memory dump
Rahul Lakkireddy [Fri, 13 Oct 2017 13:18:15 +0000 (18:48 +0530)]
cxgb4: collect on-chip memory dump

Collect EDC0 and EDC1 memory dump.

Signed-off-by: Rahul Lakkireddy <rahul.lakkireddy@chelsio.com>
Signed-off-by: Ganesh Goudar <ganeshgr@chelsio.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
6 years agocxgb4: collect register dump
Rahul Lakkireddy [Fri, 13 Oct 2017 13:18:14 +0000 (18:48 +0530)]
cxgb4: collect register dump

Add base to collect dump entities.  Collect register dump and
update template header accordingly.

Signed-off-by: Rahul Lakkireddy <rahul.lakkireddy@chelsio.com>
Signed-off-by: Ganesh Goudar <ganeshgr@chelsio.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
6 years agocxgb4: implement ethtool dump data operations
Rahul Lakkireddy [Fri, 13 Oct 2017 13:18:13 +0000 (18:48 +0530)]
cxgb4: implement ethtool dump data operations

Implement operations to set/get dump data via ethtool.  Also add
template header that precedes dump data, which helps in decoding
and extracting the dump data.

Signed-off-by: Rahul Lakkireddy <rahul.lakkireddy@chelsio.com>
Signed-off-by: Ganesh Goudar <ganeshgr@chelsio.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
6 years agonfp: Explicitly include linux/bug.h
Mark Brown [Fri, 13 Oct 2017 02:50:35 +0000 (03:50 +0100)]
nfp: Explicitly include linux/bug.h

Today's -next build encountered an error due to a missing definition of
WARN_ON(), caused by some header reorganization removing an implicit
inclusion of linux/bug.h.  Fix this with an explicit inclusion.

Signed-off-by: Mark Brown <broonie@kernel.org>
Acked-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
6 years agoMerge branch 'dsa-remove-set_addr'
David S. Miller [Sun, 15 Oct 2017 01:30:07 +0000 (18:30 -0700)]
Merge branch 'dsa-remove-set_addr'

Vivien Didelot says:

====================
net: dsa: remove .set_addr

An Ethernet switch may support having a MAC address, which can be used
as the switch's source address in transmitted full-duplex Pause frames.

If a DSA switch supports the related .set_addr operation, the DSA core
sets the master's MAC address on the switch.

This won't make sense anymore in a multi-CPU ports system, because there
won't be a unique master device assigned to a switch tree.

Moreover this operation is confusing because it makes the user think
that it could be used to program the switch with the MAC address of the
CPU/management port such that MAC address learning can be disabled on
said port, but in fact, that's not how it is currently used.

To fix this, assign a random MAC address at setup time in the mv88e6060
and mv88e6xxx drivers before removing .set_addr completely from DSA.

Changes in v3:
  - include fix for mv88e6060 switch MAC address setter.

Changes in v2:
  - remove .set_addr implementation from drivers and use a random MAC.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
6 years agonet: dsa: remove .set_addr
Vivien Didelot [Fri, 13 Oct 2017 18:18:09 +0000 (14:18 -0400)]
net: dsa: remove .set_addr

Now that there is no user for the .set_addr function, remove it from
DSA. If a switch supports this feature (like mv88e6xxx), the
implementation can be done in the driver setup.

Signed-off-by: Vivien Didelot <vivien.didelot@savoirfairelinux.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
6 years agonet: dsa: dsa_loop: remove .set_addr
Vivien Didelot [Fri, 13 Oct 2017 18:18:08 +0000 (14:18 -0400)]
net: dsa: dsa_loop: remove .set_addr

The .set_addr function does nothing, remove the dsa_loop implementation
before getting rid of it completely in DSA.

Signed-off-by: Vivien Didelot <vivien.didelot@savoirfairelinux.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
6 years agonet: dsa: mv88e6060: setup random mac address
Vivien Didelot [Fri, 13 Oct 2017 18:18:07 +0000 (14:18 -0400)]
net: dsa: mv88e6060: setup random mac address

As for mv88e6xxx, setup the switch from within the mv88e6060 driver with
a random MAC address, and remove the .set_addr implementation.

Signed-off-by: Vivien Didelot <vivien.didelot@savoirfairelinux.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
6 years agonet: dsa: mv88e6060: fix switch MAC address
Vivien Didelot [Fri, 13 Oct 2017 18:18:06 +0000 (14:18 -0400)]
net: dsa: mv88e6060: fix switch MAC address

The 88E6060 Ethernet switch always transmits the multicast bit of the
switch MAC address as a zero. It re-uses the corresponding bit 8 of the
register "Switch MAC Address Register Bytes 0 & 1" for "DiffAddr".

If the "DiffAddr" bit is 0, then all ports transmit the same source
address. If it is set to 1, then bit 2:0 are used for the port number.

The mv88e6060 driver is currently wrongly shifting the MAC address byte
0 by 9. To fix this, shift it by 8 as usual and clear its bit 0.

Signed-off-by: Vivien Didelot <vivien.didelot@savoirfairelinux.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
6 years agonet: dsa: mv88e6xxx: setup random mac address
Vivien Didelot [Fri, 13 Oct 2017 18:18:05 +0000 (14:18 -0400)]
net: dsa: mv88e6xxx: setup random mac address

An Ethernet switch may support having a MAC address, which can be used
as the switch's source address in transmitted full-duplex Pause frames.

If a DSA switch supports the related .set_addr operation, the DSA core
sets the master's MAC address on the switch. This won't make sense
anymore in a multi-CPU ports system, because there won't be a unique
master device assigned to a switch tree.

Instead, setup the switch from within the Marvell driver with a random
MAC address, and remove the .set_addr implementation.

Signed-off-by: Vivien Didelot <vivien.didelot@savoirfairelinux.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
6 years agoatm: fore200e: mark expected switch fall-throughs
Gustavo A. R. Silva [Thu, 12 Oct 2017 21:11:32 +0000 (16:11 -0500)]
atm: fore200e: mark expected switch fall-throughs

In preparation to enabling -Wimplicit-fallthrough, mark switch cases
where we are expecting to fall through.

Signed-off-by: Gustavo A. R. Silva <garsilva@embeddedor.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
6 years agonet/mlx5e: IPoIB, Modify rdma netdev allocate and free to support PKEY
Alex Vesker [Thu, 14 Sep 2017 15:22:50 +0000 (18:22 +0300)]
net/mlx5e: IPoIB, Modify rdma netdev allocate and free to support PKEY

Resources such as FT, QPN HT and mdev resources should be allocated
only by parent netdev. Shared resources are allocated and freed by the
parent interface since the parent is always present and created
before the IPoIB PKEY sub-interface.

Signed-off-by: Alex Vesker <valex@mellanox.com>
Reviewed-by: Erez Shitrit <erezsh@mellanox.com>
6 years agonet/mlx5e: IPoIB, Add PKEY child interface ethtool ops
Alex Vesker [Thu, 14 Sep 2017 15:02:31 +0000 (18:02 +0300)]
net/mlx5e: IPoIB, Add PKEY child interface ethtool ops

Similar to VLAN interfaces child interfaces have limited ethtool
support. In current code the main limitation that does not
allow child interface ethtool configuration is due to shared
resources which are managed by the parent.

Signed-off-by: Alex Vesker <valex@mellanox.com>
Reviewed-by: Erez Shitrit <erezsh@mellanox.com>
6 years agonet/mlx5e: IPoIB, Add PKEY child interface ndos
Alex Vesker [Thu, 14 Sep 2017 13:33:35 +0000 (16:33 +0300)]
net/mlx5e: IPoIB, Add PKEY child interface ndos

Child interface ndos will be called to support child interface
specific behaviour.

ndo_init flow:
-Acquire shared QPN to net-device HT from parent
-Continue with the same flow as parent interface

ndo_open flow:
-Initialize child underlay QP and connect to shared FT
-Create child send TIS
-Open child send channels

Signed-off-by: Alex Vesker <valex@mellanox.com>
Reviewed-by: Erez Shitrit <erezsh@mellanox.com>
6 years agonet/mlx5e: IPoIB, Add PKEY child interface nic profile
Alex Vesker [Thu, 14 Sep 2017 11:08:39 +0000 (14:08 +0300)]
net/mlx5e: IPoIB, Add PKEY child interface nic profile

Child interface profile will be called to support child interface
specific behaviour. The child code is sparse compared to the parent
since the RX channels are shared between the interfaces.
Creating a septate profile for child and parent will make a smother
code with a better ability for future expansion.
The profile stuct is exposed to the parent using a getter function.

Signed-off-by: Alex Vesker <valex@mellanox.com>
Reviewed-by: Erez Shitrit <erezsh@mellanox.com>
6 years agonet/mlx5e: IPoIB, Use hash-table to map between QPN to child netdev
Alex Vesker [Thu, 14 Sep 2017 07:27:25 +0000 (10:27 +0300)]
net/mlx5e: IPoIB, Use hash-table to map between QPN to child netdev

This change is needed for PKEY support, since the RQs are shared
between the child interface and the parent. The parent is responsible
for NAPI and the precessing of RX completions. Using the dqpn in the
completion descriptor we set the corresponding child IPoIB netdevice
on the SKB.
The mapping between the dqpn and the netdevice is done using a HT,
each mlx5 IPoIB interface registers its mapping on creation.

Signed-off-by: Alex Vesker <valex@mellanox.com>
Reviewed-by: Erez Shitrit <erezsh@mellanox.com>
6 years agonet/mlx5e: IPoIB, Support for setting PKEY index to underlay QP
Alex Vesker [Wed, 13 Sep 2017 09:17:50 +0000 (12:17 +0300)]
net/mlx5e: IPoIB, Support for setting PKEY index to underlay QP

Added a function to set PKEY index to IPoIB device driver using the
already present set_id function. PKEY index is attached to the QP
during state modification.

Signed-off-by: Alex Vesker <valex@mellanox.com>
Reviewed-by: Erez Shitrit <erezsh@mellanox.com>
6 years agoIB/ipoib: Add ability to set PKEY index to lower device driver
Alex Vesker [Wed, 14 Jun 2017 19:18:48 +0000 (22:18 +0300)]
IB/ipoib: Add ability to set PKEY index to lower device driver

To support passing child interfaces to the lower device a new
rdma_netdev function was used, set_id. This will allow us to
attach the PKEY index lower device resources such as TIS/QP.
For devices that do not support offloads in IPoIB same logic
will be used, setting the PKEY index to priv struct.

Signed-off-by: Alex Vesker <valex@mellanox.com>
Reviewed-by: Erez Shitrit <erezsh@mellanox.com>
6 years agoIB/ipoib: Grab rtnl lock on heavy flush when calling ndo_open/stop
Alex Vesker [Tue, 10 Oct 2017 07:36:41 +0000 (10:36 +0300)]
IB/ipoib: Grab rtnl lock on heavy flush when calling ndo_open/stop

When ndo_open and ndo_stop are called RTNL lock should be held.
In this specific case ipoib_ib_dev_open calls the offloaded ndo_open
which re-sets the number of TX queue assuming RTNL lock is held.
Since RTNL lock is not held, RTNL assert will fail.

Signed-off-by: Alex Vesker <valex@mellanox.com>
6 years agonet/mlx5: Support for attaching multiple underlay QPs to root flow table
Alex Vesker [Wed, 13 Sep 2017 08:37:02 +0000 (11:37 +0300)]
net/mlx5: Support for attaching multiple underlay QPs to root flow table

Previous support allowed connecting only a single QPN to the FT.
Now using a linked list multiple QPNs can be attached to the same FT.

Supporting attaching multiple underlay QPs is required for PKEY
support in which child and parent share the same FT.

The actual attaching/detaching FW commands will be called inside the
function symmetrically.

This change requires a change in IPoIB open and close functions, the
attaching/detaching to/from the FT is done each time we open/close.

Signed-off-by: Alex Vesker <valex@mellanox.com>
Reviewed-by: Maor Gottlieb <maorg@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
6 years agonet/mlx5e: IPoIB, Move underlay QP init/uninit to separate functions
Alex Vesker [Tue, 12 Sep 2017 11:11:29 +0000 (14:11 +0300)]
net/mlx5e: IPoIB, Move underlay QP init/uninit to separate functions

During the creation of the underlay QP the PKEY index is unknown, the
PKEY index is known only when calling ndo_open.
PKEY index attached to the QP during state modification.

Splitting the functions will also make the code symmetric and more
readable. This split is also required for later PKEY support to be
called with the PKEY index during ndo_open.

Signed-off-by: Alex Vesker <valex@mellanox.com>
Reviewed-by: Erez Shitrit <erezsh@mellanox.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
6 years agonet/mlx5: PTP code migration to driver core section
Feras Daoud [Tue, 15 Aug 2017 10:46:04 +0000 (13:46 +0300)]
net/mlx5: PTP code migration to driver core section

PTP code is moved to core section of mlx5 driver in order to share
it between ethernet and infiniband. This movement involves the following
changes:
- Change mlx5e_ prefix to be mlx5_
- Add clock structs to Core
- Add clock object to mlx5_core_dev
- Call Init/Uninit clock from core init/cleanup
- Rename mlx5e_tstamp to be mlx5_clock

Signed-off-by: Feras Daoud <ferasda@mellanox.com>
Signed-off-by: Eitan Rabin <rabin@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
6 years agonet/mlx5: File renaming towards ptp core implementation
Feras Daoud [Mon, 14 Aug 2017 08:23:27 +0000 (11:23 +0300)]
net/mlx5: File renaming towards ptp core implementation

en_clock.c renamed clock.c and moved to lib/ as first step
towards relocating code to core part of the driver to allow
sharing between Ethernet and Infiniband.

Signed-off-by: Feras Daoud <ferasda@mellanox.com>
Signed-off-by: Eitan Rabin <rabin@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
6 years agoMerge branch 'nfp-bpf-support-direct-packet-access'
David S. Miller [Sat, 14 Oct 2017 18:13:29 +0000 (11:13 -0700)]
Merge branch 'nfp-bpf-support-direct-packet-access'

Jakub Kicinski says:

====================
nfp: bpf: support direct packet access

The core of this series is direct packet access support.  With a
small change to the verifier, the offloaded code can now make
use of DPA.  We need to be careful to use kernel (after initial
translation) offsets in our JIT.  Direct packet access also brings
us to the problem of eBPF endianness.  After considering the
changes necessary we decided to not support translation on both
BE and LE hosts, for now.

This series contains two fixes - one for compare instructions and
one for ineffective jne optimization.  I chose to include fixes
in this set because the code in -net works only with unreleased
PoC FW (ABI version 1) and therefore nobody outside of Netronome
can exercise it anyway.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
6 years agonfp: bpf: support direct packet access in TC
Jakub Kicinski [Thu, 12 Oct 2017 17:34:18 +0000 (10:34 -0700)]
nfp: bpf: support direct packet access in TC

Add support for direct packet access in TC, note that because
writing the packet will cause the verifier to generate a csum
fixup prologue we won't be able to offload packet writes from
TC, just yet, only the reads will work.

Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Reviewed-by: Simon Horman <simon.horman@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
6 years agonfp: bpf: direct packet access - write
Jakub Kicinski [Thu, 12 Oct 2017 17:34:17 +0000 (10:34 -0700)]
nfp: bpf: direct packet access - write

This patch adds ability to write packet contents using pre-validated
packet pointers (direct packet access).

Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Reviewed-by: Simon Horman <simon.horman@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
6 years agonfp: bpf: add support for direct packet access - read
Jakub Kicinski [Thu, 12 Oct 2017 17:34:16 +0000 (10:34 -0700)]
nfp: bpf: add support for direct packet access - read

In direct packet access bound checks are already done, we can
simply dereference the packet pointer.

Verifier/parser logic needs to record pointer type.  Note that
although verifier does protect us from CTX vs other pointer
changes we will also want to differentiate between PACKET vs
MAP_VALUE or STACK, so we can add the check already.

Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Reviewed-by: Simon Horman <simon.horman@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
6 years agonfp: bpf: separate I/O from checks for legacy data load
Jakub Kicinski [Thu, 12 Oct 2017 17:34:15 +0000 (10:34 -0700)]
nfp: bpf: separate I/O from checks for legacy data load

Move data load into a separate function and separate it from
packet length checks of legacy I/O.  This makes the code more
readable and easier to reuse.

Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Reviewed-by: Simon Horman <simon.horman@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
6 years agonfp: bpf: fix context accesses
Jakub Kicinski [Thu, 12 Oct 2017 17:34:14 +0000 (10:34 -0700)]
nfp: bpf: fix context accesses

Sizes of fields in struct xdp_md/xdp_buff and some in sk_buff depend
on target architecture.  Take that into account and use struct xdp_buff,
not struct xdp_md.

Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Reviewed-by: Simon Horman <simon.horman@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
6 years agonfp: bpf: support BPF offload only on little endian
Jakub Kicinski [Thu, 12 Oct 2017 17:34:13 +0000 (10:34 -0700)]
nfp: bpf: support BPF offload only on little endian

eBPF is host-endian specific.  Translating both BE and LE eBPF
to the NFP is feasible, but would require quite a bit of indirection.
The fact that I don't have access to any BE hosts that would fit
a 25G/40G/100G NIC is also limiting my ability to test big endian.

For now restrict the offload to little endian hosts only.

Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Reviewed-by: Simon Horman <simon.horman@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
6 years agonfp: bpf: implement byte swap instruction
Jakub Kicinski [Thu, 12 Oct 2017 17:34:12 +0000 (10:34 -0700)]
nfp: bpf: implement byte swap instruction

Implement byte swaps with rotations, shifts and byte loads.
Remember to clear upper parts of the 64 bit registers.

Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Reviewed-by: Simon Horman <simon.horman@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
6 years agonfp: bpf: add mov helper
Jakub Kicinski [Thu, 12 Oct 2017 17:34:11 +0000 (10:34 -0700)]
nfp: bpf: add mov helper

Register move operation is encoded as alu no op.  This means
that one has to specify number of unused/none parameters to
the emit_alu().  Add a helper.

Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Reviewed-by: Simon Horman <simon.horman@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
6 years agonfp: bpf: fix compare instructions
Jakub Kicinski [Thu, 12 Oct 2017 17:34:10 +0000 (10:34 -0700)]
nfp: bpf: fix compare instructions

Now that we have BPF assemebler support in LLVM 6 we can easily
test all compare instructions (LLVM 4 didn't generate most of them
from C).  Fix the compare to immediates and refactor the order
of compare to regs to make sure they both follow the same pattern.

Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Reviewed-by: Simon Horman <simon.horman@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
6 years agonfp: bpf: add missing return in jne_imm optimization
Jakub Kicinski [Thu, 12 Oct 2017 17:34:09 +0000 (10:34 -0700)]
nfp: bpf: add missing return in jne_imm optimization

We optimize comparisons to immediate 0 as if (reg.lo | reg.hi).
The early return statement was missing, however, which means we
would generate two comparisons - optimized one followed by a
normal 2x 32 bit compare.

Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Reviewed-by: Simon Horman <simon.horman@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
6 years agonfp: bpf: reorder arguments to emit_ld_field_any()
Jakub Kicinski [Thu, 12 Oct 2017 17:34:08 +0000 (10:34 -0700)]
nfp: bpf: reorder arguments to emit_ld_field_any()

ld_field instruction has the following format in NFP assembler:

  ld_field[dst, 1000, src, <<24]

reoder parameters to emit_ld_field_any() to make it closer to
the familiar assembler order.

Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Reviewed-by: Simon Horman <simon.horman@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
6 years agobpf: verifier: set reg_type on context accesses in second pass
Jakub Kicinski [Thu, 12 Oct 2017 17:34:07 +0000 (10:34 -0700)]
bpf: verifier: set reg_type on context accesses in second pass

Use a simplified is_valid_access() callback when verifier
is used for program analysis by non-host JITs.  This allows
us to teach the verifier about packet start and packet end
offsets for direct packet access.

We can extend the callback as needed but for most packet
processing needs there isn't much more the offloads may
require.

Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Reviewed-by: Simon Horman <simon.horman@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
6 years agoMerge branch 'stmmac-Improvements-for-multi-queuing-and-for-AVB'
David S. Miller [Sat, 14 Oct 2017 18:12:08 +0000 (11:12 -0700)]
Merge branch 'stmmac-Improvements-for-multi-queuing-and-for-AVB'

Jose Abreu says:

====================
net: stmmac: Improvements for multi-queuing and for AVB

Two improvements for stmmac: First one corrects the available fifo
size per queue, second one corrects enabling of AVB queues. More info
in commit log.

Cc: David S. Miller <davem@davemloft.net>
Cc: Joao Pinto <jpinto@synopsys.com>
Cc: Giuseppe Cavallaro <peppe.cavallaro@st.com>
Cc: Alexandre Torgue <alexandre.torgue@st.com>
Changes from v1:
- Fix typo in second patch
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
Acked-by: Giuseppe Cavallaro <peppe.cavallaro@st.com>
6 years agonet: stmmac: Disable flow ctrl for RX AVB queues and really enable TX AVB queues
Jose Abreu [Fri, 13 Oct 2017 09:58:37 +0000 (10:58 +0100)]
net: stmmac: Disable flow ctrl for RX AVB queues and really enable TX AVB queues

Flow control must be disabled for AVB enabled queues and TX
AVB queues must be enabled by setting BIT(2) of TXQEN.

Correct this by passing the queue mode to DMA callbacks
and by checking in these functions wether we are in AVB
performing the necessary adjustments.

Signed-off-by: Jose Abreu <joabreu@synopsys.com>
Cc: David S. Miller <davem@davemloft.net>
Cc: Joao Pinto <jpinto@synopsys.com>
Cc: Giuseppe Cavallaro <peppe.cavallaro@st.com>
Cc: Alexandre Torgue <alexandre.torgue@st.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
6 years agonet: stmmac: Use correct values in TQS/RQS fields
Jose Abreu [Fri, 13 Oct 2017 09:58:36 +0000 (10:58 +0100)]
net: stmmac: Use correct values in TQS/RQS fields

Currently we are using all the available fifo size in RQS and
TQS fields. This will not work correctly in multi-queues IP's
because total fifo size must be splitted to the enabled queues.

Correct this by computing the available fifo size per queue and
setting the right value in TQS and RQS fields.

Signed-off-by: Jose Abreu <joabreu@synopsys.com>
Cc: David S. Miller <davem@davemloft.net>
Cc: Joao Pinto <jpinto@synopsys.com>
Cc: Giuseppe Cavallaro <peppe.cavallaro@st.com>
Cc: Alexandre Torgue <alexandre.torgue@st.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
6 years agoicmp: don't fail on fragment reassembly time exceeded
Matteo Croce [Thu, 12 Oct 2017 14:12:37 +0000 (16:12 +0200)]
icmp: don't fail on fragment reassembly time exceeded

The ICMP implementation currently replies to an ICMP time exceeded message
(type 11) with an ICMP host unreachable message (type 3, code 1).

However, time exceeded messages can either represent "time to live exceeded
in transit" (code 0) or "fragment reassembly time exceeded" (code 1).

Unconditionally replying to "fragment reassembly time exceeded" with
host unreachable messages might cause unjustified connection resets
which are now easily triggered as UFO has been removed, because, in turn,
sending large buffers triggers IP fragmentation.

The issue can be easily reproduced by running a lot of UDP streams
which is likely to trigger IP fragmentation:

  # start netserver in the test namespace
  ip netns add test
  ip netns exec test netserver

  # create a VETH pair
  ip link add name veth0 type veth peer name veth0 netns test
  ip link set veth0 up
  ip -n test link set veth0 up

  for i in $(seq 20 29); do
      # assign addresses to both ends
      ip addr add dev veth0 192.168.$i.1/24
      ip -n test addr add dev veth0 192.168.$i.2/24

      # start the traffic
      netperf -L 192.168.$i.1 -H 192.168.$i.2 -t UDP_STREAM -l 0 &
  done

  # wait
  send_data: data send error: No route to host (errno 113)
  netperf: send_omni: send_data failed: No route to host

We need to differentiate instead: if fragment reassembly time exceeded
is reported, we need to silently drop the packet,
if time to live exceeded is reported, maintain the current behaviour.
In both cases increment the related error count "icmpInTimeExcds".

While at it, fix a typo in a comment, and convert the if statement
into a switch to mate it more readable.

Signed-off-by: Matteo Croce <mcroce@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
6 years agoi40e/i40evf: don't trust VF to reset itself
Alan Brady [Wed, 11 Oct 2017 21:49:43 +0000 (14:49 -0700)]
i40e/i40evf: don't trust VF to reset itself

When using 'ethtool -L' on a VF to change number of requested queues
from PF, we shouldn't trust the VF to reset itself after making the
request.  Doing it that way opens the door for a potentially malicious
VF to do nasty things to the PF which should never be the case.

This makes it such that after VF makes a successful request, PF will
then reset the VF to institute required changes.  Only if the request
fails will PF send a message back to VF letting it know the request was
unsuccessful.

Testing-hints:
There should be no real functional changes.  This is simply hardening
against a potentially malicious VF.

Signed-off-by: Alan Brady <alan.brady@intel.com>
Tested-by: Andrew Bowers <andrewx.bowers@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
6 years agoi40e: fix link reporting
Alan Brady [Wed, 11 Oct 2017 21:49:42 +0000 (14:49 -0700)]
i40e: fix link reporting

When querying the NVM for supported phy_types, on some firmware
versions, we were failing to actually fill out the phy_types which means
ethtool wouldn't report any link types.

Testing-hints:
Check 'ethtool <iface>' if you have the right (wrong?) firmware.
Without this patch, no link modes will be reported.

Signed-off-by: Alan Brady <alan.brady@intel.com>
Tested-by: Andrew Bowers <andrewx.bowers@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
6 years agoi40e: make const array patterns static, reduces object code size
Colin Ian King [Fri, 22 Sep 2017 14:11:38 +0000 (15:11 +0100)]
i40e: make const array patterns static, reduces object code size

Don't populate const array patterns on the stack, instead make it
static. Makes the object code smaller by over 60 bytes:

Before:
   text    data     bss     dec     hex filename
   1953     496       0    2449     991 i40e_diag.o

After:
   text    data     bss     dec     hex filename
   1798     584       0    2382     94e i40e_diag.o

(gcc 6.3.0, x86-64)

Signed-off-by: Colin Ian King <colin.king@canonical.com>
Acked-by: Jesse Brandeburg <jesse.brandeburg@intel.com>
Tested-by: Andrew Bowers <andrewx.bowers@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
6 years agoi40e: Add support setting TC max bandwidth rates
Amritha Nambiar [Thu, 7 Sep 2017 11:00:32 +0000 (04:00 -0700)]
i40e: Add support setting TC max bandwidth rates

This patch enables setting up maximum Tx rates for the traffic
classes in i40e. The maximum rate is offloaded to the hardware through
the mqprio framework by specifying the mode option as 'channel' and
shaper option as 'bw_rlimit' and is configured for the VSI. Configuring
minimum Tx rate limit is not supported in the device. The minimum
usable value for Tx rate is 50Mbps.

Example:
# tc qdisc add dev eth0 root mqprio num_tc 2  map 0 0 0 0 1 1 1 1\
  queues 4@0 4@4 hw 1 mode channel shaper bw_rlimit\
  max_rate 4Gbit 5Gbit

To dump the bandwidth rates:
# tc qdisc show dev eth0

qdisc mqprio 804a: root  tc 2 map 0 0 0 0 1 1 1 1 0 0 0 0 0 0 0 0
             queues:(0:3) (4:7)
             mode:channel
             shaper:bw_rlimit   max_rate:4Gbit 5Gbit

Signed-off-by: Amritha Nambiar <amritha.nambiar@intel.com>
Tested-by: Andrew Bowers <andrewx.bowers@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>