git.proxmox.com Git - mirror_ubuntu-bionic-kernel.git/log

net: dsa: mv88e6xxx: add PPU operations

Some Marvell chips can enable/disable the PPU on demand. This is needed
to access the PHY registers when there is no indirection mechanism.

Add two new ppu_enable and ppu_disable ops to describe this and finally
get rid of the MV88E6XXX_FLAG_PPU* flags.

Signed-off-by: Vivien Didelot <vivien.didelot@savoirfairelinux.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: dsa: mv88e6xxx: add a soft reset operation

Marvell chips have different way to issue a software reset.

Old chips (such as 88E6060) have a reset bit in an ATU control register.

Newer chips moved this bit in a Global control register. Chips with
controllable PPU should reset the PPU when resetting the switch.

Add a new reset operation to implement these differences and introduce a
mv88e6xxx_software_reset() helper to wrap it conveniently.

Signed-off-by: Vivien Didelot <vivien.didelot@savoirfairelinux.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: dsa: mv88e6xxx: add helper to hardware reset

Add an helper to toggle the eventual GPIO connected to the reset pin.

Signed-off-by: Vivien Didelot <vivien.didelot@savoirfairelinux.com>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: dsa: mv88e6xxx: add helper to disable ports

Before resetting a switch, the ports should be set to the Disabled state
and the transmit queues should be drained.

Add an helper to explicit that.

Signed-off-by: Vivien Didelot <vivien.didelot@savoirfairelinux.com>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>

Merge branch 'Alacritech-SLIC-driver'

Lino Sanfilippo says:

====================
Gigabit ethernet driver for Alacritechs SLIC devices (v4)

this is the forth version of the slicoss gigabit ethernet driver (which is a
rework of the driver from Alacritech which can currently be found under
drivers/staging/slicoss). The driver is supposed to support Mojave, Oasis and
Kalahari cards, for both copper and fiber.

If this code is accepted the staging version can be removed.

The driver has been tested on a SEN2104ET adapter (4 Port PCIe copper).

v4:
- fix wrong driver name in Kconfig file (reported by Rami Rosen)
- remove unused variable from driver struct (reported by Rami Rosen)
- return "err" instead of 0 in slic_load_rcvseq_firmware() (reported by Rami Rosen)
- Fix typos in constants, comments and error message (reported by Markus Böhme)
- fix various warnings concerning signedness (reported by Markus Böhme)
- improve line formatting (reported by Markus Böhme)
- add comment describing the need for SLIC_MAX_TX_COMPLETIONS (suggested by Florian Fainelli)
- do not zero out complete rx descriptor (suggested by Florian Fainelli)
- add missing write barrier (reported by Florian Fainelli)
- remove unneeded assignment of net_device to skb (reported by Florian Fainelli)
- use napi_complete_done() instead of napi_complete (suggested by Florian Fainelli)
- use napi_schedule_irqoff() instead of napi_schedule (suggested by Florian Fainelli)
- do not map error returned by slic_init() to -ENOMEM
- do proper dma syncs before and after rx descriptor status is set to 0
- if after dma sync for CPU rx descriptor is not used return it to HW by means of dma sync for device

v3:
- dont add defines to pci_ids.h but instead put it into the drivers header file
(requested by Greg Kroah-Hartman)

v2:
- remove unusual padding in statistic strings (suggested by Andrew Lunn)
- for mdio register and bit names use defines from mii.h instead of own ones
(suggested by Andrew Lunn)
- remove unused defines
- ensure PCI flush at two more places
- use mmiowb before lock to prevent mmio writes leaking out of lock
- fix some typos in comments
- add copyright and GPL header
====================

Signed-off-by: David S. Miller <davem@davemloft.net>

MAINTAINERS: add entry for slicoss ethernet driver

Add myself as maintainer for the slicoss ethernet driver.

Signed-off-by: Lino Sanfilippo <LinoSanfilippo@gmx.de>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: ethernet: slicoss: add slicoss gigabit ethernet driver

Add driver for Alacritech gigabit ethernet cards with SLIC (session-layer
interface control) technology. The driver provides basic support without
SLIC for the following devices:

- Mojave cards (single port PCI Gigabit) both copper and fiber
- Oasis cards (single and dual port PCI-x Gigabit) copper and fiber
- Kalahari cards (dual and quad port PCI-e Gigabit) copper and fiber

Signed-off-by: Lino Sanfilippo <LinoSanfilippo@gmx.de>
Signed-off-by: David S. Miller <davem@davemloft.net>

net/udp: do not touch skb->peeked unless really needed

In UDP recvmsg() path we currently access 3 cache lines from an skb
while holding receive queue lock, plus another one if packet is
dequeued, since we need to change skb->next->prev

1st cache line (contains ->next/prev pointers, offsets 0x00 and 0x08)
2nd cache line (skb->len & skb->peeked, offsets 0x80 and 0x8e)
3rd cache line (skb->truesize/users, offsets 0xe0 and 0xe4)

skb->peeked is only needed to make sure 0-length packets are properly
handled while MSG_PEEK is operated.

I had first the intent to remove skb->peeked but the "MSG_PEEK at
non-zero offset" support added by Sam Kumar makes this not possible.

This patch avoids one cache line miss during the locked section, when
skb->len and skb->peeked do not have to be read.

It also avoids the skb_set_peeked() cost for non empty UDP datagrams.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

Merge branch 'hix5hd2_gmac-txsg-reset-clock-control'

Dongpo Li says:

====================
net: hix5hd2_gmac: add tx sg feature and reset/clock control signals

The "hix5hd2" is SoC name, add the generic ethernet driver compatible string.
The "hisi-gemac-v1" is the basic version and "hisi-gemac-v2" adds
the SG/TXCSUM/TSO/UFO features.
This patch set only adds the SG(scatter-gather) driver for transmitting,
the drivers of other features will be submitted later.

Add the MAC reset control signals and clock signals.
We make these signals optional to be backward compatible with
the hix5hd2 SoC.

Changes in v2:
- Make the compatible string changes be a separate patch and
the most specific string come first than the generic string
as advised by Rob.
- Make the MAC reset control signals and clock signals optional
to be backward compatible with the hix5hd2 SoC.
- Change the compatible string and give the clock a specific name
in hix5hd2 dts file.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>

ARM: dts: hix5hd2: add gmac generic compatible and clock names

Add gmac generic compatible and clock names.

Signed-off-by: Dongpo Li <lidongpo@hisilicon.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: hix5hd2_gmac: add reset control and clock signals

Add three reset control signals, "mac_core_rst", "mac_ifc_rst" and
"phy_rst".
The following diagram explained how the reset signals work.

                        SoC
|-----------------------------------------------------
|                               ------                |
|                               | cpu |               |
|                               ------                |
|                                  |                  |
|                              ------------ AMBA bus  |
|                         GMAC     |                  |
|                            ----------------------   |
| ------------- mac_core_rst | --------------      |  |
| |clock and   |-------------->|   mac core  |     |  |
| |reset       |             | --------------      |  |
| |generator   |----         |       |             |  |
| -------------     |        | ----------------    |  |
|          |        ---------->| mac interface |   |  |
|          |     mac_ifc_rst | ----------------    |  |
|          |                 |       |             |  |
|          |                 | ------------------  |  |
|          |phy_rst          | | RGMII interface | |  |
|          |                 | ------------------  |  |
|          |                 ----------------------   |
|----------|------------------------------------------|
           |                          |
           |                      ----------
           |--------------------- |PHY chip |
                                  ----------

The "mac_core_rst" represents "mac core reset signal", it resets
the mac core including packet processing unit, descriptor processing unit,
tx engine, rx engine, control unit.
The "mac_ifc_rst" represents "mac interface reset signal", it resets
the mac interface. The mac interface unit connects mac core and
data interface like MII/RMII/RGMII. After we set a new value of
interface mode, we must reset mac interface to reload the new mode value.
The "mac_core_rst" and "mac_ifc_rst" are both optional to be
backward compatible with the hix5hd2 SoC.
The "phy_rst" represents "phy reset signal", it does a hardware reset
on the PHY chip. This reset signal is optional if the PHY can work well
without the hardware reset.

Add one more clock signal, the existing is MAC core clock,
and the new one is MAC interface clock.
The MAC interface clock is optional to be backward compatible with
the hix5hd2 SoC.

Signed-off-by: Dongpo Li <lidongpo@hisilicon.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: hix5hd2_gmac: add tx scatter-gather feature

"hisi-gemac-v2" adds the SG/TXCSUM/TSO/UFO features.
This patch only adds the SG(scatter-gather) driver for transmitting,
the drivers of other features will be submitted later.

Signed-off-by: Dongpo Li <lidongpo@hisilicon.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: hix5hd2_gmac: add generic compatible string

The "hix5hd2" is SoC name, add the generic ethernet driver name.
The "hisi-gemac-v1" is the basic version and "hisi-gemac-v2" adds
the SG/TXCSUM/TSO/UFO features.

Signed-off-by: Dongpo Li <lidongpo@hisilicon.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: dsa: mv88e6xxx: Use EDSA on mv88e6097

Use DSA_TAG_PROTO_EDSA as tag_protocol for the mv88e6097. The
initialisation was missing before.

Fixes: a1f482aa8c33 ("net: dsa: mv88e6xxx: Move the tagging protocol into info")
Signed-off-by: Stefan Eichenberger <stefan.eichenberger@netmodule.com>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>

bpf: add additional verifier tests for BPF_PROG_TYPE_LWT_*

- direct packet read is allowed for LWT_*
- direct packet write for LWT_IN/LWT_OUT is prohibited
- direct packet write for LWT_XMIT is allowed
- access to skb->tc_classid is prohibited for LWT_*

Signed-off-by: Thomas Graf <tgraf@suug.ch>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>

tools: hv: Enable network manager for bonding scripts on RHEL

We found network manager is necessary on RHEL to make the synthetic
NIC, VF NIC bonding operations handled automatically. So, enabling
network manager here.

Signed-off-by: Haiyang Zhang <haiyangz@microsoft.com>
Reviewed-by: K. Y. Srinivasan <kys@microsoft.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: calxeda: xgmac: use new api ethtool_{get|set}_link_ksettings

The ethtool api {get|set}_settings is deprecated.
We move this driver to new api {get|set}_link_ksettings.

Signed-off-by: Philippe Reynes <tremyfr@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

Merge branch 'bpf-prog-digest'

Daniel Borkmann says:

====================
Minor BPF cleanups and digest

First two patches are minor cleanups, and the third one adds
a prog digest. For details, please see individual patches.
After this one, I have a set with tracepoint support that makes
use of this facility as well.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>

bpf: add prog_digest and expose it via fdinfo/netlink

When loading a BPF program via bpf(2), calculate the digest over
the program's instruction stream and store it in struct bpf_prog's
digest member. This is done at a point in time before any instructions
are rewritten by the verifier. Any unstable map file descriptor
number part of the imm field will be zeroed for the hash.

fdinfo example output for progs:

  # cat /proc/1590/fdinfo/5
  pos:          0
  flags:        02000002
  mnt_id:       11
  prog_type:    1
  prog_jited:   1
  prog_digest:  b27e8b06da22707513aa97363dfb11c7c3675d28
  memlock:      4096

When programs are pinned and retrieved by an ELF loader, the loader
can check the program's digest through fdinfo and compare it against
one that was generated over the ELF file's program section to see
if the program needs to be reloaded. Furthermore, this can also be
exposed through other means such as netlink in case of a tc cls/act
dump (or xdp in future), but also through tracepoints or other
facilities to identify the program. Other than that, the digest can
also serve as a base name for the work in progress kallsyms support
of programs. The digest doesn't depend/select the crypto layer, since
we need to keep dependencies to a minimum. iproute2 will get support
for this facility.

Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>

bpf, cls: consolidate prog deletion path

Commit 18cdb37ebf4c ("net: sched: do not use tcf_proto 'tp' argument from
call_rcu") removed the last usage of tp from cls_bpf_delete_prog(), so also
remove it from the function as argument to not give a wrong impression. tp
is illegal to access from this callback, since it could already have been
freed.

Refactor the deletion code a bit, so that cls_bpf_destroy() can call into
the same code for prog deletion as cls_bpf_delete() op, instead of having
it unnecessarily duplicated.

Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>

bpf: remove type arg from __is_valid_{,xdp_}access

Commit d691f9e8d440 ("bpf: allow programs to write to certain skb
fields") pushed access type check outside of __is_valid_access()
to have different restrictions for socket filters and tc programs.
type is thus not used anymore within __is_valid_access() and should
be removed as a function argument. Same for __is_valid_xdp_access()
introduced by 6a773a15a1e8 ("bpf: add XDP prog type for early driver
filter").

Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>

Merge branch 'ethoc-next'

Florian Fainelli says:

====================
net: ethoc: Misc improvements

This patch series fixes/improves a few things:

- implement a proper PHYLIB adjust_link callback to set the duplex mode
accordingly
- do not open code the fetching of a MAC address in OF/DT environments
- demote an error message that occurs more frequently than expected in low
CPU/memory/bandwidth environments

Tested on a Cirrus Logic EP93xx / TS7300 board.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>

net: ethoc: Demote packet dropped error message to debug

Spamming the console with: net eth1: packet dropped can happen
fairly frequently if the adapter is busy transmitting, demote the
message to a debug print.

Signed-off-by: Florian Fainelli <f.fainelli@gmail.com>
Reviewed-by: Tobias Klauser <tklauser@distanz.ch>
Acked-by: Thierry Reding <thierry.reding@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: ethoc: Utilize of_get_mac_address()

Do not open code getting the MAC address exclusively from the
"local-mac-address" property, but instead use of_get_mac_address() which
looks up the MAC address using the 3 typical property names.

Signed-off-by: Florian Fainelli <f.fainelli@gmail.com>
Reviewed-by: Tobias Klauser <tklauser@distanz.ch>
Acked-by: Thierry Reding <thierry.reding@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: ethoc: Account for duplex changes

ethoc_mdio_poll() which is our PHYLIB adjust_link callback does nothing,
we should at least react to duplex changes and change MODER accordingly.
Speed changes is not a problem, since the OpenCores Ethernet core seems
to be reacting okay without us telling it.

Signed-off-by: Florian Fainelli <f.fainelli@gmail.com>
Reviewed-by: Tobias Klauser <tklauser@distanz.ch>
Acked-by: Thierry Reding <thierry.reding@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net_sched: gen_estimator: complete rewrite of rate estimators

1) Old code was hard to maintain, due to complex lock chains.
   (We probably will be able to remove some kfree_rcu() in callers)

2) Using a single timer to update all estimators does not scale.

3) Code was buggy on 32bit kernel (WRITE_ONCE() on 64bit quantity
   is not supposed to work well)

In this rewrite :

- I removed the RB tree that had to be scanned in
  gen_estimator_active(). qdisc dumps should be much faster.

- Each estimator has its own timer.

- Estimations are maintained in net_rate_estimator structure,
  instead of dirtying the qdisc. Minor, but part of the simplification.

- Reading the estimator uses RCU and a seqcount to provide proper
  support for 32bit kernels.

- We reduce memory need when estimators are not used, since
  we store a pointer, instead of the bytes/packets counters.

- xt_rateest_mt() no longer has to grab a spinlock.
  (In the future, xt_rateest_tg() could be switched to per cpu counters)

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net/sched: cls_flower: Set the filter Hardware device for all use-cases

Check if the returned device from tcf_exts_get_dev function supports tc
offload and in case the rule can't be offloaded, set the filter hw_dev
parameter to the original device given by the user.

The filter hw_device parameter should always be set by fl_hw_replace_filter
function, since this pointer is used by dump stats and destroy
filter for each flower rule (offloaded or not).

Fixes: 7091d8c7055d ('net/sched: cls_flower: Add offload support using egress Hardware device')
Signed-off-by: Hadar Hen Zion <hadarh@mellanox.com>
Reported-by: Simon Horman <horms@verge.net.au>
Tested-by: Simon Horman <simon.horman@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

ipv6: Allow IPv4-mapped address as next-hop

Made kernel accept IPv6 routes with IPv4-mapped address as next-hop.

It is possible to configure IP interfaces with IPv4-mapped addresses, and
one can add IPv6 routes for IPv4-mapped destinations/prefixes, yet prior
to this fix the kernel returned an EINVAL when attempting to add an IPv6
route with an IPv4-mapped address as a nexthop/gateway.

RFC 4798 (a proposed standard RFC) uses IPv4-mapped addresses as nexthops,
thus in order to support that type of address configuration the kernel
needs to allow IPv4-mapped addresses as nexthops.

Signed-off-by: Erik Nordmark <nordmark@arista.com>
Signed-off-by: Bob Gilligan <gilligan@arista.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

bpf: Preserve const register type on const OR alu ops

Occasionally, clang (e.g. version 3.8.1) translates a sum between two
constant operands using a BPF_OR instead of a BPF_ADD. The verifier is
currently not handling this scenario, and the destination register type
becomes UNKNOWN_VALUE even if it's still storing a constant. As a result,
the destination register cannot be used as argument to a helper function
expecting a ARG_CONST_STACK_*, limiting some use cases.

Modify the verifier to handle this case, and add a few tests to make sure
all combinations are supported, and stack boundaries are still verified
even with BPF_OR.

Signed-off-by: Gianluca Borello <g.borello@gmail.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: David S. Miller <davem@davemloft.net>

r8169: Add support for restarting auto-negotiation

Implement ethtooll::nway_restart by utilizing mii_nway_restart.

Signed-off-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

Merge branch 'for-upstream' of git://git.kernel.org/pub/scm/linux/kernel/git/bluetooth/bluetooth-next

Johan Hedberg says:

====================
pull request: bluetooth-next 2016-12-03

Here's a set of Bluetooth & 802.15.4 patches for net-next (i.e. 4.10
kernel):

- Fix for a potential NULL deref in the ieee802154 netlink code
- Fix for the ED values of the at86rf2xx driver
- Documentation updates to ieee802154
- Cleanups to u8 vs __u8 usage
- Timer API usage cleanups in HCI drivers

Please let me know if there are any issues pulling. Thanks.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>

Merge branch 'tcp-tsq-perf'

Eric Dumazet says:

====================
tcp: tsq: performance series

Under very high TX stress, CPU handling NIC TX completions can spend
considerable amount of cycles handling TSQ (TCP Small Queues) logic.

This patch series avoids some atomic operations, but most notable
patch is the 3rd one, allowing other cpus processing ACK packets and
calling tcp_write_xmit() to grab TCP_TSQ_DEFERRED so that
tcp_tasklet_func() can skip already processed sockets.

This avoid lots of lock acquisitions and cache lines accesses,
particularly under load.

In v2, I added :

- tcp_small_queue_check() change to allow 1st and 2nd packets
  in write queue to be sent, even in the case TX completion of
    already acknowledged packets did not happen yet.
      This helps when TX completion coalescing parameters are set
        even to insane values, and/or busy polling is used.

- A reorganization of struct sock fields to
  lower false sharing and increase data locality.

- Then I moved tsq_flags from tcp_sock to struct sock also
  to reduce cache line misses during TX completions.

I measured an overall throughput gain of 22 % for heavy TCP use
over a single TX queue.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>

tcp: tsq: move tsq_flags close to sk_wmem_alloc

tsq_flags being in the same cache line than sk_wmem_alloc
makes a lot of sense. Both fields are changed from tcp_wfree()
and more generally by various TSQ related functions.

Prior patch made room in struct sock and added sk_tsq_flags,
this patch deletes tsq_flags from struct tcp_sock.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: reorganize struct sock for better data locality

Group fields used in TX path, and keep some cache lines mostly read
to permit sharing among cpus.

Gained two 4 bytes holes on 64bit arches.

Added a place holder for tcp tsq_flags, next to sk_wmem_alloc
to speed up tcp_wfree() in the following patch.

I have not added ____cacheline_aligned_in_smp, this might be done later.
I prefer doing this once inet and tcp/udp sockets reorg is also done.

Tested with both TCP and UDP.

UDP receiver performance under flood increased by ~20 % :
Accessing sk_filter/sk_wq/sk_napi_id no longer stalls because sk_drops
was moved away from a critical cache line, now mostly read and shared.

/* --- cacheline 4 boundary (256 bytes) --- */
unsigned int               sk_napi_id;           /* 0x100   0x4 */
int                        sk_rcvbuf;            /* 0x104   0x4 */
struct sk_filter *         sk_filter;            /* 0x108   0x8 */
union {
struct socket_wq * sk_wq;                /*         0x8 */
struct socket_wq * sk_wq_raw;            /*         0x8 */
};                                               /* 0x110   0x8 */
struct xfrm_policy *       sk_policy[2];         /* 0x118  0x10 */
struct dst_entry *         sk_rx_dst;            /* 0x128   0x8 */
struct dst_entry *         sk_dst_cache;         /* 0x130   0x8 */
atomic_t                   sk_omem_alloc;        /* 0x138   0x4 */
int                        sk_sndbuf;            /* 0x13c   0x4 */
/* --- cacheline 5 boundary (320 bytes) --- */
int                        sk_wmem_queued;       /* 0x140   0x4 */
atomic_t                   sk_wmem_alloc;        /* 0x144   0x4 */
long unsigned int          sk_tsq_flags;         /* 0x148   0x8 */
struct sk_buff *           sk_send_head;         /* 0x150   0x8 */
struct sk_buff_head        sk_write_queue;       /* 0x158  0x18 */
__s32                      sk_peek_off;          /* 0x170   0x4 */
int                        sk_write_pending;     /* 0x174   0x4 */
long int                   sk_sndtimeo;          /* 0x178   0x8 */

Signed-off-by: Eric Dumazet <edumazet@google.com>
Tested-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

tcp: tcp_mtu_probe() is likely to exit early

Adding a likely() in tcp_mtu_probe() moves its code which used to
be inlined in front of tcp_write_xmit()

We still have a cache line miss to access icsk->icsk_mtup.enabled,
we will probably have to reorganize fields to help data locality.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

tcp: tsq: add a shortcut in tcp_small_queue_check()

Always allow the two first skbs in write queue to be sent,
regardless of sk_wmem_alloc/sk_pacing_rate values.

This helps a lot in situations where TX completions are delayed either
because of driver latencies or softirq latencies.

Test is done with no cache line misses.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

tcp: tsq: avoid one atomic in tcp_wfree()

Under high load, tcp_wfree() has an atomic operation trying
to schedule a tasklet over and over.

We can schedule it only if our per cpu list was empty.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

tcp: tsq: add shortcut in tcp_tasklet_func()

Under high stress, I've seen tcp_tasklet_func() consuming
~700 usec, handling ~150 tcp sockets.

By setting TCP_TSQ_DEFERRED in tcp_wfree(), we give a chance
for other cpus/threads entering tcp_write_xmit() to grab it,
allowing tcp_tasklet_func() to skip sockets that already did
an xmit cycle.

In the future, we might give to ACK processing an increased
budget to reduce even more tcp_tasklet_func() amount of work.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

tcp: tsq: remove one locked operation in tcp_wfree()

Instead of atomically clear TSQ_THROTTLED and atomically set TSQ_QUEUED
bits, use one cmpxchg() to perform a single locked operation.

Since the following patch will also set TCP_TSQ_DEFERRED here,
this cmpxchg() will make this addition free.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

tcp: tsq: add tsq_flags / tsq_enum

This is a cleanup, to ease code review of following patches.

Old 'enum tsq_flags' is renamed, and a new enumeration is added
with the flags used in cmpxchg() operations as opposed to
single bit operations.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

Merge branch 'bnxt_en-dcbnl'

Michael Chan says:

====================
bnxt_en: Add DCBNL support.

This series adds DCBNL operations to support host-based IEEE DCBX.

v2: Updated to the latest firmware interface spec.

David, please consider this series for net-next.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>

bnxt_en: Add PFC statistics.

Report PFC statistics to ethtool -S and DCBNL.

Signed-off-by: Michael Chan <michael.chan@broadcom.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

bnxt_en: Implement DCBNL to support host-based DCBX.

Support only IEEE DCBX initially. Add IEEE DCBNL ops and functions to
get and set the hardware DCBX parameters. The DCB code is conditional on
Kconfig CONFIG_BNXT_DCB.

Signed-off-by: Michael Chan <michael.chan@broadcom.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

bnxt_en: Update firmware header file to latest 1.6.0.

Latest interface has the latest DCB command structs. Get and store the
max number of lossless TCs the hardware can support.

Signed-off-by: Michael Chan <michael.chan@broadcom.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

bnxt_en: Re-factor bnxt_setup_tc().

Add a new function bnxt_setup_mq_tc() to handle MQPRIO. This new function
will be called during ETS setup when we add DCBNL in the next patch.

Signed-off-by: Michael Chan <michael.chan@broadcom.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: phy: dp83848: Support ethernet pause frames

According to the documentation, the PHYs supported by this driver
can also support pause frames. Announce this to be so.
Tested with a TI83822I.

Acked-by: Andrew F. Davis <afd@ti.com>
Signed-off-by: Jesper Nilsson <jesper.nilsson@axis.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

ipv6 addrconf: Implemented enhanced DAD (RFC7527)

Implemented RFC7527 Enhanced DAD.
IPv6 duplicate address detection can fail if there is some temporary
loopback of Ethernet frames. RFC7527 solves this by including a random
nonce in the NS messages used for DAD, and if an NS is received with the
same nonce it is assumed to be a looped back DAD probe and is ignored.
RFC7527 is enabled by default. Can be disabled by setting both of
conf/{all,interface}/enhanced_dad to zero.

Signed-off-by: Erik Nordmark <nordmark@arista.com>
Signed-off-by: Bob Gilligan <gilligan@arista.com>
Reviewed-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>

Merge branch 'mv88e6390-batch-three'

Andrew Lunn says:

====================
mv88e6390 batch 3

More patches to support the MV88e6390. This is mostly refactoring
existing code and adding implementations for the mv88e6390. This
patchset set which reserved frames are sent to the cpu, the size of
jumbo frames that will be accepted, turn off egress rate limiting, and
configuration of pause frames.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>

net: dsa: mv88e6xxx: Implement mv88e6390 pause control

The mv88e6390 has a number flow control registers accessed via the
Flow Control register. Use these to set the pause control.

Signed-off-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: dsa: mv88e6xxx: Refactor pause configuration

The mv88e6390 has a different mechanism for configuring pause.
Refactor the code into an ops function, and for the moment, don't add
any mv88e6390 code yet.

Signed-off-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: dsa: mv88e6xxx: Refactor egress rate limiting

There are two different rate limiting configurations, depending on the
switch generation. Refactor this into ops.

Signed-off-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: dsa: mv88e6xxx: Refactor setting of jumbo frames

Some switches support jumbo frames. Refactor this code into operations
in the ops structure.

Signed-off-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: dsa: mv88e6xxx: Reserved Management frames to CPU

Older devices have a couple of registers in global2. The mv88e6390
family has a single register in global1 behind which hides similar
configuration. Implement and op for this.

Signed-off-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>

Merge branch 'mv88e6390-batch-two'

Andrew Lunn says:

====================
MV88E6390 batch two

This is the second batch of patches adding support for the
MV88e6390. They are not sufficient to make it work properly.

The mv88e6390 has a much expanded set of priority maps. Refactor the
existing code, and implement basic support for the new device.

Similarly, the monitor control register has been reworked.

The mv88e6390 has something odd in its EDSA tagging implementation,
which means it is not possible to use it. So we need to use DSA
tagging. This is the first device with EDSA support where we need to
use DSA, and the code does not support this. So two patches refactor
the existing code. The two different register definitions are
separated out, and using DSA on an EDSA capable device is added.

v2:
Add port prefix
Add helper function for 6390
Add _IEEE_ into #defines
Split monitor_ctrl into a number of separate ops.
Remove 6390 code which is management, used in a later patch
s/EGREES/EGRESS/.
Broke up setup_port_dsa() and set_port_dsa() into a number of ops

v3:
Verify mandatory ops for port setup
Don't set ether type for DSA port.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>

net: dsa: mv88e6xxx: Refactor CPU and DSA port setup

Older chips only support DSA tagging. Newer chips have both DSA and
EDSA tagging. Refactor the code by adding port functions for setting the
frame mode, egress mode, and if to forward unknown frames.

This results in the helper mv88e6xxx_6065_family() becoming unused, so
remove it.

Signed-off-by: Andrew Lunn <andrew@lunn.ch>
v3:
Verify mandatory ops for port setup
Don't set ether type for DSA port.
Signed-off-by: David S. Miller <davem@davemloft.net>

net: dsa: mv88e6xxx: Move the tagging protocol into info

Older chips support a single tagging protocol, DSA. New chips support
both DSA and EDSA, an enhanced version. Having both as an option
changes the register layouts. Up until now, it has been assumed that
if EDSA is supported, it will be used. Hence the register layout has
been determined by which protocol should be used. However, mv88e6390
has a different implementation of EDSA, which requires we need to use
the DSA tagging. Hence separate the selection of the protocol from the
register layout.

Signed-off-by: Andrew Lunn <andrew@lunn.ch>
Reviewed-by: Vivien Didelot <vivien.didelot@savoirfairelinux.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: dsa: mv88e6xxx: Monitor and Management tables

The mv88e6390 changes the monitor control register into the Monitor
and Management control, which is an indirection register to various
registers.

Add ops to set the CPU port and the ingress/egress port for both
register layouts, to global1

Signed-off-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: dsa: mv88e6xxx: Implement mv88e6390 tag remap

The mv88e6390 does not have the two registers to set the frame
priority map. Instead it has an indirection registers for setting a
number of different priority maps. Refactor the old code into an
function, implement the mv88e6390 version, and use an op to call the
right one.

Signed-off-by: Andrew Lunn <andrew@lunn.ch>
Reviewed-by: Vivien Didelot <vivien.didelot@savoirfairelinux.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

Merge branch 'fib-notifier-event-replay'

Jiri Pirko says:

====================
ipv4: fib: Replay events when registering FIB notifier

Ido says:

In kernel 4.9 the switchdev-specific FIB offload mechanism was replaced
by a new FIB notification chain to which modules could register in order
to be notified about the addition and deletion of FIB entries. The
motivation for this change was that switchdev drivers need to be able to
reflect the entire FIB table and not only FIBs configured on top of the
port netdevs themselves. This is useful in case of in-band management.

The fundamental problem with this approach is that upon registration
listeners lose all the information previously sent in the chain and
thus have an incomplete view of the FIB tables, which can result in
packet loss. This patchset fixes that by dumping the FIB tables and
replaying notifications previously sent in the chain for the registered
notification block.

The entire dump process is done under RCU and thus the FIB notification
chain is converted to be atomic. The listeners are modified accordingly.
This is done in the first eight patches.

The ninth patch adds a change sequence counter to ensure the integrity
of the FIB dump. The last patch adds the dump itself to the FIB chain
registration function and modifies existing listeners to pass a callback
to be executed in case dump was inconsistent.

---
v3->v4:
- Register the notification block after the dump and protect it using
  the change sequence counter (Hannes Frederic Sowa).
- Since we now integrate the dump into the registration function, drop
  the sysctl to set maximum number of retries and instead set it to a
  fixed number. Lets see if it's really a problem before adding something
  we can never remove.
- For the same reason, dump FIB tables for all net namespaces.
- Add a comment regarding guarantees provided by mutex semantics.

v2->v3:
- Add sysctl to set the number of FIB dump retries (Hannes Frederic Sowa).
- Read the sequence counter under RTNL to ensure synchronization
  between the dump process and other processes changing the routing
  tables (Hannes Frederic Sowa).
- Pass a callback to the dump function to be executed prior to a retry.
- Limit the dump to a single net namespace.

v1->v2:
- Add a sequence counter to ensure the integrity of the FIB dump
  (David S. Miller, Hannes Frederic Sowa).
- Protect notifications from re-ordering in listeners by using an
  ordered workqueue (Hannes Frederic Sowa).
- Introduce fib_info_hold() (Jiri Pirko).
- Relieve rocker from the need to invoke the FIB dump by registering
  to the FIB notification chain prior to ports creation.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>

ipv4: fib: Replay events when registering FIB notifier

Commit b90eb7549499 ("fib: introduce FIB notification infrastructure")
introduced a new notification chain to notify listeners (f.e., switchdev
drivers) about addition and deletion of routes.

However, upon registration to the chain the FIB tables can already be
populated, which means potential listeners will have an incomplete view
of the tables.

Solve that by dumping the FIB tables and replaying the events to the
passed notification block. The dump itself is done using RCU in order
not to starve consumers that need RTNL to make progress.

The integrity of the dump is ensured by reading the FIB change sequence
counter before and after the dump under RTNL. This allows us to avoid
the problematic situation in which the dumping process sends a ENTRY_ADD
notification following ENTRY_DEL generated by another process holding
RTNL.

Callers of the registration function may pass a callback that is
executed in case the dump was inconsistent with current FIB tables.

The number of retries until a consistent dump is achieved is set to a
fixed number to prevent callers from looping for long periods of time.
In case current limit proves to be problematic in the future, it can be
easily converted to be configurable using a sysctl.

Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

ipv4: fib: Allow for consistent FIB dumping

The next patch will enable listeners of the FIB notification chain to
request a dump of the FIB tables. However, since RTNL isn't taken during
the dump, it's possible for the FIB tables to change mid-dump, which
will result in inconsistency between the listener's table and the
kernel's.

Allow listeners to know about changes that occurred mid-dump, by adding
a change sequence counter to each net namespace. The counter is
incremented just before a notification is sent in the FIB chain.

Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

ipv4: fib: Convert FIB notification chain to be atomic

In order not to hold RTNL for long periods of time we're going to dump
the FIB tables using RCU.

Convert the FIB notification chain to be atomic, as we can't block in
RCU critical sections.

Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

rocker: Register FIB notifier before creating ports

We can miss FIB notifications sent between the time the ports were
created and the FIB notification block registered.

Instead of receiving these notifications only when they are replayed for
the FIB notification block during registration, just register the
notification block before the ports are created.

Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

rocker: Implement FIB offload in deferred work

Convert rocker to offload FIBs in deferred work in a similar fashion to
mlxsw, which was converted in the previous commits.

Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

rocker: Create an ordered workqueue for FIB offload

As explained in the previous commits, we need to process FIB entries
addition / deletion events in FIFO order or otherwise we can have a
mismatch between the kernel's FIB table and the device's.

Create an ordered workqueue for rocker to which these work items will be
submitted to.

Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

mlxsw: spectrum_router: Implement FIB offload in deferred work

FIB offload is currently done in process context with RTNL held, but
we're about to dump the FIB tables in RCU critical section, so we can no
longer sleep.

Instead, defer the operation to process context using deferred work. Make
sure fib info isn't freed while the work is queued by taking a reference
on it and releasing it after the operation is done.

Deferring the operation is valid because the upper layers always assume
the operation was successful. If it's not, then the driver-specific
abort mechanism is called and all routed traffic is directed to slow
path.

The work items are submitted to an ordered workqueue to prevent a
mismatch between the kernel's FIB table and the device's.

Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

mlxsw: core: Create an ordered workqueue for FIB offload

We're going to start processing FIB entries addition / deletion events
in deferred work. These work items must be processed in the order they
were submitted or otherwise we can have differences between the kernel's
FIB table and the device's.

Solve this by creating an ordered workqueue to which these work items
will be submitted to. Note that we can't simply convert the current
workqueue to be ordered, as EMADs re-transmissions are also processed in
deferred work.

Later on, we can migrate other work items to this workqueue, such as FDB
notification processing and nexthop resolution, since they all take the
same lock anyway.

Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

ipv4: fib: Add fib_info_hold() helper

As explained in the previous commit, modules are going to need to take a
reference on fib info and then drop it using fib_info_put().

Add the fib_info_hold() helper to make the code more readable and also
symmetric with fib_info_put().

Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Suggested-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

ipv4: fib: Export free_fib_info()

The FIB notification chain is going to be converted to an atomic chain,
which means switchdev drivers will have to offload FIB entries in
deferred work, as hardware operations entail sleeping.

However, while the work is queued fib info might be freed, so a
reference must be taken. To release the reference (and potentially free
the fib info) fib_info_put() will be called, which in turn calls
free_fib_info().

Export free_fib_info() so that modules will be able to invoke
fib_info_put().

Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

act_mirred: fix a typo in get_dev

Fixes: 255cb30425c0 ("net/sched: act_mirred: Add new tc_action_ops get_dev()")
Cc: Hadar Hen Zion <hadarh@mellanox.com>
Cc: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

Merge branch '40GbE' of git://git.kernel.org/pub/scm/linux/kernel/git/jkirsher/next-queue

Jeff Kirsher says:

====================
40GbE Intel Wired LAN Driver Updates 2016-12-02

This series contains updates to i40e and i40evf only.

Alex provides changes so that we are much more robust about defining what
we can and cannot offload in i40e and i40evf by doing additional checks
other than L4 tunnel header length.

Jake provides several fixes/changes, first cleaning up a label that is
unnecessary, as well as cleaned up the use of a "magic number".  Clarified
the code by separating the global private flags and the regular private
flags per interface into two arrays, so that future additions will not
produce duplication and buggy code.  Adds additional checks to protect
against NULL values for msix_entries and q_vectors pointers.

Michal adds Clause22 method for accessing registers for some external
PHYs.

Piotr adds additional protocol support for the admin queue discover
capabilities function.

Tushar Dave fixes a panic seen on SPARC, where writel() should not be
used to write directly to a memory address but only to a memory mapped
I/O address otherwise it causes data access exceptions.

Joe Perches separates out a section of code into its own function, to
help reduce i40evf_reset_task() a bit.

Alan fixes an issue by checking for NULL before dereferencing msix_entries
and returning early in the case where it is NULL within the i40evf_close()
code path.

Henry provides code cleanup to remove unreachable and redundant sections
of code.  Fixed up an issue where new NICs were not identifying "unknown
PHYs" correctly.

Harshitha fixes a issue where the ethtool "Supported Link" modes list
backplane interfaces on X722 devices for 10 GbE with SFP+ and Cortina
retimer, where these interfaces should not be visible to the user since
they cannot use them.

Carolyn changes an X722 informational message so that it only appears
when extra messages are desired.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>

tcp: fix the missing avr32 SOF_TIMESTAMPING_OPT_STATS

The commit of SOF_TIMESTAMPING_OPT_STATS didn't include the
new header for avr32, causing build to break. The patch fixes it.

Fixes: 1c885808e456 ("tcp: SOF_TIMESTAMPING_OPT_STATS option for SO_TIMESTAMPING")
Reported-by: Paul Gortmaker <paul.gortmaker@windriver.com>
Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

udp: be less conservative with sock rmem accounting

Before commit 850cbaddb52d ("udp: use it's own memory accounting
schema"), the udp protocol allowed sk_rmem_alloc to grow beyond
the rcvbuf by the whole current packet's truesize. After said commit
we allow sk_rmem_alloc to exceed the rcvbuf only if the receive queue
is empty. As reported by Jesper this cause a performance regression
for some (small) values of rcvbuf.

This commit is intended to fix the regression restoring the old
handling of the rcvbuf limit.

Reported-by: Jesper Dangaard Brouer <brouer@redhat.com>
Fixes: 850cbaddb52d ("udp: use it's own memory accounting schema")
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net_sched: gen_estimator: account for timer drifts

Under heavy stress, timer used in estimators tend to slowly be delayed
by a few jiffies, leading to inaccuracies.

Lets remember what was the last scheduled jiffies so that we get more
precise estimations, without having to add a multiply/divide in the loop
to account for the drifts.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

sfc: remove EFX_BUG_ON_PARANOID, use EFX_WARN_ON_[ONCE_]PARANOID instead

Logically, EFX_BUG_ON_PARANOID can never be correct. For, BUG_ON should
only be used if it is not possible to continue without potential harm;
and since the non-DEBUG driver will continue regardless (as the BUG_ON is
compiled out), clearly the BUG_ON cannot be needed in the DEBUG driver.
So, replace every EFX_BUG_ON_PARANOID with either an EFX_WARN_ON_PARANOID
or the newly defined EFX_WARN_ON_ONCE_PARANOID.

Signed-off-by: Edward Cree <ecree@solarflare.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

Merge branch 'samples-bpf-automated-cgroup-tests'

Sargun Dhillon says:

====================
samples, bpf: Refactor; Add automated tests for cgroups

These two patches are around refactoring out some old, reusable code from the
existing test_current_task_under_cgroup_user test, and adding a new, automated
test.

There is some generic cgroupsv2 setup & cleanup code, given that most
environment still don't have it setup by default. With this code, we're able
to pretty easily add an automated test for future cgroupsv2 functionality.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>

samples, bpf: Add automated test for cgroup filter attachments

This patch adds the sample program test_cgrp2_attach2. This program is
similar to test_cgrp2_attach, but it performs automated testing of the
cgroupv2 BPF attached filters. It runs the following checks:
* Simple filter attachment
* Application of filters to child cgroups
* Overriding filters on child cgroups
* Checking that this still works when the parent filter is removed

The filters that are used here are simply allow all / deny all filters, so
it isn't checking the actual functionality of the filters, but rather
the behaviour around detachment / attachment. If net_cls is enabled,
this test will fail.

Signed-off-by: Sargun Dhillon <sargun@sargun.me>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>

samples, bpf: Refactor test_current_task_under_cgroup - separate out helpers

This patch modifies test_current_task_under_cgroup_user. The test has
several helpers around creating a temporary environment for cgroup
testing, and moving the current task around cgroups. This set of
helpers can then be used in other tests.

Signed-off-by: Sargun Dhillon <sargun@sargun.me>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>

samples/bpf: silence compiler warnings

silence some of the clang compiler warnings like:
include/linux/fs.h:2693:9: warning: comparison of unsigned enum expression < 0 is always false
arch/x86/include/asm/processor.h:491:30: warning: taking address of packed member 'sp0' of class or structure 'x86_hw_tss' may result in an unaligned pointer value
include/linux/cgroup-defs.h:326:16: warning: field 'cgrp' with variable sized type 'struct cgroup' not at the end of a struct or class is a GNU extension
since they add too much noise to samples/bpf/ build.

Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>

netns: fix net_generic() "id - 1" bloat

net_generic() function is both a) inline and b) used ~600 times.

It has the following code inside

...
ptr = ng->ptr[id - 1];
...

"id" is never compile time constant so compiler is forced to subtract 1.
And those decrements or LEA [r32 - 1] instructions add up.

We also start id'ing from 1 to catch bugs where pernet sybsystem id
is not initialized and 0. This is quite pointless idea (nothing will
work or immediate interference with first registered subsystem) in
general but it hints what needs to be done for code size reduction.

Namely, overlaying allocation of pointer array and fixed part of
structure in the beginning and using usual base-0 addressing.

Ids are just cookies, their exact values do not matter, so lets start
with 3 on x86_64.

Code size savings (oh boy): -4.2 KB

As usual, ignore the initial compiler stupidity part of the table.

add/remove: 0/0 grow/shrink: 12/670 up/down: 89/-4297 (-4208)
function                                     old     new   delta
tipc_nametbl_insert_publ                    1250    1270     +20
nlmclnt_lookup_host                          686     703     +17
nfsd4_encode_fattr                          5930    5941     +11
nfs_get_client                              1050    1061     +11
register_pernet_operations                   333     342      +9
tcf_mirred_init                              843     849      +6
tcf_bpf_init                                1143    1149      +6
gss_setup_upcall                             990     994      +4
idmap_name_to_id                             432     434      +2
ops_init                                     274     275      +1
nfsd_inject_forget_client                    259     260      +1
nfs4_alloc_client                            612     613      +1
tunnel_key_walker                            164     163      -1

...

tipc_bcbase_select_primary                   392     360     -32
mac80211_hwsim_new_radio                    2808    2767     -41
ipip6_tunnel_ioctl                          2228    2186     -42
tipc_bcast_rcv                               715     672     -43
tipc_link_build_proto_msg                   1140    1089     -51
nfsd4_lock                                  3851    3796     -55
tipc_mon_rcv                                1012     956     -56
Total: Before=156643951, After=156639743, chg -0.00%

Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

netns: add dummy struct inside "struct net_generic"

This is precursor to fixing "[id - 1]" bloat inside net_generic().

Name "s" is chosen to complement name "u" often used for dummy unions.

Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

netns: publish net_generic correctly

Publishing net_generic pointer is done with silly mistake: new array is
published BEFORE setting freshly acquired pernet subsystem pointer.

memcpy
rcu_assign_pointer
kfree_rcu
ng->ptr[id - 1] = data;

This bug was introduced with commit dec827d174d7f76c457238800183ca864a639365
("[NETNS]: The generic per-net pointers.") in the glorious days of
chopping networking stack into containers proper 8.5 years ago (whee...)

How it didn't trigger for so long?
Well, you need quite specific set of conditions:

*) race window opens once per pernet subsystem addition
   (read: modprobe or boot)

*) not every pernet subsystem is eligible (need ->id and ->size)

*) not every pernet subsystem is vulnerable (need incorrect or absense
   of ordering of register_pernet_sybsys() and actually using net_generic())

*) to hide the bug even more, default is to preallocate 13 pointers which
   is actually quite a lot. You need IPv6, netfilter, bridging etc together
   loaded to trigger reallocation in the first place. Trimmed down
   config are OK.

Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

netlink: 2-clause nla_ok()

nla_ok() consists of 3 clauses:

1) int rem >= (int)sizeof(struct nlattr)

2) u16 nla_len >= sizeof(struct nlattr)

3) u16 nla_len <= int rem

The statement is that clause (1) is redundant.

What it does is ensuring that "rem" is a positive number,
so that in clause (3) positive number will be compared to positive number
with no problems.

However, "u16" fully fits into "int" and integers do not change value
when upcasting even to signed type. Negative integers will be rejected
by clause (3) just fine. Small positive integers will be rejected
by transitivity of comparison operator.

NOTE: all of the above DOES NOT apply to nlmsg_ok() where ->nlmsg_len is
u32(!), so 3 clauses AND A CAST TO INT are necessary.

Obligatory space savings report: -1.6 KB

$ ./scripts/bloat-o-meter ../vmlinux-000* ../vmlinux-001*
add/remove: 0/0 grow/shrink: 3/63 up/down: 35/-1692 (-1657)
function                                     old     new   delta
validate_scan_freqs                          142     155     +13
tcf_em_tree_validate                         867     879     +12
dcbnl_ieee_del                               328     338     +10
netlbl_cipsov4_add_common.isra               218     215      -3
...
ovs_nla_put_actions                          888     806     -82
netlbl_cipsov4_add_std                      1648    1566     -82
nl80211_parse_sched_scan                    2889    2780    -109
ip_tun_from_nlattr                          3086    2945    -141

Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

staging: wilc1000: use reset to set mac header

Since offset is zero, it's not necessary to use set function. Reset
function is straightforward, and will remove the unnecessary add
operation in set function.

Signed-off-by: Zhang Shengju <zhangshengju@cmss.chinamobile.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

iwlwifi: use reset to set transport header

Since offset is zero, it's not necessary to use set function. Reset
function is straightforward, and will remove the unnecessary add
operation in set function.

Signed-off-by: Zhang Shengju <zhangshengju@cmss.chinamobile.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

mlx4: use reset to set mac header

Since offset is zero, it's not necessary to use set function. Reset
function is straightforward, and will remove the unnecessary add
operation in set function.

Signed-off-by: Zhang Shengju <zhangshengju@cmss.chinamobile.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

bnx2x: use reset to set network header

Since offset is zero, it's not necessary to use set function. Reset
function is straightforward, and will remove the unnecessary add
operation in set function.

Signed-off-by: Zhang Shengju <zhangshengju@cmss.chinamobile.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

qede: use reset to set network header

Since offset is zero, it's not necessary to use set function. Reset
function is straightforward, and will remove the unnecessary add
operation in set function.

Signed-off-by: Zhang Shengju <zhangshengju@cmss.chinamobile.com>
Acked-by: Yuval Mintz <Yuval.Mintz@cavium.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

Merge branch 'xgene-jumbo-and-pause-frame'

Iyappan Subramanian says:

====================
drivers: net: xgene: Add Jumbo and Pause frame support

This patch set adds,

1. Jumbo frame support
2. Pause frame based flow control

and fixes RSS for non-TCP/UDP packets.
====================

Signed-off-by: Iyappan Subramanian <isubramanian@apm.com>

drivers: net: xgene: ethtool: Add get/set_pauseparam

This patch adds get_pauseparam and set_pauseparam functions.

Signed-off-by: Iyappan Subramanian <isubramanian@apm.com>
Signed-off-by: Quan Nguyen <qnguyen@apm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

drivers: net: xgene: Add flow control initialization

This patch adds flow control/pause frame initialization and
advertising capabilities.

Signed-off-by: Iyappan Subramanian <isubramanian@apm.com>
Signed-off-by: Quan Nguyen <qnguyen@apm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

drivers: net: xgene: Add flow control configuration

This patch adds functions to configure mac, when flow control
and pause frame settings change.

Signed-off-by: Iyappan Subramanian <isubramanian@apm.com>
Signed-off-by: Quan Nguyen <qnguyen@apm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

drivers: net: xgene: fix: RSS for non-TCP/UDP

This patch fixes RSS feature, for non-TCP/UDP packets.

Signed-off-by: Khuong Dinh <kdinh@apm.com>
Signed-off-by: Iyappan Subramanian <isubramanian@apm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

drivers: net: xgene: Add change_mtu function

This patch implements ndo_change_mtu() callback function that
enables mtu change.

Signed-off-by: Iyappan Subramanian <isubramanian@apm.com>
Signed-off-by: Quan Nguyen <qnguyen@apm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

drivers: net: xgene: Add support for Jumbo frame

This patch adds support for jumbo frame, by allocating
additional buffer (page) pool and configuring the hardware.

Signed-off-by: Iyappan Subramanian <isubramanian@apm.com>
Signed-off-by: Quan Nguyen <qnguyen@apm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

drivers: net: xgene: Configure classifier with pagepool

This patch configures classifier with the pagepool information.

Signed-off-by: Iyappan Subramanian <isubramanian@apm.com>
Signed-off-by: Quan Nguyen <qnguyen@apm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

drivers: net: xgene: Add helper function

This is a prepartion patch and adds xgene_enet_get_fpsel() helper
function to get buffer pool number.

Signed-off-by: Iyappan Subramanian <isubramanian@apm.com>
Signed-off-by: Quan Nguyen <qnguyen@apm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: ethernet: ti: davinci_cpdma: add missing EXPORTs

As of commit 8f32b90981dcdb355516fb95953133f8d4e6b11d
("net: ethernet: ti: davinci_cpdma: add set rate for a channel") the
ARM allmodconfig builds would fail modpost with:

ERROR: "cpdma_chan_set_weight" [drivers/net/ethernet/ti/ti_cpsw.ko] undefined!
ERROR: "cpdma_chan_get_rate" [drivers/net/ethernet/ti/ti_cpsw.ko] undefined!
ERROR: "cpdma_chan_get_min_rate" [drivers/net/ethernet/ti/ti_cpsw.ko] undefined!
ERROR: "cpdma_chan_set_rate" [drivers/net/ethernet/ti/ti_cpsw.ko] undefined!

Since these weren't declared as static, it is assumed they were
meant to be shared outside the file, and that modular build testing
was simply overlooked.

Fixes: 8f32b90981dc ("net: ethernet: ti: davinci_cpdma: add set rate for a channel")
Cc: Ivan Khoronzhuk <ivan.khoronzhuk@linaro.org>
Cc: Mugunthan V N <mugunthanvnm@ti.com>
Cc: Grygorii Strashko <grygorii.strashko@ti.com>
Cc: linux-omap@vger.kernel.org
Cc: netdev@vger.kernel.org
Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

Merge tag 'linux-can-next-for-4.10-20161201' of git://git.kernel.org/pub/scm/linux/kernel/git/mkl/linux-can-next

Marc Kleine-Budde says:

====================
pull-request: can-next 2016-12-01

this is a pull request of 4 patches for net-next/master.

There are two patches by Chris Paterson for the rcar_can and rcar_canfd
device tree binding documentation. And a patch by Geert Uytterhoeven
that corrects the order of interrupt specifiers.

The fourth patch by Colin Ian King fixes a spelling error in the
kvaser_usb driver.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>

net: stmmac: unify mdio functions

stmmac_mdio_{read|write} and stmmac_mdio_{read|write}_gmac4 are not
enought different for being split.
The only differences between thoses two functions are shift/mask for
addr/reg/clk_csr.

This patch introduce a per platform set of variable for setting thoses
shift/mask and unify mdio read and write functions.

Signed-off-by: Corentin Labbe <clabbe.montjoie@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>