git.proxmox.com Git - mirror_ubuntu-bionic-kernel.git/log

siphash: implement HalfSipHash1-3 for hash tables

HalfSipHash, or hsiphash, is a shortened version of SipHash, which
generates 32-bit outputs using a weaker 64-bit key. It has *much* lower
security margins, and shouldn't be used for anything too sensitive, but
it could be used as a hashtable key function replacement, if the output
is never exposed, and if the security requirement is not too high.

The goal is to make this something that performance-critical jhash users
would be willing to use.

On 64-bit machines, HalfSipHash1-3 is slower than SipHash1-3, so we alias
SipHash1-3 to HalfSipHash1-3 on those systems.

64-bit x86_64:
[    0.509409] test_siphash:     SipHash2-4 cycles: 4049181
[    0.510650] test_siphash:     SipHash1-3 cycles: 2512884
[    0.512205] test_siphash: HalfSipHash1-3 cycles: 3429920
[    0.512904] test_siphash:    JenkinsHash cycles:  978267
So, we map hsiphash() -> SipHash1-3

32-bit x86:
[    0.509868] test_siphash:     SipHash2-4 cycles: 14812892
[    0.513601] test_siphash:     SipHash1-3 cycles:  9510710
[    0.515263] test_siphash: HalfSipHash1-3 cycles:  3856157
[    0.515952] test_siphash:    JenkinsHash cycles:  1148567
So, we map hsiphash() -> HalfSipHash1-3

hsiphash() is roughly 3 times slower than jhash(), but comes with a
considerable security improvement.

Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
Reviewed-by: Jean-Philippe Aumasson <jeanphilippe.aumasson@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

siphash: add cryptographically secure PRF

SipHash is a 64-bit keyed hash function that is actually a
cryptographically secure PRF, like HMAC. Except SipHash is super fast,
and is meant to be used as a hashtable keyed lookup function, or as a
general PRF for short input use cases, such as sequence numbers or RNG
chaining.

For the first usage:

There are a variety of attacks known as "hashtable poisoning" in which an
attacker forms some data such that the hash of that data will be the
same, and then preceeds to fill up all entries of a hashbucket. This is
a realistic and well-known denial-of-service vector. Currently
hashtables use jhash, which is fast but not secure, and some kind of
rotating key scheme (or none at all, which isn't good). SipHash is meant
as a replacement for jhash in these cases.

There are a modicum of places in the kernel that are vulnerable to
hashtable poisoning attacks, either via userspace vectors or network
vectors, and there's not a reliable mechanism inside the kernel at the
moment to fix it. The first step toward fixing these issues is actually
getting a secure primitive into the kernel for developers to use. Then
we can, bit by bit, port things over to it as deemed appropriate.

While SipHash is extremely fast for a cryptographically secure function,
it is likely a bit slower than the insecure jhash, and so replacements
will be evaluated on a case-by-case basis based on whether or not the
difference in speed is negligible and whether or not the current jhash usage
poses a real security risk.

For the second usage:

A few places in the kernel are using MD5 or SHA1 for creating secure
sequence numbers, syn cookies, port numbers, or fast random numbers.
SipHash is a faster and more fitting, and more secure replacement for MD5
in those situations. Replacing MD5 and SHA1 with SipHash for these uses is
obvious and straight-forward, and so is submitted along with this patch
series. There shouldn't be much of a debate over its efficacy.

Dozens of languages are already using this internally for their hash
tables and PRFs. Some of the BSDs already use this in their kernels.
SipHash is a widely known high-speed solution to a widely known set of
problems, and it's time we catch-up.

Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
Reviewed-by: Jean-Philippe Aumasson <jeanphilippe.aumasson@gmail.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Eric Biggers <ebiggers3@gmail.com>
Cc: David Laight <David.Laight@aculab.com>
Cc: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: ipv4: remove disable of bottom half in inet_rtm_getroute

Nothing about the route lookup requires bottom half to be disabled.
Remove the local_bh_disable ... local_bh_enable around ip_route_input.
This appears to be a vestige of days gone by as it has been there
since the beginning of git time.

Signed-off-by: David Ahern <dsa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: intel: e100: use new api ethtool_{get|set}_link_ksettings

The ethtool api {get|set}_settings is deprecated.
We move this driver to new api {get|set}_link_ksettings.

Signed-off-by: Philippe Reynes <tremyfr@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: ibm: ibmvnic: use new api ethtool_{get|set}_link_ksettings

The ethtool api {get|set}_settings is deprecated.
We move this driver to new api {get|set}_link_ksettings.

Signed-off-by: Philippe Reynes <tremyfr@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: ibm: ibmveth: use new api ethtool_{get|set}_link_ksettings

The ethtool api {get|set}_settings is deprecated.
We move this driver to new api {get|set}_link_ksettings.

Signed-off-by: Philippe Reynes <tremyfr@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: ibm: emac: use new api ethtool_{get|set}_link_ksettings

The ethtool api {get|set}_settings is deprecated.
We move this driver to new api {get|set}_link_ksettings.

Signed-off-by: Philippe Reynes <tremyfr@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: ibm: ehea: use new api ethtool_{get|set}_link_ksettings

The ethtool api {get|set}_settings is deprecated.
We move this driver to new api {get|set}_link_ksettings.

Signed-off-by: Philippe Reynes <tremyfr@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: change init_inodecache() return void

sock_init() call it but not check it's return value,
so change it to void return and add an internal BUG_ON() check.

Signed-off-by: yuan linyu <Linyu.Yuan@alcatel-sbell.com.cn>
Signed-off-by: David S. Miller <davem@davemloft.net>

Merge branch 'tc-skb-diet'

Willem de Bruijn says:

====================
convert tc_verd to integer bitfields

The skb tc_verd field takes up two bytes but uses far fewer bits.
Convert the remaining use cases to bitfields that fit in existing
holes (depending on config options) and potentially save the two
bytes in struct sk_buff.

This patchset is based on an earlier set by Florian Westphal and its
discussion (http://www.spinics.net/lists/netdev/msg329181.html).

Patches 1 and 2 are low hanging fruit: removing the last traces of
  data that are no longer stored in tc_verd.

Patches 3 and 4 convert tc_verd to individual bitfields (5 bits).

Patch 5 reduces TC_AT to a single bitfield,
  as AT_STACK is not valid here (unlike in the case of TC_FROM).

Patch 6 changes TC_FROM to two bitfields with clearly defined purpose.

It may be possible to reduce storage further after this initial round.
If tc_skip_classify is set only by IFB, testing skb_iif may suffice.
The L2 header pushing/popping logic can perhaps be shared with
AF_PACKET, which currently not pkt_type for the same purpose.

Changes:
  RFC -> v1
    - (patch 3): remove no longer needed label in tfc_action_exec
    - (patch 5): set tc_at_ingress at the same points as existing
                 SET_TC_AT calls

Tested ingress mirred + netem + ifb:

  ip link set dev ifb0 up
  tc qdisc add dev eth0 ingress
  tc filter add dev eth0 parent ffff: \
    u32 match ip dport 8000 0xffff \
    action mirred egress redirect dev ifb0
  tc qdisc add dev ifb0 root netem delay 1000ms
  nc -u -l 8000 &
  ssh $otherhost nc -u $host 8000

Tested egress mirred:

  ip link add veth1 type veth peer name veth2
  ip link set dev veth1 up
  ip link set dev veth2 up
  tcpdump -n -i veth2 udp and dst port 8000 &

  tc qdisc add dev eth0 root handle 1: prio
  tc filter add dev eth0 parent 1:0 \
    u32 match ip dport 8000 0xffff \
    action mirred egress redirect dev veth1
  tc qdisc add dev veth1 root netem delay 1000ms
  nc -u $otherhost 8000

Tested ingress mirred:

  ip link add veth1 type veth peer name veth2
  ip link add veth3 type veth peer name veth4

  ip netns add ns0
  ip netns add ns1

  for i in 1 2 3 4; do \
    NS=ns$((${i}%2)); \
    ip link set dev veth${i} netns ${NS}; \
    ip netns exec ${NS} \
      ip addr add dev veth${i} 192.168.1.${i}/24; \
    ip netns exec ${NS} \
      ip link set dev veth${i} up; \
  done

  ip netns exec ns0 tc qdisc add dev veth2 ingress
  ip netns exec ns0 \
    tc filter add dev veth2 parent ffff: \
      u32 match ip dport 8000 0xffff \
      action mirred ingress redirect dev veth4

  ip netns exec ns0 \
    tcpdump -n -i veth4 udp and dst port 8000 &
  ip netns exec ns1 \
    nc -u 192.168.1.2 8000
====================

Signed-off-by: David S. Miller <davem@davemloft.net>

net-tc: convert tc_from to tc_from_ingress and tc_redirected

The tc_from field fulfills two roles. It encodes whether a packet was
redirected by an act_mirred device and, if so, whether act_mirred was
called on ingress or egress. Split it into separate fields.

The information is needed by the special IFB loop, where packets are
taken out of the normal path by act_mirred, forwarded to IFB, then
reinjected at their original location (ingress or egress) by IFB.

The IFB device cannot use skb->tc_at_ingress, because that may have
been overwritten as the packet travels from act_mirred to ifb_xmit,
when it passes through tc_classify on the IFB egress path. Cache this
value in skb->tc_from_ingress.

That field is valid only if a packet arriving at ifb_xmit came from
act_mirred. Other packets can be crafted to reach ifb_xmit. These
must be dropped. Set tc_redirected on redirection and drop all packets
that do not have this bit set.

Both fields are set only on cloned skbs in tc actions, so original
packet sources do not have to clear the bit when reusing packets
(notably, pktgen and octeon).

Signed-off-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net-tc: convert tc_at to tc_at_ingress

Field tc_at is used only within tc actions to distinguish ingress from
egress processing. A single bit is sufficient for this purpose.

Signed-off-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net-tc: convert tc_verd to integer bitfields

Extract the remaining two fields from tc_verd and remove the __u16
completely. TC_AT and TC_FROM are converted to equivalent two-bit
integer fields tc_at and tc_from. Where possible, use existing
helper skb_at_tc_ingress when reading tc_at. Introduce helper
skb_reset_tc to clear fields.

Not documenting tc_from and tc_at, because they will be replaced
with single bit fields in follow-on patches.

Signed-off-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net-tc: extract skip classify bit from tc_verd

Packets sent by the IFB device skip subsequent tc classification.
A single bit governs this state. Move it out of tc_verd in
anticipation of removing that __u16 completely.

The new bitfield tc_skip_classify temporarily uses one bit of a
hole, until tc_verd is removed completely in a follow-up patch.

Remove the bit hole comment. It could be 2, 3, 4 or 5 bits long.
With that many options, little value in documenting it.

Introduce a helper function to deduplicate the logic in the two
sites that check this bit.

The field tc_skip_classify is set only in IFB on skbs cloned in
act_mirred, so original packet sources do not have to clear the
bit when reusing packets (notably, pktgen and octeon).

Signed-off-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net-tc: make MAX_RECLASSIFY_LOOP local

This field is no longer kept in tc_verd. Remove it from the global
definition of that struct.

Signed-off-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net-tc: remove unused tc_verd fields

Remove the last reference to tc_verd's munge and redirect ttl bits.
These fields are no longer used.

Signed-off-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

mdio: Demote print from info to debug in mdio_device_register

While it is useful to know which MDIO device is being registered, demote
the dev_info() to a dev_dbg().

Signed-off-by: Florian Fainelli <f.fainelli@gmail.com>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: remove useless memset's in drivers get_stats64

In dev_get_stats() the statistic structure storage has already been
zeroed. Therefore network drivers do not need to call memset() again.

Signed-off-by: Stephen Hemminger <sthemmin@microsoft.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: make ndo_get_stats64 a void function

The network device operation for reading statistics is only called
in one place, and it ignores the return value. Having a structure
return value is potentially confusing because some future driver could
incorrectly assume that the return value was used.

Fix all drivers with ndo_get_stats64 to have a void function.

Signed-off-by: Stephen Hemminger <sthemmin@microsoft.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

Merge branch '100GbE' of git://git.kernel.org/pub/scm/linux/kernel/git/jkirsher/next-queue

Jeff Kirsher says:

====================
100GbE Intel Wired LAN Driver Updates 2017-01-08

This series contains updates to fm10k only.

Ngai-Mint changes the driver to use the MAC pointer in the fm10k_mac_info
structure for fm10k_get_host_state_generic().  Fixed a race condition
where the mailbox interrupt request bits can be cleared before being
handled causing certain mailbox messages from the PF to be untreated
and the PF will enter in some inactive state.

Jake removes the typecast of u8 to char, and the extra variable that was
created for the typecast.  Bumps the driver version.  Added back the
receive descriptor timestamp value so that applications built on top
of the IES API can function properly.  Cleaned up the debug statistics
flag, since debug statistics were removed and the flag was missed in
the removal.

Scott limits the DMA sync for CPU to the actual length of the packet,
instead of the entire buffer, since the DMA sync occurs every time a
packet is received.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>

net: ipv4: Remove flow arg from ip_mkroute_input

fl4 arg is not used; remove it.

Signed-off-by: David Ahern <dsa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: ipmr: Remove nowait arg to ipmr_get_route

ipmr_get_route has 1 caller and the nowait arg is 0. Remove the arg and
simplify ipmr_get_route accordingly.

Signed-off-by: David Ahern <dsa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

liquidio: simplify octeon_flush_iq()

Because every call to octeon_flush_iq() has a hardcoded 1 for the
pending_thresh argument, simplify that function by removing that argument.
This avoids one atomic read as well.

Signed-off-by: Derek Chickles <derek.chickles@cavium.com>
Signed-off-by: Felix Manlunas <felix.manlunas@cavium.com>
Signed-off-by: Satanand Burla <satananda.burla@cavium.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

fm10k: remove FM10K_FLAG_DEBUG_STATS

The debug statistics were removed due to complications with the ethtool
statistics API which are not possible to resolve without a new
statistics interface. The flag was left behind, but we no longer need
it.

Signed-off-by: Jacob Keller <jacob.e.keller@intel.com>
Tested-by: Krishneil Singh <krishneil.k.singh@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>

fm10k: report the receive timestamp in FM10K_CB(skb)->tstamp

This was accidentally removed when we defeatured the full 1588 Clock
support. We need to report the Rx descriptor timestamp value so that
applications built on top of the IES API can function properly.

Additionally, remove the FM10K_FLAG_RX_TS_ENABLED, as it is not used now
that 1588 functionality has been removed.

Signed-off-by: Jacob Keller <jacob.e.keller@intel.com>
Tested-by: Krishneil Singh <krishneil.k.singh@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>

fm10k: Limit dma sync of RX buffers to actual packet size

On packet RX, we perform a dma sync for cpu before passing the
packet up. Here we limit that sync to the actual length of the
incoming packet, rather than always syncing the entire buffer.

Signed-off-by: Scott Peterson <scott.d.peterson@intel.com>
Signed-off-by: Jacob Keller <jacob.e.keller@intel.com>
Tested-by: Krishneil Singh <krishneil.k.singh@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>

fm10k: bump version number

Signed-off-by: Jacob Keller <jacob.e.keller@intel.com>
Tested-by: Krishneil Singh <krishneil.k.singh@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>

fm10k: do not clear global mailbox interrupt bits

Partially revert commit 5e93cbadd3e9 ("fm10k: Reset mailbox global
interrupts", 2016-06-07)

The register bits related to this commit are now solely being handled by
the IES API. Recent changes in the IES API will allow an automatic
recovery from improper handling of these bits.

Signed-off-by: Ngai-Mint Kwan <ngai-mint.kwan@intel.com>
Signed-off-by: Jacob Keller <jacob.e.keller@intel.com>
Tested-by: Krishneil Singh <krishneil.k.singh@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>

fm10k: request reset when mbx->state changes

Multiple IES API resets can cause a race condition where the mailbox
interrupt request bits can be cleared before being handled. This can
leave certain mailbox messages from the PF to be untreated and the PF
will enter in some inactive state. If this situation occurs, the IES API
will initiate a mailbox version reset which, then, trigger a mailbox
state change. Once this mailbox transition occurs (from OPEN to CONNECT
state), a request for reset will be returned.

This ensures that PF will undergo a reset whenever IES API encounters an
unknown global mailbox interrupt event or whenever the IES API
terminates.

Signed-off-by: Ngai-Mint Kwan <ngai-mint.kwan@intel.com>
Signed-off-by: Jacob Keller <jacob.e.keller@intel.com>
Tested-by: Krishneil Singh <krishneil.k.singh@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>

fm10k: remove extraneous variable definition in fm10k_ethtool.c

We don't need to typecast a u8 * into a char *, so just remove the extra
variable.

Signed-off-by: Jacob Keller <jacob.e.keller@intel.com>
Tested-by: Krishneil Singh <krishneil.k.singh@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>

fm10k-shared: use mac-> instead of hw->mac.

Since a pointer "mac" to fm10k_mac_info structure exists, use it to
access the contents of its members.

Signed-off-by: Ngai-Mint Kwan <ngai-mint.kwan@intel.com>
Signed-off-by: Jacob Keller <jacob.e.keller@intel.com>
Tested-by: Krishneil Singh <krishneil.k.singh@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>

net: dsa: move HWMON support to its own file

Isolate the HWMON support in DSA in its own file. Currently only the
legacy DSA code is concerned.

Signed-off-by: Vivien Didelot <vivien.didelot@savoirfairelinux.com>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

Merge branch 'netcp-next'

Murali Karicheri says:

====================
netcp: enhancements and minor fixes

This series is for net-next. This propagates enhancements and minor
bug fixes from internal version of the driver to keep the upstream
in sync. Please review and apply if this looks good.

Tested on all of K2HK/E/L boards with nfs rootfs.
Test logs below
K2HK-EVM: http://pastebin.ubuntu.com/23754106/
k2L-EVM: http://pastebin.ubuntu.com/23754143/
K2E-EVM: http://pastebin.ubuntu.com/23754159/

History:
  v1 - dropped 1/10 amd 2/10 of v0 based on comments from Rob as
       it needs more work before submission
  v0 - Initial version
====================

Signed-off-by: David S. Miller <davem@davemloft.net>

net: netcp: ale: add proper ale entry mask bits for netcp switch ALE

For NetCP NU Switch ALE, some of the mask bits are different than
defaults used in the driver. Add a new macro DEFINE_ALE_FIELD1 that use
a configurable mask bits and use it in the driver. These bits are set to
correct values by using the new variables added to cpsw_ale structure
and re-used in the macros. The parameter nu_switch_ale is configured by
the caller driver to indicate the ALE is for that switch and is used in
the ALE driver to do customization as needed.

Signed-off-by: Murali Karicheri <m-karicheri2@ti.com>
Signed-off-by: Sekhar Nori <nsekhar@ti.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: netcp: ale: use ale_status to size the ale table

ALE h/w on newer version of NetCP (K2E/L/G) does provide a ALE_STATUS
register for the size of the ALE Table implemented in h/w. Currently
for example we set ALE Table size to 1024 for NetCP ALE on
K2E even though the ALE Status/Documentation shows it has 8192 entries.
So take advantage of this register to read the size of ALE table supported
and use that value in the driver for the newer version of NetCP ALE.
For NetCP lite, ALE Table size is much less (64) and indicated by a size
of zero in ALE_STATUS. So use that as a default for now. While at it,
also fix the ale table size on 10G switch to 2048 per User guide
http://www.ti.com/lit/ug/spruhj5/spruhj5.pdf

Signed-off-by: Murali Karicheri <m-karicheri2@ti.com>
Signed-off-by: Sekhar Nori <nsekhar@ti.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: netcp: ale: update to support unknown vlan controls for NU switch

In NU Ethernet switch used on some of the Keystone SoCs, there is
separate UNKNOWNVLAN register for membership, unreg mcast flood, reg
mcast flood and force untag egress bits in ALE. So control for these
fields require different address offset, shift and size of field.
As this ALE has the same version number as ALE in CPSW found on other
SoCs, customization based on version number is not possible. So
use a configuration parameter, nu_switch_ale, to identify the ALE
ALE found in NU Switch. Different treatment is needed for NU Switch
ALE due to difference in the ale table bits, separate unknown vlan
registers etc. The register information available in ale_controls,
needs to be updated to support the netcp NU switch h/w. So it is not
constant array any more since it needs to be updated based
on ALE type. The header of the file is also updated to indicate it
supports N port switch ALE, not just 3 port. The version mask is
3 bits in NU Switch ALE vs 8 bits on other ALE types.

While at it, change the debug print to info print so that ALE
version gets displayed in boot log.

Signed-off-by: Murali Karicheri <m-karicheri2@ti.com>
Signed-off-by: Sekhar Nori <nsekhar@ti.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: netcp: use hw capability to remove FCS word from rx packets

Some of the newer Ethernet switch hw (such as that on k2e/l/g) can
strip the Etherenet FCS from packet at the port 0 egress of the switch.
So use this capability instead of doing it in software.

Signed-off-by: Murali Karicheri <m-karicheri2@ti.com>
Signed-off-by: Sekhar Nori <nsekhar@ti.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: netcp: ethss: get phy-handle only if link interface is MAC-to-PHY

Currently to parse phy-handle, driver doesn't check if the interface is
MAC to PHY. This patch add this check for all MAC to PHY interface types
supported by the driver.

Signed-off-by: Murali Karicheri <m-karicheri2@ti.com>
Signed-off-by: Sekhar Nori <nsekhar@ti.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: netcp: store network statistics in 64 bits

Previously the network statistics were stored in 32 bit variable
which can cause some stats to roll over after several minutes of
high traffic. This implements 64 bit storage so larger numbers
can be stored.

Signed-off-by: Michael Scherban <m-scherban@ti.com>
Signed-off-by: Murali Karicheri <m-karicheri2@ti.com>
Signed-off-by: Sekhar Nori <nsekhar@ti.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: netcp: remove the redundant memmov()

The psdata is populated with command data by netcp modules
to the tail of the buffer and set_words() copy the same
to the front of the psdata. So remove the redundant memmov
function call.

Signed-off-by: Murali Karicheri <m-karicheri2@ti.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: netcp: extract eflag from desc for rx_hook handling

Extract the eflag bits from the received desc and pass it down
the rx_hook chain to be available for netcp modules. Also the
psdata and epib data has to be inspected by the netcp modules.
So the desc can be freed only after returning from the rx_hook.
So move knav_pool_desc_put() after the rx_hook processing.

Signed-off-by: Murali Karicheri <m-karicheri2@ti.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

Merge branch 'cpsw-cpdma-DDR'

Grygorii Strashko says:

====================
net: ethernet: ti: cpsw: support placing CPDMA descriptors into DDR

This series intended to add support for placing CPDMA descriptors into
DDR by introducing new module parameter "descs_pool_size" to specify
size of descriptor's pool. The "descs_pool_size" defines total number
of CPDMA CPPI descriptors to be used for both ingress/egress packets
processing. If not specified - the default value 256 will be used
which will allow to place descriptor's pool into the internal CPPI
RAM.

In addition, added ability to re-split CPDMA pool of descriptors
between RX and TX path via ethtool '-G' command wich will allow to
configure and fix number of descriptors used by RX and TX path, which,
then, will be split between RX/TX channels proportionally depending on
number of RX/TX channels and its weight.

This allows significantly to reduce UDP packets drop rate for
bandwidth >301 Mbits/sec (am57x).

Before enabling this feature, the am437x SoC has to be fixed as it's
proved that it's not working when CPDMA descriptors placed in DDR.
So, the patch 1 fixes this issue.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>

Documentation: DT: net: cpsw: remove no_bd_ram property

Even if no_bd_ram property is described in TI CPSW bindings the support for
it has never been introduced in CPSW driver, so there are no real users of
it. Hence, remove no_bd_ram property from documentation and DT files.

Cc: 'Rob Herring <robh+dt@kernel.org>'
Signed-off-by: Grygorii Strashko <grygorii.strashko@ti.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: ethernet: ti: cpsw: add support for ringparam configuration

The CPDMA uses one pool of descriptors for both RX and TX which by default
split between all channels proportionally depending on total number of
CPDMA channels and number of TX and RX channels. As result, more
descriptors will be consumed by TX path if there are more TX channels and
there is no way now to dedicate more descriptors for RX path.

So, add the ability to re-split CPDMA pool of descriptors between RX and TX
path via ethtool '-G' command wich will allow to configure and fix number
of descriptors used by RX and TX path, which, then, will be split between
RX/TX channels proportionally depending on RX/TX channels number and
weight. ethtool '-G' command will accept only number of RX entries and rest
of descriptors will be arranged for TX automatically.

Command:
  ethtool -G <devname> rx <number of descriptors>

defaults and limitations:
- minimum number of rx descriptors is 10% of total number of descriptors in
  CPDMA pool
- maximum number of rx descriptors is 90% of total number of descriptors in
  CPDMA pool
- by default, descriptors will be split equally between RX/TX path
- any values passed in "tx" parameter will be ignored

Usage:

# ethtool -g eth0
Pre-set maximums:
RX:             7372
RX Mini:        0
RX Jumbo:       0
TX:             0
Current hardware settings:
RX:             4096
RX Mini:        0
RX Jumbo:       0
TX:             4096

# ethtool -G eth0 rx 7372
# ethtool -g eth0
Ring parameters for eth0:
Pre-set maximums:
RX:             7372
RX Mini:        0
RX Jumbo:       0
TX:             0
Current hardware settings:
RX:             7372
RX Mini:        0
RX Jumbo:       0
TX:             820

Signed-off-by: Grygorii Strashko <grygorii.strashko@ti.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: ethernet: ti: cpsw: add support for descs pool size configuration

The CPSW CPDMA can process buffer descriptors placed as in internal
CPPI RAM as in DDR. This patch adds support in CPSW and CPDMA for
descs_pool_size mudule parameter, which defines total number of CPDMA CPPI
descriptors to be used for both ingress/egress packets processing:
- memory size, required for CPDMA descriptor pool, is calculated basing
on number of descriptors specified by user in descs_pool_size and
CPDMA descriptor size and allocated from coherent memory (CMA area);
- CPDMA descriptor pool will be allocated in DDR if pool memory size >
internal CPPI RAM or use internal CPPI RAM otherwise;
- if descs_pool_size not specified in DT - the default value 256 will
be used which will allow to place CPDMA descriptors pool into the
internal CPPI RAM (current default behaviour);
- CPDMA will ignore descs_pool_size if descs_pool_size = 0 for
backward comaptiobility with davinci_emac.

descs_pool_size is boot time setting and can't be changed once
CPSW/CPDMA is initialized.

Signed-off-by: Grygorii Strashko <grygorii.strashko@ti.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: ethernet: ti: cpdma: use devm_ioremap

Use devm_ioremap() and simplify the code.

Signed-off-by: Grygorii Strashko <grygorii.strashko@ti.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: ethernet: ti: cpdma: minimize number of parameters in cpdma_desc_pool_create/destroy()

Update cpdma_desc_pool_create/destroy() to accept only one parameter
struct cpdma_ctlr*, as this structure contains all required
information for pool creation/destruction.

Signed-off-by: Grygorii Strashko <grygorii.strashko@ti.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: ethernet: ti: cpdma: fix desc re-queuing

The currently processing cpdma descriptor with EOQ flag set may
contain two values in Next Descriptor Pointer field:
- valid pointer: means CPDMA missed addition of new desc in queue;
- null: no more descriptors in queue.
In the later case, it's not required to write to HDP register, but now
CPDMA does it.

Hence, add additional check for Next Descriptor Pointer != null in
cpdma_chan_process() function before writing in HDP register.

Signed-off-by: Grygorii Strashko <grygorii.strashko@ti.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: ethernet: ti: cpdma: am437x: allow descs to be plased in ddr

It's observed that cpsw/cpdma is not working properly when CPPI
descriptors are placed in DDR instead of internal CPPI RAM on am437x
SoC:
- rx/tx silently stops processing packets;
- or - after boot it's working for sometime, but stuck once Network
load is increased (ping is working, but iperf is not).
(The same issue has not been reproduced on am335x and am57xx).

It seems that write to HDP register processed faster by interconnect
than writing of descriptor memory buffer in DDR, which is probably
caused by store buffer / write buffer differences as these functions
are implemented differently across devices. So, to fix this i come up
with two minimal, required changes:

1) all accesses to the channel register HDP/CP/RXFREE registers should
be done using sync IO accessors readl()/writel(), because all previous
memory writes writes have to be completed before starting channel
(write to HDP) or completing desc processing.

2) the change 1 only doesn't work on am437x and additional reading of
desc's field is required right after the new descriptor was filled
with data and before pointer on it will be stored in
prev_desc->hw_next field or HDP register.

In addition, to above changes this patch eliminates all relaxed ordering
I/O accessors in this driver as suggested by David Miller to avoid such
kind of issues in the future, but with one exception - relaxed IO accessors
will still be used to fill desc in cpdma_chan_submit(), which is safe as
there is read barrier at the end of write sequence, and because sync IO
accessors usage here will affect on net performance.

Signed-off-by: Grygorii Strashko <grygorii.strashko@ti.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

Merge branch 'l2tp-cleanup-socket-lookup-code'

Guillaume Nault says:

====================
l2tp: cleanup socket lookup code in l2tp_ip and l2tp_ip6

First three patches remove redundant tests and add missing "const"
qualifiers.

Fourth patch splits the conditionals found in __l2tp_ip*_bind_lookup(),
to make these functions easier to review. In the process, I found that
some corner cases were still not handled properly. So I've added the
missing tests in this patch too, because they're pretty simple and the
whole "if" statements are modified anyway.

I expect it to be easier to review this way. If not, I can split up
patch #4, post the missing tests separately to -net, and later repost
this series as pure cleanup. Just let me know.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>

l2tp: rework socket comparison in __l2tp_ip*_bind_lookup()

Split conditions, so that each test becomes clearer.

Also, for l2tp_ip, check if "laddr" is 0. This prevents a socket from
binding to the unspecified address when other sockets are already bound
using the same device (if any), connection ID and namespace.

Same thing for l2tp_ip6: add ipv6_addr_any(laddr) and
ipv6_addr_any(raddr) tests to ensure that an IPv6 unspecified address
passed as parameter is properly treated a wildcard.

Signed-off-by: Guillaume Nault <g.nault@alphalink.fr>
Signed-off-by: David S. Miller <davem@davemloft.net>

l2tp: remove useless NULL check in __l2tp_ip*_bind_lookup()

If "l2tp" was NULL, that'd mean "sk" is NULL too. This can't happen
since "sk" is returned by sk_for_each_bound().

Signed-off-by: Guillaume Nault <g.nault@alphalink.fr>
Signed-off-by: David S. Miller <davem@davemloft.net>

l2tp: make __l2tp_ip*_bind_lookup() parameters 'const'

Add const qualifier wherever possible for __l2tp_ip_bind_lookup() and
__l2tp_ip6_bind_lookup().

Signed-off-by: Guillaume Nault <g.nault@alphalink.fr>
Signed-off-by: David S. Miller <davem@davemloft.net>

l2tp: remove redundant addr_len check in l2tp_ip_bind()

addr_len's value has already been verified at this point.

Signed-off-by: Guillaume Nault <g.nault@alphalink.fr>
Signed-off-by: David S. Miller <davem@davemloft.net>

RDS: validate the requested traces user input against max supported

Larger than supported value can lead to array read/write overflow.

Reported-by: Colin Ian King <colin.king@canonical.com>
Signed-off-by: Santosh Shilimkar <santosh.shilimkar@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

Merge tag 'rxrpc-rewrite-20170106' of git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs

David Howells says:

====================
afs: Implement bulk read

This pair of patches implements bulk data reading from an AFS server.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>

sctp: prepare asoc stream for stream reconf

sctp stream reconf, described in RFC 6525, needs a structure to
save per stream information in assoc, like stream state.

In the future, sctp stream scheduler also needs it to save some
stream scheduler params and queues.

This patchset is to prepare the stream array in assoc for stream
reconf. It defines sctp_stream that includes stream arrays inside
to replace ssnmap.

Note that we use different structures for IN and OUT streams, as
the members in per OUT stream will get more and more different
from per IN stream.

v1->v2:
  - put these patches into a smaller group.
v2->v3:
  - define sctp_stream to contain stream arrays, and create stream.c
    to put stream-related functions.
  - merge 3 patches into 1, as new sctp_stream has the same name
    with before.

Signed-off-by: Xin Long <lucien.xin@gmail.com>
Reviewed-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

udp: inuse checks can quit early for reuseport

UDP lib inuse checks will walk the entire hash bucket to check if the
portaddr is in use. In the case of reuseport we can stop searching when
we find a matching reuseport.

On a 16-core VM a test program that spawns 16 threads that each bind to
1024 sockets (one per 10ms) takes 1m45s. With this change it takes 11s.

Also add a cond_resched() when the port is not specified.

Signed-off-by: Eric Garver <e@erig.me>
Signed-off-by: David S. Miller <davem@davemloft.net>

cxgb4: Add port description for new cards.

Add port description for 25G and 100G cards, and also
change few port descriptions in compliance with the new
naming convention.

Signed-off-by: Ganesh Goudar <ganeshgr@chelsio.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

cxgb4/cxgb4vf: Display 25G and 100G link speed

Add support to report 25G and 100G links, which was missed
as part of commit "eb97ad99f9ed".

Signed-off-by: Ganesh Goudar <ganeshgr@chelsio.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

Merge branch '1GbE' of git://git.kernel.org/pub/scm/linux/kernel/git/jkirsher/next-queue

Jeff Kirsher says:

====================
1GbE Intel Wired LAN Driver Updates 2017-01-06

This series contains updates/fixes to igb and e1000e.

Joe fixes indentation and improper line wrapping in igb.

David Singleton fixes an issue in e1000e where in systemd, where things
are done in parallel and can create a condition where e1000_shutdown is
called after e1000_close, hitting BUG_ON assert in free_msi_irqs.

Cao Jin fixes a code comment on the wakeup status register.  Also fixes
a possible NULL pointer dereference by using igb_adapter->io_addr
instead of e1000_hw->hw_addr in igb_configure_tx_ring().

Chris Arges works around a firmware issue, which can cause probe of i210
NIC to fail, so zero the page select register during igb_get_phy_id() to
workaround the issue.  Aaron Sierra adds also a check for this issue
during the initialization of PHY parameters to ensure that this same
issue happens after probe.

Todd fixes a possible race condition in close/suspend by extending
the rtnl_lock() to protect the call to netif_device_detach() and
igb_clear_interrupt_scheme().  Also adds i211 to a known i210/i211
workaround.

Hannu Lounento fixes inverted logic on a debug statement.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>

net: ipv4: make fib_select_default static

fib_select_default has a single caller within the same file.
Make it static.

Signed-off-by: David Ahern <dsa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: ipv4: Simplify rt_fill_info

rt_fill_info has only 1 caller and both of the last 2 args -- nowait
and flags -- are hardcoded to 0. Given that remove them as input arguments
and simplify rt_fill_info accordingly.

Signed-off-by: David Ahern <dsa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

cxgb4: Synchronize access to mailbox

The issue comes when there are multiple threads attempting to use
the mailbox facility at the same time.
When DCB operations and interface up/down is run in a loop for every
0.1 sec, we observed mailbox collisions. And out of the two commands
one would fail with the present code, since we don't queue the second
command.

To overcome the above issue, added a queue to access the mailbox.
Whenever a mailbox command is issued add it to the queue. If its at
the head issue the mailbox command, else wait for the existing command
to complete. Usually command takes less than a milli-second to
complete.

Also timeout from the loop, if the command under execution takes
long time to run.

In reality, the number of mailbox access collisions is going to be
very rare since no one runs such abusive script.

Signed-off-by: Hariprasad Shenai <hariprasad@chelsio.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: dsa: b53: Utilize common helpers for u64/MAC

Utilize the two functions recently introduced: u64_to_ether() and
ether_to_u64() instead of our own versions.

Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: dsa: remove version string

The dsa_driver_version string is irrelevant and has not been bumped
since its introduction about 9 years ago. Kill it.

Signed-off-by: Vivien Didelot <vivien.didelot@savoirfairelinux.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

liquidio: fix wrong information about channels reported to ethtool

Information reported to ethtool about channels is sometimes wrong for PF,
and always wrong for VF. Fix them by getting the information from the
right fields from the right structs.

Signed-off-by: Weilin Chang <weilin.chang@cavium.com>
Signed-off-by: Felix Manlunas <felix.manlunas@cavium.com>
Signed-off-by: Derek Chickles <derek.chickles@cavium.com>
Signed-off-by: Satanand Burla <satananda.burla@cavium.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

ipv6: do not send RTM_DELADDR for tentative addresses

RTM_NEWADDR notification is sent when IFA_F_TENTATIVE is cleared from
the address. So if the address is added and deleted before DAD probes
completes, the RTM_DELADDR will be sent for which there was no
RTM_NEWADDR causing asymmetry in notification. However if the same
logic is used while sending RTM_DELADDR notification, this asymmetry
can be avoided.

Signed-off-by: Mahesh Bandewar <maheshb@google.com>
CC: Hideaki YOSHIFUJI <yoshfuji@linux-ipv6.org>
CC: Patrick McHardy <kaber@trash.net>
CC: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>

liquidio VF: fix incorrect struct being used

The VF driver is using the wrong struct when sending commands to the NIC
firmware, sometimes causing adverse effects in the firmware. The right
struct is the one that the PF is using, so make the VF use that as well.

Signed-off-by: Prasad Kanneganti <prasad.kanneganti@cavium.com>
Signed-off-by: Felix Manlunas <felix.manlunas@cavium.com>
Signed-off-by: Derek Chickles <derek.chickles@cavium.com>
Signed-off-by: Satanand Burla <satananda.burla@cavium.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

afs: Make afs_readpages() fetch data in bulk

Make afs_readpages() use afs_vnode_fetch_data()'s new ability to take a
list of pages and do a bulk fetch.

Signed-off-by: David Howells <dhowells@redhat.com>

afs: Make afs_fs_fetch_data() take a list of pages

Make afs_fs_fetch_data() take a list of pages for bulk data transfer. This
will allow afs_readpages() to be made more efficient.

Signed-off-by: David Howells <dhowells@redhat.com>

igb: Fix hw_dbg logging in igb_update_flash_i210

Fix an if statement with hw_dbg lines where the logic was inverted with
regards to the corresponding return value used in the if statement.

Signed-off-by: Hannu Lounento <hannu.lounento@ge.com>
Signed-off-by: Peter Senna Tschudin <peter.senna@collabora.com>
Tested-by: Aaron Brown <aaron.f.brown@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>

igb: add i211 to i210 PHY workaround

i210 and i211 share the same PHY but have different PCI IDs. Don't
forget i211 for any i210 workarounds.

Signed-off-by: Todd Fujinaka <todd.fujinaka@intel.com>
Tested-by: Aaron Brown <aaron.f.brown@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>

igb: close/suspend race in netif_device_detach

Similar to ixgbe, when an interface is part of a namespace it is
possible that igb_close() may be called while __igb_shutdown() is
running which ends up in a double free WARN and/or a BUG in
free_msi_irqs().

Extend the rtnl_lock() to protect the call to netif_device_detach() and
igb_clear_interrupt_scheme() in __igb_shutdown() and check for
netif_device_present() to avoid calling igb_clear_interrupt_scheme() a
second time in igb_close().

Also extend the rtnl lock in igb_resume() to netif_device_attach().

Signed-off-by: Todd Fujinaka <todd.fujinaka@intel.com>
Acked-by: Alexander Duyck <alexander.h.duyck@intel.com>
Tested-by: Aaron Brown <aaron.f.brown@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>

igb: re-assign hw address pointer on reset after PCI error

Whenever the igb driver detects the result of a read operation returns
a value composed only by F's (like 0xFFFFFFFF), it will detach the
net_device, clear the hw_addr pointer and warn to the user that adapter's
link is lost - those steps happen on igb_rd32().

In case a PCI error happens on Power architecture, there's a recovery
mechanism called EEH, that will reset the PCI slot and call driver's
handlers to reset the adapter and network functionality as well.

We observed that once hw_addr is NULL after the error is detected on
igb_rd32(), it's never assigned back, so in the process of resetting
the network functionality we got a NULL pointer dereference in both
igb_configure_tx_ring() and igb_configure_rx_ring(). In order to avoid
such bug, this patch re-assigns the hw_addr value in the slot_reset
handler.

Reported-by: Anthony H Thai <ahthai@us.ibm.com>
Reported-by: Harsha Thyagaraja <hathyaga@in.ibm.com>
Signed-off-by: Guilherme G Piccoli <gpiccoli@linux.vnet.ibm.com>
Tested-by: Aaron Brown <aaron.f.brown@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>

igb: reset the PHY before reading the PHY ID

Several people have reported firmware leaving the I210/I211 PHY's page
select register set to something other than the default of zero. This
causes the first accesses, PHY_IDx register reads, to access something
else, resulting in device probe failure:

    igb: Intel(R) Gigabit Ethernet Network Driver - version 5.4.0-k
    igb: Copyright (c) 2007-2014 Intel Corporation.
    igb: probe of 0000:01:00.0 failed with error -2

This problem began for them after a previous patch I submitted was
applied:

    commit 2a3cdead8b408351fa1e3079b220fa331480ffbc
    Author: Aaron Sierra <asierra@xes-inc.com>
    Date:   Tue Nov 3 12:37:09 2015 -0600

        igb: Remove GS40G specific defines/functions

I personally experienced this problem after attempting to PXE boot from
I210 devices using this firmware:

    Intel(R) Boot Agent GE v1.5.78
    Copyright (C) 1997-2014, Intel Corporation

Resetting the PHY before reading from it, ensures the page select
register is in its default state and doesn't make assumptions about
the PHY's register set before the PHY has been probed.

Cc: Matwey V. Kornilov <matwey@sai.msu.ru>
Cc: Chris Arges <carges@vectranetworks.com>
Cc: Jochen Henneberg <jh@henneberg-systemdesign.com>
Signed-off-by: Aaron Sierra <asierra@xes-inc.com>
Tested-by: Matwey V. Kornilov <matwey@sai.msu.ru>
Tested-by: Chris J Arges <christopherarges@gmail.com>
Tested-by: Aaron Brown <aaron.f.brown@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>

igb: use igb_adapter->io_addr instead of e1000_hw->hw_addr

When running as guest, under certain condition, it will oops as following.
writel() in igb_configure_tx_ring() results in oops, because hw->hw_addr
is NULL. While other register access won't oops kernel because they use
wr32/rd32 which have a defense against NULL pointer.

    [  141.225449] pcieport 0000:00:1c.0: AER: Multiple Uncorrected (Fatal)
    error received: id=0101
    [  141.225523] igb 0000:01:00.1: PCIe Bus Error:
    severity=Uncorrected (Fatal), type=Unaccessible,
    id=0101(Unregistered Agent ID)
    [  141.299442] igb 0000:01:00.1: broadcast error_detected message
    [  141.300539] igb 0000:01:00.0 enp1s0f0: PCIe link lost, device now
    detached
    [  141.351019] igb 0000:01:00.1 enp1s0f1: PCIe link lost, device now
    detached
    [  143.465904] pcieport 0000:00:1c.0: Root Port link has been reset
    [  143.465994] igb 0000:01:00.1: broadcast slot_reset message
    [  143.466039] igb 0000:01:00.0: enabling device (0000 -> 0002)
    [  144.389078] igb 0000:01:00.1: enabling device (0000 -> 0002)
    [  145.312078] igb 0000:01:00.1: broadcast resume message
    [  145.322211] BUG: unable to handle kernel paging request at
    0000000000003818
    [  145.361275] IP: [<ffffffffa02fd38d>]
    igb_configure_tx_ring+0x14d/0x280 [igb]
    [  145.400048] PGD 0
    [  145.438007] Oops: 0002 [#1] SMP

A similar issue & solution could be found at:
    http://patchwork.ozlabs.org/patch/689592/

Signed-off-by: Cao jin <caoj.fnst@cn.fujitsu.com>
Acked-by: Alexander Duyck <alexander.h.duyck@intel.com>
Tested-by: Aaron Brown <aaron.f.brown@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>

igb: Workaround for igb i210 firmware issue

Sometimes firmware may not properly initialize I347AT4_PAGE_SELECT causing
the probe of an igb i210 NIC to fail. This patch adds an addition zeroing
of this register during igb_get_phy_id to workaround this issue.

Thanks for Jochen Henneberg for the idea and original patch.

Signed-off-by: Chris J Arges <christopherarges@gmail.com>
Tested-by: Aaron Brown <aaron.f.brown@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>

igb: correct register comments

Signed-off-by: Cao jin <caoj.fnst@cn.fujitsu.com>
Tested-by: Aaron Brown <aaron.f.brown@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>

e1000e: driver trying to free already-free irq

During systemd reboot sequence network driver interface is shutdown
by e1000_close. The PCI driver interface is shut by e1000_shutdown.
The e1000_shutdown checks for netif_running status, if still up it
brings down driver. But it disables msi outside of this if statement,
regardless of netif status. All this is OK when e1000_close happens
after shutdown. However, by default, everything in systemd is done
in parallel. This creates a conditions where e1000_shutdown is called
after e1000_close, therefore hitting BUG_ON assert in free_msi_irqs.

CC: xe-kernel@external.cisco.com
Signed-off-by: khalidm <khalidm@cisco.com>
Signed-off-by: David Singleton <davsingl@cisco.com>
Tested-by: Aaron Brown <aaron.f.brown@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>

igb: Realign bad indentation

Statements should start on tabstops.

Use a single statement and test instead of multiple tests.

Signed-off-by: Joe Perches <joe@perches.com>
Tested-by: Aaron Brown <aaron.f.brown@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>

tools: psock_tpacket: block Rx until socket filter has been added and socket has been bound to loopback.

Packets from any/all interfaces may be queued up on the PF_PACKET socket
before it is bound to the loopback interface by psock_tpacket, and
when these are passed up by the kernel, they could interfere
with the Rx tests.

Avoid interference from spurious packet by blocking Rx until the
socket filter has been set up, and the packet has been bound to the
desired (lo) interface. The effective sequence is
socket(PF_PACKET, SOCK_RAW, 0);
set up ring
Invoke SO_ATTACH_FILTER
bind to sll_protocol set to ETH_P_ALL, sll_ifindex for lo
After this sequence, the only packets that will be passed up are
those received on loopback that pass the attached filter.

Signed-off-by: Sowmini Varadhan <sowmini.varadhan@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

tcp: provide timestamps for partial writes

For TCP sockets, TX timestamps are only captured when the user data
is successfully and fully written to the socket. In many cases,
however, TCP writes can be partial for which no timestamp is
collected.

Collect timestamps whenever any user data is (fully or partially)
copied into the socket. Pass tcp_write_queue_tail to tcp_tx_timestamp
instead of the local skb pointer since it can be set to NULL on
the error path.

Note that tcp_write_queue_tail can be NULL, even if bytes have been
copied to the socket. This is because acknowledgements are being
processed in tcp_sendmsg(), and by the time tcp_tx_timestamp is
called tcp_write_queue_tail can be NULL. For such cases, this patch
does not collect any timestamps (i.e., it is best-effort).

This patch is written with suggestions from Willem de Bruijn and
Eric Dumazet.

Change-log V1 -> V2:
- Use sockc.tsflags instead of sk->sk_tsflags.
- Use the same code path for normal writes and errors.

Signed-off-by: Soheil Hassas Yeganeh <soheil@google.com>
Acked-by: Yuchung Cheng <ycheng@google.com>
Cc: Willem de Bruijn <willemb@google.com>
Cc: Eric Dumazet <edumazet@google.com>
Cc: Neal Cardwell <ncardwell@google.com>
Cc: Martin KaFai Lau <kafai@fb.com>
Acked-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

Merge tag 'rxrpc-rewrite-20170105' of git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs

David Howells says:

====================
rxrpc: Update tracing and proc interfaces

This set of patches fixes and extends tracing:

(1) Fix the handling of enum-to-string translations so that external
     tracing tools can make use of it by using TRACE_DEFINE_ENUM.

(2) Extend a couple of tracepoints to export some extra available
     information and add three new tracepoints to allow monitoring of
     received DATA packets, call disconnection and improper/implicit call
     termination.

and adds a bit more procfs-exported information:

(3) Show a call's hard-ACK cursors in /proc/net/rxrpc_calls.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>

net:dsa: check for EPROBE_DEFER from dsa_dst_parse()

Since there can be multiple dsa switches stacked together but
not all of devicetree nodes available at the time of calling
dsa_dst_parse(), EPROBE_DEFER can be returned by it. When this
happens, only the last dsa switch has to be deleted by
dsa_dst_del_ds(), but not the whole list, because next time linux
cames back to this function it will try to add only the last dsa
switch which returned EPROBE_DEFER.

Signed-off-by: Volodymyr Bendiuga <volodymyr.bendiuga@gmail.com>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>

net:mv88e6xxx: use g2 interrupt for 6097 chip

This chip needs MV88E6XXX_FLAG_G2_INT

Signed-off-by: Volodymyr Bendiuga <volodymyr.bendiuga@gmail.com>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: xilinx: emaclite: Remove xemaclite_remove_ndev()

xemaclite_remove_ndev() is a simple wrapper around free_netdev()
checking for NULL before the call. All possible paths calling
it are guaranteed to pass a non-NULL argument, so rather call
free_netdev() directly.

Signed-off-by: Tobias Klauser <tklauser@distanz.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: ethoc: Remove unused members from struct ethoc

The io_region_size and dma_alloc members of struct ethoc are only
written but never read, so they might as well be removed.

Signed-off-by: Tobias Klauser <tklauser@distanz.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>

Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net

rxrpc: Show a call's hard-ACK cursors in /proc/net/rxrpc_calls

Show a call's hard-ACK cursors in /proc/net/rxrpc_calls so that a call's
progress can be more easily monitored.

Signed-off-by: David Howells <dhowells@redhat.com>

rxrpc: Add some more tracing

Add the following extra tracing information:

(1) Modify the rxrpc_transmit tracepoint to record the Tx window size as
     this is varied by the slow-start algorithm.

(2) Modify the rxrpc_rx_ack tracepoint to record more information from
     received ACK packets.

(3) Add an rxrpc_rx_data tracepoint to record the information in DATA
     packets.

(4) Add an rxrpc_disconnect_call tracepoint to record call disconnection,
     including the reason the call was disconnected.

(5) Add an rxrpc_improper_term tracepoint to record implicit termination
     of a call by a client either by starting a new call on a particular
     connection channel without first transmitting the final ACK for the
     previous call.

Signed-off-by: David Howells <dhowells@redhat.com>

rxrpc: Fix handling of enums-to-string translation in tracing

Fix the way enum values are translated into strings in AF_RXRPC
tracepoints. The problem with just doing a lookup in a normal flat array
of strings or chars is that external tracing infrastructure can't find it.
Rather, TRACE_DEFINE_ENUM must be used.

Also sort the enums and string tables to make it easier to keep them in
order so that a future patch to __print_symbolic() can be optimised to try
a direct lookup into the table first before iterating over it.

A couple of _proto() macro calls are removed because they refered to tables
that got moved to the tracing infrastructure. The relevant data can be
found by way of tracing.

Signed-off-by: David Howells <dhowells@redhat.com>

packet: fix panic in __packet_set_timestamp on tpacket_v3 in tx mode

When TX timestamping is in use with TPACKET_V3's TX ring, then we'll
hit the BUG() in __packet_set_timestamp() when ring buffer slot is
returned to user space via tpacket_destruct_skb(). This is due to v3
being assumed as unreachable here, but since 7f953ab2ba46 ("af_packet:
TX_RING support for TPACKET_V3") it's not anymore. Fix it by filling
the timestamp back into the ring slot.

Fixes: 7f953ab2ba46 ("af_packet: TX_RING support for TPACKET_V3")
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Sowmini Varadhan <sowmini.varadhan@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

Merge tag 'xfs-for-linus-4.10-rc3' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux

Pull xfs fixes from Darrick Wong:

- fixes for crashes and double-cleanup errors

- XFS maintainership handover

- fix to prevent absurdly large block reservations

- fix broken sysfs getter/setters

* tag 'xfs-for-linus-4.10-rc3' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux:
  xfs: fix max_retries _show and _store functions
  xfs: update MAINTAINERS
  xfs: fix crash and data corruption due to removal of busy COW extents
  xfs: use the actual AG length when reserving blocks
  xfs: fix double-cleanup when CUI recovery fails

Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net

Pull networking fixes from David Miller:

1) stmmac_drv_probe() can race with stmmac_open() because we register
    the netdevice too early. Fix from Florian Fainelli.

2) UFO handling in __ip6_append_data() and ip6_finish_output() use
    different tests for deciding whether a frame will be fragmented or
    not, put them in sync. Fix from Zheng Li.

3) The rtnetlink getstats handlers need to validate that the netlink
    request is large enough, fix from Mathias Krause.

4) Use after free in mlx4 driver, from Jack Morgenstein.

5) Fix setting of garbage UID value in sockets during setattr() calls,
    from Eric Biggers.

6) Packet drop_monitor doesn't format the netlink messages properly
    such that nlmsg_next fails to work, fix from Reiter Wolfgang.

7) Fix handling of wildcard addresses in l2tp lookups, from Guillaume
    Nault.

8) __skb_flow_dissect() can crash on pptp packets, from Ian Kumlien.

9) IGMP code doesn't reset group query timers properly, from Michal
    Tesar.

10) Fix overzealous MAIN/LOCAL route table combining in ipv4, from
    Alexander Duyck.

11) vxlan offload check needs to be more strict in be2net driver, from
    Sabrina Dubroca.

12) Moving l3mdev to packet hooks lost RX stat counters unintentionally,
    fix from David Ahern.

* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (52 commits)
  sh_eth: enable RX descriptor word 0 shift on SH7734
  sfc: don't report RX hash keys to ethtool when RSS wasn't enabled
  dpaa_eth: Initialize CGR structure before init
  dpaa_eth: cleanup after init_phy() failure
  net: systemport: Pad packet before inserting TSB
  net: systemport: Utilize skb_put_padto()
  LiquidIO VF: s/select/imply/ for PTP_1588_CLOCK
  libcxgb: fix error check for ip6_route_output()
  net: usb: asix_devices: add .reset_resume for USB PM
  net: vrf: Add missing Rx counters
  drop_monitor: consider inserted data in genlmsg_end
  benet: stricter vxlan offloading check in be_features_check
  ipv4: Do not allow MAIN to be alias for new LOCAL w/ custom rules
  net: macb: Updated resource allocation function calls to new version of API.
  net: stmmac: dwmac-oxnas: use generic pm implementation
  net: stmmac: dwmac-oxnas: fix fixed-link-phydev leaks
  net: stmmac: dwmac-oxnas: fix of-node leak
  Documentation/networking: fix typo in mpls-sysctl
  igmp: Make igmp group member RFC 3376 compliant
  flow_dissector: Update pptp handling to avoid null pointer deref.
  ...

dsa: mv88e6xxx: Optimise atu_get

Lookup in the ATU can be performed starting from a given MAC
address. This is faster than starting with the first possible MAC
address and iterating all entries.

Entries are returned in numeric order. So if the MAC address returned
is bigger than what we are searching for, we know it is not in the
ATU.

Using the benchmark provided by Volodymyr Bendiuga
<volodymyr.bendiuga@gmail.com>,

https://www.spinics.net/lists/netdev/msg411550.html

on an Marvell Armada 370 RD, the test to add a number of static fdb
entries went from 1.616531 seconds to 0.312052 seconds.

Signed-off-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>

sh_eth: enable RX descriptor word 0 shift on SH7734

The RX descriptor word 0 on SH7734 has the RFS[9:0] field in bits 16-25
(bits 0-15 usually used for that are occupied by the packet checksum).
Thus we need to set the 'shift_rd0' field in the SH7734 SoC data...

Fixes: f0e81fecd4f8 ("net: sh_eth: Add support SH7734")
Signed-off-by: Sergei Shtylyov <sergei.shtylyov@cogentembedded.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

sfc: don't report RX hash keys to ethtool when RSS wasn't enabled

If we failed to set up RSS on EF10 (e.g. because firmware declared
RX_RSS_LIMITED), ethtool --show-nfc $dev rx-flow-hash ... should report
no fields, rather than confusingly reporting what fields we _would_ be
hashing on if RSS was working.

Fixes: dcb4123cbec0 ("sfc: disable RSS when unsupported")
Signed-off-by: Edward Cree <ecree@solarflare.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

cxgb4: Support compressed error vector for T6

t6fw-1.15.15.0 enabled compressed error vector in cpl_rx_pkt for T6.
Updating driver to take care of these changes.

Signed-off-by: Santosh Rastapur <santosh@chelsio.com>
Signed-off-by: Arjun V <arjun@chelsio.com>
Signed-off-by: Hariprasad Shenai <hariprasad@chelsio.com>
Signed-off-by: Ganesh Goudar <ganeshgr@chelsio.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

Merge branch 'sh_eth-intrs-cleanup'

Sergei Shtylyov says:

====================
sh_eth: E-MAC interrupt handler cleanups

Here's a set of 3 patches against DaveM's 'net-next.git' repo. I'm cleaning
up the E-MAC interrupt handling with the main goal of factoring out the E-MAC
interrupt handler into a separate function.

[1/3] sh_eth: handle only enabled E-MAC interrupts
[2/3] sh_eth: no need for *else* after *goto*
[3/3] sh_eth: factor out sh_eth_emac_interrupt()
====================

Signed-off-by: David S. Miller <davem@davemloft.net>