git.proxmox.com Git - mirror_ubuntu-bionic-kernel.git/log

afs: Implement shared-writeable mmap

Implement shared-writeable mmap for AFS.

Signed-off-by: David Howells <dhowells@redhat.com>

afs: Get rid of the afs_writeback record

Get rid of the afs_writeback record that kAFS is using to match keys with
writes made by that key.

Instead, keep a list of keys that have a file open for writing and/or
sync'ing and iterate through those.

Signed-off-by: David Howells <dhowells@redhat.com>

afs: Introduce a file-private data record

Introduce a file-private data record for kAFS and put the key into it
rather than storing the key in file->private_data.

Signed-off-by: David Howells <dhowells@redhat.com>

afs: Use a dynamic port if 7001 is in use

It is not required that the afs client operate on port 7001.
The port could be in use because another kernel or userspace
client has already bound to it.

If the port is in use, just fallback to using a dynamic port.

Signed-off-by: Marc Dionne <marc.dionne@auristor.com>
Signed-off-by: David Howells <dhowells@redhat.com>

afs: Fix directory read/modify race

Because parsing of the directory wasn't being done under any sort of lock,
the pages holding the directory content can get invalidated whilst the
parsing is ongoing.

Further, the directory page check function gets called outside of the page
lock, so if the page gets cleared or updated, this may return reports of
bad magic numbers in the directory page.

Also, the directory may change size whilst checking and parsing are
ongoing, so more care needs to be taken here.

Fix this by:

(1) Perform the page check from the page filling function before we set
PageUptodate and drop the page lock.

(2) Check for the file having shrunk and the page having been abandoned
before checking the page contents.

(3) Lock the page whilst parsing it for the directory iterator.

Whilst we're at it, add a tracepoint to report check failure.

Signed-off-by: David Howells <dhowells@redhat.com>

afs: Trace the sending of pages

Add a pair of tracepoints to log the sending of pages for an FS.StoreData
or FS.StoreData64 operation.

Tracepoint afs_send_pages notes each set of pages added to the operation.
There may be several of these per operation as we get up at most 8
contiguous pages in one go because the bvec we're using is on the stack.

Tracepoint afs_sent_pages notes the end of adding data from a whole run of
pages to the operation and the completion of the request phase.

Signed-off-by: David Howells <dhowells@redhat.com>

afs: Trace the initiation and completion of client calls

Add tracepoints to trace the initiation and completion of client calls
within the kafs filesystem.

The afs_make_vl_call tracepoint watches calls to the volume location
database server.

The afs_make_fs_call tracepoint watches calls to the file server.

The afs_call_done tracepoint watches for call completion.

Signed-off-by: David Howells <dhowells@redhat.com>

afs: Fix documentation on # vs % prefix in mount source specification

The documentation that describes the #-prefix and the %-prefix used when
specifying the source to mount is has the descriptions the wrong way
round. Switch them over.

Reported-by: Marc Dionne <marc.dionne@auristor.com>
Signed-off-by: David Howells <dhowells@redhat.com>

afs: Fix total-length calculation for multiple-page send

Fix the total-length calculation in afs_make_call() when the operation
being dispatched has data from a series of pages attached.

Despite the patched code looking like that it should reduce mathematically
to the current code, it doesn't because the 32-bit unsigned arithmetic
being used to calculate the page-offset-difference doesn't correctly extend
to a 64-bit value when the result is effectively negative.

Without this, some FS.StoreData operations that span multiple pages fail,
reporting too little or too much data.

Signed-off-by: David Howells <dhowells@redhat.com>

afs: Only progress call state at end of Tx phase from rxrpc callback

Only progress the AFS call state at the end of Tx phase from the callback
passed to rxrpc_kernel_send_data() rather than setting it before the last
data send call.

Signed-off-by: David Howells <dhowells@redhat.com>

afs: Make use of the YFS service upgrade to fully support IPv6

YFS VL servers offer an upgraded Volume Location service that can return
IPv6 addresses to fileservers and volume servers in addition to IPv4
addresses using the YFSVL.GetEndpoints operation which we should use if
it's available.

To this end:

(1) Make rxrpc_kernel_recv_data() return the call's current service ID so
     that the caller can detect service upgrade and see what the service
     was upgraded to.

(2) When we see a VL server address we haven't seen before, send a
     VL.GetCapabilities operation to it with the service upgrade bit set.

     If we get an upgrade to the YFS VL service, change the service ID in
     the address list for that address to use the upgraded service and set
     a flag to note that this appears to be a YFS-compatible server.

(3) If, when a server's addresses are being looked up, we note that we
     previously detected a YFS-compatible server, then send the
     YFSVL.GetEndpoints operation rather than VL.GetAddrsU.

(4) Build a fileserver address list from the reply of YFSVL.GetEndpoints,
     including both IPv4 and IPv6 addresses.  Volume server addresses are
     discarded.

(5) The address list is sorted by address and port now, instead of just
     address.  This allows multiple servers on the same host sitting on
     different ports.

Signed-off-by: David Howells <dhowells@redhat.com>

afs: Overhaul volume and server record caching and fileserver rotation

The current code assumes that volumes and servers are per-cell and are
never shared, but this is not enforced, and, indeed, public cells do exist
that are aliases of each other.  Further, an organisation can, say, set up
a public cell and a private cell with overlapping, but not identical, sets
of servers.  The difference is purely in the database attached to the VL
servers.

The current code will malfunction if it sees a server in two cells as it
assumes global address -> server record mappings and that each server is in
just one cell.

Further, each server may have multiple addresses - and may have addresses
of different families (IPv4 and IPv6, say).

To this end, the following structural changes are made:

(1) Server record management is overhauled:

     (a) Server records are made independent of cell.  The namespace keeps
      track of them, volume records have lists of them and each vnode
      has a server on which its callback interest currently resides.

     (b) The cell record no longer keeps a list of servers known to be in
      that cell.

     (c) The server records are now kept in a flat list because there's no
      single address to sort on.

     (d) Server records are now keyed by their UUID within the namespace.

     (e) The addresses for a server are obtained with the VL.GetAddrsU
      rather than with VL.GetEntryByName, using the server's UUID as a
      parameter.

     (f) Cached server records are garbage collected after a period of
      non-use and are counted out of existence before purging is allowed
      to complete.  This protects the work functions against rmmod.

     (g) The servers list is now in /proc/fs/afs/servers.

(2) Volume record management is overhauled:

     (a) An RCU-replaceable server list is introduced.  This tracks both
      servers and their coresponding callback interests.

     (b) The superblock is now keyed on cell record and numeric volume ID.

     (c) The volume record is now tied to the superblock which mounts it,
      and is activated when mounted and deactivated when unmounted.
      This makes it easier to handle the cache cookie without causing a
      double-use in fscache.

     (d) The volume record is loaded from the VLDB using VL.GetEntryByNameU
      to get the server UUID list.

     (e) The volume name is updated if it is seen to have changed when the
      volume is updated (the update is keyed on the volume ID).

(3) The vlocation record is got rid of and VLDB records are no longer
     cached.  Sufficient information is stored in the volume record, though
     an update to a volume record is now no longer shared between related
     volumes (volumes come in bundles of three: R/W, R/O and backup).

and the following procedural changes are made:

(1) The fileserver cursor introduced previously is now fleshed out and
     used to iterate over fileservers and their addresses.

(2) Volume status is checked during iteration, and the server list is
     replaced if a change is detected.

(3) Server status is checked during iteration, and the address list is
     replaced if a change is detected.

(4) The abort code is saved into the address list cursor and -ECONNABORTED
     returned in afs_make_call() if a remote abort happened rather than
     translating the abort into an error message.  This allows actions to
     be taken depending on the abort code more easily.

     (a) If a VMOVED abort is seen then this is handled by rechecking the
      volume and restarting the iteration.

     (b) If a VBUSY, VRESTARTING or VSALVAGING abort is seen then this is
         handled by sleeping for a short period and retrying and/or trying
         other servers that might serve that volume.  A message is also
         displayed once until the condition has cleared.

     (c) If a VOFFLINE abort is seen, then this is handled as VBUSY for the
      moment.

     (d) If a VNOVOL abort is seen, the volume is rechecked in the VLDB to
      see if it has been deleted; if not, the fileserver is probably
      indicating that the volume couldn't be attached and needs
      salvaging.

     (e) If statfs() sees one of these aborts, it does not sleep, but
      rather returns an error, so as not to block the umount program.

(5) The fileserver iteration functions in vnode.c are now merged into
     their callers and more heavily macroised around the cursor.  vnode.c
     is removed.

(6) Operations on a particular vnode are serialised on that vnode because
     the server will lock that vnode whilst it operates on it, so a second
     op sent will just have to wait.

(7) Fileservers are probed with FS.GetCapabilities before being used.
     This is where service upgrade will be done.

(8) A callback interest on a fileserver is set up before an FS operation
     is performed and passed through to afs_make_call() so that it can be
     set on the vnode if the operation returns a callback.  The callback
     interest is passed through to afs_iget() also so that it can be set
     there too.

In general, record updating is done on an as-needed basis when we try to
access servers, volumes or vnodes rather than offloading it to work items
and special threads.

Notes:

(1) Pre AFS-3.4 servers are no longer supported, though this can be added
     back if necessary (AFS-3.4 was released in 1998).

(2) VBUSY is retried forever for the moment at intervals of 1s.

(3) /proc/fs/afs/<cell>/servers no longer exists.

Signed-off-by: David Howells <dhowells@redhat.com>

afs: Move server rotation code into its own file

Move server rotation code into its own file.

Signed-off-by: David Howells <dhowells@redhat.com>

afs: Add an address list concept

Add an RCU replaceable address list structure to hold a list of server
addresses.  The list also holds the

To this end:

(1) A cell's VL server address list can be loaded directly via insmod or
     echo to /proc/fs/afs/cells or dynamically from a DNS query for AFSDB
     or SRV records.

(2) Anyone wanting to use a cell's VL server address must wait until the
     cell record comes online and has tried to obtain some addresses.

(3) An FS server's address list, for the moment, has a single entry that
     is the key to the server list.  This will change in the future when a
     server is instead keyed on its UUID and the VL.GetAddrsU operation is
     used.

(4) An 'address cursor' concept is introduced to handle iteration through
     the address list.  This is passed to the afs_make_call() as, in the
     future, stuff (such as abort code) that doesn't outlast the call will
     be returned in it.

In the future, we might want to annotate the list with information about
how each address fares.  We might then want to propagate such annotations
over address list replacement.

Whilst we're at it, we allow IPv6 addresses to be specified in
colon-delimited lists by enclosing them in square brackets.

Signed-off-by: David Howells <dhowells@redhat.com>

afs: Overhaul cell database management

Overhaul the way that the in-kernel AFS client keeps track of cells in the
following manner:

(1) Cells are now held in an rbtree to make walking them quicker and RCU
     managed (though this is probably overkill).

(2) Cells now have a manager work item that:

     (A) Looks after fetching and refreshing the VL server list.

     (B) Manages cell record lifetime, including initialising and
      destruction.

     (B) Manages cell record caching whereby threads are kept around for a
      certain time after last use and then destroyed.

     (C) Manages the FS-Cache index cookie for a cell.  It is not permitted
      for a cookie to be in use twice, so we have to be careful to not
      allow a new cell record to exist at the same time as an old record
      of the same name.

(3) Each AFS network namespace is given a manager work item that manages
     the cells within it, maintaining a single timer to prod cells into
     updating their DNS records.

     This uses the reduce_timer() facility to make the timer expire at the
     soonest timed event that needs happening.

(4) When a module is being unloaded, cells and cell managers are now
     counted out using dec_after_work() to make sure the module text is
     pinned until after the data structures have been cleaned up.

(5) Each cell's VL server list is now protected by a seqlock rather than a
     semaphore.

Signed-off-by: David Howells <dhowells@redhat.com>

afs: Overhaul permit caching

Overhaul permit caching in AFS by making it per-vnode and sharing permit
lists where possible.

When most of the fileserver operations are called, they return a status
structure indicating the (revised) details of the vnode or vnodes involved
in the operation.  This includes the access mark derived from the ACL
(named CallerAccess in the protocol definition file).  This is cacheable
and if the ACL changes, the server will tell us that it is breaking the
callback promise, at which point we can discard the currently cached
permits.

With this patch, the afs_permits structure has, at the end, an array of
{ key, CallerAccess } elements, sorted by key pointer.  This is then cached
in a hash table so that it can be shared between vnodes with the same
access permits.

Permit lists can only be shared if they contain the exact same set of
key->CallerAccess mappings.

Note that that table is global rather than being per-net_ns.  If the keys
in a permit list cross net_ns boundaries, there is no problem sharing the
cached permits, since the permits are just integer masks.

Since permit lists pin keys, the permit cache also makes it easier for a
future patch to find all occurrences of a key and remove them by means of
setting the afs_permits::invalidated flag and then clearing the appropriate
key pointer.  In such an event, memory barriers will need adding.

Lastly, the permit caching is skipped if the server has sent either a
vnode-specific or an entire-server callback since the start of the
operation.

Signed-off-by: David Howells <dhowells@redhat.com>

afs: Overhaul the callback handling

Overhaul the AFS callback handling by the following means:

(1) Don't give up callback promises on vnodes that we are no longer using,
     rather let them just expire on the server or let the server break
     them.  This is actually more efficient for the server as the callback
     lookup is expensive if there are lots of extant callbacks.

(2) Only give up the callback promises we have from a server when the
     server record is destroyed.  Then we can just give up *all* the
     callback promises on it in one go.

(3) Servers can end up being shared between cells if cells are aliased, so
     don't add all the vnodes being backed by a particular server into a
     big FID-indexed tree on that server as there may be duplicates.

     Instead have each volume instance (~= superblock) register an interest
     in a server as it starts to make use of it and use this to allow the
     processor for callbacks from the server to find the superblock and
     thence the inode corresponding to the FID being broken by means of
     ilookup_nowait().

(4) Rather than iterating over the entire callback list when a mass-break
     comes in from the server, maintain a counter of mass-breaks in
     afs_server (cb_seq) and make afs_validate() check it against the copy
     in afs_vnode.

     It would be nice not to have to take a read_lock whilst doing this,
     but that's tricky without using RCU.

(5) Save a ref on the fileserver we're using for a call in the afs_call
     struct so that we can access its cb_s_break during call decoding.

(6) Write-lock around callback and status storage in a vnode and read-lock
     around getattr so that we don't see the status mid-update.

This has the following consequences:

(1) Data invalidation isn't seen until someone calls afs_validate() on a
     vnode.  Unfortunately, we need to use a key to query the server, but
     getting one from a background thread is tricky without caching loads
     of keys all over the place.

(2) Mass invalidation isn't seen until someone calls afs_validate().

(3) Callback breaking is going to hit the inode_hash_lock quite a bit.
     Could this be replaced with rcu_read_lock() since inodes are destroyed
     under RCU conditions.

Signed-off-by: David Howells <dhowells@redhat.com>

afs: Rename struct afs_call server member to cm_server

Rename the server member of struct afs_call to cm_server as we're only
going to be using it for incoming calls for the Cache Manager service.
This makes it easier to differentiate from the pointer to the target server
for the client, which will point to a different structure to allow for
callback handling.

Signed-off-by: David Howells <dhowells@redhat.com>

afs: Fix the afs_uuid struct to make the char-sized fields signed

In AFS's encoding of a UUID, the eight 'char' fields are all signed, so
represent them with __s8 rather than __u8. This makes the compiler
sign-extend them correctly when XDR-encoding them.

Signed-off-by: David Howells <dhowells@redhat.com>

afs: Connect up the CB.ProbeUuid

The handler for the CB.ProbeUuid operation in the cache manager is
implemented, but isn't listed in the switch-statement of operation
selection, so won't be used. Fix this by adding it.

Signed-off-by: David Howells <dhowells@redhat.com>

afs: Potentially return call->reply[0] from afs_make_call()

If call->ret_reply0 is set, return call->reply[0] on success. Change the
return type of afs_make_call() to long so that this can be passed back
without bit loss and then cast to a pointer if required.

Signed-off-by: David Howells <dhowells@redhat.com>

afs: Condense afs_call's reply{,2,3,4} into an array

Condense struct afs_call's reply anchor members - reply{,2,3,4} - into an
array.

Signed-off-by: David Howells <dhowells@redhat.com>

afs: Consolidate abort_to_error translators

The AFS abort code space is shared across all services, so there's no need
for separate abort_to_error translators for each service.

Consolidate them into a single function and remove the function pointers
for them.

Signed-off-by: David Howells <dhowells@redhat.com>

afs: Allow IPv6 address specification of VL servers

Allow VL server specifications to be given IPv6 addresses as well as IPv4
addresses, for example as:

echo add foo.org 1111:2222:3333:0:4444:5555:6666:7777 >/proc/fs/afs/cells

Note that ':' is the expected separator for separating IPv4 addresses, but
if a ',' is detected or no '.' is detected in the string, the delimiter is
switched to ','.

This also works with DNS AFSDB or SRV record strings fetched by upcall from
userspace.

Signed-off-by: David Howells <dhowells@redhat.com>

afs: Keep and pass sockaddr_rxrpc addresses rather than in_addr

Keep and pass sockaddr_rxrpc addresses around rather than keeping and
passing in_addr addresses to allow for the use of IPv6 and non-standard
port numbers in future.

This also allows the port and service_id fields to be removed from the
afs_call struct.

Signed-off-by: David Howells <dhowells@redhat.com>

afs: Update the cache index structure

Update the cache index structure in the following ways:

(1) Don't use the volume name followed by the volume type as levels in the
     cache index.  Volumes can be renamed.  Use the volume ID instead.

(2) Don't store the VLDB data for a volume in the tree.  If the volume
     database should be cached locally, then it should be done in a separate
     tree.

(3) Expand the volume ID stored in the cache to 64 bits.

(4) Expand the file/vnode ID stored in the cache to 96 bits.

(5) Increment the cache structure version number to 1.

Signed-off-by: David Howells <dhowells@redhat.com>

afs: Add some protocol defs

Add some protocol definitions, including max field lengths, flag defs, an
XDR-encoded UUID def, more VL operation IDs and more fileserver abort
codes.

Signed-off-by: David Howells <dhowells@redhat.com>

afs: Push the net ns pointer to more places

Push the network namespace pointer to more places in AFS, including the
afs_server structure (which doesn't hold a ref on the netns).

In particular, afs_put_cell() now takes requires a net ns parameter so that
it can safely alter the netns after decrementing the cell usage count - the
cell will be deallocated by a background thread after being cached for a
period, which means that it's not safe to access it after reducing its
usage count.

Signed-off-by: David Howells <dhowells@redhat.com>

afs: Note the cell in the superblock info also

Keep a reference to the cell in the superblock info structure in addition
to the volume and net pointers. This will make it easier to clean up in a
future patch in which afs_put_volume() will need the cell pointer.

Whilst we're at it, make the cell and volume getting functions return a
pointer to the object got to make the call sites look neater.

Signed-off-by: David Howells <dhowells@redhat.com>

afs: Fix server reaping

Fix server reaping and make sure it's all done before we start trying to
purge cells, given that servers currently pin cells.

Signed-off-by: David Howells <dhowells@redhat.com>

afs: Close the rxrpc socket only after purging the servers

Close the rxrpc socket only after we've purged the server records (and also
cell and volume records which might refer to servers) so that we can give
up the callbacks on each server.

Signed-off-by: David Howells <dhowells@redhat.com>

afs: Lay the groundwork for supporting network namespaces

Lay the groundwork for supporting network namespaces (netns) to the AFS
filesystem by moving various global features to a network-namespace struct
(afs_net) and providing an instance of this as a temporary global variable
that everything uses via accessor functions for the moment.

The following changes have been made:

(1) Store the netns in the superblock info.  This will be obtained from
     the mounter's nsproxy on a manual mount and inherited from the parent
     superblock on an automount.

(2) The cell list is made per-netns.  It can be viewed through
     /proc/net/afs/cells and also be modified by writing commands to that
     file.

(3) The local workstation cell is set per-ns in /proc/net/afs/rootcell.
     This is unset by default.

(4) The 'rootcell' module parameter, which sets a cell and VL server list
     modifies the init net namespace, thereby allowing an AFS root fs to be
     theoretically used.

(5) The volume location lists and the file lock manager are made
     per-netns.

(6) The AF_RXRPC socket and associated I/O bits are made per-ns.

The various workqueues remain global for the moment.

Changes still to be made:

(1) /proc/fs/afs/ should be moved to /proc/net/afs/ and a symlink emplaced
     from the old name.

(2) A per-netns subsys needs to be registered for AFS into which it can
     store its per-netns data.

(3) Rather than the AF_RXRPC socket being opened on module init, it needs
     to be opened on the creation of a superblock in that netns.

(4) The socket needs to be closed when the last superblock using it is
     destroyed and all outstanding client calls on it have been completed.
     This prevents a reference loop on the namespace.

(5) It is possible that several namespaces will want to use AFS, in which
     case each one will need its own UDP port.  These can either be set
     through /proc/net/afs/cm_port or the kernel can pick one at random.
     The init_ns gets 7001 by default.

Other issues that need resolving:

(1) The DNS keyring needs net-namespacing.

(2) Where do upcalls go (eg. DNS request-key upcall)?

(3) Need something like open_socket_in_file_ns() syscall so that AFS
     command line tools attempting to operate on an AFS file/volume have
     their RPC calls go to the right place.

Signed-off-by: David Howells <dhowells@redhat.com>

Pass mode to wait_on_atomic_t() action funcs and provide default actions

Make wait_on_atomic_t() pass the TASK_* mode onto its action function as an
extra argument and make it 'unsigned int throughout.

Also, consolidate a bunch of identical action functions into a default
function that can do the appropriate thing for the mode.

Also, change the argument name in the bit_wait*() function declarations to
reflect the fact that it's the mode and not the bit number.

[Peter Z gives this a grudging ACK, but thinks that the whole atomic_t wait
should be done differently, though he's not immediately sure as to how]

Signed-off-by: David Howells <dhowells@redhat.com>
Acked-by: Peter Zijlstra <peterz@infradead.org>
cc: Ingo Molnar <mingo@kernel.org>

Merge remote-tracking branch 'tip/timers/core' into afs-next

These AFS patches need the timer_reduce() patch from timers/core.

Signed-off-by: David Howells <dhowells@redhat.com>

Merge branch 'net-improve-the-process-of-redirect-and-toobig-for-ipv6-tunnels'

Xin Long says:

====================
net: improve the process of redirect and toobig for ipv6 tunnels

Now let's say there are 3 kinds of icmp packets to process for tunnels,
toobig(needfrag), redirect, others, their process should be:

- toobig(needfrag)
   update the lower dst's pmtu by route cache, also update sk dst's pmtu
   if possible, or it will be fine if sk dst pmtu will get updated on tx
   path.

- redirect
   update the lower dst's gw by route cache and return, no need to send
   this redirect packet to user sk.

- others
   send the packet to user's sk, or it will also be fine to use err_count
   to count it and report fail link on tx path.

All ipv4 tunnels basically follow this while some of ipv6 tunnels are
doing in different ways, like ip6gre and ip6_tunnels update tnl dev's
mtu instead of updating lower dst pmtu, no redirect process on their
err_handlers, which doesn't make any sense and even causes performance
problems.

This patchset is to improve the process of redirect and toobig for ip6gre
ip4ip6, ip6ip6 tunnels, as in ipv4 tunnels.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>

ip6_tunnel: clean up ip4ip6 and ip6ip6's err_handlers

This patch is to remove some useless codes of redirect and fix some
indents on ip4ip6 and ip6ip6's err_handlers.

Note that redirect icmp packet is already processed in ip6_tnl_err,
the old redirect codes in ip4ip6_err actually never worked even
before this patch. Besides, there's no need to send redirect to
user's sk, it's for lower dst, so just remove it in this patch.

Signed-off-by: Xin Long <lucien.xin@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

ip6_tunnel: process toobig in a better way

The same improvement in "ip6_gre: process toobig in a better way"
is needed by ip4ip6 and ip6ip6 as well.

Note that ip4ip6 and ip6ip6 will also update sk dst pmtu in their
err_handlers. Like I said before, gre6 could not do this as it's
inner proto is not certain. But for all of them, sk dst pmtu will
be updated in tx path if in need.

Signed-off-by: Xin Long <lucien.xin@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

ip6_tunnel: add the process for redirect in ip6_tnl_err

The same process for redirect in "ip6_gre: add the process for redirect
in ip6gre_err" is needed by ip4ip6 and ip6ip6 as well.

Signed-off-by: Xin Long <lucien.xin@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

ip6_gre: process toobig in a better way

Now ip6gre processes toobig icmp packet by setting gre dev's mtu in
ip6gre_err, which would cause few things not good:

  - It couldn't set mtu with dev_set_mtu due to it's not in user context,
    which causes route cache and idev->cnf.mtu6 not to be updated.

  - It has to update sk dst pmtu in tx path according to gredev->mtu for
    ip6gre, while it updates pmtu again according to lower dst pmtu in
    ip6_tnl_xmit.

  - To change dev->mtu by toobig icmp packet is not a good idea, it should
    only work on pmtu.

This patch is to process toobig by updating the lower dst's pmtu, as later
sk dst pmtu will be updated in ip6_tnl_xmit, the same way as in ip4gre.

Note that gre dev's mtu will not be updated any more, it doesn't make any
sense to change dev's mtu after receiving a toobig packet.

Signed-off-by: Xin Long <lucien.xin@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

ip6_gre: add the process for redirect in ip6gre_err

This patch is to add redirect icmp packet process for ip6gre by
calling ip6_redirect() in ip6gre_err(), as in vti6_err.

Prior to this patch, there's even no route cache generated after
receiving redirect.

Reported-by: Jianlin Shi <jishi@redhat.com>
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

forcedeth: remove redudant assignments in xmit

In xmit process, the variables are set many times. In fact,
it is enough for these variables to be set once.
After a long time test, the throughput performance is better
than before.

CC: Srinivas Eeda <srinivas.eeda@oracle.com>
CC: Joe Jin <joe.jin@oracle.com>
CC: Junxiao Bi <junxiao.bi@oracle.com>
Signed-off-by: Zhu Yanjun <yanjun.zhu@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

Merge tag 'nfc-next-4.15-1' of git://git.kernel.org/pub/scm/linux/kernel/git/sameo/nfc-next

Samuel Ortiz says:

====================
NFC 4.15 pull request

This is the NFC pull request for 4.15. We have:

- A new netlink command for explicitly deactivating NFC targets
- i2c constification for all NFC drivers
- One NFC device allocation error path fix
====================

Signed-off-by: David S. Miller <davem@davemloft.net>

Merge branch 'Openvswitch-meter-action'

Andy Zhou says:

====================
Openvswitch meter action

This patch series is the first attempt to add openvswitch
meter support. We have previously experimented with adding
metering support in nftables. However 1) It was not clear
how to expose a named nftables object cleanly, and 2)
the logic that implements metering is quite small, < 100 lines
of code.

With those two observations, it seems cleaner to add meter
support in the openvswitch module directly.

---

    v1(RFC)->v2:  remove unused code improve locking
  and other review comments
    v2 -> v3:     rebase
    v3 -> v4:     fix undefined "__udivdi3" references on 32 bit builds.
                  use div_u64() instead.
    v4 -> v5:     rebase
====================

Acked-by: Pravin B Shelar <pshelar@ovn.org>
Signed-off-by: David S. Miller <davem@davemloft.net>

openvswitch: Add meter action support

Implements OVS kernel meter action support.

Signed-off-by: Andy Zhou <azhou@ovn.org>
Signed-off-by: David S. Miller <davem@davemloft.net>

openvswitch: Add meter infrastructure

OVS kernel datapath so far does not support Openflow meter action.
This is the first stab at adding kernel datapath meter support.
This implementation supports only drop band type.

Signed-off-by: Andy Zhou <azhou@ovn.org>
Signed-off-by: David S. Miller <davem@davemloft.net>

openvswitch: export get_dp() API.

Later patches will invoke get_dp() outside of datapath.c. Export it.

Signed-off-by: Andy Zhou <azhou@ovn.org>
Signed-off-by: David S. Miller <davem@davemloft.net>

openvswitch: Add meter netlink definitions

Meter has its own netlink family. Define netlink messages and attributes
for communicating with the user space programs.

Signed-off-by: Andy Zhou <azhou@ovn.org>
Signed-off-by: David S. Miller <davem@davemloft.net>

Merge branch 'dsa-b53-Support-prepended-Broadcom-tags'

Florian Fainelli says:

====================
net: dsa: b53: Support prepended Broadcom tags

This patch series adds support for prepended 4-bytes Broadcom tags that we
already support. This type of tag will typically be used when interfaced to
a SoC like BCM58xx (NorthStar Plus) which supports a Flow Accelerator (WIP).
In that case, we need to support a slightly different tagging format.

The first patch does a bit of re-factoring and passes a port index to
the get_tag_protocol() function since at least two different drivers need
that type of information (mt7530, b53) to support tagging or not.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>

net: dsa: b53: Support prepended Broadcom tags

On BCM58xx devices (Northstar Plus), there is an accelerator attached to
port 8 which would only work if we use prepended Broadcom tags. Resolve
that difference in our get_tag_protocol() function by setting the
appropriate tagging protocol in that case. We need to change
b53_brcm_hdr_setup() a little bit now since we can deal with two types
of Broadcom tags.

Signed-off-by: Florian Fainelli <f.fainelli@gmail.com>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: dsa: Support prepended Broadcom tag

Add a new type: DSA_TAG_PROTO_PREPEND which allows us to support for the
4-bytes Broadcom tag that we already support, but in a format where it
is pre-pended to the packet instead of located between the MAC SA and
the Ethertyper (DSA_TAG_PROTO_BRCM).

Signed-off-by: Florian Fainelli <f.fainelli@gmail.com>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: dsa: tag_brcm: Prepare for supporting prepended tag

In preparation for supporting the same Broadcom tag format, but instead
of inserted between the MAC SA and EtherType, prepended to the Ethernet
frame, restructure the code a little bit to make that possible and take
an offset parameter.

Signed-off-by: Florian Fainelli <f.fainelli@gmail.com>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: dsa: Pass a port to get_tag_protocol()

A number of drivers want to check whether the configured CPU port is a
possible configuration for enabling tagging, pass down the CPU port
number so they verify that.

Signed-off-by: Florian Fainelli <f.fainelli@gmail.com>
Reviewed-by: Vivien Didelot <vivien.didelot@savoirfairelinux.com>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>

net/sched/sch_red.c: work around gcc-4.4.4 anon union initializer issue

gcc-4.4.4 (at lest) has issues with initializers and anonymous unions:

net/sched/sch_red.c: In function 'red_dump_offload':
net/sched/sch_red.c:282: error: unknown field 'stats' specified in initializer
net/sched/sch_red.c:282: warning: initialization makes integer from pointer without a cast
net/sched/sch_red.c:283: error: unknown field 'stats' specified in initializer
net/sched/sch_red.c:283: warning: initialization makes integer from pointer without a cast
net/sched/sch_red.c: In function 'red_dump_stats':
net/sched/sch_red.c:352: error: unknown field 'xstats' specified in initializer
net/sched/sch_red.c:352: warning: initialization makes integer from pointer without a cast

Work around this.

Fixes: 602f3baf2218 ("net_sch: red: Add offload ability to RED qdisc")
Cc: Nogah Frankel <nogahf@mellanox.com>
Cc: Jiri Pirko <jiri@mellanox.com>
Cc: Simon Horman <simon.horman@netronome.com>
Cc: David S. Miller <davem@davemloft.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: David S. Miller <davem@davemloft.net>

net/mlx4: Use Kconfig flag to remove support of old gen2 Mellanox devices

Since Mellanox focus is on newer adapters, we would like to have the
ability to disable the support for old gen2 adapters.

This can be done by turning off the MLX4_CORE_GEN2 Kconfig flag.
We keep it turned on by default.

Signed-off-by: Slava Shwartsman <slavash@mellanox.com>
Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

af_netlink: ensure that NLMSG_DONE never fails in dumps

The way people generally use netlink_dump is that they fill in the skb
as much as possible, breaking when nla_put returns an error. Then, they
get called again and start filling out the next skb, and again, and so
forth. The mechanism at work here is the ability for the iterative
dumping function to detect when the skb is filled up and not fill it
past the brim, waiting for a fresh skb for the rest of the data.

However, if the attributes are small and nicely packed, it is possible
that a dump callback function successfully fills in attributes until the
skb is of size 4080 (libmnl's default page-sized receive buffer size).
The dump function completes, satisfied, and then, if it happens to be
that this is actually the last skb, and no further ones are to be sent,
then netlink_dump will add on the NLMSG_DONE part:

nlh = nlmsg_put_answer(skb, cb, NLMSG_DONE, sizeof(len), NLM_F_MULTI);

It is very important that netlink_dump does this, of course. However, in
this example, that call to nlmsg_put_answer will fail, because the
previous filling by the dump function did not leave it enough room. And
how could it possibly have done so? All of the nla_put variety of
functions simply check to see if the skb has enough tailroom,
independent of the context it is in.

In order to keep the important assumptions of all netlink dump users, it
is therefore important to give them an skb that has this end part of the
tail already reserved, so that the call to nlmsg_put_answer does not
fail. Otherwise, library authors are forced to find some bizarre sized
receive buffer that has a large modulo relative to the common sizes of
messages received, which is ugly and buggy.

This patch thus saves the NLMSG_DONE for an additional message, for the
case that things are dangerously close to the brim. This requires
keeping track of the errno from ->dump() across calls.

Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

Merge branch 'netem-add-nsec-scheduling-and-slot-feature'

Dave Taht says:

====================
netem: add nsec scheduling and slot feature

This patch series converts netem away from the old "ticks" interface and
userspace API, and adds support for a new "slot" feature intended to
emulate bursty macs such as WiFi and LTE better.

Changes since v2:
Use u64 for packet_len_sched_time()
Use simpler max(time_to_send,q->slot.slot_next)

Changes since v1:
Always pass new nanosecond APIs to userspace
====================

Signed-off-by: David S. Miller <davem@davemloft.net>

netem: support delivering packets in delayed time slots

Slotting is a crude approximation of the behaviors of shared media such
as cable, wifi, and LTE, which gather up a bunch of packets within a
varying delay window and deliver them, relative to that, nearly all at
once.

It works within the existing loss, duplication, jitter and delay
parameters of netem. Some amount of inherent latency must be specified,
regardless.

The new "slot" parameter specifies a minimum and maximum delay between
transmission attempts.

The "bytes" and "packets" parameters can be used to limit the amount of
information transferred per slot.

Examples of use:

tc qdisc add dev eth0 root netem delay 200us \
         slot 800us 10ms bytes 64k packets 42

A more correct example, using stacked netem instances and a packet limit
to emulate a tail drop wifi queue with slots and variable packet
delivery, with a 200Mbit isochronous underlying rate, and 20ms path
delay:

tc qdisc add dev eth0 root handle 1: netem delay 20ms rate 200mbit \
         limit 10000
tc qdisc add dev eth0 parent 1:1 handle 10:1 netem delay 200us \
         slot 800us 10ms bytes 64k packets 42 limit 512

Signed-off-by: Dave Taht <dave.taht@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

netem: add uapi to express delay and jitter in nanoseconds

netem userspace has long relied on a horrible /proc/net/psched hack
to translate the current notion of "ticks" to nanoseconds.

Expressing latency and jitter instead, in well defined nanoseconds,
increases the dynamic range of emulated delays and jitter in netem.

It will also ease a transition where reducing a tick to nsec
equivalence would constrain the max delay in prior versions of
netem to only 4.3 seconds.

Signed-off-by: Dave Taht <dave.taht@gmail.com>
Suggested-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

netem: convert to qdisc_watchdog_schedule_ns

Upgrade the internal netem scheduler to use nanoseconds rather than
ticks throughout.

Convert to and from the std "ticks" userspace api automatically,
while allowing for finer grained scheduling to take place.

Signed-off-by: Dave Taht <dave.taht@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

ipv6: try not to take rtnl_lock in ip6mr_sk_done

Avoid traversing the list of mr6_tables (which requires the
rtnl_lock) in ip6mr_sk_done(), when we know in advance that
a match will not be found.
This can happen when rawv6_close()/ip6mr_sk_done() is invoked
on non-mroute6 sockets.
This patch helps reduce rtnl_lock contention when destroying
a large number of net namespaces, each having a non-mroute6
raw socket.

v2: same patch, only fixed subject line and expanded comment.

Signed-off-by: Francesco Ruggeri <fruggeri@arista.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: realtek: r8169: remove redundant assignment to giga_ctrl

The variable giga_ctrl is being assigned to zero however this is
never read and hence the assignment is redundant, so remove it.
Cleans up clang warning:

drivers/net/ethernet/realtek/r8169.c:1978:3: warning: Value stored
to 'giga_ctrl' is never read

Signed-off-by: Colin Ian King <colin.king@canonical.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: dsa: lan9303: Documentation: Add missing word "Mbps"

Signed-off-by: Egil Hjelmeland <privat@egil-hjelmeland.no>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: dsa: lan9303: Fix lan9303_alr_del_port()

Fix embarrassing bug in lan9303_alr_del_port(): Instead of zeroing
entr->mac_addr, I destroyed the next cache entry. Affected .port_fdb_del and
.port_mdb_del.

Fixes: 0620427ea0d6 ("net: dsa: lan9303: Add fdb/mdb manipulation")
Signed-off-by: Egil Hjelmeland <privat@egil-hjelmeland.no>
Reviewed-by: Vivien Didelot <vivien.didelot@savoirfairelinux.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

timers: Add a function to start/reduce a timer

Add a function, similar to mod_timer(), that will start a timer if it isn't
running and will modify it if it is running and has an expiry time longer
than the new time.  If the timer is running with an expiry time that's the
same or sooner, no change is made.

The function looks like:

int timer_reduce(struct timer_list *timer, unsigned long expires);

This can be used by code such as networking code to make it easier to share
a timer for multiple timeouts.  For instance, in upcoming AF_RXRPC code,
the rxrpc_call struct will maintain a number of timeouts:

unsigned long ack_at;
unsigned long resend_at;
unsigned long ping_at;
unsigned long expect_rx_by;
unsigned long expect_req_by;
unsigned long expect_term_by;

each of which is set independently of the others.  With timer reduction
available, when the code needs to set one of the timeouts, it only needs to
look at that timeout and then call timer_reduce() to modify the timer,
starting it or bringing it forward if necessary.  There is no need to refer
to the other timeouts to see which is earliest and no need to take any lock
other than, potentially, the timer lock inside timer_reduce().

Note, that this does not protect against concurrent invocations of any of
the timer functions.

As an example, the expect_rx_by timeout above, which terminates a call if
we don't get a packet from the server within a certain time window, would
be set something like this:

unsigned long now = jiffies;
unsigned long expect_rx_by = now + packet_receive_timeout;
WRITE_ONCE(call->expect_rx_by, expect_rx_by);
timer_reduce(&call->timer, expect_rx_by);

The timer service code (which might, say, be in a work function) would then
check all the timeouts to see which, if any, had triggered, deal with
those:

t = READ_ONCE(call->ack_at);
if (time_after_eq(now, t)) {
cmpxchg(&call->ack_at, t, now + MAX_JIFFY_OFFSET);
set_bit(RXRPC_CALL_EV_ACK, &call->events);
}

and then restart the timer if necessary by finding the soonest timeout that
hasn't yet passed and then calling timer_reduce().

The disadvantage of doing things this way rather than comparing the timers
each time and calling mod_timer() is that you *will* take timer events
unless you can finish what you're doing and delete the timer in time.

The advantage of doing things this way is that you don't need to use a lock
to work out when the next timer should be set, other than the timer's own
lock - which you might not have to take.

[ tglx: Fixed weird formatting and adopted it to pending changes ]

Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: keyrings@vger.kernel.org
Cc: linux-afs@lists.infradead.org
Link: https://lkml.kernel.org/r/151023090769.23050.1801643667223880753.stgit@warthog.procyon.org.uk

pstore: Use ktime_get_real_fast_ns() instead of __getnstimeofday()

__getnstimeofday() is a rather odd interface, with a number of quirks:

- The caller may come from NMI context, but the implementation is not NMI safe,
  one way to get there from NMI is

      NMI handler:
        something bad
          panic()
            kmsg_dump()
              pstore_dump()
                 pstore_record_init()
                   __getnstimeofday()

- The calling conventions are different from any other timekeeping functions,
  to deal with returning an error code during suspended timekeeping.

Address the above issues by using a completely different method to get the
time: ktime_get_real_fast_ns() is NMI safe and has a reasonable behavior
when timekeeping is suspended: it returns the time at which it got
suspended. As Thomas Gleixner explained, this is safe, as
ktime_get_real_fast_ns() does not call into the clocksource driver that
might be suspended.

The result can easily be transformed into a timespec structure. Since
ktime_get_real_fast_ns() was not exported to modules, add the export.

The pstore behavior for the suspended case changes slightly, as it now
stores the timestamp at which timekeeping was suspended instead of storing
a zero timestamp.

This change is not addressing y2038-safety, that's subject to a more
complex follow up patch.

Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Kees Cook <keescook@chromium.org>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Anton Vorontsov <anton@enomsg.org>
Cc: Stephen Boyd <sboyd@codeaurora.org>
Cc: John Stultz <john.stultz@linaro.org>
Cc: Colin Cross <ccross@android.com>
Link: https://lkml.kernel.org/r/20171110152530.1926955-1-arnd@arndb.de

Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net

Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net

Pull networking fixes from David Miller:

1) Use after free in vlan, from Cong Wang.

2) Handle NAPI poll with a zero budget properly in mlx5 driver, from
    Saeed Mahameed.

3) If DMA mapping fails in mlx5 driver, NULL out page, from Inbar
    Karmy.

4) Handle overrun in RX FIFO of sun4i CAN driver, from Gerhard
    Bertelsmann.

5) Missing return in mdb and vlan prepare phase of DSA layer, from
    Vivien Didelot.

* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net:
  vlan: fix a use-after-free in vlan_device_event()
  net: dsa: return after vlan prepare phase
  net: dsa: return after mdb prepare phase
  can: ifi: Fix transmitter delay calculation
  tcp: fix tcp_fastretrans_alert warning
  tcp: gso: avoid refcount_t warning from tcp_gso_segment()
  can: peak: Add support for new PCIe/M2 CAN FD interfaces
  can: sun4i: handle overrun in RX FIFO
  can: c_can: don't indicate triple sampling support for D_CAN
  net/mlx5e: Increase Striding RQ minimum size limit to 4 multi-packet WQEs
  net/mlx5e: Set page to null in case dma mapping fails
  net/mlx5e: Fix napi poll with zero budget
  net/mlx5: Cancel health poll before sending panic teardown command
  net/mlx5: Loop over temp list to release delay events
  rds: ib: Fix NULL pointer dereference in debug code

Merge tag 'wireless-drivers-next-for-davem-2017-11-11' of git://git.kernel.org/pub/scm/linux/kernel/git/kvalo/wireless-drivers-next

Kalle Valo says:

====================
wireless-drivers-next patches for 4.15

Last minute patches before the merge window. Not really anything
special standing out, mostly fixes or cleanup and some minor new
features.

Major changes:

iwlwifi

* some new PCI IDs
====================

Signed-off-by: David S. Miller <davem@davemloft.net>

net: Remove unused skb_shared_info member

ip6_frag_id was only used by UFO, which has been removed.
ipv6_proxy_select_ident() only existed to set ip6_frag_id and has no
in-tree callers.

Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

Merge branch 'l2tp-avoid-aliasing-tunnels-socket-pointer'

Guillaume Nault says:

====================
l2tp: avoid aliasing tunnels socket pointer

We don't need to copy the tunnel's socket pointer in the pseudo-wire
specific session structures. This uselessly complicates the code
and hampers evolution.

This series was part of an effort to protect tunnels socket pointer
with RCU. But since it provides nice cleanup, I submit it separately.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>

l2tp: remove the .tunnel_sock field from struct pppol2tp_session

The last user of .tunnel_sock is pppol2tp_connect() which defensively
uses it to verify internal data consistency.

This check isn't necessary: l2tp_session_get() guarantees that the
returned session belongs to the tunnel passed as parameter. And
.tunnel_sock is never updated, so checking that it still points to
the parent tunnel socket is useless; that test can never fail.

Signed-off-by: Guillaume Nault <g.nault@alphalink.fr>
Signed-off-by: David S. Miller <davem@davemloft.net>

l2tp: avoid using ->tunnel_sock for getting session's parent tunnel

Sessions don't need to use l2tp_sock_to_tunnel(xxx->tunnel_sock) for
accessing their parent tunnel. They have the .tunnel field in the
l2tp_session structure for that. Furthermore, in all these cases, the
session is registered, so we're guaranteed that .tunnel isn't NULL and
that the session properly holds a reference on the tunnel.

Signed-off-by: Guillaume Nault <g.nault@alphalink.fr>
Signed-off-by: David S. Miller <davem@davemloft.net>

l2tp: remove .tunnel_sock from struct l2tp_eth

This field has never been used.

Signed-off-by: Guillaume Nault <g.nault@alphalink.fr>
Signed-off-by: David S. Miller <davem@davemloft.net>

Merge branch 'dsa-b53-Turn-on-Broadcom-tags'

Florian Fainelli says:

====================
net: dsa: b53: Turn on Broadcom tags

This was long overdue, with this patch series, the b53 driver now
turns on Broadcom tags except for 5325 and 5365 which use an older
format that we do not support yet (TBD).

First patch is necessary in order for bgmac, used on BCM5301X and Northstar
Plus to work correctly and successfully send ARP packets back to the requsester.

Second patch is actually a bug fix, but because net/master and net-next/master
diverge in that area, I am targeting net-next/master here.

Finally, the last patch enables Broadcom tags after checking that the CPU port
selected is either, 5, 7 or 8, since those are the only valid combinations
given currently supported HW.

Changes in v3:

- guarded padding with netdev_uses_dsa() to let the non-DSA use cases
not have a performance hit for smaller packets

- added missing select NET_DSA_TAG_BRCM to drivers/net/dsa/b53/Kconfig

Changes in v2:

- moved a hunk between patch 2 and patch 3 to avoid a bisectability issue
====================

Signed-off-by: David S. Miller <davem@davemloft.net>

net: dsa: b53: Turn on Broadcom tags

Enable Broadcom tags for b53 devices, except 5325 and 5365 which use a
different Broadcom tag format not yet supported by net/dsa/tag_brcm.c.

We also make sure that we can turn on Broadcom tags on a CPU port number
that is capable of that: 5, 7 or 8.

Signed-off-by: Florian Fainelli <f.fainelli@gmail.com>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: dsa: b53: Stop using dev->cpu_port incorrectly

dev->cpu_port is the driver local information that should only be used
to look up register offsets for a particular port, when they differ
(e.g: IMP port override), but it should certainly not be used in place
of the DSA configured CPU port.

Since the DSA switch layer calls port_vlan_{add,del}() on the CPU port
as well, we can remove the specific setting of the CPU port within
port_vlan_{add,del}.

Fixes: ff39c2d68679 ("net: dsa: b53: Add bridge support")
Fixes: 967dd82ffc52 ("net: dsa: b53: Add support for Broadcom RoboSwitch")
Signed-off-by: Florian Fainelli <f.fainelli@gmail.com>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: bgmac: Pad packets to a minimum size

In preparation for enabling Broadcom tags with b53, pad packets to a
minimum size of 64 bytes (sans FCS) in order for the Broadcom switch to
accept ingressing frames. Without this, we would typically be able to
DHCP, but not resolve with ARP because packets are too small and get
rejected by the switch.

Signed-off-by: Florian Fainelli <f.fainelli@gmail.com>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>

Merge tag 'linux-can-fixes-for-4.14-20171110' of git://git.kernel.org/pub/scm/linux/kernel/git/mkl/linux-can

Marc Kleine-Budde says:

====================
pull-request: can 2017-11-10

this is a pull request for net/master.

The first patch by Richard Schütz for the c_can driver removes the false
indication to support triple sampling for d_can. Gerhard Bertelsmann's
patch for the sun4i driver improves the RX overrun handling. The patch
by Stephane Grosjean for the peak_canfd driver adds the PCI ids for
various new PCIe/M2 interfaces. Marek Vasut's patch for the ifi driver
fix transmitter delay calculation.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>

Merge branch 'dsa-lan9303-IGMP-handling'

Egil Hjelmeland says:

====================
net: dsa: lan9303: IGMP handling

Set up the HW switch to trap IGMP packets to CPU port.
And make sure skb->offload_fwd_mark is cleared for incoming IGMP packets.

skb->offload_fwd_mark calculation is a candidate for consolidation into the
DSA core. The calculation can probably be more polished when done at a point
where DSA has updated skb.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>

net: dsa: lan9303: Clear offload_fwd_mark for IGMP

Now that IGMP packets no longer is flooded in HW, we want the SW bridge to
forward packets based on bridge configuration. To make that happen,
IGMP packets must have skb->offload_fwd_mark = 0.

Signed-off-by: Egil Hjelmeland <privat@egil-hjelmeland.no>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: dsa: lan9303: Set up trapping of IGMP to CPU port

IGMP packets should be trapped to the CPU port. The SW bridge knows
whether to forward to other ports.

With "IGMP snooping for local traffic" merged, IGMP trapping is also
required for stable IGMPv2 operation.

LAN9303 does not trap IGMP packets by default.

Enable IGMP trapping in lan9303_setup.

Signed-off-by: Egil Hjelmeland <privat@egil-hjelmeland.no>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>

cxgb4: collect vpd info directly from hardware

Collect vpd information directly from hardware instead of software
adapter context. Move EEPROM physical address to virtual address
translation logic to t4_hw.c and update relevant files.

Fixes: 6f92a6544f1a ("cxgb4: collect hardware misc dumps")
Signed-off-by: Rahul Lakkireddy <rahul.lakkireddy@chelsio.com>
Signed-off-by: Ganesh Goudar <ganeshgr@chelsio.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

Merge tag 'mlx5-fixes-2017-11-08' of git://git.kernel.org/pub/scm/linux/kernel/git/saeed/linux

Saeed Mahameed says:

====================
Mellanox, mlx5 fixes 2017-11-08

The following series includes some fixes for mlx5 core and etherent
driver.

Sorry for the late submission but as you can see i have some very
critical fixes below that i would like them merged into this RC.

Please pull and let me know if there is any problem.

For -stable:
('net/mlx5e: Set page to null in case dma mapping fails') kernels >= 4.13
('net/mlx5: FPGA, return -EINVAL if size is zero') kernels >= 4.13
('net/mlx5: Cancel health poll before sending panic teardown command') kernels >= 4.13

V1->V2:
- Fix Reviewed-by tag of the 2nd patch.
- Drop the FPGA 0 size fix, it needs some more change log info.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>

vlan: fix a use-after-free in vlan_device_event()

After refcnt reaches zero, vlan_vid_del() could free
dev->vlan_info via RCU:

RCU_INIT_POINTER(dev->vlan_info, NULL);
call_rcu(&vlan_info->rcu, vlan_info_rcu_free);

However, the pointer 'grp' still points to that memory
since it is set before vlan_vid_del():

        vlan_info = rtnl_dereference(dev->vlan_info);
        if (!vlan_info)
                goto out;
        grp = &vlan_info->grp;

Depends on when that RCU callback is scheduled, we could
trigger a use-after-free in vlan_group_for_each_dev()
right following this vlan_vid_del().

Fix it by moving vlan_vid_del() before setting grp. This
is also symmetric to the vlan_vid_add() we call in
vlan_device_event().

Reported-by: Fengguang Wu <fengguang.wu@intel.com>
Fixes: efc73f4bbc23 ("net: Fix memory leak - vlan_info struct")
Cc: Alexander Duyck <alexander.duyck@gmail.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Girish Moodalbail <girish.moodalbail@oracle.com>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Reviewed-by: Girish Moodalbail <girish.moodalbail@oracle.com>
Tested-by: Fengguang Wu <fengguang.wu@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: dsa: mv88e6xxx: Fix stats histogram mode

The statistics histogram mode was not being explicitly initialized on
devices other than the 6390 family. Clearing the statistics then
overwrote the default setting, setting the histogram to a reserved
mode.

Explicitly set the histogram mode for all devices. Change the
statistics clear into a read/modify/write, and since it is now more
complex, move it into global1.c.

Signed-off-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>

Merge branch 'mv88e6xxx-broadcast-flooding-in-hardware'

Andrew Lunn says:

====================
mv88e6xxx broadcast flooding in hardware

This patchset makes the mv88e6xxx driver perform flooding in hardware,
rather than let the software bridge perform the flooding. This is a
prerequisite for IGMP snooping on the bridge interface.

In order to make hardware broadcasting work, a few other issues need
fixing or improving. SWITCHDEV_ATTR_ID_PORT_PARENT_ID is broken, which
is apparent when testing on the ZII devel board with multiple
switches.

Some of these patches are taken from a previous RFC patchset of IGMP
support.

Rebased onto net-next, with fixup for Vivien's refactoring.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>

net: dsa: mv88e6xxx: Flood broadcast frames in hardware

By default, the switch does not flood broadcast frames. Instead the
broadcast address is unknown in the ATU, so the frame gets forwarded
out the cpu port. The software bridge then floods it back to the
individual switch ports which are members of the bridge.

Add an ATU entry in the switch so that it floods broadcast frames out
ports, rather than have the software bridge do it. Also, send a copy
out the cpu port and any dsa ports. Rely on the port vectors to
prevent broadcast frames leaking between bridges, and separated ports.

Additionally, when a VLAN is added, a new FID is allocated. This
represents a new table of ATU entries. A broadcast entry is added to
the new FID.

With offload_fwd_mark being set, the software bridge will not flood
the frames it receives back to the switch.

Signed-off-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: dsa: mv88e6xxx: Move mv88e6xxx_port_db_load_purge()

This function is going to be needed by a soon to be added new
function. Move it earlier so we can avoid a forward declaration.
No functional changes.

Signed-off-by: Andrew Lunn <andrew@lunn.ch>
Reviewed-by: Vivien Didelot <vivien.didelot@savoirfairelinux.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: dsa: mv88e6xxx: Print offending port when vlan check fails

When testing if a VLAN is one more than one bridge, we print an error
message that the VLAN is already in use somewhere else. Print both the
new port which would like the VLAN, and the port which already has it,
to aid debugging.

Signed-off-by: Andrew Lunn <andrew@lunn.ch>
Reviewed-by: Vivien Didelot <vivien.didelot@savoirfairelinux.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: dsa: mv88e6xxx: Fixed port netdev check for VLANs

Having the same VLAN on multiple bridges is currently unsupported as
an offload. mv88e6xxx_port_check_hw_vlan() is used to ensure that a
VLAN is not on multiple bridges when adding a VLAN range to a port. It
loops the ports and checks to see if there are ports in a different
bridge with the same VLAN.

While walking all switch ports, the code was checking if the new port
has a netdev slave attached to it. If not, skip checking the port
being walked. This seems like a typ0. If the new port does not have a
slave, how has a VLAN been added to it in the first place, requiring
this check be performed at all? More likely, we should be checking if
the port being walked has a slave. Without the port having a slave, it
cannot have a VLAN on it, so there is no need to check further for
that particular port.

Signed-off-by: Andrew Lunn <andrew@lunn.ch>
Reviewed-by: Vivien Didelot <vivien.didelot@savoirfairelinux.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: dsa: {e}dsa: set offload_fwd_mark on received packets

The software bridge needs to know if a packet has already been bridged
by hardware offload to ports in the same hardware offload, in order
that it does not re-flood them, causing duplicates. This is
particularly true for broadcast and multicast traffic which the host
has requested.

By setting offload_fwd_mark in the skb the bridge will only flood to
ports in other offloads and other netifs. Set this flag in the DSA and
EDSA tag driver.

Signed-off-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: dsa: Fix SWITCHDEV_ATTR_ID_PORT_PARENT_ID

SWITCHDEV_ATTR_ID_PORT_PARENT_ID is used by the software bridge when
determining which ports to flood a packet out. If the packet
originated from a switch, it assumes the switch has already flooded
the packet out the switches ports, so the bridge should not flood the
packet itself out switch ports. Ports on the same switch are expected
to return the same parent ID when SWITCHDEV_ATTR_ID_PORT_PARENT_ID is
called.

DSA gets this wrong with clusters of switches. As far as the software
bridge is concerned, the cluster is all one switch. A packet from any
switch in the cluster can be assumed to have been flooded as needed
out of all ports of the cluster, not just the switch it originated
from. Hence all ports of a cluster should return the same parent. The
old implementation did not, each switch in the cluster had its own ID.

Also wrong was that the ID was not unique if multiple DSA instances
are in operation.

Use the tree ID as the parent ID, which is the same for all switches
in a cluster and unique across switch clusters.

Signed-off-by: Andrew Lunn <andrew@lunn.ch>
Reviewed-by: Vivien Didelot <vivien.didelot@savoirfairelinux.com>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

Merge branch 'ieee802154-for-davem-2017-11-09' of git://git.kernel.org/pub/scm/linux/kernel/git/sschmidt/wpan-next

Stefan Schmidt says:

====================
pull-request: net-next: ieee802154 2017-11-09

A small update on ieee802154 patches for net-next. Nothing dramatic, but simply
housekeeping this time around.
A fix for the correct mask to be applied in the mrf24j40 driver by Gustavo A. R. Silva
Removal of a non existing email user for the ca8210 driver by Harry Morris
A bunch of checkpatch cleanups across the subsystem from myself
====================

Signed-off-by: David S. Miller <davem@davemloft.net>

bindings: net: stmmac: correctify note about LPI interrupt

There are two different combined signal for various interrupt events:
In EQOS-CORE and EQOS-MTL configurations, mci_intr_o is the interrupt
signal.
In EQOS-DMA, EQOS-AHB and EQOS-AXI configurations, these interrupt events
are combined with the events in the DMA on the sbd_intr_o signal.

Depending on configuration, the device tree irq "macirq" will refer to
either mci_intr_o or sbd_intr_o.

The databook states:
"The MAC generates the LPI interrupt when the Tx or Rx side enters or exits
the LPI state. The interrupt mci_intr_o (sbd_intr_o in certain
configurations) is asserted when the LPI interrupt status is set.

When the MAC exits the Rx LPI state, then in addition to the mci_intr_o
(sbd_intr_o in certain configurations), the sideband signal lpi_intr_o is
asserted.

If you do not want to gate-off the application clock during the Rx LPI
state, you can leave the lpi_intr_o signal unconnected and use the
mci_intr_o (sbd_intr_o in certain configurations) signal to detect Rx LPI
exit."

Since the "macirq" is always raised when Tx or Rx enters/exits the LPI
state, "eth_lpi" must therefore refer to lpi_intr_o, which is only raised
when Rx exits the LPI state. Update the DT binding description to reflect
reality.

Signed-off-by: Niklas Cassel <niklas.cassel@axis.com>
Acked-by: Alexandre TORGUE <alexandre.torgue@st.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

ipvlan: fix ipv6 outbound device

When process the outbound packet of ipv6, we should assign the master
device to output device other than input device.

Signed-off-by: Keefe Liu <liuqifa@huawei.com>
Acked-by: Mahesh Bandewar <maheshb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: thunderx: fix double free error

This patch fixes an error in memory allocation/freeing in
ThunderX PF driver.

I moved the allocation to the probe() function and made it managed.

>From the Colin's email:

While running static analysis on linux-next with CoverityScan I found 3
double free errors in the Cavium thunder driver.

The issue occurs on the err_disable_device: label of function nic_probe
when nic_free_lmacmem(nic) is called and a double free occurs on
nic->duplex, nic->link and nic->speed.  This occurs when nic_init_hw()
fails:

        /* Initialize hardware */
        err = nic_init_hw(nic);
        if (err)
                goto err_release_regions;

nic_init_hw() calls nic_get_hw_info() and this calls nic_free_lmacmem()
if any of the allocations fail. This free'ing occurs again by the call
to nic_free_lmacmem() on the err_release_regions exit path in nic_probe().

Reported-by: Colin Ian King <colin.king@canonical.com>
Signed-off-by: Aleksey Makarov <aleksey.makarov@cavium.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: thunderbolt: Clear finished Tx frame bus address in tbnet_tx_callback()

When Thunderbolt network interface is disabled or when the cable is
unplugged the driver releases all allocated buffers by calling
tbnet_free_buffers() for each ring. This function then calls
dma_unmap_page() for each buffer it finds where bus address is non-zero.
Now, we only clear this bus address when the Tx buffer is sent to the
hardware so it is possible that the function finds an entry that has
already been unmapped.

Enabling DMA-API debugging catches this as well:

thunderbolt 0000:06:00.0: DMA-API: device driver tries to free DMA
memory it has not allocated [device address=0x0000000068321000] [size=4096 bytes]

Fix this by clearing the bus address of a Tx frame right after we have
unmapped the buffer.

Signed-off-by: Mika Westerberg <mika.westerberg@linux.intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

sock: Remove the global prot_inuse counter.

The per-cpu counter for init_net is prepared in core_initcall.
The patch 7d720c3e ("percpu: add __percpu sparse annotations to net")
and d6d9ca0fe ("net: this_cpu_xxx conversions") optimize the
routines. Then remove the old counter.

Cc: Pavel Emelyanov <xemul@openvz.org>
Signed-off-by: Tonghao Zhang <zhangtonghao@didichuxing.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: sfc: remove redundant variable start

Variable start is assigned but never read hence it is redundant
and can be removed. Cleans up clang warning:

drivers/net/ethernet/sfc/ptp.c:655:2: warning: Value stored to 'start'
is never read

Signed-off-by: Colin Ian King <colin.king@canonical.com>
Acked-by: Bert Kenward <bkenward@solarflare.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

qlge: remove duplicated assignment to mbcp

The assignment to mbcp is identical to the initiatialized value assigned
to mbcp at declaration time a few lines earlier, hence we can remove the
second redundant assignment. Cleans up clang warning:

drivers/net/ethernet/qlogic/qlge/qlge_mpi.c:209:22: warning:
Value stored to 'mbcp' during its initialization is never read

Signed-off-by: Colin Ian King <colin.king@canonical.com>
Signed-off-by: David S. Miller <davem@davemloft.net>