]> git.proxmox.com Git - mirror_ubuntu-hirsute-kernel.git/log
mirror_ubuntu-hirsute-kernel.git
7 years agonet: phy: Add rockchip PHY driver support
David Wu [Thu, 10 Aug 2017 13:56:40 +0000 (21:56 +0800)]
net: phy: Add rockchip PHY driver support

Support integrated ethernet PHY currently.

Signed-off-by: David Wu <david.wu@rock-chips.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
7 years agoforcedeth: replace init_timer_deferrable with setup_deferrable_timer
Zhu Yanjun [Thu, 10 Aug 2017 08:13:12 +0000 (04:13 -0400)]
forcedeth: replace init_timer_deferrable with setup_deferrable_timer

Replace init_timer_deferrable with setup_deferrable_timer to simplify
the source code.

Signed-off-by: Zhu Yanjun <yanjun.zhu@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
7 years agodrivers: net: davinci_mdio: print bus frequency
Maxim Uvarov [Thu, 10 Aug 2017 07:47:47 +0000 (10:47 +0300)]
drivers: net: davinci_mdio: print bus frequency

Frequency can be adjusted in DT it make sense to
print current used value on driver init.

Signed-off-by: Max Uvarov <muvarov@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
7 years agodrivers: net: davinci_mdio: remove busy loop on wait user access
Maxim Uvarov [Thu, 10 Aug 2017 07:47:46 +0000 (10:47 +0300)]
drivers: net: davinci_mdio: remove busy loop on wait user access

Polling 14 mdio devices on single mdio bus eats 30% of 1Ghz cpu time
due to busy loop in wait(). Add small delay to relax cpu.

Signed-off-by: Max Uvarov <muvarov@gmail.com>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Reviewed-by: Grygorii Strashko <grygorii.strashko@ti.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
7 years agoMerge branch 'netvsc-minor-fixes-and-improvements'
David S. Miller [Fri, 11 Aug 2017 21:00:14 +0000 (14:00 -0700)]
Merge branch 'netvsc-minor-fixes-and-improvements'

Stephen Hemminger says:

====================
netvsc: minor fixes and improvements

These are non-critical bug fixes, related to functionality now in net-next.
 1. delaying the automatic bring up of VF device to allow udev to change name.
 2. performance improvement
 3. handle MAC address change with VF; mostly propogate the error that VF gives.
 4. minor cleanups
 5. allow setting send/receive buffer size with ethtool.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
7 years agonetvsc: keep track of some non-fatal overload conditions
stephen hemminger [Thu, 10 Aug 2017 00:46:12 +0000 (17:46 -0700)]
netvsc: keep track of some non-fatal overload conditions

Add ethtool statistics for case where send chimmeny buffer is
exhausted and driver has to fall back to doing scatter/gather
send. Also, add statistic for case where ring buffer is full and
receive completions are delayed.

Signed-off-by: Stephen Hemminger <sthemmin@microsoft.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
7 years agonetvsc: allow controlling send/recv buffer size
stephen hemminger [Thu, 10 Aug 2017 00:46:11 +0000 (17:46 -0700)]
netvsc: allow controlling send/recv buffer size

Control the size of the buffer areas via ethtool ring settings.
They aren't really traditional hardware rings, but host API breaks
receive and send buffer into chunks. The final size of the chunks are
controlled by the host.

The default value of send and receive buffer area for host DMA
is much larger than it needs to be. Experimentation shows that
4M receive and 1M send is sufficient.

Signed-off-by: Stephen Hemminger <sthemmin@microsoft.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
7 years agonetvsc: remove unnecessary check for NULL hdr
stephen hemminger [Thu, 10 Aug 2017 00:46:10 +0000 (17:46 -0700)]
netvsc: remove unnecessary check for NULL hdr

The function init_page_array is always called with a valid pointer
to RNDIS header. No check for NULL is needed.

Signed-off-by: Stephen Hemminger <sthemmin@microsoft.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
7 years agonetvsc: remove unnecessary cast of void pointer
stephen hemminger [Thu, 10 Aug 2017 00:46:09 +0000 (17:46 -0700)]
netvsc: remove unnecessary cast of void pointer

Assignment to a typed pointer is sufficient in C.
No cast is needed.

Signed-off-by: Stephen Hemminger <sthemmin@microsoft.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
7 years agonetvsc: whitespace cleanup
stephen hemminger [Thu, 10 Aug 2017 00:46:08 +0000 (17:46 -0700)]
netvsc: whitespace cleanup

Fix some minor indentation issues.

Signed-off-by: Stephen Hemminger <sthemmin@microsoft.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
7 years agonetvsc: no need to allocate send/receive on numa node
stephen hemminger [Thu, 10 Aug 2017 00:46:07 +0000 (17:46 -0700)]
netvsc: no need to allocate send/receive on numa node

The send and receive buffers are both per-device (not per-channel).
The associated NUMA node is a property of the CPU which is per-channel
therefore it makes no sense to force the receive/send buffer to be
allocated on a particular node (since it is a shared resource).

Signed-off-by: Stephen Hemminger <sthemmin@microsoft.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
7 years agonetvsc: check error return when restoring channels and mtu
stephen hemminger [Thu, 10 Aug 2017 00:46:06 +0000 (17:46 -0700)]
netvsc: check error return when restoring channels and mtu

If setting new values fails, and the attempt to restore original
settings fails. Then log an error and leave device down.
This should never happen, but if it does don't go down in flames.

Signed-off-by: Stephen Hemminger <sthemmin@microsoft.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
7 years agonetvsc: propagate MAC address change to VF slave
stephen hemminger [Thu, 10 Aug 2017 00:46:05 +0000 (17:46 -0700)]
netvsc: propagate MAC address change to VF slave

If VF is slaved to synthetic device, then any change to netvsc
MAC address should be propagated to the slave device.

If slave device doesn't support MAC address change then it
should also be an error to attempt to change synthetic NIC MAC
address.

It also fixes the error unwind in the original code.
If give a bad address, the old code would change the device
MAC address anyway.

Reviewed-by: Haiyang Zhang <haiyangz@microsoft.com>
Signed-off-by: Stephen Hemminger <sthemmin@microsoft.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
7 years agonetvsc: don't signal host twice if empty
stephen hemminger [Thu, 10 Aug 2017 00:46:04 +0000 (17:46 -0700)]
netvsc: don't signal host twice if empty

When hv_pkt_iter_next() returns NULL, it has already called
hv_pkt_iter_close(). Calling it twice can lead to extra host signal.

Signed-off-by: Stephen Hemminger <sthemmin@microsoft.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
7 years agonetvsc: delay setup of VF device
stephen hemminger [Thu, 10 Aug 2017 00:46:03 +0000 (17:46 -0700)]
netvsc: delay setup of VF device

When VF device is discovered, delay bring it automatically up in
order to allow userspace to some simple changes (like renaming).

Reported-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Signed-off-by: Stephen Hemminger <sthemmin@microsoft.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
7 years agophylink: Fix an uninitialized variable bug
Dan Carpenter [Wed, 9 Aug 2017 21:35:50 +0000 (00:35 +0300)]
phylink: Fix an uninitialized variable bug

"ret" isn't necessarily initialized here.

Fixes: 9525ae83959b ("phylink: add phylink infrastructure")
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
7 years agoliquidio: removed check for queue size alignment
Intiyaz Basha [Wed, 9 Aug 2017 20:28:04 +0000 (13:28 -0700)]
liquidio: removed check for queue size alignment

There is no restriction on queue size alignment.  Hence removing check for
valid queue size.

Signed-off-by: Intiyaz Basha <intiyaz.basha@cavium.com>
Signed-off-by: Felix Manlunas <felix.manlunas@cavium.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
7 years agoliquidio: rx/tx queue cleanup
Intiyaz Basha [Wed, 9 Aug 2017 19:07:08 +0000 (12:07 -0700)]
liquidio: rx/tx queue cleanup

When deleting a queue, clear its corresponding bit in the qmask, vfree its
memory, clear out the pointer that's pointing to it, and decrement the
queue count.

Signed-off-by: Intiyaz Basha <intiyaz.basha@cavium.com>
Signed-off-by: Felix Manlunas <fmanlunas@cavium.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
7 years agoMerge branch 'net-sched-let-the-offloader-decide-what-to-offload'
David S. Miller [Fri, 11 Aug 2017 20:47:01 +0000 (13:47 -0700)]
Merge branch 'net-sched-let-the-offloader-decide-what-to-offload'

Jiri Pirko says:

====================
net: sched: let the offloader decide what to offload

Currently there is a Qdisc_class_ops->tcf_cl_offload callback
that is called to find out if cls would offload rule or not.
This is only supported by sch_ingress and sch_clsact.
So the Qdisc are to decide. However, the driver knows what is he
able to offload, so move the decision making to drivers completely.
Just pass classid there and provide set of helpers to allow
identification of qdisc.

As a side effect, this actually allows clsact egress rules
offload in mlxsw.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
7 years agonet: sched: remove cops->tcf_cl_offload
Jiri Pirko [Wed, 9 Aug 2017 12:30:35 +0000 (14:30 +0200)]
net: sched: remove cops->tcf_cl_offload

cops->tcf_cl_offload is no longer needed, as the drivers check what they
can and cannot offload using the classid identify helpers. So remove this.

Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
7 years agonet: sched: remove handle propagation down to the drivers
Jiri Pirko [Wed, 9 Aug 2017 12:30:34 +0000 (14:30 +0200)]
net: sched: remove handle propagation down to the drivers

There is no longer need to use handle in drivers, so remove it from
tc_cls_common_offload struct.

Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
7 years agonet: sched: use newly added classid identity helpers
Jiri Pirko [Wed, 9 Aug 2017 12:30:33 +0000 (14:30 +0200)]
net: sched: use newly added classid identity helpers

Instead of checking handle, which does not have the inner class
information and drivers wrongly assume clsact->egress as ingress, use
the newly introduced classid identification helpers.

Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
7 years agonet: sched: propagate classid down to offload drivers
Jiri Pirko [Wed, 9 Aug 2017 12:30:32 +0000 (14:30 +0200)]
net: sched: propagate classid down to offload drivers

Drivers need classid to decide they support this specific qdisc+class
or not. So propagate it down via the tc_cls_common_offload struct.

Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
7 years agonet: sched: Add helpers to identify classids
Jiri Pirko [Wed, 9 Aug 2017 12:30:31 +0000 (14:30 +0200)]
net: sched: Add helpers to identify classids

Offloading drivers need to understand what qdisc class a filter is added
to. Currently they only need to identify ingress, clsact->ingress and
clsact->egress. So provide these helpers.

Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
7 years agogeneve: use netlink_ext_ack for error reporting in rtnl operations
Girish Moodalbail [Wed, 9 Aug 2017 08:09:28 +0000 (01:09 -0700)]
geneve: use netlink_ext_ack for error reporting in rtnl operations

Add extack error messages for failure paths while creating/modifying
geneve devices. Once extack support is added to iproute2, more
meaningful and helpful error messages will be displayed making it easy
for users to discern what went wrong.

Before:

=======
$ ip link add gen1 address 0:1:2:3:4:5:6 type geneve id 200 \
  remote 192.168.13.2
RTNETLINK answers: Invalid argument

After:
======
$ ip link add gen1 address 0:1:2:3:4:5:6 type geneve id 200 \
  remote 192.168.13.2
Error: Provided link layer address is not Ethernet

Also, netdev_dbg() calls used to log errors associated with Netlink
request have been removed.

Signed-off-by: Girish Moodalbail <girish.moodalbail@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
7 years agoMerge branch 'sctp-remove-typedefs-from-structures-part-6'
David S. Miller [Fri, 11 Aug 2017 17:02:45 +0000 (10:02 -0700)]
Merge branch 'sctp-remove-typedefs-from-structures-part-6'

Xin Long says:

====================
sctp: remove typedefs from structures part 6

As we know, typedef is suggested not to use in kernel, even checkpatch.pl
also gives warnings about it. Now sctp is using it for many structures.

All this kind of typedef's using should be removed. This patchset is the
part 6 to remove all typedefs in include/net/sctp/structs.h, command.h
and sm.h.

Just as the part 1-5, No any code's logic would be changed in these patches,
only cleaning up.

Note that this is the last part for this typedef cleaning up. after this
patchset, no more inappropriate typedefs in sctp. It's also to tidy some
codes when removing them, like fixing many indents, reodering some local
params, especially in the last 2 patches.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
7 years agosctp: fix some indents in sm_make_chunk.c
Xin Long [Fri, 11 Aug 2017 02:23:58 +0000 (10:23 +0800)]
sctp: fix some indents in sm_make_chunk.c

There are some bad indents of functions' defination in sm_make_chunk.c.
They have been there since beginning, it was probably caused by that
the typedef sctp_chunk_t was replaced with struct sctp_chunk.

So it's the best time to fix them in this patchset, it's also to fix
some bad indents in other functions' defination in sm_make_chunk.c.

Signed-off-by: Xin Long <lucien.xin@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
7 years agosctp: remove the typedef sctp_disposition_t
Xin Long [Fri, 11 Aug 2017 02:23:57 +0000 (10:23 +0800)]
sctp: remove the typedef sctp_disposition_t

This patch is to remove the typedef sctp_disposition_t, and
replace with enum sctp_disposition in the places where it's
using this typedef.

It's also to fix the indent for many functions' defination.

Signed-off-by: Xin Long <lucien.xin@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
7 years agosctp: remove the typedef sctp_sm_table_entry_t
Xin Long [Fri, 11 Aug 2017 02:23:56 +0000 (10:23 +0800)]
sctp: remove the typedef sctp_sm_table_entry_t

This patch is to remove the typedef sctp_sm_table_entry_t, and
replace with struct sctp_sm_table_entry in the places where it's
using this typedef.

It is also to fix some indents.

Signed-off-by: Xin Long <lucien.xin@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
7 years agosctp: remove the unused typedef sctp_sm_command_t
Xin Long [Fri, 11 Aug 2017 02:23:55 +0000 (10:23 +0800)]
sctp: remove the unused typedef sctp_sm_command_t

Remove this typedef including the struct, there is even no places
using it.

Signed-off-by: Xin Long <lucien.xin@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
7 years agosctp: remove the typedef sctp_verb_t
Xin Long [Fri, 11 Aug 2017 02:23:54 +0000 (10:23 +0800)]
sctp: remove the typedef sctp_verb_t

This patch is to remove the typedef sctp_verb_t, and
replace with enum sctp_verb in the places where it's
using this typedef.

Signed-off-by: Xin Long <lucien.xin@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
7 years agosctp: remove the typedef sctp_arg_t
Xin Long [Fri, 11 Aug 2017 02:23:53 +0000 (10:23 +0800)]
sctp: remove the typedef sctp_arg_t

This patch is to remove the typedef sctp_arg_t, and
replace with union sctp_arg in the places where it's
using this typedef.

Signed-off-by: Xin Long <lucien.xin@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
7 years agosctp: remove the typedef sctp_cmd_seq_t
Xin Long [Fri, 11 Aug 2017 02:23:52 +0000 (10:23 +0800)]
sctp: remove the typedef sctp_cmd_seq_t

This patch is to remove the typedef sctp_cmd_seq_t, and
replace with struct sctp_cmd_seq in the places where it's
using this typedef.

Note that it doesn't fix many indents although it should,
as sctp_disposition_t's removal would mess them up again.
So better to fix them when removing sctp_disposition_t in
the later patch.

Signed-off-by: Xin Long <lucien.xin@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
7 years agosctp: remove the typedef sctp_cmd_t
Xin Long [Fri, 11 Aug 2017 02:23:51 +0000 (10:23 +0800)]
sctp: remove the typedef sctp_cmd_t

This patch is to remove the typedef sctp_cmd_t, and
replace with enum sctp_cmd in the places where it's
using this typedef.

Signed-off-by: Xin Long <lucien.xin@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
7 years agosctp: remove the typedef sctp_socket_type_t
Xin Long [Fri, 11 Aug 2017 02:23:50 +0000 (10:23 +0800)]
sctp: remove the typedef sctp_socket_type_t

This patch is to remove the typedef sctp_socket_type_t, and
replace with enum sctp_socket_type in the places where it's
using this typedef.

Signed-off-by: Xin Long <lucien.xin@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
7 years agosctp: remove the typedef sctp_dbg_objcnt_entry_t
Xin Long [Fri, 11 Aug 2017 02:23:49 +0000 (10:23 +0800)]
sctp: remove the typedef sctp_dbg_objcnt_entry_t

This patch is to remove the typedef sctp_dbg_objcnt_entry_t, and
replace with struct sctp_dbg_objcnt_entry in the places where it's
using this typedef.

Signed-off-by: Xin Long <lucien.xin@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
7 years agosctp: remove the typedef sctp_cmsgs_t
Xin Long [Fri, 11 Aug 2017 02:23:48 +0000 (10:23 +0800)]
sctp: remove the typedef sctp_cmsgs_t

This patch is to remove the typedef sctp_cmsgs_t, and
replace with struct sctp_cmsgs in the places where it's
using this typedef.

Signed-off-by: Xin Long <lucien.xin@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
7 years agosctp: remove the typedef sctp_endpoint_type_t
Xin Long [Fri, 11 Aug 2017 02:23:47 +0000 (10:23 +0800)]
sctp: remove the typedef sctp_endpoint_type_t

This patch is to remove the typedef sctp_endpoint_type_t, and
replace with enum sctp_endpoint_type in the places where it's
using this typedef.

Signed-off-by: Xin Long <lucien.xin@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
7 years agosctp: remove the typedef sctp_sender_hb_info_t
Xin Long [Fri, 11 Aug 2017 02:23:46 +0000 (10:23 +0800)]
sctp: remove the typedef sctp_sender_hb_info_t

This patch is to remove the typedef sctp_sender_hb_info_t, and
replace with struct sctp_sender_hb_info in the places where it's
using this typedef.

It is also to use sizeof(variable) instead of sizeof(type).

Signed-off-by: Xin Long <lucien.xin@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
7 years agosctp: remove the unused typedef sctp_packet_phandler_t
Xin Long [Fri, 11 Aug 2017 02:23:45 +0000 (10:23 +0800)]
sctp: remove the unused typedef sctp_packet_phandler_t

Remove this function typedef, there is even no places
using it.

Signed-off-by: Xin Long <lucien.xin@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
7 years agoMerge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
David S. Miller [Thu, 10 Aug 2017 19:11:16 +0000 (12:11 -0700)]
Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net

Mainline had UFO fixes, but UFO is removed in net-next so we
take the HEAD hunks.

Minor context conflict in bcmsysport statistics bug fix.

Signed-off-by: David S. Miller <davem@davemloft.net>
7 years agoMerge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Linus Torvalds [Thu, 10 Aug 2017 17:30:29 +0000 (10:30 -0700)]
Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net

Pull networking fixes from David Miller:

 1) Fix handling of initial STATE message in TIPC, from Jon Paul Maloy.

 2) Fix stats handling in bcm_sysport_get_stats(), from Florian
    Fainelli.

 3) Reject 16777215 VNI value in geneve_validate(), from Girish
    Moodalbail.

 4) Fix initial IGMP sysctl setting regression, from Nikolay Borisov.

 5) Once a UFO fragmented frame is treated as UFO, we should continue
    doing so. Likewise once a frame has been segmented, we should
    continue doing that and not try to convert it to a UFO frame. From
    Willem de Bruijn.

 6) Test the AF_PACKET RX/TX ring pg_vec state under the socket lock to
    prevent races. From Willem de Bruijn.

* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net:
  packet: fix tp_reserve race in packet_set_ring
  udp: consistently apply ufo or fragmentation
  net: sched: set xt_tgchk_param par.nft_compat as 0 in ipt_init_target
  igmp: Fix regression caused by igmp sysctl namespace code.
  geneve: maximum value of VNI cannot be used
  net: systemport: Fix software statistics for SYSTEMPORT Lite
  tipc: remove premature ESTABLISH FSM event at link synchronization

7 years agopacket: fix tp_reserve race in packet_set_ring
Willem de Bruijn [Thu, 10 Aug 2017 16:41:58 +0000 (12:41 -0400)]
packet: fix tp_reserve race in packet_set_ring

Updates to tp_reserve can race with reads of the field in
packet_set_ring. Avoid this by holding the socket lock during
updates in setsockopt PACKET_RESERVE.

This bug was discovered by syzkaller.

Fixes: 8913336a7e8d ("packet: add PACKET_RESERVE sockopt")
Reported-by: Andrey Konovalov <andreyknvl@google.com>
Signed-off-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
7 years agoudp: consistently apply ufo or fragmentation
Willem de Bruijn [Thu, 10 Aug 2017 16:29:19 +0000 (12:29 -0400)]
udp: consistently apply ufo or fragmentation

When iteratively building a UDP datagram with MSG_MORE and that
datagram exceeds MTU, consistently choose UFO or fragmentation.

Once skb_is_gso, always apply ufo. Conversely, once a datagram is
split across multiple skbs, do not consider ufo.

Sendpage already maintains the first invariant, only add the second.
IPv6 does not have a sendpage implementation to modify.

A gso skb must have a partial checksum, do not follow sk_no_check_tx
in udp_send_skb.

Found by syzkaller.

Fixes: e89e9cf539a2 ("[IPv4/IPv6]: UFO Scatter-gather approach")
Reported-by: Andrey Konovalov <andreyknvl@google.com>
Signed-off-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
7 years agoMerge branch 'rtnetlink-fix-initial-rtnl-pushdown-fallout'
David S. Miller [Thu, 10 Aug 2017 16:50:22 +0000 (09:50 -0700)]
Merge branch 'rtnetlink-fix-initial-rtnl-pushdown-fallout'

Florian Westphal says:

====================
rtnetlink: fix initial rtnl pushdown fallout

This series fixes various bugs and splats reported since the
allow-handler-to-run-with-no-rtnl series went in.

Last patch adds a script that can be used to add further
tests in case more bugs are reported.
In case you prefer reverting the original series instead of
fixing fallout I can resend this patch on its own.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
7 years agoselftests: add rtnetlink test script
Florian Westphal [Thu, 10 Aug 2017 14:53:02 +0000 (16:53 +0200)]
selftests: add rtnetlink test script

add a simple script to exercise some rtnetlink call paths, so KASAN,
lockdep etc. can yell at developer before patches are sent upstream.

This can be extended to also cover bond, team, vrf and the like.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
7 years agortnetlink: fallback to UNSPEC if current family has no doit callback
Florian Westphal [Thu, 10 Aug 2017 14:53:01 +0000 (16:53 +0200)]
rtnetlink: fallback to UNSPEC if current family has no doit callback

We need to use PF_UNSPEC in case the requested family has no doit
callback, otherwise this now fails with EOPNOTSUPP instead of running the
unspec doit callback, as before.

Fixes: 6853dd488119 ("rtnetlink: protect handler table with rcu")
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
7 years agortnetlink: init handler refcounts to 1
Florian Westphal [Thu, 10 Aug 2017 14:53:00 +0000 (16:53 +0200)]
rtnetlink: init handler refcounts to 1

If using CONFIG_REFCOUNT_FULL=y we get following splat:
 refcount_t: increment on 0; use-after-free.
WARNING: CPU: 0 PID: 304 at lib/refcount.c:152 refcount_inc+0x47/0x50
Call Trace:
 rtnetlink_rcv_msg+0x191/0x260
 ...

This warning is harmless (0 is "no callback running", not "memory
was freed").

Use '1' as the new 'no handler is running' base instead of 0 to avoid
this.

Fixes: 019a316992ee ("rtnetlink: add reference counting to prevent module unload while dump is in progress")
Reported-by: Sabrina Dubroca <sdubroca@redhat.com>
Reported-by: kernel test robot <fengguang.wu@intel.com>
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
7 years agortnetlink: switch rtnl_link_get_slave_info_data_size to rcu
Florian Westphal [Thu, 10 Aug 2017 14:52:59 +0000 (16:52 +0200)]
rtnetlink: switch rtnl_link_get_slave_info_data_size to rcu

David Ahern reports following splat:
 RTNL: assertion failed at net/core/dev.c (5717)
 netdev_master_upper_dev_get+0x5f/0x70
 if_nlmsg_size+0x158/0x240
 rtnl_calcit.isra.26+0xa3/0xf0

rtnl_link_get_slave_info_data_size currently assumes RTNL protection, but
there appears to be no hard requirement for this, so use rcu instead.

At the time of this writing, there are three 'get_slave_size' callbacks
(now invoked under rcu): bond_get_slave_size, vrf_get_slave_size and
br_port_get_slave_size, all return constant only (i.e. they don't sleep).

Fixes: 6853dd488119 ("rtnetlink: protect handler table with rcu")
Reported-by: David Ahern <dsahern@gmail.com>
Signed-off-by: Florian Westphal <fw@strlen.de>
Acked-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
7 years agortnetlink: do not use RTM_GETLINK directly
Florian Westphal [Thu, 10 Aug 2017 14:52:58 +0000 (16:52 +0200)]
rtnetlink: do not use RTM_GETLINK directly

Userspace sends RTM_GETLINK type, but the kernel substracts
RTM_BASE from this, i.e. 'type' doesn't contain RTM_GETLINK
anymore but instead RTM_GETLINK - RTM_BASE.

This caused the calcit callback to not be invoked when it
should have been (and vice versa).

While at it, also fix a off-by one when checking family index. vs
handler array size.

Fixes: e1fa6d216dd ("rtnetlink: call rtnl_calcit directly")
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
7 years agortnetlink: use rcu_dereference_raw to silence rcu splat
Florian Westphal [Thu, 10 Aug 2017 14:52:57 +0000 (16:52 +0200)]
rtnetlink: use rcu_dereference_raw to silence rcu splat

Ido reports a rcu splat in __rtnl_register.
The splat is correct; as rtnl_register doesn't grab any logs
and doesn't use rcu locks either.  It has always been like this.
handler families are not registered in parallel so there are no
races wrt. the kmalloc ordering.

The only reason to use rcu_dereference in the first place was to
avoid sparse from complaining about this.

Thus this switches to _raw() to not have rcu checks here.

The alternative is to add rtnl locking to register/unregister,
however, I don't see a compelling reason to do so as this has been
lockless for the past twenty years or so.

Fixes: 6853dd4881 ("rtnetlink: protect handler table with rcu")
Reported-by: Ido Schimmel <idosch@idosch.org>
Signed-off-by: Florian Westphal <fw@strlen.de>
Tested-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
7 years agoMerge git://git.kernel.org/pub/scm/linux/kernel/git/davem/sparc
Linus Torvalds [Thu, 10 Aug 2017 16:36:06 +0000 (09:36 -0700)]
Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/sparc

Pull sparc updates from David Miller:

 1) Recognize M8 cpus, just basic chip ID matching, from Allen Pais.

 2) Prevent crashes when bringing up sunvdc virtual block devices in
    some environments. From Jim Quigley.

* git://git.kernel.org/pub/scm/linux/kernel/git/davem/sparc:
  sunvdc: prevent sunvdc panic when mpgroup disk added to guest domain
  sparc64: Increase max_phys_bits to 51 and VA bits to 53 for M8.
  sparc64: recognize and support sparc M8 cpu type
  sparc64: properly name the cpu constants

7 years agonet: core: fix compile error inside flow_dissector due to new dsa callback
John Crispin [Thu, 10 Aug 2017 08:09:03 +0000 (10:09 +0200)]
net: core: fix compile error inside flow_dissector due to new dsa callback

The following error was introduced by
commit 43e665287f93 ("net-next: dsa: fix flow dissection")
due to a missing #if guard

net/core/flow_dissector.c: In function '__skb_flow_dissect':
net/core/flow_dissector.c:448:18: error: 'struct net_device' has no member named 'dsa_ptr'
ops = skb->dev->dsa_ptr->tag_ops;
                ^
make[3]: *** [net/core/flow_dissector.o] Error 1

Signed-off-by: John Crispin <john@phrozen.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
7 years agoMerge branch 'dsa-flow-dissection'
David S. Miller [Thu, 10 Aug 2017 05:51:47 +0000 (22:51 -0700)]
Merge branch 'dsa-flow-dissection'

John Crispin says:

====================
net-next: dsa: fix flow dissection

RPS and probably other kernel features are currently broken on some if not
all DSA devices. The root cause of this is that skb_hash will call the
flow_dissector. At this point the skb still contains the magic switch
header and the skb->protocol field is not set up to the correct 802.3
value yet. By the time the tag specific code is called, removing the header
and properly setting the protocol an invalid hash is already set. In the
case of the mt7530 this will result in all flows always having the same
hash.

Changes since RFC:
* use a callback instead of static values
* add cover letter
====================

Reviewed-by: Vivien Didelot <vivien.didelot@savoirfairelinux.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
7 years agonet-next: dsa: fix flow dissection
John Crispin [Wed, 9 Aug 2017 12:41:19 +0000 (14:41 +0200)]
net-next: dsa: fix flow dissection

RPS and probably other kernel features are currently broken on some if not
all DSA devices. The root cause of this is that skb_hash will call the
flow_dissector. At this point the skb still contains the magic switch
header and the skb->protocol field is not set up to the correct 802.3
value yet. By the time the tag specific code is called, removing the header
and properly setting the protocol an invalid hash is already set. In the
case of the mt7530 this will result in all flows always having the same
hash.

Signed-off-by: Muciri Gatimu <muciri@openmesh.com>
Signed-off-by: Shashidhar Lakkavalli <shashidhar.lakkavalli@openmesh.com>
Signed-off-by: John Crispin <john@phrozen.org>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
7 years agonet-next: tag_mtk: add flow_dissect callback to the ops struct
John Crispin [Wed, 9 Aug 2017 12:41:18 +0000 (14:41 +0200)]
net-next: tag_mtk: add flow_dissect callback to the ops struct

The MT7530 inserts the 4 magic header in between the 802.3 address and
protocol field. The patch implements the callback that can be called by
the flow dissector to figure out the real protocol and offset of the
network header. With this patch applied we can properly parse the packet
and thus make hashing function properly.

Signed-off-by: Muciri Gatimu <muciri@openmesh.com>
Signed-off-by: Shashidhar Lakkavalli <shashidhar.lakkavalli@openmesh.com>
Signed-off-by: John Crispin <john@phrozen.org>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
7 years agonet-next: dsa: add flow_dissect callback to struct dsa_device_ops
John Crispin [Wed, 9 Aug 2017 12:41:17 +0000 (14:41 +0200)]
net-next: dsa: add flow_dissect callback to struct dsa_device_ops

When the flow dissector first sees packets coming in on a DSA devices the
802.3 header wont be located where the code expects it to be as the tag
is still present. Adding this new callback allows a DSA device to provide a
new function that the flow_dissector can use to get the correct protocol
and offset of the network header.

Signed-off-by: Muciri Gatimu <muciri@openmesh.com>
Signed-off-by: Shashidhar Lakkavalli <shashidhar.lakkavalli@openmesh.com>
Signed-off-by: John Crispin <john@phrozen.org>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
7 years agonet-next: dsa: move struct dsa_device_ops to the global header file
John Crispin [Wed, 9 Aug 2017 12:41:16 +0000 (14:41 +0200)]
net-next: dsa: move struct dsa_device_ops to the global header file

We need to access this struct from within the flow_dissector to fix
dissection for packets coming in on DSA devices.

Signed-off-by: Muciri Gatimu <muciri@openmesh.com>
Signed-off-by: Shashidhar Lakkavalli <shashidhar.lakkavalli@openmesh.com>
Signed-off-by: John Crispin <john@phrozen.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
7 years agonet: sched: set xt_tgchk_param par.nft_compat as 0 in ipt_init_target
Xin Long [Wed, 9 Aug 2017 10:15:19 +0000 (18:15 +0800)]
net: sched: set xt_tgchk_param par.nft_compat as 0 in ipt_init_target

Commit 55917a21d0cc ("netfilter: x_tables: add context to know if
extension runs from nft_compat") introduced a member nft_compat to
xt_tgchk_param structure.

But it didn't set it's value for ipt_init_target. With unexpected
value in par.nft_compat, it may return unexpected result in some
target's checkentry.

This patch is to set all it's fields as 0 and only initialize the
non-zero fields in ipt_init_target.

v1->v2:
  As Wang Cong's suggestion, fix it by setting all it's fields as
  0 and only initializing the non-zero fields.

Fixes: 55917a21d0cc ("netfilter: x_tables: add context to know if extension runs from nft_compat")
Suggested-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
7 years agoigmp: Fix regression caused by igmp sysctl namespace code.
Nikolay Borisov [Wed, 9 Aug 2017 11:38:04 +0000 (14:38 +0300)]
igmp: Fix regression caused by igmp sysctl namespace code.

Commit dcd87999d415 ("igmp: net: Move igmp namespace init to correct file")
moved the igmp sysctls initialization from tcp_sk_init to igmp_net_init. This
function is only called as part of per-namespace initialization, only if
CONFIG_IP_MULTICAST is defined, otherwise igmp_mc_init() call in ip_init is
compiled out, casuing the igmp pernet ops to not be registerd and those sysctl
being left initialized with 0. However, there are certain functions, such as
ip_mc_join_group which are always compiled and make use of some of those
sysctls. Let's do a partial revert of the aforementioned commit and move the
sysctl initialization into inet_init_net, that way they will always have
sane values.

Fixes: dcd87999d415 ("igmp: net: Move igmp namespace init to correct file")
Link: https://bugzilla.kernel.org/show_bug.cgi?id=196595
Reported-by: Gerardo Exequiel Pozzi <vmlinuz386@gmail.com>
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
7 years agoMerge branch 'mediatek-bring-up-QDMA-RX-ring-0'
David S. Miller [Thu, 10 Aug 2017 05:45:36 +0000 (22:45 -0700)]
Merge branch 'mediatek-bring-up-QDMA-RX-ring-0'

John Crispin says:

====================
net-next: mediatek: bring up QDMA RX ring 0

The MT7623 has several DMA rings. Inside the SW path, the core will use
the PDMA when receiving traffic. While bringing up the HW path we noticed
that the PPE requires the QDMA RX to also be brought up as it uses this
ring internally for its flow scheduling.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
7 years agonet-next: mediatek: bring up QDMA RX ring 0
John Crispin [Wed, 9 Aug 2017 10:09:32 +0000 (12:09 +0200)]
net-next: mediatek: bring up QDMA RX ring 0

This patch is in preparation for adding HW flow and QoS offloading. For
those features to work, the driver needs to bring up the first QDMA RX
ring. This ring is used by the PPE offloading HW.

Signed-off-by: John Crisp in <john@phrozen.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
7 years agonet-next: mediatek: fix typos inside the header file
John Crispin [Wed, 9 Aug 2017 10:09:31 +0000 (12:09 +0200)]
net-next: mediatek: fix typos inside the header file

Trivial patch fixing 2 typos.

Signed-off-by: John Crispin <john@phrozen.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
7 years agonet: atm: make atmdev_ops const
Bhumika Goyal [Wed, 9 Aug 2017 09:32:08 +0000 (15:02 +0530)]
net: atm: make atmdev_ops const

Make these const as they are only stored in the ops field of a atm_dev
structure, which is const.
Done using Coccinelle.

Signed-off-by: Bhumika Goyal <bhumirks@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
7 years agoatm: make atmdev_ops const
Bhumika Goyal [Wed, 9 Aug 2017 09:19:15 +0000 (14:49 +0530)]
atm: make atmdev_ops const

Make these structures const as they are either passed to the function
atm_dev_register having the corresponding argument as const or stored in
the ops field of a atm_dev structure, which is also const.
Done using Coccinelle.

Signed-off-by: Bhumika Goyal <bhumirks@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
7 years agonet: dsa: make dsa_switch_ops const
Bhumika Goyal [Wed, 9 Aug 2017 05:04:15 +0000 (10:34 +0530)]
net: dsa: make dsa_switch_ops const

Make these structures const as they are only stored in the ops field of
a dsa_switch structure, which is const.
Done using Coccinelle.

Signed-off-by: Bhumika Goyal <bhumirks@gmail.com>
Reviewed-by: Vivien Didelot <vivien.didelot@savoirfairelinux.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
7 years agoliquidio: napi cleanup
Intiyaz Basha [Wed, 9 Aug 2017 02:34:28 +0000 (19:34 -0700)]
liquidio: napi cleanup

Disable napi when interface is going down.
Delete napi when destroying the interface.

Signed-off-by: Intiyaz Basha <intiyaz.basha@cavium.com>
Signed-off-by: Felix Manlunas <felix.manlunas@cavium.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
7 years agogeneve: maximum value of VNI cannot be used
Girish Moodalbail [Wed, 9 Aug 2017 00:26:24 +0000 (17:26 -0700)]
geneve: maximum value of VNI cannot be used

Geneve's Virtual Network Identifier (VNI) is 24 bit long, so the range
of values for it would be from 0 to 16777215 (2^24 -1).  However, one
cannot create a geneve device with VNI set to 16777215. This patch fixes
this issue.

Signed-off-by: Girish Moodalbail <girish.moodalbail@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
7 years agonet: ipv6: lower ndisc notifier priority below addrconf
David Ahern [Tue, 8 Aug 2017 21:51:02 +0000 (15:51 -0600)]
net: ipv6: lower ndisc notifier priority below addrconf

ndisc_notify is used to send unsolicited neighbor advertisements
(e.g., on a link up). Currently, the ndisc notifier is run before the
addrconf notifer which means NA's are not sent for link-local addresses
which are added by the addrconf notifier.

Fix by lowering the priority of the ndisc notifier. Setting the priority
to ADDRCONF_NOTIFY_PRIORITY - 5 means it runs after addrconf and before
the route notifier which is ADDRCONF_NOTIFY_PRIORITY - 10.

Signed-off-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
7 years agonet: systemport: Fix software statistics for SYSTEMPORT Lite
Florian Fainelli [Tue, 8 Aug 2017 21:45:09 +0000 (14:45 -0700)]
net: systemport: Fix software statistics for SYSTEMPORT Lite

With SYSTEMPORT Lite we have holes in our statistics layout that make us
skip over the hardware MIB counters, bcm_sysport_get_stats() was not
taking that into account resulting in reporting 0 for all SW-maintained
statistics, fix this by skipping accordingly.

Fixes: 44a4524c54af ("net: systemport: Add support for SYSTEMPORT Lite")
Signed-off-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
7 years agotipc: remove premature ESTABLISH FSM event at link synchronization
Jon Paul Maloy [Tue, 8 Aug 2017 20:23:56 +0000 (22:23 +0200)]
tipc: remove premature ESTABLISH FSM event at link synchronization

When a link between two nodes come up, both endpoints will initially
send out a STATE message to the peer, to increase the probability that
the peer endpoint also is up when the first traffic message arrives.
Thereafter, if the establishing link is the second link between two
nodes, this first "traffic" message is a TUNNEL_PROTOCOL/SYNCH message,
helping the peer to perform initial synchronization between the two
links.

However, the initial STATE message may be lost, in which case the SYNCH
message will be the first one arriving at the peer. This should also
work, as the SYNCH message itself will be used to take up the link
endpoint before  initializing synchronization.

Unfortunately the code for this case is broken. Currently, the link is
brought up through a tipc_link_fsm_evt(ESTABLISHED) when a SYNCH
arrives, whereupon __tipc_node_link_up() is called to distribute the
link slots and take the link into traffic. But, __tipc_node_link_up() is
itself starting with a test for whether the link is up, and if true,
returns without action. Clearly, the tipc_link_fsm_evt(ESTABLISHED) call
is unnecessary, since tipc_node_link_up() is itself issuing such an
event, but also harmful, since it inhibits tipc_node_link_up() to
perform the test of its tasks, and the link endpoint in question hence
is never taken into traffic.

This problem has been exposed when we set up dual links between pre-
and post-4.4 kernels, because the former ones don't send out the
initial STATE message described above.

We fix this by removing the unnecessary event call.

Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
7 years agoibmvnic: Correct 'unused variable' warning in build.
Nathan Fontenot [Tue, 8 Aug 2017 20:26:18 +0000 (15:26 -0500)]
ibmvnic: Correct 'unused variable' warning in build.

Commit a248878d7a1d ("ibmvnic: Check for transport event on driver resume")
removed the loop to kick irqs on driver resume but didn't remove the now
unused loop variable 'i'.

Signed-off-by: Nathan Fontenot <nfont@linux.vnet.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
7 years agoibmvnic: Add netdev_dbg output for debugging
Nathan Fontenot [Tue, 8 Aug 2017 20:24:05 +0000 (15:24 -0500)]
ibmvnic: Add netdev_dbg output for debugging

To ease debugging of the ibmvnic driver add a series of netdev_dbg()
statements to track driver status, especially during initialization,
removal, and resetting of the driver.

Signed-off-by: Nathan Fontenot <nfont@linux.vnet.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
7 years agoibmvnic: Clean up resources on probe failure
Nathan Fontenot [Tue, 8 Aug 2017 19:28:45 +0000 (14:28 -0500)]
ibmvnic: Clean up resources on probe failure

Ensure that any resources allocated during probe are released if the
probe of the driver fails.

Signed-off-by: Nathan Fontenot <nfont@linux.vnet.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
7 years agosunvdc: prevent sunvdc panic when mpgroup disk added to guest domain
Jim Quigley [Fri, 21 Jul 2017 13:20:15 +0000 (09:20 -0400)]
sunvdc: prevent sunvdc panic when mpgroup disk added to guest domain

Using mpgroup to define multiple paths for a virtual disk causes multiple
virtual-device-port ports to be created for that virtual device.
Each virtual-device-port port then gets a vdisk created for it by the Linux
sunvdc driver. As mpgroup is not supported by the Linux sunvdc driver it
cannot handle multiple ports for a single vdisk, leading to a kernel panic
at startup.

This fix prevents more than one vdisk per virtual-device-port being created
until full virtual disk multipathing (mpgroup) support is implemented.

Signed-off-by: Jim Quigley <Jim.Quigley@oracle.com>
Reviewed-by: Liam Merwick <liam.merwick@oracle.com>
Reviewed-by: Shannon Nelson <shannon.nelson@oracle.com>
Reviewed-by: Alexandre Chartre <alexandre.chartre@oracle.com>
Reviewed-by: Aaron Young <aaron.young@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
7 years agoMerge branch 'rtnetlink-allow-selected-handlers-to-run-without-rtnl'
David S. Miller [Wed, 9 Aug 2017 23:57:38 +0000 (16:57 -0700)]
Merge branch 'rtnetlink-allow-selected-handlers-to-run-without-rtnl'

Florian Westphal says:

====================
rtnetlink: allow selected handlers to run without rtnl

Changes since v1:
 In patch 6, don't make ipv6 route handlers lockless, they all have
 assumptions on rtnl being held.  Other patches are unchanged.

The RTNL mutex is used to serialize both rtnetlink calls and
dump requests.
Its also used to protect other things such as the list of current
net namespaces.

Unfortunately RTNL mutex is a performance issue, e.g. a cpu adding an
ip address prevents other cpus from seemingly unrelated tasks such as
dumping tc classifiers or doing rtnetlink route lookups.

This patch set adds basic infrastructure to start pushing the rtnl lock
down to those places that need it, or even elide it entirely in some cases.

Subsystems can now indicate that their doit() callback can run without
RTNL mutex, such callbacks can then run in parallel.

This will obviously need a lot of followup work; all current
users need to be audited/changed to benefit from this.
Initial no-rtnl spot is netns new/getid.

We have various 'get' handlers that are also a tempting target,
however, several of these depend on rtnl mutex to prevent information
from changing while objects are being read by rtnl handlers; however,
it doesn't appear impossible to change this.

Dumps are another problem entirely, see
commit 2907c35ff64708065 ("net: hold rtnl again in dump callbacks"),
this patchset doesn't touch dump requests.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
7 years agonet: call newid/getid without rtnl mutex held
Florian Westphal [Wed, 9 Aug 2017 18:41:53 +0000 (20:41 +0200)]
net: call newid/getid without rtnl mutex held

Both functions take nsid_lock and don't rely on rtnl lock.

Signed-off-by: Florian Westphal <fw@strlen.de>
Reviewed-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
7 years agortnetlink: add RTNL_FLAG_DOIT_UNLOCKED
Florian Westphal [Wed, 9 Aug 2017 18:41:52 +0000 (20:41 +0200)]
rtnetlink: add RTNL_FLAG_DOIT_UNLOCKED

Allow callers to tell rtnetlink core that its doit callback
should be invoked without holding rtnl mutex.

Signed-off-by: Florian Westphal <fw@strlen.de>
Reviewed-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
7 years agortnetlink: protect handler table with rcu
Florian Westphal [Wed, 9 Aug 2017 18:41:51 +0000 (20:41 +0200)]
rtnetlink: protect handler table with rcu

Note that netlink dumps still acquire rtnl mutex via the netlink
dump infrastructure.

Signed-off-by: Florian Westphal <fw@strlen.de>
Reviewed-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
7 years agortnetlink: small rtnl lock pushdown
Florian Westphal [Wed, 9 Aug 2017 18:41:50 +0000 (20:41 +0200)]
rtnetlink: small rtnl lock pushdown

instead of rtnl lock/unload at the top level, push it down
to the called function.

This is just an intermediate step, next commit switches protection
of the rtnl_link ops table to rcu, in which case (for dumps) the
rtnl lock is acquired only by the netlink dumper infrastructure
(current lock/unlock/dump/lock/unlock rtnl sequence becomes
 rcu lock/rcu unlock/dump).

Signed-off-by: Florian Westphal <fw@strlen.de>
Reviewed-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
7 years agortnetlink: add reference counting to prevent module unload while dump is in progress
Florian Westphal [Wed, 9 Aug 2017 18:41:49 +0000 (20:41 +0200)]
rtnetlink: add reference counting to prevent module unload while dump is in progress

I don't see what prevents rmmod (unregister_all is called) while a dump
is active.

Even if we'd add rtnl lock/unlock pair to unregister_all (as done here),
thats not enough either as rtnl_lock is released right before the dump
process starts.

So this adds a refcount:
 * acquire rtnl mutex
 * bump refcount
 * release mutex
 * start the dump

... and make unregister_all remove the callbacks (no new dumps possible)
and then wait until refcount is 0.

Signed-off-by: Florian Westphal <fw@strlen.de>
Reviewed-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
7 years agortnetlink: make rtnl_register accept a flags parameter
Florian Westphal [Wed, 9 Aug 2017 18:41:48 +0000 (20:41 +0200)]
rtnetlink: make rtnl_register accept a flags parameter

This change allows us to later indicate to rtnetlink core that certain
doit functions should be called without acquiring rtnl_mutex.

This change should have no effect, we simply replace the last (now
unused) calcit argument with the new flag.

Signed-off-by: Florian Westphal <fw@strlen.de>
Reviewed-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
7 years agortnetlink: call rtnl_calcit directly
Florian Westphal [Wed, 9 Aug 2017 18:41:47 +0000 (20:41 +0200)]
rtnetlink: call rtnl_calcit directly

There is only a single place in the kernel that regisers the "calcit"
callback (to determine min allocation for dumps).

This is in rtnetlink.c for PF_UNSPEC RTM_GETLINK.
The function that checks for calcit presence at run time will first check
the requested family (which will always fail for !PF_UNSPEC as no callsite
registers this), then falls back to checking PF_UNSPEC.

Therefore we can just check if type is RTM_GETLINK and then do a direct
call.  Because of fallback to PF_UNSPEC all RTM_GETLINK types used this
regardless of family.

This has the advantage that we don't need to allocate space for
the function pointer for all the other families.

A followup patch will drop the calcit function pointer from the
rtnl_link callback structure.

Signed-off-by: Florian Westphal <fw@strlen.de>
Reviewed-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
7 years agoMerge branch 'bpf-new-branches'
David S. Miller [Wed, 9 Aug 2017 23:53:57 +0000 (16:53 -0700)]
Merge branch 'bpf-new-branches'

Daniel Borkmann says:

====================
bpf: Add BPF_J{LT,LE,SLT,SLE} instructions

This set adds BPF_J{LT,LE,SLT,SLE} instructions to the BPF
insn set, interpreter, JIT hardening code and all JITs are
also updated to support the new instructions. Basic idea is
to reduce register pressure by avoiding BPF_J{GT,GE,SGT,SGE}
rewrites. Removing the workaround for the rewrites in LLVM,
this can result in shorter BPF programs, less stack usage
and less verification complexity. First patch provides some
more details on rationale and integration.

Thanks a lot!

v1 -> v2:
  - Reworded commit msg in patch 1
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
7 years agobpf: add test cases for new BPF_J{LT, LE, SLT, SLE} instructions
Daniel Borkmann [Wed, 9 Aug 2017 23:40:03 +0000 (01:40 +0200)]
bpf: add test cases for new BPF_J{LT, LE, SLT, SLE} instructions

Add test cases to the verifier selftest suite in order to verify that
i) direct packet access, and ii) dynamic map value access is working
with the changes related to the new instructions.

Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
7 years agobpf: enable BPF_J{LT, LE, SLT, SLE} opcodes in verifier
Daniel Borkmann [Wed, 9 Aug 2017 23:40:02 +0000 (01:40 +0200)]
bpf: enable BPF_J{LT, LE, SLT, SLE} opcodes in verifier

Enable the newly added jump opcodes, main parts are in two
different areas, namely direct packet access and dynamic map
value access. For the direct packet access, we now allow for
the following two new patterns to match in order to trigger
markings with find_good_pkt_pointers():

Variant 1 (access ok when taking the branch):

  0: (61) r2 = *(u32 *)(r1 +76)
  1: (61) r3 = *(u32 *)(r1 +80)
  2: (bf) r0 = r2
  3: (07) r0 += 8
  4: (ad) if r0 < r3 goto pc+2
  R0=pkt(id=0,off=8,r=0) R1=ctx R2=pkt(id=0,off=0,r=0)
  R3=pkt_end R10=fp
  5: (b7) r0 = 0
  6: (95) exit

  from 4 to 7: R0=pkt(id=0,off=8,r=8) R1=ctx
               R2=pkt(id=0,off=0,r=8) R3=pkt_end R10=fp
  7: (71) r0 = *(u8 *)(r2 +0)
  8: (05) goto pc-4
  5: (b7) r0 = 0
  6: (95) exit
  processed 11 insns, stack depth 0

Variant 2 (access ok on fall-through):

  0: (61) r2 = *(u32 *)(r1 +76)
  1: (61) r3 = *(u32 *)(r1 +80)
  2: (bf) r0 = r2
  3: (07) r0 += 8
  4: (bd) if r3 <= r0 goto pc+1
  R0=pkt(id=0,off=8,r=8) R1=ctx R2=pkt(id=0,off=0,r=8)
  R3=pkt_end R10=fp
  5: (71) r0 = *(u8 *)(r2 +0)
  6: (b7) r0 = 1
  7: (95) exit

  from 4 to 6: R0=pkt(id=0,off=8,r=0) R1=ctx
               R2=pkt(id=0,off=0,r=0) R3=pkt_end R10=fp
  6: (b7) r0 = 1
  7: (95) exit
  processed 10 insns, stack depth 0

The above two basically just swap the branches where we need
to handle an exception and allow packet access compared to the
two already existing variants for find_good_pkt_pointers().

For the dynamic map value access, we add the new instructions
to reg_set_min_max() and reg_set_min_max_inv() in order to
learn bounds. Verifier test cases for both are added in a
follow-up patch.

Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
7 years agobpf, nfp: implement jiting of BPF_J{LT,LE}
Daniel Borkmann [Wed, 9 Aug 2017 23:40:01 +0000 (01:40 +0200)]
bpf, nfp: implement jiting of BPF_J{LT,LE}

This work implements jiting of BPF_J{LT,LE} instructions with
BPF_X/BPF_K variants for the nfp eBPF JIT. The two BPF_J{SLT,SLE}
instructions have not been added yet given BPF_J{SGT,SGE} are
not supported yet either.

Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
7 years agobpf, ppc64: implement jiting of BPF_J{LT, LE, SLT, SLE}
Daniel Borkmann [Wed, 9 Aug 2017 23:40:00 +0000 (01:40 +0200)]
bpf, ppc64: implement jiting of BPF_J{LT, LE, SLT, SLE}

This work implements jiting of BPF_J{LT,LE,SLT,SLE} instructions
with BPF_X/BPF_K variants for the ppc64 eBPF JIT.

Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Naveen N. Rao <naveen.n.rao@linux.vnet.ibm.com>
Tested-by: Naveen N. Rao <naveen.n.rao@linux.vnet.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
7 years agobpf, s390x: implement jiting of BPF_J{LT, LE, SLT, SLE}
Daniel Borkmann [Wed, 9 Aug 2017 23:39:59 +0000 (01:39 +0200)]
bpf, s390x: implement jiting of BPF_J{LT, LE, SLT, SLE}

This work implements jiting of BPF_J{LT,LE,SLT,SLE} instructions
with BPF_X/BPF_K variants for the s390x eBPF JIT.

Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Michael Holzheu <holzheu@linux.vnet.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
7 years agobpf, sparc64: implement jiting of BPF_J{LT, LE, SLT, SLE}
Daniel Borkmann [Wed, 9 Aug 2017 23:39:58 +0000 (01:39 +0200)]
bpf, sparc64: implement jiting of BPF_J{LT, LE, SLT, SLE}

This work implements jiting of BPF_J{LT,LE,SLT,SLE} instructions
with BPF_X/BPF_K variants for the sparc64 eBPF JIT.

Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
7 years agobpf, arm64: implement jiting of BPF_J{LT, LE, SLT, SLE}
Daniel Borkmann [Wed, 9 Aug 2017 23:39:57 +0000 (01:39 +0200)]
bpf, arm64: implement jiting of BPF_J{LT, LE, SLT, SLE}

This work implements jiting of BPF_J{LT,LE,SLT,SLE} instructions
with BPF_X/BPF_K variants for the arm64 eBPF JIT.

Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
7 years agobpf, x86: implement jiting of BPF_J{LT,LE,SLT,SLE}
Daniel Borkmann [Wed, 9 Aug 2017 23:39:56 +0000 (01:39 +0200)]
bpf, x86: implement jiting of BPF_J{LT,LE,SLT,SLE}

This work implements jiting of BPF_J{LT,LE,SLT,SLE} instructions
with BPF_X/BPF_K variants for the x86_64 eBPF JIT.

Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
7 years agobpf: add BPF_J{LT,LE,SLT,SLE} instructions
Daniel Borkmann [Wed, 9 Aug 2017 23:39:55 +0000 (01:39 +0200)]
bpf: add BPF_J{LT,LE,SLT,SLE} instructions

Currently, eBPF only understands BPF_JGT (>), BPF_JGE (>=),
BPF_JSGT (s>), BPF_JSGE (s>=) instructions, this means that
particularly *JLT/*JLE counterparts involving immediates need
to be rewritten from e.g. X < [IMM] by swapping arguments into
[IMM] > X, meaning the immediate first is required to be loaded
into a register Y := [IMM], such that then we can compare with
Y > X. Note that the destination operand is always required to
be a register.

This has the downside of having unnecessarily increased register
pressure, meaning complex program would need to spill other
registers temporarily to stack in order to obtain an unused
register for the [IMM]. Loading to registers will thus also
affect state pruning since we need to account for that register
use and potentially those registers that had to be spilled/filled
again. As a consequence slightly more stack space might have
been used due to spilling, and BPF programs are a bit longer
due to extra code involving the register load and potentially
required spill/fills.

Thus, add BPF_JLT (<), BPF_JLE (<=), BPF_JSLT (s<), BPF_JSLE (s<=)
counterparts to the eBPF instruction set. Modifying LLVM to
remove the NegateCC() workaround in a PoC patch at [1] and
allowing it to also emit the new instructions resulted in
cilium's BPF programs that are injected into the fast-path to
have a reduced program length in the range of 2-3% (e.g.
accumulated main and tail call sections from one of the object
file reduced from 4864 to 4729 insns), reduced complexity in
the range of 10-30% (e.g. accumulated sections reduced in one
of the cases from 116432 to 88428 insns), and reduced stack
usage in the range of 1-5% (e.g. accumulated sections from one
of the object files reduced from 824 to 784b).

The modification for LLVM will be incorporated in a backwards
compatible way. Plan is for LLVM to have i) a target specific
option to offer a possibility to explicitly enable the extension
by the user (as we have with -m target specific extensions today
for various CPU insns), and ii) have the kernel checked for
presence of the extensions and enable them transparently when
the user is selecting more aggressive options such as -march=native
in a bpf target context. (Other frontends generating BPF byte
code, e.g. ply can probe the kernel directly for its code
generation.)

  [1] https://github.com/borkmann/llvm/tree/bpf-insns

Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
7 years agoMerge branch 'net-zerocopy-fixes'
David S. Miller [Wed, 9 Aug 2017 23:49:17 +0000 (16:49 -0700)]
Merge branch 'net-zerocopy-fixes'

Willem de Bruijn says:

====================
net: zerocopy fixes

Fix two issues introduced in the msg_zerocopy patchset.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
7 years agosock: fix zerocopy_success regression with msg_zerocopy
Willem de Bruijn [Wed, 9 Aug 2017 23:09:44 +0000 (19:09 -0400)]
sock: fix zerocopy_success regression with msg_zerocopy

Do not use uarg->zerocopy outside msg_zerocopy. In other paths the
field is not explicitly initialized and aliases another field.

Those paths have only one reference so do not need this intermediate
variable. Call uarg->callback directly.

Fixes: 1f8b977ab32d ("sock: enable MSG_ZEROCOPY")
Signed-off-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
7 years agosock: fix zerocopy panic in mem accounting
Willem de Bruijn [Wed, 9 Aug 2017 23:09:43 +0000 (19:09 -0400)]
sock: fix zerocopy panic in mem accounting

Only call mm_unaccount_pinned_pages when releasing a struct ubuf_info
that has initialized its field uarg->mmp.

Before this patch, a vhost-net with experimental_zcopytx can crash in

  mm_unaccount_pinned_pages
  sock_zerocopy_put
  skb_zcopy_clear
  skb_release_data

Only sock_zerocopy_alloc initializes this field. Move the unaccount
call from generic sock_zerocopy_put to its specific callback
sock_zerocopy_callback.

Fixes: a91dbff551a6 ("sock: ulimit on MSG_ZEROCOPY pages")
Reported-by: David Ahern <dsahern@gmail.com>
Signed-off-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
7 years agoMerge branch '1GbE' of git://git.kernel.org/pub/scm/linux/kernel/git/jkirsher/next...
David S. Miller [Wed, 9 Aug 2017 23:47:19 +0000 (16:47 -0700)]
Merge branch '1GbE' of git://git.kernel.org/pub/scm/linux/kernel/git/jkirsher/next-queue

Jeff Kirsher says:

====================
1GbE Intel Wired LAN Driver Updates 2017-08-08

This series contains updates to e1000e and igb/igbvf.

Gangfeng Huang fixes an issue with receive network flow classification,
where igb_nfc_filter_exit() was not being called in igb_down() which
would cause the filter tables to "fill up" if a user where to change
the adapter settings (such as speed) which requires a reset of the
adapter.

Cliff Spradlin fixes a timestamping issue, where igb was allowing requests
for hardware timestamping even if it was not configured for hardware
transmit timestamping.

Corinna Vinschen removes the error message that there was an "unexpected
SYS WRAP", when it is actually expected.  So remove the message to not
confuse users.

Greg Edwards provides several patches for the mailbox interface between
the PF and VF drivers.  Added a mailbox unlock method to be used to unlock
the PF/VF mailbox by the PF.  Added a lock around the VF mailbox ops to
prevent the VF from sending another message while the PF is still
processing the previous message.  Fixed a "scheduling while atomic" issue
by changing msleep() to mdelay().

Sasha adds support for the next LOM generations i219 (v8 & v9) which
will be available in the next Intel client platform IceLake.

John Linville adds support for a Broadcom PHY to the igb driver, since
there are designs out in the world which use the igb MAC and a third
party PHY.  This allows the driver to load and function as expected on
these designs.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
7 years agoMerge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
David S. Miller [Wed, 9 Aug 2017 23:28:45 +0000 (16:28 -0700)]
Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net

The UDP offload conflict is dealt with by simply taking what is
in net-next where we have removed all of the UFO handling code
entirely.

The TCP conflict was a case of local variables in a function
being removed from both net and net-next.

In netvsc we had an assignment right next to where a missing
set of u64 stats sync object inits were added.

Signed-off-by: David S. Miller <davem@davemloft.net>
7 years agoMerge tag 'pinctrl-v4.13-2' of git://git.kernel.org/pub/scm/linux/kernel/git/linusw...
Linus Torvalds [Wed, 9 Aug 2017 21:30:34 +0000 (14:30 -0700)]
Merge tag 'pinctrl-v4.13-2' of git://git.kernel.org/pub/scm/linux/kernel/git/linusw/linux-pinctrl

Pull pin control fixes from Linus Walleij:
 "These are the pin control fixes I have gathered since the return from
  my vacation. They boiled in -next a while so let's get them in.

  Apart from the documentation build it is purely driver fixes. Which is
  nice. The Intel fixes seem kind of important.

   - Fix the documentation build as the docs were moved

   - Correct the UART pin list on the Intel Merrifield

   - Fix pin assignment and number of pins on the Marvell Armada 37xx
     pin controller

   - Cover the Setzer models in the Chromebook DMI quirk in the Intel
     cheryview driver so they start working

   - Add the missing "sim" function to the sunxi driver

   - Fix USB pin definitions on Uniphier Pro4

   - Smatch fix for invalid reference in the zx pin control driver"

* tag 'pinctrl-v4.13-2' of git://git.kernel.org/pub/scm/linux/kernel/git/linusw/linux-pinctrl:
  pinctrl: generic: update references to Documentation/pinctrl.txt
  pinctrl: intel: merrifield: Correct UART pin lists
  pinctrl: armada-37xx: Fix number of pin in south bridge
  pinctrl: armada-37xx: Fix the pin 23 on south bridge
  pinctrl: cherryview: Add Setzer models to the Chromebook DMI quirk
  pinctrl: sunxi: add a missing function of A10/A20 pinctrl driver
  pinctrl: uniphier: fix USB3 pin assignment for Pro4
  pinctrl: zte: fix dereference of 'data' in zx_set_mux()

7 years agofutex: Remove unnecessary warning from get_futex_key
Mel Gorman [Wed, 9 Aug 2017 07:27:11 +0000 (08:27 +0100)]
futex: Remove unnecessary warning from get_futex_key

Commit 65d8fc777f6d ("futex: Remove requirement for lock_page() in
get_futex_key()") removed an unnecessary lock_page() with the
side-effect that page->mapping needed to be treated very carefully.

Two defensive warnings were added in case any assumption was missed and
the first warning assumed a correct application would not alter a
mapping backing a futex key.  Since merging, it has not triggered for
any unexpected case but Mark Rutland reported the following bug
triggering due to the first warning.

  kernel BUG at kernel/futex.c:679!
  Internal error: Oops - BUG: 0 [#1] PREEMPT SMP
  Modules linked in:
  CPU: 0 PID: 3695 Comm: syz-executor1 Not tainted 4.13.0-rc3-00020-g307fec773ba3 #3
  Hardware name: linux,dummy-virt (DT)
  task: ffff80001e271780 task.stack: ffff000010908000
  PC is at get_futex_key+0x6a4/0xcf0 kernel/futex.c:679
  LR is at get_futex_key+0x6a4/0xcf0 kernel/futex.c:679
  pc : [<ffff00000821ac14>] lr : [<ffff00000821ac14>] pstate: 80000145

The fact that it's a bug instead of a warning was due to an unrelated
arm64 problem, but the warning itself triggered because the underlying
mapping changed.

This is an application issue but from a kernel perspective it's a
recoverable situation and the warning is unnecessary so this patch
removes the warning.  The warning may potentially be triggered with the
following test program from Mark although it may be necessary to adjust
NR_FUTEX_THREADS to be a value smaller than the number of CPUs in the
system.

    #include <linux/futex.h>
    #include <pthread.h>
    #include <stdio.h>
    #include <stdlib.h>
    #include <sys/mman.h>
    #include <sys/syscall.h>
    #include <sys/time.h>
    #include <unistd.h>

    #define NR_FUTEX_THREADS 16
    pthread_t threads[NR_FUTEX_THREADS];

    void *mem;

    #define MEM_PROT  (PROT_READ | PROT_WRITE)
    #define MEM_SIZE  65536

    static int futex_wrapper(int *uaddr, int op, int val,
                             const struct timespec *timeout,
                             int *uaddr2, int val3)
    {
        syscall(SYS_futex, uaddr, op, val, timeout, uaddr2, val3);
    }

    void *poll_futex(void *unused)
    {
        for (;;) {
            futex_wrapper(mem, FUTEX_CMP_REQUEUE_PI, 1, NULL, mem + 4, 1);
        }
    }

    int main(int argc, char *argv[])
    {
        int i;

        mem = mmap(NULL, MEM_SIZE, MEM_PROT,
               MAP_SHARED | MAP_ANONYMOUS, -1, 0);

        printf("Mapping @ %p\n", mem);

        printf("Creating futex threads...\n");

        for (i = 0; i < NR_FUTEX_THREADS; i++)
            pthread_create(&threads[i], NULL, poll_futex, NULL);

        printf("Flipping mapping...\n");
        for (;;) {
            mmap(mem, MEM_SIZE, MEM_PROT,
                 MAP_FIXED | MAP_SHARED | MAP_ANONYMOUS, -1, 0);
        }

        return 0;
    }

Reported-and-tested-by: Mark Rutland <mark.rutland@arm.com>
Signed-off-by: Mel Gorman <mgorman@suse.de>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: stable@vger.kernel.org # 4.7+
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>