git.proxmox.com Git - mirror_ubuntu-zesty-kernel.git/log

Input: ALPS - fix V8+ protocol handling (73 03 28)

BugLink: http://bugs.launchpad.net/bugs/1677589
commit e7348396c6d51b57c95c6646c390cd078e038e19 upstream.

Devices identified as E7="73 03 28" use slightly modified version of V8
protocol, with lower count per electrode, different offsets, and different
feature bits in OTP data.

Fixes: aeaa881f9b17 ("Input: ALPS - set DualPoint flag for 74 03 28 devices")
Signed-off-by: Masaki Ota <masaki.ota@jp.alps.com>
Acked-by: Pali Rohar <pali.rohar@gmail.com>
Tested-by: Paul Donohue <linux-kernel@PaulSD.com>
Tested-by: Nick Fletcher <nick.m.fletcher@gmail.com>
Signed-off-by: Dmitry Torokhov <dmitry.torokhov@gmail.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Tim Gardner <tim.gardner@canonical.com>

HID: sony: Fix input device leak when connecting a DS4 twice using USB/BT

BugLink: http://bugs.launchpad.net/bugs/1677589
commit a687c5765b5ae19fe559e14615ddc87ebb46d409 upstream.

When a user connects a DS4 twice using USB and BT, we reject the
second device connection after the setup work. We then perform
a cleanup, but during cleanup we are not removing the touchpad
device. This leads to leakage of an input device, which we would
never remove. It can likely result into a kernel oops as well
when the touchpad evdev node is accessed and the underlaying HID
device has been removed from the system.

[jkosina@suse.cz: added stable annotation]
Fixes: ac797b95f532 ("HID: sony: Make the DS4 touchpad a separate device")
Signed-off-by: Roderick Colenbrander <roderick.colenbrander@sony.com>
Reviewed-by: Benjamin Tissoires <benjamin.tissoires@redhat.com>
Signed-off-by: Jiri Kosina <jkosina@suse.cz>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Tim Gardner <tim.gardner@canonical.com>

net: solve a NAPI race

BugLink: http://bugs.launchpad.net/bugs/1677589
commit 39e6c8208d7b6fb9d2047850fb3327db567b564b upstream.

While playing with mlx4 hardware timestamping of RX packets, I found
that some packets were received by TCP stack with a ~200 ms delay...

Since the timestamp was provided by the NIC, and my probe was added
in tcp_v4_rcv() while in BH handler, I was confident it was not
a sender issue, or a drop in the network.

This would happen with a very low probability, but hurting RPC
workloads.

A NAPI driver normally arms the IRQ after the napi_complete_done(),
after NAPI_STATE_SCHED is cleared, so that the hard irq handler can grab
it.

Problem is that if another point in the stack grabs NAPI_STATE_SCHED bit
while IRQ are not disabled, we might have later an IRQ firing and
finding this bit set, right before napi_complete_done() clears it.

This can happen with busy polling users, or if gro_flush_timeout is
used. But some other uses of napi_schedule() in drivers can cause this
as well.

thread 1                                 thread 2 (could be on same cpu, or not)

// busy polling or napi_watchdog()
napi_schedule();
...
napi->poll()

device polling:
read 2 packets from ring buffer
                                          Additional 3rd packet is
available.
                                          device hard irq

                                          // does nothing because
NAPI_STATE_SCHED bit is owned by thread 1
                                          napi_schedule();

napi_complete_done(napi, 2);
rearm_irq();

Note that rearm_irq() will not force the device to send an additional
IRQ for the packet it already signaled (3rd packet in my example)

This patch adds a new NAPI_STATE_MISSED bit, that napi_schedule_prep()
can set if it could not grab NAPI_STATE_SCHED

Then napi_complete_done() properly reschedules the napi to make sure
we do not miss something.

Since we manipulate multiple bits at once, use cmpxchg() like in
sk_busy_loop() to provide proper transactions.

In v2, I changed napi_watchdog() to use a relaxed variant of
napi_schedule_prep() : No need to set NAPI_STATE_MISSED from this point.

In v3, I added more details in the changelog and clears
NAPI_STATE_MISSED in busy_poll_stop()

In v4, I added the ideas given by Alexander Duyck in v3 review

Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Alexander Duyck <alexander.duyck@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Tim Gardner <tim.gardner@canonical.com>

amd-xgbe: Fix the ECC-related bit position definitions

BugLink: http://bugs.launchpad.net/bugs/1677589
[ Upstream commit f43feef4e6acde10857fcbfdede790d6b3f2c71d ]

The ECC bit positions that describe whether the ECC interrupt is for
Tx, Rx or descriptor memory and whether the it is a single correctable
or double detected error were defined in incorrectly (reversed order).
Fix the bit position definitions for these settings so that the proper
ECC handling is performed.

Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Tim Gardner <tim.gardner@canonical.com>

tcp: initialize icsk_ack.lrcvtime at session start time

BugLink: http://bugs.launchpad.net/bugs/1677589
[ Upstream commit 15bb7745e94a665caf42bfaabf0ce062845b533b ]

icsk_ack.lrcvtime has a 0 value at socket creation time.

tcpi_last_data_recv can have bogus value if no payload is ever received.

This patch initializes icsk_ack.lrcvtime for active sessions
in tcp_finish_connect(), and for passive sessions in
tcp_create_openreq_child()

Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Tim Gardner <tim.gardner@canonical.com>

genetlink: fix counting regression on ctrl_dumpfamily()

BugLink: http://bugs.launchpad.net/bugs/1677589
[ Upstream commit 1d2a6a5e4bf2921531071fcff8538623dce74efa ]

Commit 2ae0f17df1cd ("genetlink: use idr to track families") replaced

if (++n < fams_to_skip)
continue;
into:

if (n++ < fams_to_skip)
continue;

This subtle change cause that on retry ctrl_dumpfamily() call we omit
one family that failed to do ctrl_fill_info() on previous call, because
cb->args[0] = n number counts also family that failed to do
ctrl_fill_info().

Patch fixes the problem and avoid confusion in the future just decrease
n counter when ctrl_fill_info() fail.

User visible problem caused by this bug is failure to get access to
some genetlink family i.e. nl80211. However problem is reproducible
only if number of registered genetlink families is big enough to
cause second call of ctrl_dumpfamily().

Cc: Xose Vazquez Perez <xose.vazquez@gmail.com>
Cc: Larry Finger <Larry.Finger@lwfinger.net>
Cc: Johannes Berg <johannes@sipsolutions.net>
Fixes: 2ae0f17df1cd ("genetlink: use idr to track families")
Signed-off-by: Stanislaw Gruszka <sgruszka@redhat.com>
Acked-by: Johannes Berg <johannes@sipsolutions.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Tim Gardner <tim.gardner@canonical.com>

socket, bpf: fix sk_filter use after free in sk_clone_lock

BugLink: http://bugs.launchpad.net/bugs/1677589
[ Upstream commit a97e50cc4cb67e1e7bff56f6b41cda62ca832336 ]

In sk_clone_lock(), we create a new socket and inherit most of the
parent's members via sock_copy() which memcpy()'s various sections.
Now, in case the parent socket had a BPF socket filter attached,
then newsk->sk_filter points to the same instance as the original
sk->sk_filter.

sk_filter_charge() is then called on the newsk->sk_filter to take a
reference and should that fail due to hitting max optmem, we bail
out and release the newsk instance.

The issue is that commit 278571baca2a ("net: filter: simplify socket
charging") wrongly combined the dismantle path with the failure path
of xfrm_sk_clone_policy(). This means, even when charging failed, we
call sk_free_unlock_clone() on the newsk, which then still points to
the same sk_filter as the original sk.

Thus, sk_free_unlock_clone() calls into __sk_destruct() eventually
where it tests for present sk_filter and calls sk_filter_uncharge()
on it, which potentially lets sk_omem_alloc wrap around and releases
the eBPF prog and sk_filter structure from the (still intact) parent.

Fix it by making sure that when sk_filter_charge() failed, we reset
newsk->sk_filter back to NULL before passing to sk_free_unlock_clone(),
so that we don't mess with the parents sk_filter.

Only if xfrm_sk_clone_policy() fails, we did reach the point where
either the parent's filter was NULL and as a result newsk's as well
or where we previously had a successful sk_filter_charge(), thus for
that case, we do need sk_filter_uncharge() to release the prior taken
reference on sk_filter.

Fixes: 278571baca2a ("net: filter: simplify socket charging")
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Tim Gardner <tim.gardner@canonical.com>

ipv4: provide stronger user input validation in nl_fib_input()

BugLink: http://bugs.launchpad.net/bugs/1677589
[ Upstream commit c64c0b3cac4c5b8cb093727d2c19743ea3965c0b ]

Alexander reported a KMSAN splat caused by reads of uninitialized
field (tb_id_in) from user provided struct fib_result_nl

It turns out nl_fib_input() sanity tests on user input is a bit
wrong :

User can pretend nlh->nlmsg_len is big enough, but provide
at sendmsg() time a too small buffer.

Reported-by: Alexander Potapenko <glider@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Tim Gardner <tim.gardner@canonical.com>

net: bcmgenet: remove bcmgenet_internal_phy_setup()

BugLink: http://bugs.launchpad.net/bugs/1677589
[ Upstream commit 31739eae738ccbe8b9d627c3f2251017ca03f4d2 ]

Commit 6ac3ce8295e6 ("net: bcmgenet: Remove excessive PHY reset")
removed the bcmgenet_mii_reset() function from bcmgenet_power_up() and
bcmgenet_internal_phy_setup() functions.  In so doing it broke the reset
of the internal PHY devices used by the GENETv1-GENETv3 which required
this reset before the UniMAC was enabled.  It also broke the internal
GPHY devices used by the GENETv4 because the config_init that installed
the AFE workaround was no longer occurring after the reset of the GPHY
performed by bcmgenet_phy_power_set() in bcmgenet_internal_phy_setup().
In addition the code in bcmgenet_internal_phy_setup() related to the
"enable APD" comment goes with the bcmgenet_mii_reset() so it should
have also been removed.

Commit bd4060a6108b ("net: bcmgenet: Power on integrated GPHY in
bcmgenet_power_up()") moved the bcmgenet_phy_power_set() call to the
bcmgenet_power_up() function, but failed to remove it from the
bcmgenet_internal_phy_setup() function.  Had it done so, the
bcmgenet_internal_phy_setup() function would have been empty and could
have been removed at that time.

Commit 5dbebbb44a6a ("net: bcmgenet: Software reset EPHY after power on")
was submitted to correct the functional problems introduced by
commit 6ac3ce8295e6 ("net: bcmgenet: Remove excessive PHY reset"). It
was included in v4.4 and made available on 4.3-stable. Unfortunately,
it didn't fully revert the commit because this bcmgenet_mii_reset()
doesn't apply the soft reset to the internal GPHY used by GENETv4 like
the previous one did. This prevents the restoration of the AFE work-
arounds for internal GPHY devices after the bcmgenet_phy_power_set() in
bcmgenet_internal_phy_setup().

This commit takes the alternate approach of removing the unnecessary
bcmgenet_internal_phy_setup() function which shouldn't have been in v4.3
so that when bcmgenet_mii_reset() was restored it should have only gone
into bcmgenet_power_up().  This will avoid the problems while also
removing the redundancy (and hopefully some of the confusion).

Fixes: 6ac3ce8295e6 ("net: bcmgenet: Remove excessive PHY reset")
Signed-off-by: Doug Berger <opendmb@gmail.com>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Tim Gardner <tim.gardner@canonical.com>

ipv6: make sure to initialize sockc.tsflags before first use

BugLink: http://bugs.launchpad.net/bugs/1677589
[ Upstream commit d515684d78148884d5fc425ba904c50f03844020 ]

In the case udp_sk(sk)->pending is AF_INET6, udpv6_sendmsg() would
jump to do_append_data, skipping the initialization of sockc.tsflags.
Fix the problem by moving sockc.tsflags initialization earlier.

The bug was detected with KMSAN.

Fixes: c14ac9451c34 ("sock: enable timestamping using control messages")
Signed-off-by: Alexander Potapenko <glider@google.com>
Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Tim Gardner <tim.gardner@canonical.com>

net/mlx5e: Count LRO packets correctly

BugLink: http://bugs.launchpad.net/bugs/1677589
[ Upstream commit 8ab7e2ae15d84ba758b2c8c6f4075722e9bd2a08 ]

RX packets statistics ('rx_packets' counter) used to count LRO packets
as one, even though it contains multiple segments.
This patch will increment the counter by the number of segments, and
align the driver with the behavior of other drivers in the stack.

Note that no information is lost in this patch due to 'rx_lro_packets'
counter existence.

Before, ethtool showed:
$ ethtool -S ens6 | egrep "rx_packets|rx_lro_packets"
     rx_packets: 435277
     rx_lro_packets: 35847
     rx_packets_phy: 1935066

Now, we will see the more logical statistics:
$ ethtool -S ens6 | egrep "rx_packets|rx_lro_packets"
     rx_packets: 1935066
     rx_lro_packets: 35847
     rx_packets_phy: 1935066

Fixes: e586b3b0baee ("net/mlx5: Ethernet Datapath files")
Signed-off-by: Gal Pressman <galp@mellanox.com>
Cc: kernel-team@fb.com
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Tim Gardner <tim.gardner@canonical.com>

net/mlx5e: Count GSO packets correctly

BugLink: http://bugs.launchpad.net/bugs/1677589
[ Upstream commit d3a4e4da54c7adb420d5f48e89be913b14bdeff1 ]

TX packets statistics ('tx_packets' counter) used to count GSO packets
as one, even though it contains multiple segments.
This patch will increment the counter by the number of segments, and
align the driver with the behavior of other drivers in the stack.

Note that no information is lost in this patch due to 'tx_tso_packets'
counter existence.

Before, ethtool showed:
$ ethtool -S ens6 | egrep "tx_packets|tx_tso_packets"
     tx_packets: 61340
     tx_tso_packets: 60954
     tx_packets_phy: 2451115

Now, we will see the more logical statistics:
$ ethtool -S ens6 | egrep "tx_packets|tx_tso_packets"
     tx_packets: 2451115
     tx_tso_packets: 60954
     tx_packets_phy: 2451115

Fixes: e586b3b0baee ("net/mlx5: Ethernet Datapath files")
Signed-off-by: Gal Pressman <galp@mellanox.com>
Cc: kernel-team@fb.com
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Tim Gardner <tim.gardner@canonical.com>

net/mlx5: Increase number of max QPs in default profile

BugLink: http://bugs.launchpad.net/bugs/1677589
[ Upstream commit 5f40b4ed975c26016cf41953b7510fe90718e21c ]

With ConnectX-4 sharing SRQs from the same space as QPs, we hit a
limit preventing some applications to allocate needed QPs amount.
Double the size to 256K.

Fixes: e126ba97dba9e ('mlx5: Add driver for Mellanox Connect-IB adapters')
Signed-off-by: Maor Gottlieb <maorg@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Tim Gardner <tim.gardner@canonical.com>

net/mlx5e: Use the proper UAPI values when offloading TC vlan actions

BugLink: http://bugs.launchpad.net/bugs/1677589
[ Upstream commit 09c91ddf2cd33489c2c14edfef43ae38d412888e ]

Currently we use the non UAPI values and we miss erring on
the modify action which is not supported, fix that.

Fixes: 8b32580df1cb ('net/mlx5e: Add TC vlan action for SRIOV offloads')
Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
Reported-by: Petr Machata <petrm@mellanox.com>
Reviewed-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Tim Gardner <tim.gardner@canonical.com>

net/mlx5: Add missing entries for set/query rate limit commands

BugLink: http://bugs.launchpad.net/bugs/1677589
[ Upstream commit 1f30a86c58093046dc3e49c23d2618894e098f7a ]

The switch cases for the rate limit set and query commands were
missing, which could get us wrong under fw error or driver reset
flow, fix that.

Fixes: 1466cc5b23d1 ('net/mlx5: Rate limit tables support')
Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
Reviewed-by: Hadar Hen Zion <hadarh@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Tim Gardner <tim.gardner@canonical.com>

net: vrf: Reset rt6i_idev in local dst after put

BugLink: http://bugs.launchpad.net/bugs/1677589
[ Upstream commit 3dc857f0e8fc22610a59cbb346ba62c6e921863f ]

The VRF driver takes a reference to the inet6_dev on the VRF device for
its rt6_local dst when handling local traffic through the VRF device as
a loopback. When the device is deleted the driver does a put on the idev
but does not reset rt6i_idev in the rt6_info struct. When the dst is
destroyed, dst_destroy calls ip6_dst_destroy which does a second put for
what is essentially the same reference causing it to be prematurely freed.
Reset rt6i_idev after the put in the vrf driver.

Fixes: b4869aa2f881e ("net: vrf: ipv6 support for local traffic to
local addresses")
Signed-off-by: David Ahern <dsa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Tim Gardner <tim.gardner@canonical.com>

qmi_wwan: add Dell DW5811e

BugLink: http://bugs.launchpad.net/bugs/1677589
[ Upstream commit 6bd845d1cf98b45c634baacb8381436dad3c2dd0 ]

This is a Dell branded Sierra Wireless EM7455. It is operating in
MBIM mode by default, but can be configured to provide two QMI/RMNET
functions.

Signed-off-by: Bjørn Mork <bjorn@mork.no>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Tim Gardner <tim.gardner@canonical.com>

net: unix: properly re-increment inflight counter of GC discarded candidates

BugLink: http://bugs.launchpad.net/bugs/1677589
[ Upstream commit 7df9c24625b9981779afb8fcdbe2bb4765e61147 ]

Dmitry has reported that a BUG_ON() condition in unix_notinflight()
may be triggered by a simple code that forwards unix socket in an
SCM_RIGHTS message.
That is caused by incorrect unix socket GC implementation in unix_gc().

The GC first collects list of candidates, then (a) decrements their
"children's" inflight counter, (b) checks which inflight counters are
now 0, and then (c) increments all inflight counters back.
(a) and (c) are done by calling scan_children() with inc_inflight or
dec_inflight as the second argument.

Commit 6209344f5a37 ("net: unix: fix inflight counting bug in garbage
collector") changed scan_children() such that it no longer considers
sockets that do not have UNIX_GC_CANDIDATE flag. It also added a block
of code that that unsets this flag _before_ invoking
scan_children(, dec_iflight, ). This may lead to incorrect inflight
counters for some sockets.

This change fixes this bug by changing order of operations:
UNIX_GC_CANDIDATE is now unset only after all inflight counters are
restored to the original state.

  kernel BUG at net/unix/garbage.c:149!
  RIP: 0010:[<ffffffff8717ebf4>]  [<ffffffff8717ebf4>]
  unix_notinflight+0x3b4/0x490 net/unix/garbage.c:149
  Call Trace:
   [<ffffffff8716cfbf>] unix_detach_fds.isra.19+0xff/0x170 net/unix/af_unix.c:1487
   [<ffffffff8716f6a9>] unix_destruct_scm+0xf9/0x210 net/unix/af_unix.c:1496
   [<ffffffff86a90a01>] skb_release_head_state+0x101/0x200 net/core/skbuff.c:655
   [<ffffffff86a9808a>] skb_release_all+0x1a/0x60 net/core/skbuff.c:668
   [<ffffffff86a980ea>] __kfree_skb+0x1a/0x30 net/core/skbuff.c:684
   [<ffffffff86a98284>] kfree_skb+0x184/0x570 net/core/skbuff.c:705
   [<ffffffff871789d5>] unix_release_sock+0x5b5/0xbd0 net/unix/af_unix.c:559
   [<ffffffff87179039>] unix_release+0x49/0x90 net/unix/af_unix.c:836
   [<ffffffff86a694b2>] sock_release+0x92/0x1f0 net/socket.c:570
   [<ffffffff86a6962b>] sock_close+0x1b/0x20 net/socket.c:1017
   [<ffffffff81a76b8e>] __fput+0x34e/0x910 fs/file_table.c:208
   [<ffffffff81a771da>] ____fput+0x1a/0x20 fs/file_table.c:244
   [<ffffffff81483ab0>] task_work_run+0x1a0/0x280 kernel/task_work.c:116
   [<     inline     >] exit_task_work include/linux/task_work.h:21
   [<ffffffff8141287a>] do_exit+0x183a/0x2640 kernel/exit.c:828
   [<ffffffff8141383e>] do_group_exit+0x14e/0x420 kernel/exit.c:931
   [<ffffffff814429d3>] get_signal+0x663/0x1880 kernel/signal.c:2307
   [<ffffffff81239b45>] do_signal+0xc5/0x2190 arch/x86/kernel/signal.c:807
   [<ffffffff8100666a>] exit_to_usermode_loop+0x1ea/0x2d0
  arch/x86/entry/common.c:156
   [<     inline     >] prepare_exit_to_usermode arch/x86/entry/common.c:190
   [<ffffffff81009693>] syscall_return_slowpath+0x4d3/0x570
  arch/x86/entry/common.c:259
   [<ffffffff881478e6>] entry_SYSCALL_64_fastpath+0xc4/0xc6

Link: https://lkml.org/lkml/2017/3/6/252
Signed-off-by: Andrey Ulanov <andreyu@google.com>
Reported-by: Dmitry Vyukov <dvyukov@google.com>
Fixes: 6209344 ("net: unix: fix inflight counting bug in garbage collector")
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Tim Gardner <tim.gardner@canonical.com>

openvswitch: Add missing case OVS_TUNNEL_KEY_ATTR_PAD

BugLink: http://bugs.launchpad.net/bugs/1677589
[ Upstream commit 8f3dbfd79ed9ef9770305a7cc4e13dfd31ad2cd0 ]

Added a case for OVS_TUNNEL_KEY_ATTR_PAD to the switch statement
in ip_tun_from_nlattr in order to prevent the default case
returning an error.

Fixes: b46f6ded906e ("libnl: nla_put_be64(): align on a 64-bit area")
Signed-off-by: Kris Murphy <kriskend@linux.vnet.ibm.com>
Acked-by: Joe Stringer <joe@ovn.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Tim Gardner <tim.gardner@canonical.com>

amd-xgbe: Fix jumbo MTU processing on newer hardware

BugLink: http://bugs.launchpad.net/bugs/1677589
[ Upstream commit 622c36f143fc9566ba49d7cec994c2da1182d9e2 ]

Newer hardware does not provide a cumulative payload length when multiple
descriptors are needed to handle the data. Once the MTU increases beyond
the size that can be handled by a single descriptor, the SKB does not get
built properly by the driver.

The driver will now calculate the size of the data buffers used by the
hardware. The first buffer of the first descriptor is for packet headers
or packet headers and data when the headers can't be split. Subsequent
descriptors in a multi-descriptor chain will not use the first buffer. The
second buffer is used by all the descriptors in the chain for payload data.
Based on whether the driver is processing the first, intermediate, or last
descriptor it can calculate the buffer usage and build the SKB properly.

Tested and verified on both old and new hardware.

Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Tim Gardner <tim.gardner@canonical.com>

net: properly release sk_frag.page

BugLink: http://bugs.launchpad.net/bugs/1677589
[ Upstream commit 22a0e18eac7a9e986fec76c60fa4a2926d1291e2 ]

I mistakenly added the code to release sk->sk_frag in
sk_common_release() instead of sk_destruct()

TCP sockets using sk->sk_allocation == GFP_ATOMIC do no call
sk_common_release() at close time, thus leaking one (order-3) page.

iSCSI is using such sockets.

Fixes: 5640f7685831 ("net: use a per task frag allocator")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Tim Gardner <tim.gardner@canonical.com>

net: bcmgenet: Do not suspend PHY if Wake-on-LAN is enabled

BugLink: http://bugs.launchpad.net/bugs/1677589
[ Upstream commit 5371bbf4b295eea334ed453efa286afa2c3ccff3 ]

Suspending the PHY would be putting it in a low power state where it
may no longer allow us to do Wake-on-LAN.

Fixes: cc013fb48898 ("net: bcmgenet: correctly suspend and resume PHY device")
Signed-off-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Tim Gardner <tim.gardner@canonical.com>

net/openvswitch: Set the ipv6 source tunnel key address attribute correctly

BugLink: http://bugs.launchpad.net/bugs/1677589
[ Upstream commit 3d20f1f7bd575d147ffa75621fa560eea0aec690 ]

When dealing with ipv6 source tunnel key address attribute
(OVS_TUNNEL_KEY_ATTR_IPV6_SRC) we are wrongly setting the tunnel
dst ip, fix that.

Fixes: 6b26ba3a7d95 ('openvswitch: netlink attributes for IPv6 tunneling')
Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
Reported-by: Paul Blakey <paulb@mellanox.com>
Acked-by: Jiri Benc <jbenc@redhat.com>
Acked-by: Joe Stringer <joe@ovn.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Tim Gardner <tim.gardner@canonical.com>

UBUNTU: SAUCE: (no-up) net/mlx5: Avoid dereferencing uninitialized pointer

BugLink: http://bugs.launchpad.net/bugs/1676786
In NETDEV_CHANGEUPPER event the upper_info field is valid
only when linking is true. Otherwise it should be ignored.

Fixes: 7907f23adc18 (net/mlx5: Implement RoCE LAG feature)
Signed-off-by: Talat Batheesh <talatb@mellanox.com>
Reviewed-by: Aviv Heller <avivh@mellanox.com>
Reviewed-by: Moni Shoua <monis@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Tim Gardner <tim.gardner@canonical.com>

drivers: hv: Turn off write permission on the hypercall page

BugLink: http://bugs.launchpad.net/bugs/1676635
The hypercall page only needs to be executable but currently it is setup to
be writable as well. Fix the issue.

Signed-off-by: K. Y. Srinivasan <kys@microsoft.com>
Cc: <stable@vger.kernel.org>
Acked-by: Kees Cook <keescook@chromium.org>
Reported-by: Stephen Hemminger <stephen@networkplumber.org>
Tested-by: Stephen Hemminger <stephen@networkplumber.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
(cherry picked from commit 372b1e91343e657a7cc5e2e2bcecd5140ac28119)
Signed-off-by: Tim Gardner <tim.gardner@canonical.com>

UBUNTU: SAUCE: (no-up) hv: Supply vendor ID and package ABI

BugLink: http://bugs.launchpad.net/bugs/1193172
BugLink: http://bugs.launchpad.net/bugs/1676635
Signed-off-by: Tim Gardner <tim.gardner@canonical.com>

hv_utils: implement Hyper-V PTP source

BugLink: http://bugs.launchpad.net/bugs/1676635
With TimeSync version 4 protocol support we started updating system time
continuously through the whole lifetime of Hyper-V guests. Every 5 seconds
there is a time sample from the host which triggers do_settimeofday[64]().
While the time from the host is very accurate such adjustments may cause
issues:
- Time is jumping forward and backward, some applications may misbehave.
- In case an NTP server runs in parallel and uses something else for time
sync (network, PTP,...) system time will never converge.
- Systemd starts annoying you by printing "Time has been changed" every 5
seconds to the system log.

Instead of doing in-kernel time adjustments offload the work to an
NTP client by exposing TimeSync messages as a PTP device. Users may now
decide what they want to use as a source.

I tested the solution with chrony, the config was:

refclock PHC /dev/ptp0 poll 3 dpoll -2 offset 0

The result I'm seeing is accurate enough, the time delta between the guest
and the host is almost always within [-10us, +10us], the in-kernel solution
was giving us comparable results.

I also tried implementing PPS device instead of PTP by using not currently
used Hyper-V synthetic timers (we use only one of four for clockevent) but
with PPS source only chrony wasn't able to give me the required accuracy,
the delta often more that 100us.

Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Signed-off-by: K. Y. Srinivasan <kys@microsoft.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
(cherry picked from commit 3716a49a81ba19dda7202633a68b28564ba95eb5)
Signed-off-by: Tim Gardner <tim.gardner@canonical.com>

hv: export current Hyper-V clocksource

BugLink: http://bugs.launchpad.net/bugs/1676635
As a preparation to implementing Hyper-V PTP device supporting
.getcrosststamp we need to export a reference to the current Hyper-V
clocksource in use (MSR or TSC page).

Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Signed-off-by: K. Y. Srinivasan <kys@microsoft.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
(cherry picked from commit dee863b571b0a76e9c549ee99e8782bb4bc6502b)
Signed-off-by: Tim Gardner <tim.gardner@canonical.com>

Drivers: hv: Fix the bug in generating the guest ID

BugLink: http://bugs.launchpad.net/bugs/1676635
Fix the bug in the generation of the guest ID. Without this fix
the host side telemetry code is broken.

Signed-off-by: K. Y. Srinivasan <kys@microsoft.com>
Fixes: 352c9624242d ("Drivers: hv: vmbus: Move the definition of generate_guest_id()")
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
(cherry picked from commit 9b06e1018abc65585b07c75c5b3f406dbabe7005)
Signed-off-by: Tim Gardner <tim.gardner@canonical.com>

Drivers: hv: Log the negotiated IC versions.

BugLink: http://bugs.launchpad.net/bugs/1676635
Log the negotiated IC versions.

Signed-off-by: Alex Ng <alexng@messages.microsoft.com>
Signed-off-by: K. Y. Srinivasan <kys@microsoft.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
(cherry picked from commit 1274a690f6b2bd2b37447c47e3062afa8aa43f93)
Signed-off-by: Tim Gardner <tim.gardner@canonical.com>

Drivers: hv: vmbus: Use all supported IC versions to negotiate

BugLink: http://bugs.launchpad.net/bugs/1676635
Previously, we were assuming that each IC protocol version was tied to a
specific host version. For example, some Windows 10 preview hosts only
support v3 TimeSync even though driver assumes v4 is supported by all
Windows 10 hosts.

The guest will stop trying to negotiate even though older supported
versions may still be offered by the host.

Make IC version negotiation more robust by going through all versions
that are supported by the guest.

Fixes: 3da0401b4d0e ("Drivers: hv: utils: Fix the mapping between host
version and protocol to use")

Reported-by: Rolf Neugebauer <rolf.neugebauer@docker.com>
Signed-off-by: Alex Ng <alexng@messages.microsoft.com>
Signed-off-by: K. Y. Srinivasan <kys@microsoft.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
(cherry picked from commit a1656454131880980bc3a5313c8bf66ef5990c91)
Signed-off-by: Tim Gardner <tim.gardner@canonical.com>

Drivers: hv: balloon: add a fall through comment to hv_memory_notifier()

BugLink: http://bugs.launchpad.net/bugs/1676635
Coverity scan gives a warning when there is fall through in a switch
without a comment. This fall through is intentional as ol_waitevent needs
to be completed to unblock hv_mem_hot_add() allowing it to process next
requests regardless of the result of if we were able to online this block.

Reported-by: Stephen Hemminger <sthemmin@microsoft.com>
Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Signed-off-by: K. Y. Srinivasan <kys@microsoft.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
(cherry picked from commit ad6d41253bf91eabb41626683c35a712ba27a20c)
Signed-off-by: Tim Gardner <tim.gardner@canonical.com>

Drivers: hv: restore TSC page cleanup before kexec

BugLink: http://bugs.launchpad.net/bugs/1676635
We need to cleanup the TSC page before doing kexec/kdump or the new kernel
may crash if it tries to use it.

Fixes: 63ed4e0c67df ("Drivers: hv: vmbus: Consolidate all Hyper-V specific clocksource code")
Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Signed-off-by: K. Y. Srinivasan <kys@microsoft.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
(cherry picked from commit 5647dbf8f0807a35421bd0232247b02413ef2cab)
Signed-off-by: Tim Gardner <tim.gardner@canonical.com>

Drivers: hv: restore hypervcall page cleanup before kexec

BugLink: http://bugs.launchpad.net/bugs/1676635
We need to cleanup the hypercall page before doing kexec/kdump or the new
kernel may crash if it tries to use it. Reuse the now-empty hv_cleanup
function renaming it to hyperv_cleanup and moving to the arch specific
code.

Fixes: 8730046c1498 ("Drivers: hv vmbus: Move Hypercall page setup out of common code")
Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Signed-off-by: K. Y. Srinivasan <kys@microsoft.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
(cherry picked from commit d6f3609d2b4c6d0eec01f398cb685e50da3e6013)
Signed-off-by: Tim Gardner <tim.gardner@canonical.com>

hv_util: switch to using timespec64

BugLink: http://bugs.launchpad.net/bugs/1676635
do_settimeofday() is deprecated, use do_settimeofday64() instead.

Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Acked-by: John Stultz <john.stultz@linaro.org>
Acked-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: K. Y. Srinivasan <kys@microsoft.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
(cherry picked from commit 17244623a4c0f68d3f02c9c74d9b6ae259425826)
Signed-off-by: Tim Gardner <tim.gardner@canonical.com>

Drivers: hv: vmbus: Cleanup hyperv_vmbus.h

BugLink: http://bugs.launchpad.net/bugs/1676635
Get rid of all unused definitions.

Signed-off-by: K. Y. Srinivasan <kys@microsoft.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
(cherry picked from commit 8e27a236312c4ab6dc8dbd303552b771d3569cf1)
Signed-off-by: Tim Gardner <tim.gardner@canonical.com>

Drivers: hv: vmbus: Define an APIs to manage interrupt state

BugLink: http://bugs.launchpad.net/bugs/1676635
As part of cleaning up architecture specific code, define APIs
to manage interrupt state.

Signed-off-by: K. Y. Srinivasan <kys@microsoft.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
(cherry picked from commit 37e11d5c7052a5ca55ef807731c75218ea341b4c)
Signed-off-by: Tim Gardner <tim.gardner@canonical.com>

Drivers: hv: vmbus: Define an API to retrieve virtual processor index

BugLink: http://bugs.launchpad.net/bugs/1676635
As part of cleaning up architecture specific code, define an API
to retrieve the virtual procesor index.

Signed-off-by: K. Y. Srinivasan <kys@microsoft.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
(cherry picked from commit 7297ff0ca9db7e2d830841035b95d8b94b529142)
Signed-off-by: Tim Gardner <tim.gardner@canonical.com>

Drivers: hv: vmbus: Define APIs to manipulate the synthetic interrupt controller

BugLink: http://bugs.launchpad.net/bugs/1676635
As part of cleaning up architecture specific code, define APIs
to manipulate the interrupt controller state.

Signed-off-by: K. Y. Srinivasan <kys@microsoft.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
(cherry picked from commit 06d1d98a839f196e94cb726008fb2118e430f356)
Signed-off-by: Tim Gardner <tim.gardner@canonical.com>

Drivers: hv: vmbus: Define APIs to manipulate the event page

BugLink: http://bugs.launchpad.net/bugs/1676635
As part of cleaning up architecture specific code, define APIs
to manipulate the event page.

Signed-off-by: K. Y. Srinivasan <kys@microsoft.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
(cherry picked from commit 8e307bf82d76ab02e95a00d132d926f04db6ccab)
Signed-off-by: Tim Gardner <tim.gardner@canonical.com>

Drivers: hv: vmbus: Define APIs to manipulate the message page

BugLink: http://bugs.launchpad.net/bugs/1676635
As part of cleaning up architecture specific code, define APIs
to manipulate the message page.

Signed-off-by: K. Y. Srinivasan <kys@microsoft.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
(cherry picked from commit 155e4a2f28a59e5344dfa7c5d003161fe59a5bf2)
Signed-off-by: Tim Gardner <tim.gardner@canonical.com>

Drivers: hv: vmbus: Get rid of an unsused variable

BugLink: http://bugs.launchpad.net/bugs/1676635
The version variable while it is initialized is not used;
get rid of it.

Signed-off-by: K. Y. Srinivasan <kys@microsoft.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
(cherry picked from commit d383877db60bcc7fd02d1051a90e078d731dfb59)
Signed-off-by: Tim Gardner <tim.gardner@canonical.com>

Drivers: hv: util: Use hv_get_current_tick() to get current tick

BugLink: http://bugs.launchpad.net/bugs/1676635
As part of the effort to interact with Hyper-V in an instruction set
architecture independent way, use the new API to get the current
tick.

Signed-off-by: K. Y. Srinivasan <kys@microsoft.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
(cherry picked from commit 305f7549c9298247723c255baddb7a54b4e63050)
Signed-off-by: Tim Gardner <tim.gardner@canonical.com>

Drivers: hv: vmbus: Restructure the clockevents code

BugLink: http://bugs.launchpad.net/bugs/1676635
Move the relevant code that programs the hypervisor to an architecture
specific file.

Signed-off-by: K. Y. Srinivasan <kys@microsoft.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
(cherry picked from commit d5116b4091ecca271c249ede43a49c1245920558)
Signed-off-by: Tim Gardner <tim.gardner@canonical.com>

Drivers: hv: vmbus: Move the code to signal end of message

BugLink: http://bugs.launchpad.net/bugs/1676635
As part of the effort to separate out architecture specific code, move the
code for signaling end of message.

Signed-off-by: K. Y. Srinivasan <kys@microsoft.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
(cherry picked from commit e810e48c0c9a1a1ebb90cfe966bce6dc80ce08e7)
Signed-off-by: Tim Gardner <tim.gardner@canonical.com>

Drivers: hv: vmbus: Move the check for hypercall page setup

BugLink: http://bugs.launchpad.net/bugs/1676635
As part of the effort to separate out architecture specific code, move the
check for detecting if the hypercall page is setup.

Signed-off-by: K. Y. Srinivasan <kys@microsoft.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
(cherry picked from commit 73638cddaad861a5ebb2b119d8b318d4bded8f8d)
Signed-off-by: Tim Gardner <tim.gardner@canonical.com>

Drivers: hv: vmbus: Move the crash notification function

BugLink: http://bugs.launchpad.net/bugs/1676635
As part of the effort to separate out architecture specific code, move the
crash notification function.

Signed-off-by: K. Y. Srinivasan <kys@microsoft.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
(cherry picked from commit d058fa7e98ff01a4b4750a2210fc19906db3cbe1)
Signed-off-by: Tim Gardner <tim.gardner@canonical.com>

Drivers: hv: vmbus: Move the extracting of Hypervisor version information

BugLink: http://bugs.launchpad.net/bugs/1676635
As part of the effort to separate out architecture specific code,
extract hypervisor version information in an architecture specific
file.

Signed-off-by: K. Y. Srinivasan <kys@microsoft.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
(cherry picked from commit 8de8af7e0873c4fdac2205327dff922819e16657)
Signed-off-by: Tim Gardner <tim.gardner@canonical.com>

Drivers: hv: vmbus: Consolidate all Hyper-V specific clocksource code

BugLink: http://bugs.launchpad.net/bugs/1676635
As part of the effort to separate out architecture specific code,
consolidate all Hyper-V specific clocksource code to an architecture
specific code.

Signed-off-by: K. Y. Srinivasan <kys@microsoft.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
(cherry picked from commit 63ed4e0c67df332681ebfef6eca6852da28d6300)
Signed-off-by: Tim Gardner <tim.gardner@canonical.com>

Drivers: hv: vmbus: Move Hypercall invocation code out of common code

BugLink: http://bugs.launchpad.net/bugs/1676635
As part of the effort to separate out architecture specific code, move the
hypercall invocation code to an architecture specific file.

Signed-off-by: K. Y. Srinivasan <kys@microsoft.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
(cherry picked from commit 6ab42a66d2cc10afefea9f9e5d9a5ad5a836d254)
Signed-off-by: Tim Gardner <tim.gardner@canonical.com>

Drivers: hv vmbus: Move Hypercall page setup out of common code

BugLink: http://bugs.launchpad.net/bugs/1676635
As part of the effort to separate out architecture specific code, move the
hypercall page setup to an architecture specific file.

Signed-off-by: K. Y. Srinivasan <kys@microsoft.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
(cherry picked from commit 8730046c1498e8fb8c9a124789893944e8ce8220)
Signed-off-by: Tim Gardner <tim.gardner@canonical.com>

Drivers: hv: vmbus: Move the definition of generate_guest_id()

BugLink: http://bugs.launchpad.net/bugs/1676635
As part of the effort to separate out architecture specific code, move the
definition of generate_guest_id() to x86 specific header file.

Signed-off-by: K. Y. Srinivasan <kys@microsoft.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
(cherry picked from commit 352c9624242d5836ad8a960826183011367871a4)
Signed-off-by: Tim Gardner <tim.gardner@canonical.com>

Drivers: hv: vmbus: Move the definition of hv_x64_msr_hypercall_contents

BugLink: http://bugs.launchpad.net/bugs/1676635
As part of the effort to separate out architecture specific code, move the
definition of hv_x64_msr_hypercall_contents to x86 specific header file.

Signed-off-by: K. Y. Srinivasan <kys@microsoft.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
(cherry picked from commit 3f646ed70ccd1c4e5c1263d2922247d28c8e08f0)
Signed-off-by: Tim Gardner <tim.gardner@canonical.com>

Drivers: hv: util: Backup: Fix a rescind processing issue

BugLink: http://bugs.launchpad.net/bugs/1676635
VSS may use a char device to support the communication between
the user level daemon and the driver. When the VSS channel is rescinded
we need to make sure that the char device is fully cleaned up before
we can process a new VSS offer from the host. Implement this logic.

Signed-off-by: K. Y. Srinivasan <kys@microsoft.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
(cherry picked from commit d77044d142e960f7b5f814a91ecb8bcf86aa552c)
Signed-off-by: Tim Gardner <tim.gardner@canonical.com>

Drivers: hv: util: Fcopy: Fix a rescind processing issue

BugLink: http://bugs.launchpad.net/bugs/1676635
Fcopy may use a char device to support the communication between
the user level daemon and the driver. When the Fcopy channel is rescinded
we need to make sure that the char device is fully cleaned up before
we can process a new Fcopy offer from the host. Implement this logic.

Signed-off-by: K. Y. Srinivasan <kys@microsoft.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
(cherry picked from commit 20951c7535b5e6af46bc37b7142105f716df739c)
Signed-off-by: Tim Gardner <tim.gardner@canonical.com>

Drivers: hv: util: kvp: Fix a rescind processing issue

BugLink: http://bugs.launchpad.net/bugs/1676635
KVP may use a char device to support the communication between
the user level daemon and the driver. When the KVP channel is rescinded
we need to make sure that the char device is fully cleaned up before
we can process a new KVP offer from the host. Implement this logic.

Signed-off-by: K. Y. Srinivasan <kys@microsoft.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
(cherry picked from commit 5a66fecbf6aa528e375cbebccb1061cc58d80c84)
Signed-off-by: Tim Gardner <tim.gardner@canonical.com>

Drivers: hv: vmbus: Fix a rescind handling bug

BugLink: http://bugs.launchpad.net/bugs/1676635
The host can rescind a channel that has been offered to the
guest and once the channel is rescinded, the host does not
respond to any requests on that channel. Deal with the case where
the guest may be blocked waiting for a response from the host.

Signed-off-by: K. Y. Srinivasan <kys@microsoft.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
(cherry picked from commit ccb61f8a99e6c29df4fb96a65dad4fad740d5be9)
Signed-off-by: Tim Gardner <tim.gardner@canonical.com>

hv: make CPU offlining prevention fine-grained

BugLink: http://bugs.launchpad.net/bugs/1676635
Since commit e513229b4c38 ("Drivers: hv: vmbus: prevent cpu offlining on
newer hypervisors") cpu offlining was disabled. It is still true that we
can't offline CPUs which have VMBus channels bound to them but we may have
'free' CPUs (e.v. we booted with maxcpus= parameter and onlined CPUs after
VMBus was initialized), these CPUs may be disabled without issues.

In future, we may even allow closing CPUs which have only sub-channels
assinged to them by closing these sub-channels. All devices will continue
to work.

Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Signed-off-by: K. Y. Srinivasan <kys@microsoft.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
(cherry picked from commit 523b94087078f7f5ac10b7d9cd04277927031c39)
Signed-off-by: Tim Gardner <tim.gardner@canonical.com>

hv: switch to cpuhp state machine for synic init/cleanup

BugLink: http://bugs.launchpad.net/bugs/1676635
To make it possible to online/offline CPUs switch to cpuhp infrastructure
for doing hv_synic_init()/hv_synic_cleanup().

Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Signed-off-by: K. Y. Srinivasan <kys@microsoft.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
(cherry picked from commit 76d36ab79820430f73c584673aef10ba2446fced)
Signed-off-by: Tim Gardner <tim.gardner@canonical.com>

Drivers: hv: vmbus: Prevent sending data on a rescinded channel

BugLink: http://bugs.launchpad.net/bugs/1676635
After the channel is rescinded, the host does not read from the rescinded channel.
Fail writes to a channel that has already been rescinded. If we permit writes on a
rescinded channel, since the host will not respond we will have situations where
we will be unable to unload vmbus drivers that cannot have any outstanding requests
to the host at the point they are unoaded.

Signed-off-by: K. Y. Srinivasan <kys@microsoft.com>
Cc: <Stable@vger.kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
(cherry picked from commit e7e97dd8b77ee7366f2f8c70a033bf5fa05ec2e0)
Signed-off-by: Tim Gardner <tim.gardner@canonical.com>

hv: don't reset hv_context.tsc_page on crash

BugLink: http://bugs.launchpad.net/bugs/1676635
It may happen that secondary CPUs are still alive and resetting
hv_context.tsc_page will cause a consequent crash in read_hv_clock_tsc()
as we don't check for it being not NULL there. It is safe as we're not
freeing this page anyways.

Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Signed-off-by: K. Y. Srinivasan <kys@microsoft.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
(cherry picked from commit 56ef6718a1d8d77745033c5291e025ce18504159)
Signed-off-by: Tim Gardner <tim.gardner@canonical.com>

hv: init percpu_list in hv_synic_alloc()

BugLink: http://bugs.launchpad.net/bugs/1676635
Initializing hv_context.percpu_list in hv_synic_alloc() helps to prevent a
crash in percpu_channel_enq() when not all CPUs were online during
initialization and it naturally belongs there.

Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Signed-off-by: K. Y. Srinivasan <kys@microsoft.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
(cherry picked from commit 3c7630d35009e6635e5b58d62de554fd5b6db5df)
Signed-off-by: Tim Gardner <tim.gardner@canonical.com>

hv: allocate synic pages for all present CPUs

BugLink: http://bugs.launchpad.net/bugs/1676635
It may happen that not all CPUs are online when we do hv_synic_alloc() and
in case more CPUs come online later we may try accessing these allocated
structures.

Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Signed-off-by: K. Y. Srinivasan <kys@microsoft.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
(cherry picked from commit 421b8f20d3c381b215f988b42428f56fc3b82405)
Signed-off-by: Tim Gardner <tim.gardner@canonical.com>

Drivers: hv: vmbus: Raise retry/wait limits in vmbus_post_msg()

BugLink: http://bugs.launchpad.net/bugs/1676635
DoS protection conditions were altered in WS2016 and now it's easy to get
-EAGAIN returned from vmbus_post_msg() (e.g. when we try changing MTU on a
netvsc device in a loop). All vmbus_post_msg() callers don't retry the
operation and we usually end up with a non-functional device or crash.

While host's DoS protection conditions are unknown to me my tests show that
it can take up to 10 seconds before the message is sent so doing udelay()
is not an option, we really need to sleep. Almost all vmbus_post_msg()
callers are ready to sleep but there is one special case:
vmbus_initiate_unload() which can be called from interrupt/NMI context and
we can't sleep there. I'm also not sure about the lonely
vmbus_send_tl_connect_request() which has no in-tree users but its external
users are most likely waiting for the host to reply so sleeping there is
also appropriate.

Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Signed-off-by: K. Y. Srinivasan <kys@microsoft.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
(cherry picked from commit c0bb03924f1a80e7f65900e36c8e6b3dc167c5f8)
Signed-off-by: Tim Gardner <tim.gardner@canonical.com>

Revert "UBUNTU: SAUCE: (no-up) hv: Supply vendor ID and package ABI"

BugLink: http://bugs.launchpad.net/bugs/1676635
This reverts commit 2400c988c0b5da90b7035bfce63f1105e66b3423.

Signed-off-by: Tim Gardner <tim.gardner@canonical.com>

Revert "drivers: hv: Turn off write permission on the hypercall page"

BugLink: http://bugs.launchpad.net/bugs/1676635
This reverts commit 71a5a0559d132a6bb20e63e8e9c62fbd22666137.

Signed-off-by: Tim Gardner <tim.gardner@canonical.com>

Revert "Drivers: hv: util: Backup: Fix a rescind processing issue"

BugLink: http://bugs.launchpad.net/bugs/1676635
This reverts commit 8da12e10a191c62830c277f35a4fa5403eb1bcd2.

Signed-off-by: Tim Gardner <tim.gardner@canonical.com>

Revert "Drivers: hv: util: Fcopy: Fix a rescind processing issue"

BugLink: http://bugs.launchpad.net/bugs/1676635
This reverts commit c9d4b38c5c386cab269664832fdcd9d6b878f998.

Signed-off-by: Tim Gardner <tim.gardner@canonical.com>

Revert "Drivers: hv: util: kvp: Fix a rescind processing issue"

BugLink: http://bugs.launchpad.net/bugs/1676635
This reverts commit ca0b5897e11ebcd15770561849f45f2c7a980d85.

Signed-off-by: Tim Gardner <tim.gardner@canonical.com>

Revert "Drivers: hv: vmbus: Fix a rescind handling bug"

BugLink: http://bugs.launchpad.net/bugs/1676635
This reverts commit 1b7d44c16f61522ee0c7b79d6f666a89c3244a5a.

Signed-off-by: Tim Gardner <tim.gardner@canonical.com>

Revert "Drivers: hv: vmbus: Prevent sending data on a rescinded channel"

BugLink: http://bugs.launchpad.net/bugs/1676635
This reverts commit 81afb2c5dfd49aab0f6a3240c83d975416b53245.

Signed-off-by: Tim Gardner <tim.gardner@canonical.com>

Revert "hv: init percpu_list in hv_synic_alloc()"

BugLink: http://bugs.launchpad.net/bugs/1676635
This reverts commit db60c8d6cc34f9966be31c574ec20d577a6730a2.

Signed-off-by: Tim Gardner <tim.gardner@canonical.com>

Revert "hv: allocate synic pages for all present CPUs"

BugLink: http://bugs.launchpad.net/bugs/1676635
This reverts commit 9b66ff22466f0f566fe688a0e77d03a4e7fb11de.

Signed-off-by: Tim Gardner <tim.gardner@canonical.com>

Revert "Drivers: hv: vmbus: Raise retry/wait limits in vmbus_post_msg()"

BugLink: http://bugs.launchpad.net/bugs/1676635
This reverts commit 816725f684dd5d018c4314f79797d0ea8eccdd9b.

Signed-off-by: Tim Gardner <tim.gardner@canonical.com>

Revert "hv: don't reset hv_context.tsc_page on crash"

BugLink: http://bugs.launchpad.net/bugs/1676635
This reverts commit e7a2222fc8a0d23d0e6020f04cccc63ff545f9bf.

Signed-off-by: Tim Gardner <tim.gardner@canonical.com>

powerpc/64: Use optimized checksum routines on little-endian

BugLink: http://bugs.launchpad.net/bugs/1670247
Currently we have optimized hand-coded assembly checksum routines for
big-endian 64-bit systems, but for little-endian we use the generic C
routines. This modifies the optimized routines to work for
little-endian. With this, we no longer need to enable
CONFIG_GENERIC_CSUM. This also fixes a couple of comments in
checksum_64.S so they accurately reflect what the associated instruction
does.

Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
[mpe: Use the more common __BIG_ENDIAN__]
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
(cherry picked from commit d4fde568a34a93897dfb9ae64cfe9dda9d5c908c)
Signed-off-by: Tim Gardner <tim.gardner@canonical.com>

powerpc/64: Fix checksum folding in csum_tcpudp_nofold and ip_fast_csum_nofold

BugLink: http://bugs.launchpad.net/bugs/1670247
These functions compute an IP checksum by computing a 64-bit sum and
folding it to 32 bits (the "nofold" in their names refers to folding
down to 16 bits). However, doing (u32) (s + (s >> 32)) is not
sufficient to fold a 64-bit sum to 32 bits correctly. The addition
can produce a carry out from bit 31, which needs to be added in to
the sum to produce the correct result.

To fix this, we copy the from64to32() function from lib/checksum.c
and use that.

Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
(cherry picked from commit b492f7e4e07a28e706db26cf4943bb0911435426)
Signed-off-by: Tim Gardner <tim.gardner@canonical.com>

scsi: storvsc: Workaround for virtual DVD SCSI version

BugLink: http://bugs.launchpad.net/bugs/1674635
Hyper-V host emulation of SCSI for virtual DVD device reports SCSI
version 0 (UNKNOWN) but is still capable of supporting REPORTLUN.

Without this patch, a GEN2 Linux guest on Hyper-V will not boot 4.11
successfully with virtual DVD ROM device. What happens is that the SCSI
scan process falls back to doing sequential probing by INQUIRY. But the
storvsc driver has a previous workaround that masks/blocks all errors
reports from INQUIRY (or MODE_SENSE) commands. This workaround causes
the scan to then populate a full set of bogus LUN's on the target and
then sends kernel spinning off into a death spiral doing block reads on
the non-existent LUNs.

By setting the correct blacklist flags, the target with the DVD device
is scanned with REPORTLUN and that works correctly.

Patch needs to go in current 4.11, it is safe but not necessary in older
kernels.

Signed-off-by: Stephen Hemminger <sthemmin@microsoft.com>
Reviewed-by: K. Y. Srinivasan <kys@microsoft.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
(cherry picked from commit f1c635b439a5c01776fe3a25b1e2dc546ea82e6f)
Signed-off-by: Joseph Salisbury <joseph.salisbury@canonical.com>
Signed-off-by: Tim Gardner <tim.gardner@canonical.com>

powerpc/powernv: Remove separate entry for OPAL real mode calls

BugLink: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1675806
All entry points already read the MSR so they can easily do
the right thing.

Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
(cherry picked from commit ab9bad0ead9ab179ace09988a3f1cfca122eb7c2)
Signed-off-by: Breno Leitao <breno.leitao@gmail.com>
Signed-off-by: Tim Gardner <tim.gardner@canonical.com>

powerpc/powernv: Initialise nest mmu

BugLink: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1675806
POWER9 contains an off core mmu called the nest mmu (NMMU). This is
used by other hardware units on the chip to translate virtual
addresses into real addresses. The unit attempting an address
translation provides the majority of the context required for the
translation request except for the base address of the partition table
(ie. the PTCR) which needs to be programmed into the NMMU.

This patch adds a call to OPAL to set the PTCR for the nest mmu in
opal_init().

Signed-off-by: Alistair Popple <alistair@popple.id.au>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
(cherry picked from commit 1d0761d2557d1540727723e4f05395d53321d555)
Signed-off-by: Breno Leitao <breno.leitao@gmail.com>
Signed-off-by: Tim Gardner <tim.gardner@canonical.com>

KVM: PPC: Book 3S: XICS: Don't lock twice when checking for resend

BugLink: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1675806
This patch improves the code that takes lock twice to check the resend flag
and do the actual resending, by checking the resend flag locklessly, and
add a boolean parameter check_resend to icp_[rm_]deliver_irq(), so the
resend flag can be checked in the lock when doing the delivery.

We need make sure when we clear the ics's bit in the icp's resend_map, we
don't miss the resend flag of the irqs that set the bit. It could be
ordered through the barrier in test_and_clear_bit(), and a newly added
wmb between setting irq's resend flag, and icp's resend_map.

Signed-off-by: Li Zhong <zhong@linux.vnet.ibm.com>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
(cherry picked from commit 21acd0e4df04f02176e773468658c3cebff096bb)
Signed-off-by: Breno Leitao <breno.leitao@gmail.com>
Signed-off-by: Tim Gardner <tim.gardner@canonical.com>

powerpc: Update to new option-vector-5 format for CAS

BugLink: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1675806
On POWER9 the ibm,client-architecture-support (CAS) negotiation process
has been updated to change how the host to guest negotiation is done for
the new hash/radix mmu as well as the nest mmu, process tables and guest
translation shootdown (GTSE).

This is documented in the unreleased PAPR ACR "CAS option vector
additions for P9".

The host tells the guest which options it supports in
ibm,arch-vec-5-platform-support. The guest then chooses a subset of these
to request in the CAS call and these are agreed to in the
ibm,architecture-vec-5 property of the chosen node.

Thus we read ibm,arch-vec-5-platform-support and make our selection before
calling CAS. We then parse the ibm,architecture-vec-5 property of the
chosen node to check whether we should run as hash or radix.

ibm,arch-vec-5-platform-support format:

index value pairs: <index, val> ... <index, val>

index: Option vector 5 byte number
val: Some representation of supported values

Signed-off-by: Suraj Jitindar Singh <sjitindarsingh@gmail.com>
Acked-by: Paul Mackerras <paulus@ozlabs.org>
[mpe: Don't print about unknown options, be consistent with OV5_FEAT]
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
(cherry picked from commit 014d02cbf16b3106dc8e93281d2a9c189751ed5e)
Signed-off-by: Breno Leitao <breno.leitao@gmail.com>
Signed-off-by: Tim Gardner <tim.gardner@canonical.com>

powerpc/64: Invalidate process table caching after setting process table

BugLink: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1675806
The POWER9 MMU reads and caches entries from the process table.
When we kexec from one kernel to another, the second kernel sets
its process table pointer but doesn't currently do anything to
make the CPU invalidate any cached entries from the old process table.
This adds a tlbie (TLB invalidate entry) instruction with parameters
to invalidate caching of the process table after the new process
table is installed.

Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
(cherry picked from commit 7a70d7288c926ae88e0c773fbb506aa374e99c2d)
Signed-off-by: Breno Leitao <breno.leitao@gmail.com>
Signed-off-by: Tim Gardner <tim.gardner@canonical.com>

KVM: PPC: Book 3S: Fix error return in kvm_vm_ioctl_create_spapr_tce()

BugLink: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1675806
Fix to return error code -ENOMEM from the memory alloc error handling
case instead of 0, as done elsewhere in this function.

Signed-off-by: Wei Yongjun <weiyongjun1@huawei.com>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
(cherry picked from commit 5982f0849e08fe4e4e7df5e345c4539ce9780b1b)
Signed-off-by: Breno Leitao <breno.leitao@gmail.com>
Signed-off-by: Tim Gardner <tim.gardner@canonical.com>

KVM: PPC: Book3S HV: Don't try to signal cpu -1

BugLink: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1675806
If the target vcpu for kvmppc_fast_vcpu_kick_hv() is not running on
any CPU, then we will have vcpu->arch.thread_cpu == -1, and as it
happens, kvmppc_fast_vcpu_kick_hv will call kvmppc_ipi_thread with
-1 as the cpu argument.  Although this is not meaningful, in the past,
before commit 1704a81ccebc ("KVM: PPC: Book3S HV: Use msgsnd for IPIs
to other cores on POWER9", 2016-11-18), it was harmless because CPU
-1 is not in the same core as any real CPU thread.  On a POWER9,
however, we don't do the "same core" check, so we were trying to
do a msgsnd to thread -1, which is invalid.  To avoid this, we add
a check to see that vcpu->arch.thread_cpu is >= 0 before calling
kvmppc_ipi_thread() with it.  Since vcpu->arch.thread_vcpu can change
asynchronously, we use READ_ONCE to ensure that the value we check is
the same value that we use as the argument to kvmppc_ipi_thread().

Fixes: 1704a81ccebc ("KVM: PPC: Book3S HV: Use msgsnd for IPIs to other cores on POWER9")
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
(cherry picked from commit 3deda5e50c893be38c1b6b3a73f8f8fb5560baa4)
Signed-off-by: Breno Leitao <breno.leitao@gmail.com>
Signed-off-by: Tim Gardner <tim.gardner@canonical.com>

KVM: PPC: Book3S HV: Don't use ASDR for real-mode HPT faults on POWER9

BugLink: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1675806
In HPT mode on POWER9, the ASDR register is supposed to record
segment information for hypervisor page faults.  It turns out that
POWER9 DD1 does not record the page size information in the ASDR
for faults in guest real mode.  We have the necessary information
in memory already, so by moving the checks for real mode that already
existed, we can use the in-memory copy.  Since a load is likely to
be faster than reading an SPR, we do this unconditionally (not just
for POWER9 DD1).

Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
(cherry picked from commit 4e5acdc23a3dcbd6ad6dc93a9783dd9c838987c8)
Signed-off-by: Breno Leitao <breno.leitao@gmail.com>
Signed-off-by: Tim Gardner <tim.gardner@canonical.com>

KVM: PPC: Book3S HV: Fix software walk of guest process page tables

BugLink: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1675806
This fixes some bugs in the code that walks the guest's page tables.
These bugs cause MMIO emulation to fail whenever the guest is in
virtial mode (MMU on), leading to the guest hanging if it tried to
access a virtio device.

The first bug was that when reading the guest's process table, we were
using the whole of arch->process_table, not just the field that contains
the process table base address. The second bug was that the mask used
when reading the process table entry to get the radix tree base address,
RPDB_MASK, had the wrong value.

Fixes: 9e04ba69beec ("KVM: PPC: Book3S HV: Add basic infrastructure for radix guests")
Fixes: e99833448c5f ("powerpc/mm/radix: Add partition table format & callback")
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
(cherry picked from commit 70cd4c10b290dd77fff6dc702a9a2c8c679df121)
Signed-off-by: Breno Leitao <breno.leitao@gmail.com>
Signed-off-by: Tim Gardner <tim.gardner@canonical.com>

powerpc/64: CONFIG_RELOCATABLE support for hmi interrupts

BugLink: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1675806
The branch from hmi_exception_early to hmi_exception_realmode must use
a "relocatable-style" branch, because it is branching from unrelocated
exception code to beyond __end_interrupts.

Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
(cherry picked from commit 2337d207288f163e10bd8d4d7eeb0c1c75046a0c)
Signed-off-by: Breno Leitao <breno.leitao@gmail.com>
Signed-off-by: Tim Gardner <tim.gardner@canonical.com>

KVM: PPC: Book3S HV: Enable radix guest support

BugLink: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1675806
This adds a few last pieces of the support for radix guests:

* Implement the backends for the KVM_PPC_CONFIGURE_V3_MMU and
  KVM_PPC_GET_RMMU_INFO ioctls for radix guests

* On POWER9, allow secondary threads to be on/off-lined while guests
  are running.

* Set up LPCR and the partition table entry for radix guests.

* Don't allocate the rmap array in the kvm_memory_slot structure
  on radix.

* Don't try to initialize the HPT for radix guests, since they don't
  have an HPT.

* Take out the code that prevents the HV KVM module from
  initializing on radix hosts.

At this stage, we only support radix guests if the host is running
in radix mode, and only support HPT guests if the host is running in
HPT mode.  Thus a guest cannot switch from one mode to the other,
which enables some simplifications.

Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
(cherry picked from commit 8cf4ecc0ca9bd9bdc9b4ca0a99f7445a1e74afed)
Signed-off-by: Breno Leitao <breno.leitao@gmail.com>
Signed-off-by: Tim Gardner <tim.gardner@canonical.com>

KVM: PPC: Book3S HV: Invalidate ERAT on guest entry/exit for POWER9 DD1

BugLink: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1675806
On POWER9 DD1, we need to invalidate the ERAT (effective to real
address translation cache) when changing the PIDR register, which
we do as part of guest entry and exit.

Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
(cherry picked from commit f11f6f79b606fb54bb388d0ea652ed889b2fdf86)
Signed-off-by: Breno Leitao <breno.leitao@gmail.com>
Signed-off-by: Tim Gardner <tim.gardner@canonical.com>

KVM: PPC: Book3S HV: Allow guest exit path to have MMU on

BugLink: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1675806
If we allow LPCR[AIL] to be set for radix guests, then interrupts from
the guest to the host can be delivered by the hardware with relocation
on, and thus the code path starting at kvmppc_interrupt_hv can be
executed in virtual mode (MMU on) for radix guests (previously it was
only ever executed in real mode).

Most of the code is indifferent to whether the MMU is on or off, but
the calls to OPAL that use the real-mode OPAL entry code need to
be switched to use the virtual-mode code instead.  The affected
calls are the calls to the OPAL XICS emulation functions in
kvmppc_read_one_intr() and related functions.  We test the MSR[IR]
bit to detect whether we are in real or virtual mode, and call the
opal_rm_* or opal_* function as appropriate.

The other place that depends on the MMU being off is the optimization
where the guest exit code jumps to the external interrupt vector or
hypervisor doorbell interrupt vector, or returns to its caller (which
is __kvmppc_vcore_entry).  If the MMU is on and we are returning to
the caller, then we don't need to use an rfid instruction since the
MMU is already on; a simple blr suffices.  If there is an external
or hypervisor doorbell interrupt to handle, we branch to the
relocation-on version of the interrupt vector.

Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
(cherry picked from commit 53af3ba2e8195f504d6a3a0667ccb5e7d4c57599)
Signed-off-by: Breno Leitao <breno.leitao@gmail.com>
Signed-off-by: Tim Gardner <tim.gardner@canonical.com>

KVM: PPC: Book3S HV: Invalidate TLB on radix guest vcpu movement

BugLink: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1675806
With radix, the guest can do TLB invalidations itself using the tlbie
(global) and tlbiel (local) TLB invalidation instructions.  Linux guests
use local TLB invalidations for translations that have only ever been
accessed on one vcpu.  However, that doesn't mean that the translations
have only been accessed on one physical cpu (pcpu) since vcpus can move
around from one pcpu to another.  Thus a tlbiel might leave behind stale
TLB entries on a pcpu where the vcpu previously ran, and if that task
then moves back to that previous pcpu, it could see those stale TLB
entries and thus access memory incorrectly.  The usual symptom of this
is random segfaults in userspace programs in the guest.

To cope with this, we detect when a vcpu is about to start executing on
a thread in a core that is a different core from the last time it
executed.  If that is the case, then we mark the core as needing a
TLB flush and then send an interrupt to any thread in the core that is
currently running a vcpu from the same guest.  This will get those vcpus
out of the guest, and the first one to re-enter the guest will do the
TLB flush.  The reason for interrupting the vcpus executing on the old
core is to cope with the following scenario:

CPU 0 CPU 1 CPU 4
(core 0) (core 0) (core 1)

VCPU 0 runs task X      VCPU 1 runs
core 0 TLB gets
entries from task X
VCPU 0 moves to CPU 4
VCPU 0 runs task X
Unmap pages of task X
tlbiel

(still VCPU 1) task X moves to VCPU 1
task X runs
task X sees stale TLB
entries

That is, as soon as the VCPU starts executing on the new core, it
could unmap and tlbiel some page table entries, and then the task
could migrate to one of the VCPUs running on the old core and
potentially see stale TLB entries.

Since the TLB is shared between all the threads in a core, we only
use the bit of kvm->arch.need_tlb_flush corresponding to the first
thread in the core.  To ensure that we don't have a window where we
can miss a flush, this moves the clearing of the bit from before the
actual flush to after it.  This way, two threads might both do the
flush, but we prevent the situation where one thread can enter the
guest before the flush is finished.

Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
(cherry picked from commit a29ebeaf5575d03eef178bb87c425a1e46cae1ca)
Signed-off-by: Breno Leitao <breno.leitao@gmail.com>
Signed-off-by: Tim Gardner <tim.gardner@canonical.com>

KVM: PPC: Book3S HV: Make HPT-specific hypercalls return error in radix mode

BugLink: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1675806
If the guest is in radix mode, then it doesn't have a hashed page
table (HPT), so all of the hypercalls that manipulate the HPT can't
work and should return an error. This adds checks to make them
return H_FUNCTION ("function not supported").

Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
(cherry picked from commit 65dae5403a162fe6ef7cd8b2835de9d23c303891)
Signed-off-by: Breno Leitao <breno.leitao@gmail.com>
Signed-off-by: Tim Gardner <tim.gardner@canonical.com>

KVM: PPC: Book3S HV: Implement dirty page logging for radix guests

BugLink: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1675806
This adds code to keep track of dirty pages when requested (that is,
when memslot->dirty_bitmap is non-NULL) for radix guests. We use the
dirty bits in the PTEs in the second-level (partition-scoped) page
tables, together with a bitmap of pages that were dirty when their
PTE was invalidated (e.g., when the page was paged out). This bitmap
is stored in the first half of the memslot->dirty_bitmap area, and
kvm_vm_ioctl_get_dirty_log_hv() now uses the second half for the
bitmap that gets returned to userspace.

Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
(cherry picked from commit 8f7b79b8379a85fb8dd0c3f42d9f452ec5552161)
Signed-off-by: Breno Leitao <breno.leitao@gmail.com>
Signed-off-by: Tim Gardner <tim.gardner@canonical.com>

KVM: PPC: Book3S HV: MMU notifier callbacks for radix guests

BugLink: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1675806
This adapts our implementations of the MMU notifier callbacks
(unmap_hva, unmap_hva_range, age_hva, test_age_hva, set_spte_hva)
to call radix functions when the guest is using radix. These
implementations are much simpler than for HPT guests because we
have only one PTE to deal with, so we don't need to traverse
rmap chains.

Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
(cherry picked from commit 01756099e0a5f431bbada9693d566269acfb51f9)
Signed-off-by: Breno Leitao <breno.leitao@gmail.com>
Signed-off-by: Tim Gardner <tim.gardner@canonical.com>

KVM: PPC: Book3S HV: Page table construction and page faults for radix guests

BugLink: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1675806
This adds the code to construct the second-level ("partition-scoped" in
architecturese) page tables for guests using the radix MMU.  Apart from
the PGD level, which is allocated when the guest is created, the rest
of the tree is all constructed in response to hypervisor page faults.

As well as hypervisor page faults for missing pages, we also get faults
for reference/change (RC) bits needing to be set, as well as various
other error conditions.  For now, we only set the R or C bit in the
guest page table if the same bit is set in the host PTE for the
backing page.

This code can take advantage of the guest being backed with either
transparent or ordinary 2MB huge pages, and insert 2MB page entries
into the guest page tables.  There is no support for 1GB huge pages
yet.

Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
(cherry picked from commit 5a319350a46572d073042a3194676099dd2c135d)
Signed-off-by: Breno Leitao <breno.leitao@gmail.com>
Signed-off-by: Tim Gardner <tim.gardner@canonical.com>

KVM: PPC: Book3S HV: Modify guest entry/exit paths to handle radix guests

BugLink: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1675806
This adds code to branch around the parts that radix guests don't
need - clearing and loading the SLB with the guest SLB contents,
saving the guest SLB contents on exit, and restoring the host SLB
contents.

Since the host is now using radix, we need to save and restore the
host value for the PID register.

On hypervisor data/instruction storage interrupts, we don't do the
guest HPT lookup on radix, but just save the guest physical address
for the fault (from the ASDR register) in the vcpu struct.

Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
(cherry picked from commit f4c51f841d2ac7d36cacb84efbc383190861f87c)
Signed-off-by: Breno Leitao <breno.leitao@gmail.com>
Signed-off-by: Tim Gardner <tim.gardner@canonical.com>

KVM: PPC: Book3S HV: Add basic infrastructure for radix guests

BugLink: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1675806
This adds a field in struct kvm_arch and an inline helper to
indicate whether a guest is a radix guest or not, plus a new file
to contain the radix MMU code, which currently contains just a
translate function which knows how to traverse the guest page
tables to translate an address.

Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
(cherry picked from commit 9e04ba69beec372ddf857c700ff922e95f50b0d0)
Signed-off-by: Breno Leitao <breno.leitao@gmail.com>
Signed-off-by: Tim Gardner <tim.gardner@canonical.com>

KVM: PPC: Book3S HV: Use ASDR for HPT guests on POWER9

BugLink: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1675806
POWER9 adds a register called ASDR (Access Segment Descriptor
Register), which is set by hypervisor data/instruction storage
interrupts to contain the segment descriptor for the address
being accessed, assuming the guest is using HPT translation.
(For radix guests, it contains the guest real address of the
access.)

Thus, for HPT guests on POWER9, we can use this register rather
than looking up the SLB with the slbfee. instruction.

Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
(cherry picked from commit ef8c640cb9cc865a461827b698fcc55b0ecaa600)
Signed-off-by: Breno Leitao <breno.leitao@gmail.com>
Signed-off-by: Tim Gardner <tim.gardner@canonical.com>