page_cgroup_ino() doesn't return a valid memcg pointer for non-compound
slab pages, because it depends on PgHead AND PgSlab flags to be set to
determine the memory cgroup from the kmem_cache. It's correct for
compound pages, but not for generic small pages. Those don't have PgHead
set, so it ends up returning zero.
Fix this by replacing the condition to PageSlab() && !PageTail().
Before this patch:
[root@localhost ~]# ./page-types -c /sys/fs/cgroup/user.slice/user-0.slice/user@0.service/ | grep slab
0x0000000000000080 38 0 _______S___________________________________ slab
After this patch:
[root@localhost ~]# ./page-types -c /sys/fs/cgroup/user.slice/user-0.slice/user@0.service/ | grep slab
0x0000000000000080 147 0 _______S___________________________________ slab
Also, hwpoison_filter_task() uses output of page_cgroup_ino() in order
to filter error injection events based on memcg. So if
page_cgroup_ino() fails to return memcg pointer, we just fail to inject
memory error. Considering that hwpoison filter is for testing, affected
users are limited and the impact should be marginal.
[n-horiguchi@ah.jp.nec.com: changelog additions] Link: http://lkml.kernel.org/r/20191031012151.2722280-1-guro@fb.com Fixes: 4d96ba353075 ("mm: memcg/slab: stop setting page->mem_cgroup pointer for slab pages") Signed-off-by: Roman Gushchin <guro@fb.com> Reviewed-by: Shakeel Butt <shakeelb@google.com> Acked-by: David Rientjes <rientjes@google.com> Cc: Vladimir Davydov <vdavydov.dev@gmail.com> Cc: Daniel Jordan <daniel.m.jordan@oracle.com> Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: Connor Kuehl <connor.kuehl@canonical.com> Signed-off-by: Stefan Bader <stefan.bader@canonical.com>
In the current code, we use the atomic_cmpxchg() to serialize the output
of the dump_stack(), but this implementation suffers the thundering herd
problem. We have observed such kind of livelock on a Marvell cn96xx
board(24 cpus) when heavily using the dump_stack() in a kprobe handler.
Actually we can let the competitors to wait for the releasing of the
lock before jumping to atomic_cmpxchg(). This will definitely mitigate
the thundering herd problem. Thanks Linus for the suggestion.
[akpm@linux-foundation.org: fix comment] Link: http://lkml.kernel.org/r/20191030031637.6025-1-haokexin@gmail.com Fixes: b58d977432c8 ("dump_stack: serialize the output from dump_stack()") Signed-off-by: Kevin Hao <haokexin@gmail.com> Suggested-by: Linus Torvalds <torvalds@linux-foundation.org> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: Connor Kuehl <connor.kuehl@canonical.com> Signed-off-by: Stefan Bader <stefan.bader@canonical.com>
/proc/pagetypeinfo is a debugging tool to examine internal page
allocator state wrt to fragmentation. It is not very useful for any
other use so normal users really do not need to read this file.
Waiman Long has noticed that reading this file can have negative side
effects because zone->lock is necessary for gathering data and that a)
interferes with the page allocator and its users and b) can lead to hard
lockups on large machines which have very long free_list.
Reduce both issues by simply not exporting the file to regular users.
Link: http://lkml.kernel.org/r/20191025072610.18526-2-mhocko@kernel.org Fixes: 467c996c1e19 ("Print out statistics in relation to fragmentation avoidance to /proc/pagetypeinfo") Signed-off-by: Michal Hocko <mhocko@suse.com> Reported-by: Waiman Long <longman@redhat.com> Acked-by: Mel Gorman <mgorman@suse.de> Acked-by: Vlastimil Babka <vbabka@suse.cz> Acked-by: Waiman Long <longman@redhat.com> Acked-by: Rafael Aquini <aquini@redhat.com> Acked-by: David Rientjes <rientjes@google.com> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Cc: David Hildenbrand <david@redhat.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Roman Gushchin <guro@fb.com> Cc: Konstantin Khlebnikov <khlebnikov@yandex-team.ru> Cc: Jann Horn <jannh@google.com> Cc: Song Liu <songliubraving@fb.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: Connor Kuehl <connor.kuehl@canonical.com> Signed-off-by: Stefan Bader <stefan.bader@canonical.com>
We have a usecase to use tmpfs as QEMU memory backend and we would like
to take the advantage of THP as well. But, our test shows the EPT is
not PMD mapped even though the underlying THP are PMD mapped on host.
The number showed by /sys/kernel/debug/kvm/largepage is much less than
the number of PMD mapped shmem pages as the below:
And some benchmarks do worse than with anonymous THPs.
By digging into the code we figured out that commit 127393fbe597 ("mm:
thp: kvm: fix memory corruption in KVM with THP enabled") checks if
there is a single PTE mapping on the page for anonymous THP when setting
up EPT map. But the _mapcount < 0 check doesn't work for page cache THP
since every subpage of page cache THP would get _mapcount inc'ed once it
is PMD mapped, so PageTransCompoundMap() always returns false for page
cache THP. This would prevent KVM from setting up PMD mapped EPT entry.
So we need handle page cache THP correctly. However, when page cache
THP's PMD gets split, kernel just remove the map instead of setting up
PTE map like what anonymous THP does. Before KVM calls get_user_pages()
the subpages may get PTE mapped even though it is still a THP since the
page cache THP may be mapped by other processes at the mean time.
Checking its _mapcount and whether the THP has PTE mapped or not.
Although this may report some false negative cases (PTE mapped by other
processes), it looks not trivial to make this accurate.
With this fix /sys/kernel/debug/kvm/largepage would show reasonable
pages are PMD mapped by EPT as the below:
Deferred memory initialisation updates zone->managed_pages during the
initialisation phase but before that finishes, the per-cpu page
allocator (pcpu) calculates the number of pages allocated/freed in
batches as well as the maximum number of pages allowed on a per-cpu
list. As zone->managed_pages is not up to date yet, the pcpu
initialisation calculates inappropriately low batch and high values.
This increases zone lock contention quite severely in some cases with
the degree of severity depending on how many CPUs share a local zone and
the size of the zone. A private report indicated that kernel build
times were excessive with extremely high system CPU usage. A perf
profile indicated that a large chunk of time was lost on zone->lock
contention.
This patch recalculates the pcpu batch and high values after deferred
initialisation completes for every populated zone in the system. It was
tested on a 2-socket AMD EPYC 2 machine using a kernel compilation
workload -- allmodconfig and all available CPUs.
mmtests configuration: config-workload-kernbench-max Configuration was
modified to build on a fresh XFS partition.
accompanied by an increase in machines going completely radio silent
under memory pressure.
One thing that changed since 4.16 is e699e2c6a654 ("net, mm: account
sock objects to kmemcg"), which made these slab caches subject to cgroup
memory accounting and control.
The problem with that is that cgroups, unlike the page allocator, do not
maintain dedicated atomic reserves. As a cgroup's usage hovers at its
limit, atomic allocations - such as done during network rx - can fail
consistently for extended periods of time. The kernel is not able to
operate under these conditions.
We don't want to revert the culprit patch, because it indeed tracks a
potentially substantial amount of memory used by a cgroup.
We also don't want to implement dedicated atomic reserves for cgroups.
There is no point in keeping a fixed margin of unused bytes in the
cgroup's memory budget to accomodate a consumer that is impossible to
predict - we'd be wasting memory and get into configuration headaches,
not unlike what we have going with min_free_kbytes. We do this for
physical mem because we have to, but cgroups are an accounting game.
Instead, account these privileged allocations to the cgroup, but let
them bypass the configured limit if they have to. This way, we get the
benefits of accounting the consumed memory and have it exert pressure on
the rest of the cgroup, but like with the page allocator, we shift the
burden of reclaimining on behalf of atomic allocations onto the regular
allocations that can block.
Link: http://lkml.kernel.org/r/20191022233708.365764-1-hannes@cmpxchg.org Fixes: e699e2c6a654 ("net, mm: account sock objects to kmemcg") Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Reviewed-by: Shakeel Butt <shakeelb@google.com> Cc: Suleiman Souhlal <suleiman@google.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: <stable@vger.kernel.org> [4.18+] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: Connor Kuehl <connor.kuehl@canonical.com> Signed-off-by: Stefan Bader <stefan.bader@canonical.com>
__mem_cgroup_free() can be called on the failure path in
mem_cgroup_alloc(). However memcg_flush_percpu_vmstats() and
memcg_flush_percpu_vmevents() which are called from __mem_cgroup_free()
access the fields of memcg which can potentially be null if called from
failure path from mem_cgroup_alloc(). Indeed syzbot has reported the
following crash:
The unsolicited event handler for the headphone jack on CA0132 codec
driver tries to reschedule the another delayed work with
cancel_delayed_work_sync(). It's no good idea, unfortunately,
especially after we changed the work queue to the standard global
one; this may lead to a stall because both works are using the same
global queue.
Fix it by dropping the _sync but does call cancel_delayed_work()
instead.
For Focusrite Saffire Pro i/o, the lowest 8 bits of register represents
configured source of sampling clock. The next lowest 8 bits represents
whether the configured source is actually detected or not just after
the register is changed for the source.
Current implementation evaluates whole the register to detect configured
source. This results in failure due to the next lowest 8 bits when the
source is connected in advance.
The clean up commit 41672c0c24a6 ("ALSA: timer: Simplify error path in
snd_timer_open()") unified the error handling code paths with the
standard goto, but it introduced a subtle bug: the timer instance is
stored in snd_timer_open() incorrectly even if it returns an error.
This may eventually lead to UAF, as spotted by fuzzer.
The culprit is the snd_timer_open() code checks the
SNDRV_TIMER_IFLG_EXCLUSIVE flag with the common variable timeri.
This variable is supposed to be the newly created instance, but we
(ab-)used it for a temporary check before the actual creation of a
timer instance. After that point, there is another check for the max
number of instances, and it bails out if over the threshold. Before
the refactoring above, it worked fine because the code returned
directly from that point. After the refactoring, however, it jumps to
the unified error path that stores the timeri variable in return --
even if it returns an error. Unfortunately this stored value is kept
in the caller side (snd_timer_user_tselect()) in tu->timeri. This
causes inconsistency later, as if the timer was successfully
assigned.
In this patch, we fix it by not re-using timeri variable but a
temporary variable for testing the exclusive connection, so timeri
remains NULL at that point.
Functions like phy_modify_paged() read the current page, on Realtek
PHY's this means reading the value of register 0x1f. Add special
handling for reading this register, similar to what we do already
in r8168g_mdio_write(). Currently we read a random value that by
chance seems to be 0 always.
Fixes: a2928d28643e ("r8169: use paged versions of phylib MDIO access functions") Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: Connor Kuehl <connor.kuehl@canonical.com> Signed-off-by: Stefan Bader <stefan.bader@canonical.com>
The "42f5cda5eaf4" commit rightly set SOCK_DONE on peer shutdown,
but there is an issue if we receive the SHUTDOWN(RDWR) while the
virtio_transport_close_timeout() is scheduled.
In this case, when the timeout fires, the SOCK_DONE is already
set and the virtio_transport_close_timeout() will not call
virtio_transport_reset() and virtio_transport_do_close().
This causes that both sockets remain open and will never be released,
preventing the unloading of [virtio|vhost]_transport modules.
This patch fixes this issue, calling virtio_transport_reset() and
virtio_transport_do_close() when we receive the SHUTDOWN(RDWR)
and there is nothing left to read.
Fixes: 42f5cda5eaf4 ("vsock/virtio: set SOCK_DONE on peer shutdown") Cc: Stephen Barber <smbarber@chromium.org> Signed-off-by: Stefano Garzarella <sgarzare@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: Connor Kuehl <connor.kuehl@canonical.com> Signed-off-by: Stefan Bader <stefan.bader@canonical.com>
If a pnet table entry is to be added mentioning a valid ethernet
interface, but an invalid infiniband or ISM device, the dev_put()
operation for the ethernet interface is called twice, resulting
in a negative refcount for the ethernet interface, which disables
removal of such a network interface.
This patch removes one of the dev_put() calls.
Fixes: 890a2cb4a966 ("net/smc: rework pnet table") Signed-off-by: Ursula Braun <ubraun@linux.ibm.com> Signed-off-by: Karsten Graul <kgraul@linux.ibm.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: Connor Kuehl <connor.kuehl@canonical.com> Signed-off-by: Stefan Bader <stefan.bader@canonical.com>
When a new filter is added to cls_api, the function
tcf_chain_tp_insert_unique() looks up the protocol/priority/chain to
determine if the tcf_proto is duplicated in the chain's hashtable. It then
creates a new entry or continues with an existing one. In cls_flower, this
allows the function fl_ht_insert_unque to determine if a filter is a
duplicate and reject appropriately, meaning that the duplicate will not be
passed to drivers via the offload hooks. However, when a tcf_proto is
destroyed it is removed from its chain before a hardware remove hook is
hit. This can lead to a race whereby the driver has not received the
remove message but duplicate flows can be accepted. This, in turn, can
lead to the offload driver receiving incorrect duplicate flows and out of
order add/delete messages.
Prevent duplicates by utilising an approach suggested by Vlad Buslov. A
hash table per block stores each unique chain/protocol/prio being
destroyed. This entry is only removed when the full destroy (and hardware
offload) has completed. If a new flow is being added with the same
identiers as a tc_proto being detroyed, then the add request is replayed
until the destroy is complete.
Fixes: 8b64678e0af8 ("net: sched: refactor tp insert/delete for concurrent execution") Signed-off-by: John Hurley <john.hurley@netronome.com> Signed-off-by: Vlad Buslov <vladbu@mellanox.com> Reviewed-by: Simon Horman <simon.horman@netronome.com> Reported-by: Louis Peens <louis.peens@netronome.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: Connor Kuehl <connor.kuehl@canonical.com> Signed-off-by: Stefan Bader <stefan.bader@canonical.com>
This patch fixes the problem of the spin locks, originally
meant for the netpoll path of hns driver, causing deadlock in
the normal NAPI poll path. The issue happened due to the presence
of the stray leftover spin lock code related to the netpoll,
whose support was earlier removed from the HNS[1], got activated
due to enabling of NET_POLL_CONTROLLER switch.
Earlier background:
The netpoll handling code originally had this bug(as identified
by Marc Zyngier[2]) of wrong spin lock API being used which did
not disable the interrupts and hence could cause locking issues.
i.e. if the lock were first acquired in context to thread like
'ip' util and this lock if ever got later acquired again in
context to the interrupt context like TX/RX (Interrupts could
always pre-empt the lock holding task and acquire the lock again)
and hence could cause deadlock.
Proposed Solution:
1. If the netpoll was enabled in the HNS driver, which is not
right now, we could have simply used spin_[un]lock_irqsave()
2. But as netpoll is disabled, therefore, it is best to get rid
of the existing locks and stray code for now. This should
solve the problem reported by Marc.
Fixes: 4bd2c03be707 ("net: hns: remove ndo_poll_controller") Cc: lipeng <lipeng321@huawei.com> Cc: Yisen Zhuang <yisen.zhuang@huawei.com> Cc: Eric Dumazet <edumazet@google.com> Cc: David S. Miller <davem@davemloft.net> Reported-by: Marc Zyngier <maz@kernel.org> Acked-by: Marc Zyngier <maz@kernel.org> Tested-by: Marc Zyngier <maz@kernel.org> Signed-off-by: Salil Mehta <salil.mehta@huawei.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: Connor Kuehl <connor.kuehl@canonical.com> Signed-off-by: Stefan Bader <stefan.bader@canonical.com>
While looking at a syzbot KCSAN report [1], I found multiple
issues in this code :
1) fib6_nh->last_probe has an initial value of 0.
While probably okay on 64bit kernels, this causes an issue
on 32bit kernels since the time_after(jiffies, 0 + interval)
might be false ~24 days after boot (for HZ=1000)
2) The data-race found by KCSAN
I could use READ_ONCE() and WRITE_ONCE(), but we also can
take the opportunity of not piling-up too many rt6_probe_deferred()
works by using instead cmpxchg() so that only one cpu wins the race.
[1]
BUG: KCSAN: data-race in find_match / find_match
read to 0xffff8880bb7aabe8 of 8 bytes by interrupt on cpu 0:
rt6_probe net/ipv6/route.c:657 [inline]
find_match net/ipv6/route.c:757 [inline]
find_match+0x521/0x790 net/ipv6/route.c:733
__find_rr_leaf+0xe3/0x780 net/ipv6/route.c:831
find_rr_leaf net/ipv6/route.c:852 [inline]
rt6_select net/ipv6/route.c:896 [inline]
fib6_table_lookup+0x383/0x650 net/ipv6/route.c:2164
ip6_pol_route+0xee/0x5c0 net/ipv6/route.c:2200
ip6_pol_route_output+0x48/0x60 net/ipv6/route.c:2452
fib6_rule_lookup+0x3d6/0x470 net/ipv6/fib6_rules.c:117
ip6_route_output_flags_noref+0x16b/0x230 net/ipv6/route.c:2484
ip6_route_output_flags+0x50/0x1a0 net/ipv6/route.c:2497
ip6_dst_lookup_tail+0x25d/0xc30 net/ipv6/ip6_output.c:1049
ip6_dst_lookup_flow+0x68/0x120 net/ipv6/ip6_output.c:1150
inet6_csk_route_socket+0x2f7/0x420 net/ipv6/inet6_connection_sock.c:106
inet6_csk_xmit+0x91/0x1f0 net/ipv6/inet6_connection_sock.c:121
__tcp_transmit_skb+0xe81/0x1d60 net/ipv4/tcp_output.c:1169
Reported by Kernel Concurrency Sanitizer on:
CPU: 0 PID: 18894 Comm: udevd Not tainted 5.4.0-rc3+ #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
Fixes: cc3a86c802f0 ("ipv6: Change rt6_probe to take a fib6_nh") Fixes: f547fac624be ("ipv6: rate-limit probes for neighbourless routes") Signed-off-by: Eric Dumazet <edumazet@google.com> Reported-by: syzbot <syzkaller@googlegroups.com> Reviewed-by: David Ahern <dsahern@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: Connor Kuehl <connor.kuehl@canonical.com> Signed-off-by: Stefan Bader <stefan.bader@canonical.com>
Add a test which spawns 16 threads and performs concurrent
send and recv calls on the same socket.
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com> Reviewed-by: Simon Horman <simon.horman@netronome.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: Connor Kuehl <connor.kuehl@canonical.com> Signed-off-by: Stefan Bader <stefan.bader@canonical.com>
TLS TX needs to release and re-acquire the socket lock if send buffer
fills up.
TLS SW TX path currently depends on only allowing one thread to enter
the function by the abuse of sk_write_pending. If another writer is
already waiting for memory no new ones are allowed in.
This has two problems:
- writers don't wake other threads up when they leave the kernel;
meaning that this scheme works for single extra thread (second
application thread or delayed work) because memory becoming
available will send a wake up request, but as Mallesham and
Pooja report with larger number of threads it leads to threads
being put to sleep indefinitely;
- the delayed work does not get _scheduled_ but it may _run_ when
other writers are present leading to crashes as writers don't
expect state to change under their feet (same records get pushed
and freed multiple times); it's hard to reliably bail from the
work, however, because the mere presence of a writer does not
guarantee that the writer will push pending records before exiting.
Ensuring wakeups always happen will make the code basically open
code a mutex. Just use a mutex.
The TLS HW TX path does not have any locking (not even the
sk_write_pending hack), yet it uses a per-socket sg_tx_data
array to push records.
Fixes: a42055e8d2c3 ("net/tls: Add support for async encryption of records for performance") Reported-by: Mallesham Jatharakonda <mallesh537@gmail.com> Reported-by: Pooja Trivedi <poojatrivedi@gmail.com> Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com> Reviewed-by: Simon Horman <simon.horman@netronome.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: Connor Kuehl <connor.kuehl@canonical.com> Signed-off-by: Stefan Bader <stefan.bader@canonical.com>
sk_write_pending being not zero does not guarantee that partial
record will be pushed. If the thread waiting for memory times out
the pending record may get stuck.
In case of tls_device there is no path where parial record is
set and writer present in the first place. Partial record is
set only in tls_push_sg() and tls_push_sg() will return an
error immediately. All tls_device callers of tls_push_sg()
will return (and not wait for memory) if it failed.
Fixes: a42055e8d2c3 ("net/tls: Add support for async encryption of records for performance") Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com> Reviewed-by: Simon Horman <simon.horman@netronome.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: Connor Kuehl <connor.kuehl@canonical.com> Signed-off-by: Stefan Bader <stefan.bader@canonical.com>
The check that the event is actually for this device should be moved
from the "port" handler to the net device handler.
Otherwise the port handler will deny bonding configuration for other
net devices in the same system (like enetc in the LS1028A) that don't
have the lag_upper_info->tx_type restriction that ocelot has.
Fixes: dc96ee3730fc ("net: mscc: ocelot: add bonding support") Signed-off-by: Claudiu Manoil <claudiu.manoil@nxp.com> Signed-off-by: Vladimir Oltean <olteanv@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: Connor Kuehl <connor.kuehl@canonical.com> Signed-off-by: Stefan Bader <stefan.bader@canonical.com>
While rebooting the system with SR-IOV vfs enabled leads
to below crash due to recurrence of __qede_remove() on the VF
devices (first from .shutdown() flow of the VF itself and
another from PF's .shutdown() flow executing pci_disable_sriov())
This patch adds a safeguard in __qede_remove() flow to fix this,
so that driver doesn't attempt to remove "already removed" devices.
The variable nfcid_skb is not changed in the callee nfc_hci_get_param()
if error occurs. Consequently, the freed variable nfcid_skb will be
freed again, resulting in a double free bug. Set nfcid_skb to NULL after
releasing it to fix the bug.
Signed-off-by: Pan Bian <bianpan2016@163.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: Connor Kuehl <connor.kuehl@canonical.com> Signed-off-by: Stefan Bader <stefan.bader@canonical.com>
The function nfc_put_device(dev) is called twice to drop the reference
to dev when there is no associated local llcp. Remove one of them to fix
the bug.
Fixes: 52feb444a903 ("NFC: Extend netlink interface for LTO, RW, and MIUX parameters support") Fixes: d9b8d8e19b07 ("NFC: llcp: Service Name Lookup netlink interface") Signed-off-by: Pan Bian <bianpan2016@163.com> Reviewed-by: Johan Hovold <johan@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: Connor Kuehl <connor.kuehl@canonical.com> Signed-off-by: Stefan Bader <stefan.bader@canonical.com>
The address of fw_vsc_cfg is on stack. Releasing it with devm_kfree() is
incorrect, which may result in a system crash or other security impacts.
The expected object to free is *fw_vsc_cfg.
Signed-off-by: Pan Bian <bianpan2016@163.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: Connor Kuehl <connor.kuehl@canonical.com> Signed-off-by: Stefan Bader <stefan.bader@canonical.com>
Signed-off-by: Aleksander Morgado <aleksander@aleksander.es> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: Connor Kuehl <connor.kuehl@canonical.com> Signed-off-by: Stefan Bader <stefan.bader@canonical.com>
sk_msg_trim() tries to only update curr pointer if it falls into
the trimmed region. The logic, however, does not take into the
account pointer wrapping that sk_msg_iter_var_prev() does nor
(as John points out) the fact that msg->sg is a ring buffer.
This means that when the message was trimmed completely, the new
curr pointer would have the value of MAX_MSG_FRAGS - 1, which is
neither smaller than any other value, nor would it actually be
correct.
Special case the trimming to 0 length a little bit and rework
the comparison between curr and end to take into account wrapping.
This bug caused the TLS code to not copy all of the message, if
zero copy filled in fewer sg entries than memcopy would need.
Big thanks to Alexander Potapenko for the non-KMSAN reproducer.
v2:
- take into account that msg->sg is a ring buffer (John).
Link: https://lore.kernel.org/netdev/20191030160542.30295-1-jakub.kicinski@netronome.com/ Fixes: d829e9c4112b ("tls: convert to generic sk_msg interface") Reported-by: syzbot+f8495bff23a879a6d0bd@syzkaller.appspotmail.com Reported-by: syzbot+6f50c99e8f6194bf363f@syzkaller.appspotmail.com Co-developed-by: John Fastabend <john.fastabend@gmail.com> Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com> Signed-off-by: John Fastabend <john.fastabend@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: Connor Kuehl <connor.kuehl@canonical.com> Signed-off-by: Stefan Bader <stefan.bader@canonical.com>
During the exit/unregistration process of the RmNet driver, the function
rmnet_unregister_real_device() is called to handle freeing the driver's
internal state and removing the RX handler on the underlying physical
device. However, the order of operations this function performs is wrong
and can lead to a use after free of the rmnet_port structure.
Before calling netdev_rx_handler_unregister(), this port structure is
freed with kfree(). If packets are received on any RmNet devices before
synchronize_net() completes, they will attempt to use this already-freed
port structure when processing the packet. As such, before cleaning up any
other internal state, the RX handler must be unregistered in order to
guarantee that no further packets will arrive on the device.
Fixes: ceed73a2cf4a ("drivers: net: ethernet: qualcomm: rmnet: Initial implementation") Signed-off-by: Sean Tranchetti <stranche@codeaurora.org> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: Connor Kuehl <connor.kuehl@canonical.com> Signed-off-by: Stefan Bader <stefan.bader@canonical.com>
Reported by Kernel Concurrency Sanitizer on:
CPU: 0 PID: 0 Comm: swapper/0 Not tainted 5.4.0-rc3+ #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
Signed-off-by: Eric Dumazet <edumazet@google.com> Reported-by: syzbot <syzkaller@googlegroups.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: Connor Kuehl <connor.kuehl@canonical.com> Signed-off-by: Stefan Bader <stefan.bader@canonical.com>
Octeon's input ring-buffer entry has 14 bits-wide size field, so to account
for second possible VLAN header max_mtu must be further reduced.
Fixes: 109cc16526c6d ("ethernet/cavium: use core min/max MTU checking") Signed-off-by: Alexander Sverdlin <alexander.sverdlin@nokia.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: Connor Kuehl <connor.kuehl@canonical.com> Signed-off-by: Stefan Bader <stefan.bader@canonical.com>
Hendrik reported routes in the main table using source address are not
removed when the address is removed. The problem is that fib_sync_down_addr
does not account for devices in the default VRF which are associated
with the main table. Fix by updating the table id reference.
Fixes: 5a56a0b3a45d ("net: Don't delete routes in different VRFs") Reported-by: Hendrik Donner <hd@os-cillation.de> Signed-off-by: David Ahern <dsahern@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: Connor Kuehl <connor.kuehl@canonical.com> Signed-off-by: Stefan Bader <stefan.bader@canonical.com>
A malicious device may give half an answer when asked
for its MTU. The driver will proceed after this with
a garbage MTU. Anything but a complete answer must be treated
as an error.
V2: used sizeof as request by Alexander
Reported-and-tested-by: syzbot+0631d878823ce2411636@syzkaller.appspotmail.com Signed-off-by: Oliver Neukum <oneukum@suse.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: Connor Kuehl <connor.kuehl@canonical.com> Signed-off-by: Stefan Bader <stefan.bader@canonical.com>
Since de77ecd4ef02 ("bonding: improve link-status update in
mii-monitoring"), the bonding driver has utilized two separate variables
to indicate the next link state a particular slave should transition to.
Each is used to communicate to a different portion of the link state
change commit logic; one to the bond_miimon_commit function itself, and
another to the state transition logic.
Unfortunately, the two variables can become unsynchronized,
resulting in incorrect link state transitions within bonding. This can
cause slaves to become stuck in an incorrect link state until a
subsequent carrier state transition.
The issue occurs when a special case in bond_slave_netdev_event
sets slave->link directly to BOND_LINK_FAIL. On the next pass through
bond_miimon_inspect after the slave goes carrier up, the BOND_LINK_FAIL
case will set the proposed next state (link_new_state) to BOND_LINK_UP,
but the new_link to BOND_LINK_DOWN. The setting of the final link state
from new_link comes after that from link_new_state, and so the slave
will end up incorrectly in _DOWN state.
Resolve this by combining the two variables into one.
Reported-by: Aleksei Zakharov <zakharov.a.g@yandex.ru> Reported-by: Sha Zhang <zhangsha.zhang@huawei.com> Cc: Mahesh Bandewar <maheshb@google.com> Fixes: de77ecd4ef02 ("bonding: improve link-status update in mii-monitoring") Signed-off-by: Jay Vosburgh <jay.vosburgh@canonical.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: Connor Kuehl <connor.kuehl@canonical.com> Signed-off-by: Stefan Bader <stefan.bader@canonical.com>
AceLan Kao [Thu, 7 Nov 2019 06:36:00 +0000 (07:36 +0100)]
UBUNTU: SAUCE: platform/x86: dell-uart-backlight: add quirk for old platforms
BugLink: https://bugs.launchpad.net/bugs/1813877
Old platforms do not support DELL_UART_GET_SCALAR command and the
behavior of DELL_UART_GET_FIRMWARE_VER command is different as the new
firmware, so the new way to check if the backlight is controlled by
scalar IC doesn't work on old platforms. We now add them into a list and
use the old way to do the check.
Signed-off-by: AceLan Kao <acelan.kao@canonical.com> Acked-by: Seth Forshee <seth.forshee@canonical.com> Acked-by: Connor Kuehl <connor.kuehl@canonical.com> Signed-off-by: Stefan Bader <stefan.bader@canonical.com>
AceLan Kao [Thu, 7 Nov 2019 06:36:00 +0000 (07:36 +0100)]
UBUNTU: SAUCE: platform/x86: dell-uart-backlight: add force parameter
BugLink: https://bugs.launchpad.net/bugs/1813877
Add force parameter to force load the driver if the platform doesn't
provide a working scalar status command.
Signed-off-by: AceLan Kao <acelan.kao@canonical.com> Acked-by: Seth Forshee <seth.forshee@canonical.com> Acked-by: Connor Kuehl <connor.kuehl@canonical.com> Signed-off-by: Stefan Bader <stefan.bader@canonical.com>
AceLan Kao [Thu, 7 Nov 2019 06:36:00 +0000 (07:36 +0100)]
UBUNTU: SAUCE: platform/x86: dell-uart-backlight: add missing status command
BugLink: https://bugs.launchpad.net/bugs/1813877
DELL_UART_GET_SCALAR has been declared in
drivers/platform/x86/dell-uart-backlight.h, but its definition is
missing. It won't lead to issues on old AIO platforms, since this
command is newly introduced and is not supported by all old AIOs.
Signed-off-by: AceLan Kao <acelan.kao@canonical.com> Acked-by: Seth Forshee <seth.forshee@canonical.com> Acked-by: Connor Kuehl <connor.kuehl@canonical.com> Signed-off-by: Stefan Bader <stefan.bader@canonical.com>
Slawomir Laba [Wed, 13 Nov 2019 13:39:17 +0000 (10:39 -0300)]
i40e: Fix crash caused by stress setting of VF MAC addresses
BugLink: https://bugs.launchpad.net/bugs/1852432
Add update to the VSI pointer passed to the i40e_set_vf_mac function.
If VF is in reset state the driver waits in i40e_set_vf_mac function
for the reset to be complete, yet after reset the vsi pointer
that was passed into this function is no longer valid.
The patch updates local VSI pointer directly from pf->vsi array,
by using the id stored in VF pointer (lan_vsi_idx).
Without this commit the driver might occasionally invoke general
protection fault in kernel and disable the OS entirely.
Signed-off-by: Slawomir Laba <slawomirx.laba@intel.com> Tested-by: Andrew Bowers <andrewx.bowers@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
(cherry picked from commit 9889707b06acfe9bb37a6edcaae627d4a5eacc72) Signed-off-by: Heitor Alves de Siqueira <halves@canonical.com> Acked-by: Kamal Mostafa <kamal@canonical.com> Acked-by: Sultan Alsawaf <sultan.alsawaf@canonical.com> Signed-off-by: Khalid Elmously <khalid.elmously@canonical.com>
When deleting VMs with VFs created by i40e, a general protection
fault occurs in i40e_config_vf_promiscuous_mode due to race
condition for vsi->mac_filter_hash
And it also happens when deleteing pod with VFs
This issue was reported in e1000-devel mailling list
https://sourceforge.net/p/e1000/mailman/message/36766306/
Suggested-by: Billy McFall <bmcfall@redhat.com> Signed-off-by: Gerald Yang <gerald.yang@canonical.com> Acked-by: Connor Kuehl <connor.kuehl@canonical.com> Acked-by: Sultan Alsawaf <sultan.alsawaf@canonical.com> Signed-off-by: Khalid Elmously <khalid.elmously@canonical.com>
Seth Forshee [Mon, 18 Nov 2019 15:39:00 +0000 (16:39 +0100)]
UBUNTU: [Packaging] Fix module signing with older modinfo
BugLink: https://bugs.launchpad.net/bugs/1852581
Not all versions of modinfo support the signer field;
specifically, the version in boinic does not. This leaves all
modules unsigned in hwe kernels based on eoan and later. Change
the check to look for the magic string at the end of the module,
which does not rely on any external tools being aware of module
signatures.
Signed-off-by: Seth Forshee <seth.forshee@canonical.com> Acked-by: Thadeu Lima de Souza Cascardo <cascardo@canonical.com> Acked-by: Kleber Sacilotto de Souza <kleber.souza@canonical.com> Signed-off-by: Kleber Sacilotto de Souza <kleber.souza@canonical.com>
Connor Kuehl [Wed, 13 Nov 2019 22:35:25 +0000 (14:35 -0800)]
UBUNTU: [Config] CONFIG_CAVIUM_TX2_ERRATUM_219=y
BugLink: https://bugs.launchpad.net/bugs/1850456
Ubuntu commit ff47386da551 ("arm64: Allow CAVIUM_TX2_ERRATUM_219 to be
selected") received this config option via an upstream stable update.
Accept the default suggested value of 'y' to avoid spurious data aborts.
BugLink: https://bugs.launchpad.net/bugs/1852111
Ubuntu commit from an upstream stable sync 2d4c93a1a8b2 ("ASoC: SOF:
Intel: hda: Disable DMI L1 entry during capture") brings in a config
option to aid in troubleshooting power optimizations for HDaudio
platforms. Accept the default value of 'n'.
Andy Whitcroft [Fri, 8 Nov 2019 10:25:00 +0000 (11:25 +0100)]
efi: efi_get_memory_map -- increase map headroom
We are seeing some EFI based machines failing to boot hard in the EFI
stub:
exit_boot() failed!
efi_main() failed!
This seems to occur when the bootloader (grub2 in this case) has had
to manipulate some additional files due to a change in the way MAAS
boots the machines. We tracked this down to the memory map dance
efi_get_memory_map(). Basically we attempt to close boot services and
it informs us it cannot do so because it failed to record the updated
memory map. This occurs when there is insufficient space in the passed
memory map buffer to record changes during the operation. At the point
when this occurs we are unable to call the allocation functions to
reallocate the buffer so we panic.
To avoid this we allocate some additional entries in the buffer to cover
any additional entries. This headroom is currently insufficient for
these machines under this use case. Increase EFI_MMAP_NR_SLACK_SLOTS to
provide space for more memory map modifications.
BugLink: https://bugs.launchpad.net/bugs/1851810 Signed-off-by: Andy Whitcroft <apw@canonical.com> Acked-by: Stefan Bader <stefan.bader@canonical.com> Acked-by: Seth Forshee <seth.forshee@canonical.com> Signed-off-by: Stefan Bader <stefan.bader@canonical.com>
wget tends to be excessivly chatty. Request a less verbose output (-nv)
while retaining files downloaded or failure for the same.
BugLink: https://bugs.launchpad.net/bugs/1850958 Signed-off-by: Andy Whitcroft <apw@canonical.com> Acked-by: Stefan Bader <stefan.bader@canonical.com> Acked-by: Kamal Mostafa <kamal@canonical.com> Signed-off-by: Stefan Bader <stefan.bader@canonical.com>
Andy Whitcroft [Fri, 1 Nov 2019 21:48:00 +0000 (22:48 +0100)]
UBUNTU: [Packaging] dkms -- try launchpad librarian for pool downloads
When we build a kernel we capture the current dkms package versions.
This allows us to keep the versions syncronised in an entire SRU cycle
between the master and derivative kernels. It also allows us to maintain
the same versions on respin. This suffers from the issue that they may
expire from the archive pool if the dkms packages are revved.
The same packages are available in the launchpad librarian and indeed
(now) via well known names. So when are attempting to download from the
archive pool first attempt to get the package from the primary archive
librarian first. If this fails continue to do the pool scan as normal.
This increases the chance we will find an appropriate package regardless
of the currently published packages, whilst retaining the ability to build
in a more restricted environment (such as when we have no internet access).
BugLink: https://bugs.launchpad.net/bugs/1850958 Signed-off-by: Andy Whitcroft <apw@canonical.com> Acked-by: Stefan Bader <stefan.bader@canonical.com> Acked-by: Kamal Mostafa <kamal@canonical.com> Signed-off-by: Stefan Bader <stefan.bader@canonical.com>
Signed-off-by: Colin Ian King <colin.king@canonical.com> Signed-off-by: David S. Miller <davem@davemloft.net>
(cherry picked from commit 4208966f65f520d7f392dbaa62e39a8fa88ffb95) Signed-off-by: Kamal Mostafa <kamal@canonical.com> Acked-by: Stefan Bader <stefan.bader@canonical.com> Acked-by: Khaled Elmously <khalid.elmously@canonical.com> Signed-off-by: Stefan Bader <stefan.bader@canonical.com>
BugLink: https://bugs.launchpad.net/bugs/1850175
There is a race condition that can occur when calling ena_down().
The ena_clean_tx_irq() - which is a part of the napi handler -
function might wake up the tx queue when the queue is supposed
to be down (during recovery or changing the size of the queues
for example) This causes the ena_start_xmit() function to trigger
and possibly try to access the destroyed queues.
After these flows the tx queue is active and ena_start_xmit() accesses
the destroyed queue which leads to a kernel panic.
fixes: 1738cd3ed342 (net: ena: Add a driver for Amazon Elastic Network Adapters (ENA))
Signed-off-by: Sameeh Jubran <sameehj@amazon.com> Signed-off-by: David S. Miller <davem@davemloft.net>
(cherry picked from commit a53651ec93a8d7ab5b26c5390e0c389048b4b4b6) Signed-off-by: Kamal Mostafa <kamal@canonical.com> Acked-by: Stefan Bader <stefan.bader@canonical.com> Acked-by: Khaled Elmously <khalid.elmously@canonical.com> Signed-off-by: Stefan Bader <stefan.bader@canonical.com>
Matt Roper [Thu, 17 Oct 2019 12:19:00 +0000 (14:19 +0200)]
drm/i915/cml: Add second PCH ID for CMP
BugLink: http://bugs.launchpad.net/bugs/1848491
The CMP PCH ID we have in the driver is correct for the CML-U machines we have
in our CI system, but the CML-S and CML-H CI machines appear to use a
different PCH ID, leading our driver to detect no PCH for them.
Cc: Rodrigo Vivi <rodrigo.vivi@intel.com> Cc: Anusha Srivatsa <anusha.srivatsa@intel.com>
References: 729ae330a0f2e2 ("drm/i915/cml: Introduce Comet Lake PCH")
Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=111461 Signed-off-by: Matt Roper <matthew.d.roper@intel.com> Acked-by: Chris Wilson <chris@chris-wilson.co.uk> Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk> Link: https://patchwork.freedesktop.org/patch/msgid/20190916233251.387-1-matthew.d.roper@intel.com
(backported from commit 8698ba53cd7173c32320ebbef4d389d41ebb5780) Signed-off-by: Timo Aaltonen <timo.aaltonen@canonical.com> Acked-by: Stefan Bader <stefan.bader@canonical.com> Acked-by: Connor Kuehl <connor.kuehl@canonical.com> Signed-off-by: Stefan Bader <stefan.bader@canonical.com>
Sasha Neftin [Thu, 17 Oct 2019 17:48:35 +0000 (01:48 +0800)]
UBUNTU: SAUCE: e1000e: Add support for Comet Lake
BugLink: https://bugs.launchpad.net/bugs/1848555
Add devices ID's for the next LOM generations that will be
available on the next Intel Client platform (Comet Lake)
This patch provides the initial support for these devices
UBUNTU: SAUCE: seccomp: fix SECCOMP_USER_NOTIF_FLAG_CONTINUE test
The ifndef for SECCOMP_USER_NOTIF_FLAG_CONTINUE was placed under the
ifndef for the SECCOMP_FILTER_FLAG_NEW_LISTENER feature. This will not
work on systems that do support SECCOMP_FILTER_FLAG_NEW_LISTENER but do not
support SECCOMP_USER_NOTIF_FLAG_CONTINUE. So move the latter ifndef out of
the former ifndef's scope.
2019-10-20 11:14:01 make run_tests -C seccomp
make: Entering directory '/usr/src/perf_selftests-x86_64-rhel-7.6-0eebfed2954f152259cae0ad57b91d3ea92968e8/tools/testing/selftests/seccomp'
gcc -Wl,-no-as-needed -Wall seccomp_bpf.c -lpthread -o seccomp_bpf
seccomp_bpf.c: In function ‘user_notification_continue’:
seccomp_bpf.c:3562:15: error: ‘SECCOMP_USER_NOTIF_FLAG_CONTINUE’ undeclared (first use in this function)
resp.flags = SECCOMP_USER_NOTIF_FLAG_CONTINUE;
^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
seccomp_bpf.c:3562:15: note: each undeclared identifier is reported only once for each function it appears in
Makefile:12: recipe for target 'seccomp_bpf' failed
make: *** [seccomp_bpf] Error 1
make: Leaving directory '/usr/src/perf_selftests-x86_64-rhel-7.6-0eebfed2954f152259cae0ad57b91d3ea92968e8/tools/testing/selftests/seccomp'
BugLink: https://bugs.launchpad.net/bugs/1849281 Reported-by: kernel test robot <rong.a.chen@intel.com> Fixes: 0eebfed2954f ("seccomp: test SECCOMP_USER_NOTIF_FLAG_CONTINUE") Cc: linux-kselftest@vger.kernel.org Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com> Reviewed-by: Tycho Andersen <tycho@tycho.ws> Signed-off-by: Manoj Iyer <manoj.iyer@canonical.com> Acked-by: Stefan Bader <stefan.bader@canonical.com> Acked-by: Seth Forshee <seth.forshee@canonical.com> Signed-off-by: Khalid Elmously <khalid.elmously@canonical.com>
UBUNTU: SAUCE: seccomp: rework define for SECCOMP_USER_NOTIF_FLAG_CONTINUE
Switch from BIT(0) to (1UL << 0).
First, there are already two different forms used in the header, so there's
no need to add a third. Second, the BIT() macros is kernel internal and
afaict not actually exposed to userspace. Maybe there's some magic there
I'm missing but it definitely causes issues when compiling a program that
tries to use SECCOMP_USER_NOTIF_FLAG_CONTINUE. It currently fails in the
following way:
# github.com/lxc/lxd/lxd
/usr/bin/ld: $WORK/b001/_x003.o: in function
`__do_user_notification_continue':
lxd/main_checkfeature.go:240: undefined reference to `BIT'
collect2: error: ld returned 1 exit status
Switching to (1UL << 0) should prevent that and is more in line what is
already done in the rest of the header.
Kai-Heng Feng [Wed, 30 Oct 2019 14:01:49 +0000 (22:01 +0800)]
UBUNTU: SAUCE: x86/intel: Disable HPET on Intel Ice Lake platforms
BugLink: https://bugs.launchpad.net/bugs/1840239
Ice Lake platform have similar behavior as Coffee Lake, have skewed HPET
timer once the SoCs entered PC10 so let's disable HPET on Ice Lake.
as result.
Kai-Heng Feng [Wed, 30 Oct 2019 14:01:48 +0000 (22:01 +0800)]
UBUNTU: SAUCE: x86/intel: Disable HPET on Intel Coffe Lake platforms
BugLink: https://bugs.launchpad.net/bugs/1840239
Some Coffee Lake platforms have skewed HPET timer once the SoCs entered
PC10, and marked TSC as unstable clocksource as result.
Harry Pan identified it's a firmware bug [1].
To prevent creating a circular dependency between HPET and TSC, let's
disable HPET on affected platforms.
Nobuto Murata [Tue, 5 Nov 2019 01:56:47 +0000 (10:56 +0900)]
UBUNTU: [Packaging] include iavf/i40evf in generic
BugLink: https://bugs.launchpad.net/bugs/1848481
Intel's VF drivers such as igbvf and ixgbevf are already available in
generic. It makes sense to add iavf/i40evf especially for SR-IOV enabled
clouds with Intel X710/XXV710/XL710.
If a gadget driver is in the pending drivers list, a UDC
becomes available and udc_bind_to_driver() fails, then it
gets deleted from the pending list.
i.e. list_del(&driver->pending) in check_pending_gadget_drivers().
Then if that gadget driver is unregistered,
usb_gadget_unregister_driver() does a list_del(&driver->pending)
again thus causing a page fault as that list entry has been poisoned
by the previous list_del().
Fix this by using list_del_init() instead of list_del() in
check_pending_gadget_drivers().
Test case:
- Make sure no UDC is available
- modprobe g_mass_storage file=wrongfile
- Load UDC driver so it becomes available
lun0: unable to open backing file: wrongfile
- modprobe -r g_mass_storage
Newer versions of GCC (>= 9) demand that the size of the string to be
copied must be explicitly smaller than the size of the destination.
Thus, the NULL char has to be taken into account on strncpy.
This will avoid the following compiling error:
tlbie_test.c: In function 'main':
tlbie_test.c:639:4: error: 'strncpy' specified bound 100 equals destination size
strncpy(logdir, optarg, LOGDIR_NAME_SIZE);
^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
cc1: all warnings being treated as errors
Signed-off-by: Desnes A. Nunes do Rosario <desnesn@linux.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20191003211010.9711-1-desnesn@linux.ibm.com Signed-off-by: Sandipan Das <sandipan@linux.ibm.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: Connor Kuehl <connor.kuehl@canonical.com> Signed-off-by: Khalid Elmously <khalid.elmously@canonical.com>
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
[mpe: Some minor fixes to make it build] Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20190924035254.24612-4-aneesh.kumar@linux.ibm.com Signed-off-by: Sandipan Das <sandipan@linux.ibm.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: Connor Kuehl <connor.kuehl@canonical.com> Signed-off-by: Khalid Elmously <khalid.elmously@canonical.com>
When the client hits reconnect it iterates over the mid
pending queue marking entries for retry and moving them
to a temporary list to issue callbacks later without holding
GlobalMid_Lock. In the same time there is no guarantee that
mids can't be removed from the temporary list or even
freed completely by another thread. It may cause a temporary
list corruption:
Fix this by obtaining extra references for mids being retried
and marking them as MID_DELETED which indicates that such a mid
has been dequeued from the pending list.
Also move mid cleanup logic from DeleteMidQEntry to
_cifs_mid_q_entry_release which is called when the last reference
to a particular mid is put. This allows to avoid any use-after-free
of response buffers.
The patch needs to be backported to stable kernels. A stable tag
is not mentioned below because the patch doesn't apply cleanly
to any actively maintained stable kernel.
Reviewed-by: Ronnie Sahlberg <lsahlber@redhat.com> Reviewed-and-tested-by: David Wysochanski <dwysocha@redhat.com> Signed-off-by: Pavel Shilovsky <pshilov@microsoft.com> Signed-off-by: Steve French <stfrench@microsoft.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: Connor Kuehl <connor.kuehl@canonical.com> Signed-off-by: Khalid Elmously <khalid.elmously@canonical.com>
The SIMATIC IPC227E uses the PMC clock for on-board components and gets
stuck during boot if the clock is disabled. Therefore, add this device
to the critical systems list.
Fixes: 648e921888ad ("clk: x86: Stop marking clocks as CLK_IS_CRITICAL") Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com> Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: Connor Kuehl <connor.kuehl@canonical.com> Signed-off-by: Khalid Elmously <khalid.elmously@canonical.com>
UDP IPv6 packets auto flowlabels are using a 32bit secret
(static u32 hashrnd in net/core/flow_dissector.c) and
apply jhash() over fields known by the receivers.
Attackers can easily infer the 32bit secret and use this information
to identify a device and/or user, since this 32bit secret is only
set at boot time.
Really, using jhash() to generate cookies sent on the wire
is a serious security concern.
Trying to change the rol32(hash, 16) in ip6_make_flowlabel() would be
a dead end. Trying to periodically change the secret (like in sch_sfq.c)
could change paths taken in the network for long lived flows.
Let's switch to siphash, as we did in commit df453700e8d8
("inet: switch IP ID generator to siphash")
Using a cryptographically strong pseudo random function will solve this
privacy issue and more generally remove other weak points in the stack.
Packet schedulers using skb_get_hash_perturb() benefit from this change.
Fixes: b56774163f99 ("ipv6: Enable auto flow labels by default") Fixes: 42240901f7c4 ("ipv6: Implement different admin modes for automatic flow labels") Fixes: 67800f9b1f4e ("ipv6: Call skb_get_hash_flowi6 to get skb->hash in ip6_make_flowlabel") Fixes: cb1ce2ef387b ("ipv6: Implement automatic flow label generation on transmit") Signed-off-by: Eric Dumazet <edumazet@google.com> Reported-by: Jonathan Berger <jonathann1@walla.com> Reported-by: Amit Klein <aksecurity@gmail.com> Reported-by: Benny Pinkas <benny@pinkas.net> Cc: Tom Herbert <tom@herbertland.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: Connor Kuehl <connor.kuehl@canonical.com> Signed-off-by: Khalid Elmously <khalid.elmously@canonical.com>
The EPHY integrated into the 40nm Set-Top Box devices can falsely
detect energy when connected to a disabled peer interface. When the
peer interface is enabled the EPHY will detect and report the link
as active, but on occasion may get into a state where it is not
able to exchange data with the connected GENET MAC. This issue has
not been observed when the link parameters are auto-negotiated;
however, it has been observed with a manually configured link.
It has been empirically determined that issuing a soft reset to the
EPHY when energy is detected prevents it from getting into this bad
state.
Fixes: 1c1008c793fa ("net: bcmgenet: add main driver file") Signed-off-by: Doug Berger <opendmb@gmail.com> Acked-by: Florian Fainelli <f.fainelli@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: Connor Kuehl <connor.kuehl@canonical.com> Signed-off-by: Khalid Elmously <khalid.elmously@canonical.com>
The internal 40nm EPHYs use a "Workaround for putting the PHY in
IDDQ mode." These PHYs require a soft reset to restore functionality
after they are powered back up.
This commit defines the soft_reset function to use genphy_soft_reset
during phy_init_hw to accommodate this.
Fixes: 6e2d85ec0559 ("net: phy: Stop with excessive soft reset") Signed-off-by: Doug Berger <opendmb@gmail.com> Acked-by: Florian Fainelli <f.fainelli@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: Connor Kuehl <connor.kuehl@canonical.com> Signed-off-by: Khalid Elmously <khalid.elmously@canonical.com>
If packet corruption failed we jump to finish_segs and return
NET_XMIT_SUCCESS. Seeing success will make the parent qdisc
increment its backlog, that's incorrect - we need to return
NET_XMIT_DROP.
Fixes: 6071bd1aa13e ("netem: Segment GSO packets on enqueue") Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com> Reviewed-by: Simon Horman <simon.horman@netronome.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: Connor Kuehl <connor.kuehl@canonical.com> Signed-off-by: Khalid Elmously <khalid.elmously@canonical.com>
This device is sold as 'ThinkPad USB-C Dock Gen 2 (40AS)'.
Chipset is RTL8153 and works with r8152.
Without this, the generic cdc_ether grabs the device, and the device jam
connected networks up when the machine suspends.
Signed-off-by: Kazutoshi Noguchi <noguchi.kazutosi@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: Connor Kuehl <connor.kuehl@canonical.com> Signed-off-by: Khalid Elmously <khalid.elmously@canonical.com>
As soon as the netdev is registers, the kernel can start using the
interface. If the driver connects the MAC to the PHY after the netdev
is registered, there is a race condition where the interface can be
opened without having the PHY connected.
Change the order to close this race condition.
Fixes: 92571a1aae40 ("lan78xx: Connect phy early") Reported-by: Daniel Wagner <dwagner@suse.de> Signed-off-by: Andrew Lunn <andrew@lunn.ch> Tested-by: Daniel Wagner <dwagner@suse.de> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: Connor Kuehl <connor.kuehl@canonical.com> Signed-off-by: Khalid Elmously <khalid.elmously@canonical.com>
Intel test robot reported a ~7% regression on TCP_CRR tests
that they bisected to the cited commit.
Indeed, every time a new TCP socket is created or deleted,
the atomic counter net->count is touched (via get_net(net)
and put_net(net) calls)
So cpus might have to reload a contended cache line in
net_hash_mix(net) calls.
We need to reorder 'struct net' fields to move @hash_mix
in a read mostly cache line.
We move in the first cache line fields that can be
dirtied often.
We probably will have to address in a followup patch
the __randomize_layout that was added in linux-4.13,
since this might break our placement choices.
Fixes: 355b98553789 ("netns: provide pure entropy for net_hash_mix()") Signed-off-by: Eric Dumazet <edumazet@google.com> Reported-by: kernel test robot <oliver.sang@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: Connor Kuehl <connor.kuehl@canonical.com> Signed-off-by: Khalid Elmously <khalid.elmously@canonical.com>
To corrupt a GSO frame we first perform segmentation. We then
proceed using the first segment instead of the full GSO skb and
requeue the rest of the segments as separate packets.
If there are any issues with processing the first segment we
still want to process the rest, therefore we jump to the
finish_segs label.
Commit 177b8007463c ("net: netem: fix backlog accounting for
corrupted GSO frames") started using the pointer to the first
segment in the "rest of segments processing", but as mentioned
above the first segment may had already been freed at this point.
Backlog corrections for parent qdiscs have to be adjusted.
Fixes: 177b8007463c ("net: netem: fix backlog accounting for corrupted GSO frames") Reported-by: kbuild test robot <lkp@intel.com> Reported-by: Dan Carpenter <dan.carpenter@oracle.com> Reported-by: Ben Hutchings <ben@decadent.org.uk> Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com> Reviewed-by: Simon Horman <simon.horman@netronome.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: Connor Kuehl <connor.kuehl@canonical.com> Signed-off-by: Khalid Elmously <khalid.elmously@canonical.com>
Currently, the num_msi_left means the vector numbers of NIC,
but if the PF supported RoCE, it contains the vector numbers
of NIC and RoCE(Not expected).
This may cause interrupts lost in some case, because of the
NIC module used the vector resources which belongs to RoCE.
This patch adds a new variable num_nic_msi to store the vector
numbers of NIC, and adjust the default TQP numbers and rss_size
according to the value of num_nic_msi.
Fixes: 46a3df9f9718 ("net: hns3: Add HNS3 Acceleration Engine & Compatibility Layer Support") Signed-off-by: Yonglong Liu <liuyonglong@huawei.com> Signed-off-by: Huazhong Tan <tanhuazhong@huawei.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: Connor Kuehl <connor.kuehl@canonical.com> Signed-off-by: Khalid Elmously <khalid.elmously@canonical.com>
Thomas found that some forwarded packets would be stuck
in FQ packet scheduler because their skb->tstamp contained
timestamps far in the future.
We thought we addressed this point in commit 8203e2d844d3
("net: clear skb->tstamp in forwarding paths") but there
is still an issue when/if a packet needs to be fragmented.
In order to meet EDT requirements, we have to make sure all
fragments get the original skb->tstamp.
Note that this original skb->tstamp should be zero in
forwarding path, but might have a non zero value in
output path if user decided so.
Fixes: fb420d5d91c1 ("tcp/fq: move back to CLOCK_MONOTONIC") Signed-off-by: Eric Dumazet <edumazet@google.com> Reported-by: Thomas Bartschies <Thomas.Bartschies@cvk.de> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: Connor Kuehl <connor.kuehl@canonical.com> Signed-off-by: Khalid Elmously <khalid.elmously@canonical.com>
Clearing the existing bitmask of mirrored ports essentially prevents us
from capturing more than one port at any given time. This is clearly
wrong, do not clear the bitmask prior to setting up the new port.
Reported-by: Hubert Feurstein <h.feurstein@gmail.com> Fixes: ed3af5fd08eb ("net: dsa: b53: Add support for port mirroring") Signed-off-by: Florian Fainelli <f.fainelli@gmail.com> Reviewed-by: Vivien Didelot <vivien.didelot@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: Connor Kuehl <connor.kuehl@canonical.com> Signed-off-by: Khalid Elmously <khalid.elmously@canonical.com>
When commit 28b2e0d2cd13 ("net: phy: remove parameter new_link from
phy_mac_interrupt()") removed the new_link parameter it set the
phydev->link state from the MAC before invoking phy_mac_interrupt().
However, once commit 88d6272acaaa ("net: phy: avoid unneeded MDIO
reads in genphy_read_status") was added this initialization prevents
the proper determination of the connection parameters by the function
genphy_read_status().
This commit removes that initialization to restore the proper
functionality.
Fixes: 88d6272acaaa ("net: phy: avoid unneeded MDIO reads in genphy_read_status") Signed-off-by: Doug Berger <opendmb@gmail.com> Acked-by: Florian Fainelli <f.fainelli@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: Connor Kuehl <connor.kuehl@canonical.com> Signed-off-by: Khalid Elmously <khalid.elmously@canonical.com>
This patch removes the iph field from the state structure, which is not
properly initialized. Instead, add a new field to make the "do we want
to set DF" be the state bit and move the code to set the DF flag from
ip_frag_next().
Joint work with Pablo and Linus.
Fixes: 19c3401a917b ("net: ipv4: place control buffer handling away from fragmentation iterators") Reported-by: Patrick Schönthaler <patrick@notvads.ovh> Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: Connor Kuehl <connor.kuehl@canonical.com> Signed-off-by: Khalid Elmously <khalid.elmously@canonical.com>
When CQE compression is enabled, compressed CQEs use the following
structure: a title is followed by one or many blocks, each containing 8
mini CQEs (except the last, which may contain fewer mini CQEs).
Due to NAPI budget restriction, a complete structure is not always
parsed in one NAPI run, and some blocks with mini CQEs may be deferred
to the next NAPI poll call - we have the mlx5e_decompress_cqes_cont call
in the beginning of mlx5e_poll_rx_cq. However, if the budget is
extremely low, some blocks may be left even after that, but the code
that follows the mlx5e_decompress_cqes_cont call doesn't check it and
assumes that a new CQE begins, which may not be the case. In such cases,
random memory corruptions occur.
An extremely low NAPI budget of 8 is used when busy_poll or busy_read is
active.
This commit adds a check to make sure that the previous compressed CQE
has been completely parsed after mlx5e_decompress_cqes_cont, otherwise
it prevents a new CQE from being fetched in the middle of a compressed
CQE.
This commit fixes random crashes in __build_skb, __page_pool_put_page
and other not-related-directly places, that used to happen when both CQE
compression and busy_poll/busy_read were enabled.
Ethtool self test contains a test for link speed. This test reads the
PTYS register and determines whether the current speed is valid or not.
Change current implementation to use the function mlx5e_port_linkspeed()
that does the same check and fails when speed is invalid. This code
redundancy lead to a bug when mlx5e_port_linkspeed() was updated with
expended speeds and the self test was not.
Fixes: 2c81bfd5ae56 ("net/mlx5e: Move port speed code from en_ethtool.c to en/port.c") Signed-off-by: Aya Levin <ayal@mellanox.com> Reviewed-by: Moshe Shemesh <moshe@mellanox.com> Signed-off-by: Saeed Mahameed <saeedm@mellanox.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: Connor Kuehl <connor.kuehl@canonical.com> Signed-off-by: Khalid Elmously <khalid.elmously@canonical.com>
As reported in [0] at least one RTL8168dp version has problems
establishing a link. This chip version has an integrated RTL8211b PHY,
however the chip seems to report a wrong PHY ID, resulting in a wrong
PHY driver (for Generic Realtek PHY) being loaded.
Work around this issue by adding a hook to r8168dp_2_mdio_read()
for returning the correct PHY ID.
The devlink parameter "acl_region_rehash_interval" is a runtime
parameter whose value is stored in a dynamically allocated memory. While
reloading the driver, this memory is freed and then allocated again. A
use-after-free might happen if during this time frame someone tries to
retrieve its value.
Since commit 070c63f20f6c ("net: devlink: allow to change namespaces
during reload") the use-after-free can be reliably triggered when
reloading the driver into a namespace, as after freeing the memory (via
reload_down() callback) all the parameters are notified.
Fix this by unpublishing and then re-publishing the parameters during
reload.
Fixes: 98bbf70c1c41 ("mlxsw: spectrum: add "acl_region_rehash_interval" devlink param") Fixes: 7c62cfb8c574 ("devlink: publish params only after driver init is done") Signed-off-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: Ido Schimmel <idosch@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: Connor Kuehl <connor.kuehl@canonical.com> Signed-off-by: Khalid Elmously <khalid.elmously@canonical.com>
If a nonblocking socket is immediately closed after connect(),
the connect worker may not have started. This results in a refcount
problem, since sock_hold() is called from the connect worker.
This patch moves the sock_hold in front of the connect worker
scheduling.
The union should contain the extended dest and counter list.
Remove the resevered 0x40 bits which is redundant.
This change doesn't break any functionally.
Everything works today because the code in fs_cmd.c is using
the correct structs if extended dest or the basic dest.
Termination tables are used for vlan push actions on uplink ports.
To support RoCE dual port the source port value was placed in a register.
Fix the code to use an API method returning the source port according to
the FW capabilities.
Fixes: 10caabdaad5a ("net/mlx5e: Use termination table for VLAN push actions") Signed-off-by: Dmytro Linkin <dmitrolin@mellanox.com> Reviewed-by: Jianbo Liu <jianbol@mellanox.com> Reviewed-by: Oz Shlomo <ozsh@mellanox.com> Reviewed-by: Roi Dayan <roid@mellanox.com> Signed-off-by: Saeed Mahameed <saeedm@mellanox.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: Connor Kuehl <connor.kuehl@canonical.com> Signed-off-by: Khalid Elmously <khalid.elmously@canonical.com>
When rxrpc_recvmsg_data() sets the return value to 1 because it's drained
all the data for the last packet, it checks the last-packet flag on the
whole packet - but this is wrong, since the last-packet flag is only set on
the final subpacket of the last jumbo packet. This means that a call that
receives its last packet in a jumbo packet won't complete properly.
Fix this by having rxrpc_locate_data() determine the last-packet state of
the subpacket it's looking at and passing that back to the caller rather
than having the caller look in the packet header. The caller then needs to
cache this in the rxrpc_call struct as rxrpc_locate_data() isn't then
called again for this packet.
Fixes: 248f219cb8bc ("rxrpc: Rewrite the data and ack handling code") Fixes: e2de6c404898 ("rxrpc: Use info in skbuff instead of reparsing a jumbo packet") Signed-off-by: David Howells <dhowells@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: Connor Kuehl <connor.kuehl@canonical.com> Signed-off-by: Khalid Elmously <khalid.elmously@canonical.com>
The phylink_dbg() macro does not follow dynamic debug or defined(DEBUG)
and as a result, it spams the kernel log since a PR_DEBUG level is
currently used. Fix it to be defined appropriately whether
CONFIG_DYNAMIC_DEBUG or defined(DEBUG) are set.