David Howells [Wed, 20 Oct 2021 13:06:34 +0000 (14:06 +0100)]
fscache: Provide a function to resize a cookie
Provide a function to change the size of the storage attached to a cookie,
to match the size of the file being cached when it's changed by truncate or
fallocate:
David Howells [Wed, 20 Oct 2021 13:06:34 +0000 (14:06 +0100)]
fscache: Provide a function to note the release of a page
Provide a function to be called from a network filesystem's releasepage
method to indicate that a page has been released that might have been a
reflection of data upon the server - and now that data must be reloaded
from the server or the cache.
This is used to end an optimisation for empty files, in particular files
that have just been created locally, whereby we know there cannot yet be
any data that we would need to read from the server or the cache.
David Howells [Wed, 20 Oct 2021 22:50:01 +0000 (23:50 +0100)]
vfs, fscache: Implement pinning of cache usage for writeback
Cachefiles has a problem in that it needs to keep the backing file for a
cookie open whilst there are local modifications pending that need to be
written to it. However, we don't want to keep the file open indefinitely,
as that causes EMFILE/ENFILE/ENOMEM problems.
Reopening the cache file, however, is a problem if this is being done due
to writeback triggered by exit(). Some filesystems will oops if we try to
open a file in that context because they want to access current->fs or
other resources that have already been dismantled.
To get around this, I added the following:
(1) An inode flag, I_PINNING_FSCACHE_WB, to be set on a network filesystem
inode to indicate that we have a usage count on the cookie caching
that inode.
(2) A flag in struct writeback_control, unpinned_fscache_wb, that is set
when __writeback_single_inode() clears the last dirty page from
i_pages - at which point it clears I_PINNING_FSCACHE_WB and sets this
flag.
This has to be done here so that clearing I_PINNING_FSCACHE_WB can be
done atomically with the check of PAGECACHE_TAG_DIRTY that clears
I_DIRTY_PAGES.
(3) A function, fscache_set_page_dirty(), which if it is not set, sets
I_PINNING_FSCACHE_WB and calls fscache_use_cookie() to pin the cache
resources.
(4) A function, fscache_unpin_writeback(), to be called by ->write_inode()
to unuse the cookie.
(5) A function, fscache_clear_inode_writeback(), to be called when the
inode is evicted, before clear_inode() is called. This cleans up any
lingering I_PINNING_FSCACHE_WB.
The network filesystem can then use these tools to make sure that
fscache_write_to_cache() can write locally modified data to the cache as
well as to the server.
For the future, I'm working on write helpers for netfs lib that should
allow this facility to be removed by keeping track of the dirty regions
separately - but that's incomplete at the moment and is also going to be
affected by folios, one way or another, since it deals with pages
Provide a higher-level function than fscache_write() to perform a write
from an inode's pagecache to the cache, whilst fending off concurrent
writes by means of the PG_fscache mark on a page:
If caching is false, this function does nothing except call (*term_func)()
if given. It assumes that, in such a case, PG_fscache will not have been
set on the pages.
Otherwise, if caching is true, this function requires the source pages to
have had PG_fscache set on them before calling. start and len define the
region of the file to be modified and i_size indicates the new file size.
The source pages are extracted from the mapping.
term_func and term_func_priv work as for fscache_write(). The PG_fscache
marks will be cleared at the end of the operation, before term_func is
called or the function otherwise returns.
There is an additonal helper function to clear the PG_fscache bits from a
range of pages:
If caching is true, the pages to be managed are expected to be located on
mapping in the range defined by start and len. If caching is false, it
does nothing.
David Howells [Wed, 20 Oct 2021 13:06:34 +0000 (14:06 +0100)]
fscache: Implement raw I/O interface
Provide a pair of functions to perform raw I/O on the cache. The first
function allows an arbitrary asynchronous direct-IO read to be made against
a cache object, though the read should be aligned and sized appropriately
for the backing device:
The cache resources must have been previously initialised by
fscache_begin_read_operation(). A read operation is sent to the backing
filesystem, starting at start_pos within the file. The size of the read is
specified by the iterator, as is the location of the output buffer.
If there is a hole in the data it can be ignored and left to the backing
filesystem to deal with (NETFS_READ_HOLE_IGNORE), a hole at the beginning
can be skipped over and the buffer padded with zeros
(NETFS_READ_HOLE_CLEAR) or -ENODATA can be given (NETFS_READ_HOLE_FAIL).
If term_func is not NULL, the operation may be performed asynchronously.
Upon completion, successful or otherwise, (*term_func)() will be called and
passed term_func_priv, along with an error or the amount of data
transferred. If the op is run asynchronously, fscache_read() will return
-EIOCBQUEUED.
The second function allows an arbitrary asynchronous direct-IO write to be
made against a cache object, though the write should be aligned and sized
appropriately for the backing device:
This will update the auxiliary data and/or the size of the object attached
to a cookie if either pointer is not-NULL and flag that the disk needs to
be updated.
Note that fscache_unuse_cookie() also allows this to be done.
David Howells [Thu, 11 Nov 2021 23:14:29 +0000 (23:14 +0000)]
fscache: Count data storage objects in a cache
Count the data storage objects that are currently allocated in a cache.
This is used to pin certain cache structures until cache withdrawal is
complete.
Three helpers are provided to manage and make use of the count:
This should be called by the backend to note that an object has been
destroyed. This sends a wakeup event that allows cache withdrawal to
proceed if it was waiting for that object.
David Howells [Wed, 20 Oct 2021 13:06:34 +0000 (14:06 +0100)]
fscache: Provide a means to begin an operation
Provide a function to begin a read operation:
int fscache_begin_read_operation(
struct netfs_cache_resources *cres,
struct fscache_cookie *cookie)
This is primarily intended to be called by network filesystems on behalf of
netfslib, but may also be called to use the I/O access functions directly.
It attaches the resources required by the cache to cres struct from the
supplied cookie.
This holds access to the cache behind the cookie for the duration of the
operation and forces cache withdrawal and cookie invalidation to perform
synchronisation on the operation. cres->inval_counter is set from the
cookie at this point so that it can be compared at the end of the
operation.
Note that this does not guarantee that the cache state is fully set up and
able to perform I/O immediately; looking up and creation may be left in
progress in the background. The operations intended to be called by the
network filesystem, such as reading and writing, are expected to wait for
the cookie to move to the correct state.
This will, however, potentially sleep, waiting for a certain minimum state
to be set or for operations such as invalidate to advance far enough that
I/O can resume.
Also provide a function for the cache to call to wait for the cache object
to get to a state where it can be used for certain things:
This looks at the cache resources provided by the begin function and waits
for them to get to an appropriate stage. There's a choice of wanting just
some parameters (FSCACHE_WANT_PARAM) or the ability to do I/O
(FSCACHE_WANT_READ or FSCACHE_WANT_WRITE).
This causes any cached data for the specified cookie to be discarded. If
the cookie is marked as being in use, a new cache object will be created if
possible and future I/O will use that instead. In-flight I/O should be
abandoned (writes) or reconsidered (reads). Each time it is called
cookie->inval_counter is incremented and this can be used to detect
invalidation at the end of an I/O operation.
The coherency data attached to the cookie can be updated and the cookie
size should be reset. One flag is available, FSCACHE_INVAL_DIO_WRITE,
which should be used to indicate invalidation due to a DIO write on a
file. This will temporarily disable caching for this cookie.
Changes
=======
ver #2:
- Should only change to inval state if can get access to cache.
David Howells [Wed, 20 Oct 2021 14:53:34 +0000 (15:53 +0100)]
fscache: Implement cookie user counting and resource pinning
Provide a pair of functions to count the number of users of a cookie (open
files, writeback, invalidation, resizing, reads, writes), to obtain and pin
resources for the cookie and to prevent culling for the whilst there are
users.
The first function marks a cookie as being in use:
The caller should indicate the cookie to use and whether or not the caller
is in a context that may modify the cookie (e.g. a file open O_RDWR).
If the cookie is not already resourced, fscache will ask the cache backend
in the background to do whatever it needs to look up, create or otherwise
obtain the resources necessary to access data. This is pinned to the
cookie and may not be culled, though it may be withdrawn if the cache as a
whole is withdrawn.
The second function removes the in-use mark from a cookie and, optionally,
updates the coherency data:
If non-NULL, the aux_data buffer and/or the object_size will be saved into
the cookie and will be set on the backing store when the object is
committed.
If this removes the last usage on a cookie, the cookie is placed onto an
LRU list from which it will be removed and closed after a couple of seconds
if it doesn't get reused. This prevents resource overload in the cache -
in particular it prevents it from holding too many files open.
Changes
=======
ver #2:
- Fix fscache_unuse_cookie() to use atomic_dec_and_lock() to avoid a
potential race if the cookie gets reused before it completes the
unusement.
- Added missing transition to LRU_DISCARDING state.
David Howells [Wed, 20 Oct 2021 14:53:34 +0000 (15:53 +0100)]
fscache: Implement simple cookie state machine
Implement a very simple cookie state machine to handle lookup,
invalidation, withdrawal, relinquishment and, to be added later, commit on
LRU discard.
Three cache methods are provided: ->lookup_cookie() to look up and, if
necessary, create a data storage object; ->withdraw_cookie() to free the
resources associated with that object and potentially delete it; and
->prepare_to_write(), to do prepare for changes to the cached data to be
modified locally.
Changes
=======
ver #3:
- Fix a race between LRU discard and relinquishment whereby the former
would override the latter and thus the latter would never happen[1].
ver #2:
- Don't hold n_accesses elevated whilst cache is bound to a cookie, but
rather add a flag that prevents the state machine from being queued when
n_accesses reaches 0.
David Howells [Wed, 20 Oct 2021 14:26:17 +0000 (15:26 +0100)]
fscache: Provide and use cache methods to lookup/create/free a volume
Add cache methods to lookup, create and remove a volume.
Looking up or creating the volume requires the cache pinning for access;
freeing the volume requires the volume pinning for access. The
->acquire_volume() method is used to ask the cache backend to lookup and,
if necessary, create a volume; the ->free_volume() method is used to free
the resources for a volume.
David Howells [Wed, 20 Oct 2021 14:53:34 +0000 (15:53 +0100)]
fscache: Implement cookie-level access helpers
Add a number of helper functions to manage access to a cookie, pinning the
cache object in place for the duration to prevent cache withdrawal from
removing it:
This function initialises the access count when a cache binds to a
cookie. An extra ref is taken on the access count to prevent wakeups
while the cache is active. We're only interested in the wakeup when a
cookie is being withdrawn and we're waiting for it to quiesce - at
which point the counter will be decremented before the wait.
The FSCACHE_COOKIE_NACC_ELEVATED flag is set on the cookie to keep
track of the extra ref in order to handle a race between
relinquishment and withdrawal both trying to drop the extra ref.
This function attempts to begin access upon a cookie, pinning it in
place if it's cached. If successful, it returns true and leaves a the
access count incremented.
This function drops the access count obtained by (2), permitting
object withdrawal to take place when it reaches zero.
A tracepoint is provided to track changes to the access counter on a
cookie.
Changes
=======
ver #2:
- Don't hold n_accesses elevated whilst cache is bound to a cookie, but
rather add a flag that prevents the state machine from being queued when
n_accesses reaches 0.
David Howells [Wed, 20 Oct 2021 14:26:17 +0000 (15:26 +0100)]
fscache: Implement volume-level access helpers
Add a pair of helper functions to manage access to a volume, pinning the
volume in place for the duration to prevent cache withdrawal from removing
it:
The way the access gate on the volume works/will work is:
(1) If the cache tests as not live (state is not FSCACHE_CACHE_IS_ACTIVE),
then we return false to indicate access was not permitted.
(2) If the cache tests as live, then we increment the volume's n_accesses
count and then recheck the cache liveness, ending the access if it
ceased to be live.
(3) When we end the access, we decrement the volume's n_accesses and wake
up the any waiters if it reaches 0.
(4) Whilst the cache is caching, the volume's n_accesses is kept
artificially incremented to prevent wakeups from happening.
(5) When the cache is taken offline, the state is changed to prevent new
accesses, the volume's n_accesses is decremented and we wait for it to
become 0.
David Howells [Wed, 20 Oct 2021 14:53:34 +0000 (15:53 +0100)]
fscache: Implement cookie registration
Add functions to the fscache API to allow data file cookies to be acquired
and relinquished by the network filesystem. It is intended that the
filesystem will create such cookies per-inode under a volume.
The filesystem must first have created a volume cookie, which is passed in
here. If it passes in NULL then the function will just return a NULL
cookie.
A binary key should be passed in index_key and is of size index_key_len.
This is saved in the cookie and is used to locate the associated data in
the cache.
A coherency data buffer of size aux_data_len will be allocated and
initialised from the buffer pointed to by aux_data. This is used to
validate cache objects when they're opened and is stored on disk with them
when they're committed. The data is stored in the cookie and will be
updateable by various functions in later patches.
The object_size must also be given. This is also used to perform a
coherency check and to size the backing storage appropriately.
This function disallows a cookie from being acquired twice in parallel,
though it will cause the second user to wait if the first is busy
relinquishing its cookie.
When a network filesystem has finished with a cookie, it should call:
If retire is true, any backing data will be discarded immediately.
Changes
=======
ver #3:
- fscache_hash()'s size parameter is now in bytes. Use __le32 as the unit
to round up to.
- When comparing cookies, simply see if the attributes are the same rather
than subtracting them to produce a strcmp-style return[1].
- Add a check to see if the cookie is still hashed at the point of
freeing.
ver #2:
- Don't hold n_accesses elevated whilst cache is bound to a cookie, but
rather add a flag that prevents the state machine from being queued when
n_accesses reaches 0.
- Remove the unused cookie pointer field from the fscache_acquire
tracepoint.
David Howells [Wed, 20 Oct 2021 14:26:17 +0000 (15:26 +0100)]
fscache: Implement volume registration
Add functions to the fscache API to allow volumes to be acquired and
relinquished by the network filesystem. A volume is an index of data
storage cache objects. A volume is represented by a volume cookie in the
API. A filesystem would typically create a volume for a superblock and
then create per-inode cookies within it.
The volume_key is a printable string used to match the volume in the cache.
It should not contain any '/' characters. For AFS, for example, this would
be "afs,<cellname>,<volume_id>", e.g. "afs,example.com,523001".
The cache_name can be NULL, but if not it should be a string indicating the
name of the cache to use if there's more than one available.
The coherency data, if given, is an arbitrarily-sized blob that's attached
to the volume and is compared when the volume is looked up. If it doesn't
match, the old volume is judged to be out of date and it and everything
within it is discarded.
Acquiring a volume twice concurrently is disallowed, though the function
will wait if an old volume cookie is being relinquishing.
When a network filesystem has finished with a volume, it should return the
volume cookie by calling:
If invalidate is true, the entire volume will be discarded; if false, the
volume will be synced and the coherency data will be updated.
Changes
=======
ver #4:
- Removed an extraneous param from kdoc on fscache_relinquish_volume()[3].
ver #3:
- fscache_hash()'s size parameter is now in bytes. Use __le32 as the unit
to round up to.
- When comparing cookies, simply see if the attributes are the same rather
than subtracting them to produce a strcmp-style return[2].
- Make the coherency data an arbitrary blob rather than a u64, but don't
store it for the moment.
ver #2:
- Fix error check[1].
- Make a fscache_acquire_volume() return errors, including EBUSY if a
conflicting volume cookie already exists. No error is printed now -
that's left to the netfs.
This gets the cache cookie for a cache of the specified name and moves
it to the preparation state. If a nameless cache cookie exists, that
will be given this name and used.
David Howells [Wed, 20 Oct 2021 14:45:28 +0000 (15:45 +0100)]
fscache: Implement a hash function
Implement a function to generate hashes. It needs to be stable over time
and endianness-independent as the hashes will appear on disk in future
patches. It can assume that its input is a multiple of four bytes in size
and alignment.
This is borrowed from the VFS and simplified. le32_to_cpu() is added to
make it endianness-independent.
Changes
=======
ver #3:
- Read the data being hashed in an endianness-independent way[1].
- Change the size parameter to be in bytes rather than words.
David Howells [Wed, 20 Oct 2021 13:30:37 +0000 (14:30 +0100)]
netfs: Pass a flag to ->prepare_write() to say if there's no alloc'd space
Pass a flag to ->prepare_write() to indicate if there's definitely no
space allocated in the cache yet (for instance if we've already checked as
we were asked to do a read).
David Howells [Mon, 25 Oct 2021 20:53:44 +0000 (21:53 +0100)]
fscache: Remove the contents of the fscache driver, pending rewrite
Remove the code that comprises the fscache driver as it's going to be
substantially rewritten, with the majority of the code being erased in the
rewrite.
A small piece of linux/fscache.h is left as that is #included by a bunch of
network filesystems.
Jeffle Xu [Tue, 7 Dec 2021 03:14:49 +0000 (11:14 +0800)]
netfs: fix parameter of cleanup()
The order of these two parameters is just reversed. gcc didn't warn on
that, probably because 'void *' can be converted from or to other
pointer types without warning.
Cc: stable@vger.kernel.org Fixes: 3d3c95046742 ("netfs: Provide readahead and readpage netfs helpers") Fixes: e1b1240c1ff5 ("netfs: Add write_begin helper") Signed-off-by: Jeffle Xu <jefflexu@linux.alibaba.com> Signed-off-by: David Howells <dhowells@redhat.com> Reviewed-by: Jeff Layton <jlayton@redhat.com> Link: https://lore.kernel.org/r/20211207031449.100510-1-jefflexu@linux.alibaba.com/
David Howells [Tue, 7 Dec 2021 09:53:24 +0000 (09:53 +0000)]
netfs: Fix lockdep warning from taking sb_writers whilst holding mmap_lock
Taking sb_writers whilst holding mmap_lock isn't allowed and will result in
a lockdep warning like that below. The problem comes from cachefiles
needing to take the sb_writers lock in order to do a write to the cache,
but being asked to do this by netfslib called from readpage, readahead or
write_begin[1].
Fix this by always offloading the write to the cache off to a worker
thread. The main thread doesn't need to wait for it, so deadlock can be
avoided.
This can be tested by running the quick xfstests on something like afs or
ceph with lockdep enabled.
WARNING: possible circular locking dependency detected
5.15.0-rc1-build2+ #292 Not tainted
------------------------------------------------------
holetest/65517 is trying to acquire lock: ffff88810c81d730 (mapping.invalidate_lock#3){.+.+}-{3:3}, at: filemap_fault+0x276/0x7a5
but task is already holding lock: ffff8881595b53e8 (&mm->mmap_lock#2){++++}-{3:3}, at: do_user_addr_fault+0x28d/0x59c
which lock already depends on the new lock.
the existing dependency chain (in reverse order) is:
Linus Torvalds [Sun, 5 Dec 2021 20:58:18 +0000 (12:58 -0800)]
Merge tag 'for-5.16/parisc-6' of git://git.kernel.org/pub/scm/linux/kernel/git/deller/parisc-linux
Pull parisc fixes from Helge Deller:
"Some bug and warning fixes:
- Fix "make install" to use debians "installkernel" script which is
now in /usr/sbin
- Fix the bindeb-pkg make target by giving the correct KBUILD_IMAGE
file name
- Fix compiler warnings by annotating parisc agp init functions with
__init
- Fix timekeeping on SMP machines with dual-core CPUs
- Enable some more config options in the 64-bit defconfig"
* tag 'for-5.16/parisc-6' of git://git.kernel.org/pub/scm/linux/kernel/git/deller/parisc-linux:
parisc: Mark cr16 CPU clocksource unstable on all SMP machines
parisc: Fix "make install" on newer debian releases
parisc/agp: Annotate parisc agp init functions with __init
parisc: Enable sata sil, audit and usb support on 64-bit defconfig
parisc: Fix KBUILD_IMAGE for self-extracting kernel
Linus Torvalds [Sun, 5 Dec 2021 17:34:57 +0000 (09:34 -0800)]
Merge tag 'usb-5.16-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/usb
Pull USB fixes from Greg KH:
"Here are some small USB fixes for a few reported issues. Included in
here are:
- xhci fix for a _much_ reported regression. I don't think there's a
community distro that has not reported this problem yet :(
- new USB quirk addition
- cdns3 minor fixes
- typec regression fix.
All of these have been in linux-next with no reported problems, and
the xhci fix has been reported by many to resolve their reported
problem"
* tag 'usb-5.16-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/usb:
usb: cdnsp: Fix a NULL pointer dereference in cdnsp_endpoint_init()
usb: cdns3: gadget: fix new urb never complete if ep cancel previous requests
usb: typec: tcpm: Wait in SNK_DEBOUNCED until disconnect
USB: NO_LPM quirk Lenovo Powered USB-C Travel Hub
xhci: Fix commad ring abort, write all 64 bits to CRCR register.
Linus Torvalds [Sun, 5 Dec 2021 17:13:20 +0000 (09:13 -0800)]
Merge tag 'tty-5.16-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/tty
Pull tty/serial fixes from Greg KH:
"Here are some small TTY and Serial driver fixes for 5.16-rc4 to
resolve a number of reported problems.
They include:
- liteuart serial driver fixes
- 8250_pci serial driver fixes for pericom devices
- 8250 RTS line control fix while in RS-485 mode
- tegra serial driver fix
- msm_serial driver fix
- pl011 serial driver new id
- fsl_lpuart revert of broken change
- 8250_bcm7271 serial driver fix
- MAINTAINERS file update for rpmsg tty driver that came in 5.16-rc1
- vgacon fix for reported problem
All of these, except for the 8250_bcm7271 fix have been in linux-next
with no reported problem. The 8250_bcm7271 fix was added to the tree
on Friday so no chance to be linux-next yet. But it should be fine as
the affected developers submitted it"
* tag 'tty-5.16-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/tty:
serial: 8250_bcm7271: UART errors after resuming from S2
serial: 8250_pci: rewrite pericom_do_set_divisor()
serial: 8250_pci: Fix ACCES entries in pci_serial_quirks array
serial: 8250: Fix RTS modem control while in rs485 mode
Revert "tty: serial: fsl_lpuart: drop earlycon entry for i.MX8QXP"
serial: tegra: Change lower tolerance baud rate limit for tegra20 and tegra30
serial: liteuart: relax compile-test dependencies
serial: liteuart: fix minor-number leak on probe errors
serial: liteuart: fix use-after-free and memleak on unbind
serial: liteuart: Fix NULL pointer dereference in ->remove()
vgacon: Propagate console boot parameters before calling `vc_resize'
tty: serial: msm_serial: Deactivate RX DMA for polling support
serial: pl011: Add ACPI SBSA UART match id
serial: core: fix transmit-buffer reset and memleak
MAINTAINERS: Add rpmsg tty driver maintainer
Linus Torvalds [Sun, 5 Dec 2021 16:58:52 +0000 (08:58 -0800)]
Merge tag 'timers_urgent_for_v5.16_rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull timer fix from Borislav Petkov:
- Prevent a tick storm when a dedicated timekeeper CPU in nohz_full
mode runs for prolonged periods with interrupts disabled and ends up
programming the next tick in the past, leading to that storm
* tag 'timers_urgent_for_v5.16_rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
timers/nohz: Last resort update jiffies on nohz_full IRQ entry
Linus Torvalds [Sun, 5 Dec 2021 16:53:31 +0000 (08:53 -0800)]
Merge tag 'sched_urgent_for_v5.16_rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull scheduler fixes from Borislav Petkov:
- Properly init uclamp_flags of a runqueue, on first enqueuing
- Fix preempt= callback return values
- Correct utime/stime resource usage reporting on nohz_full to return
the proper times instead of shorter ones
* tag 'sched_urgent_for_v5.16_rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
sched/uclamp: Fix rq->uclamp_max not set on first enqueue
preempt/dynamic: Fix setup_preempt_mode() return value
sched/cputime: Fix getrusage(RUSAGE_THREAD) with nohz_full
Linus Torvalds [Sun, 5 Dec 2021 16:43:35 +0000 (08:43 -0800)]
Merge tag 'x86_urgent_for_v5.16_rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 fixes from Borislav Petkov:
- Fix a couple of SWAPGS fencing issues in the x86 entry code
- Use the proper operand types in __{get,put}_user() to prevent
truncation in SEV-ES string io
- Make sure the kernel mappings are present in trampoline_pgd in order
to prevent any potential accesses to unmapped memory after switching
to it
- Fix a trivial list corruption in objtool's pv_ops validation
- Disable the clocksource watchdog for TSC on platforms which claim
that the TSC is constant, doesn't stop in sleep states, CPU has TSC
adjust and the number of sockets of the platform are max 2, to
prevent erroneous markings of the TSC as unstable.
- Make sure TSC adjust is always checked not only when going idle
- Prevent a stack leak by initializing struct _fpx_sw_bytes properly in
the FPU code
- Fix INTEL_FAM6_RAPTORLAKE define naming to adhere to the convention
* tag 'x86_urgent_for_v5.16_rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/xen: Add xenpv_restore_regs_and_return_to_usermode()
x86/entry: Use the correct fence macro after swapgs in kernel CR3
x86/entry: Add a fence for kernel entry SWAPGS in paranoid_entry()
x86/sev: Fix SEV-ES INS/OUTS instructions for word, dword, and qword
x86/64/mm: Map all kernel memory into trampoline_pgd
objtool: Fix pv_ops noinstr validation
x86/tsc: Disable clocksource watchdog for TSC on qualified platorms
x86/tsc: Add a timer to make sure TSC_adjust is always checked
x86/fpu/signal: Initialize sw_bytes in save_xstate_epilog()
x86/cpu: Drop spurious underscore from RAPTOR_LAKE #define
Linus Torvalds [Sun, 5 Dec 2021 16:25:33 +0000 (08:25 -0800)]
Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm
Pull more kvm fixes from Paolo Bonzini:
- Static analysis fix
- New SEV-ES protocol for communicating invalid VMGEXIT requests
- Ensure APICv is considered inactive if there is no APIC
- Fix reserved bits for AMD PerfEvtSeln register
* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm:
KVM: SVM: Do not terminate SEV-ES guests on GHCB validation failure
KVM: SEV: Fall back to vmalloc for SEV-ES scratch area if necessary
KVM: SEV: Return appropriate error codes if SEV-ES scratch setup fails
KVM: x86/mmu: Retry page fault if root is invalidated by memslot update
KVM: VMX: Set failure code in prepare_vmcs02()
KVM: ensure APICv is considered inactive if there is no APIC
KVM: x86/pmu: Fix reserved bits for AMD PerfEvtSeln register
Tom Lendacky [Thu, 2 Dec 2021 18:52:05 +0000 (12:52 -0600)]
KVM: SVM: Do not terminate SEV-ES guests on GHCB validation failure
Currently, an SEV-ES guest is terminated if the validation of the VMGEXIT
exit code or exit parameters fails.
The VMGEXIT instruction can be issued from userspace, even though
userspace (likely) can't update the GHCB. To prevent userspace from being
able to kill the guest, return an error through the GHCB when validation
fails rather than terminating the guest. For cases where the GHCB can't be
updated (e.g. the GHCB can't be mapped, etc.), just return back to the
guest.
The new error codes are documented in the lasest update to the GHCB
specification.
Fixes: 291bd20d5d88 ("KVM: SVM: Add initial support for a VMGEXIT VMEXIT") Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com>
Message-Id: <b57280b5562893e2616257ac9c2d4525a9aeeb42.1638471124.git.thomas.lendacky@amd.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
KVM: SEV: Fall back to vmalloc for SEV-ES scratch area if necessary
Use kvzalloc() to allocate KVM's buffer for SEV-ES's GHCB scratch area so
that KVM falls back to __vmalloc() if physically contiguous memory isn't
available. The buffer is purely a KVM software construct, i.e. there's
no need for it to be physically contiguous.
Cc: Tom Lendacky <thomas.lendacky@amd.com> Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20211109222350.2266045-3-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Return appropriate error codes if setting up the GHCB scratch area for an
SEV-ES guest fails. In particular, returning -EINVAL instead of -ENOMEM
when allocating the kernel buffer could be confusing as userspace would
likely suspect a guest issue.
Fixes: 8f423a80d299 ("KVM: SVM: Support MMIO for an SEV-ES guest") Cc: Tom Lendacky <thomas.lendacky@amd.com> Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20211109222350.2266045-2-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Linus Torvalds [Sat, 4 Dec 2021 21:43:52 +0000 (13:43 -0800)]
Merge tag '5.16-rc3-smb3-fixes' of git://git.samba.org/sfrench/cifs-2.6
Pull cifs fixes from Steve French:
"Three SMB3 multichannel/fscache fixes and a DFS fix.
In testing multichannel reconnect scenarios recently various problems
with the cifs.ko implementation of fscache were found (e.g. incorrect
initialization of fscache cookies in some cases)"
* tag '5.16-rc3-smb3-fixes' of git://git.samba.org/sfrench/cifs-2.6:
cifs: avoid use of dstaddr as key for fscache client cookie
cifs: add server conn_id to fscache client cookie
cifs: wait for tcon resource_id before getting fscache super
cifs: fix missed refcounting of ipc tcon
Helge Deller [Sat, 4 Dec 2021 20:21:46 +0000 (21:21 +0100)]
parisc: Mark cr16 CPU clocksource unstable on all SMP machines
In commit c8c3735997a3 ("parisc: Enhance detection of synchronous cr16
clocksources") I assumed that CPUs on the same physical core are syncronous.
While booting up the kernel on two different C8000 machines, one with a
dual-core PA8800 and one with a dual-core PA8900 CPU, this turned out to be
wrong. The symptom was that I saw a jump in the internal clocks printed to the
syslog and strange overall behaviour. On machines which have 4 cores (2
dual-cores) the problem isn't visible, because the current logic already marked
the cr16 clocksource unstable in this case.
This patch now marks the cr16 interval timers unstable if we have more than one
CPU in the system, and it fixes this issue.
Helge Deller [Sat, 4 Dec 2021 20:14:40 +0000 (21:14 +0100)]
parisc: Fix "make install" on newer debian releases
On newer debian releases the debian-provided "installkernel" script is
installed in /usr/sbin. Fix the kernel install.sh script to look for the
script in this directory as well.
Linus Torvalds [Sat, 4 Dec 2021 16:34:59 +0000 (08:34 -0800)]
Merge tag 'io_uring-5.16-2021-12-03' of git://git.kernel.dk/linux-block
Pull io_uring fix from Jens Axboe:
"Just a single fix preventing repeated retries of task_work based io-wq
thread creation, fixing a regression from when io-wq was made more (a
bit too much) resilient against signals"
* tag 'io_uring-5.16-2021-12-03' of git://git.kernel.dk/linux-block:
io-wq: don't retry task_work creation failure on fatal conditions
Linus Torvalds [Sat, 4 Dec 2021 16:28:42 +0000 (08:28 -0800)]
Merge tag 'scsi-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi
Pull SCSI fixes from James Bottomley:
"Two patches, both in drivers.
One is a fix to FC recovery (lpfc) and the other is an enhancement to
support the Intel Alder Motherboard with the UFS driver which comes
under the -rc exception process for hardware enabling"
* tag 'scsi-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi:
scsi: ufs: ufs-pci: Add support for Intel ADL
scsi: lpfc: Fix non-recovery of remote ports following an unsolicited LOGO
Linus Torvalds [Sat, 4 Dec 2021 16:13:20 +0000 (08:13 -0800)]
Merge tag 'gfs2-v5.16-rc4-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/gfs2/linux-gfs2
Pull gfs2 fixes from Andreas Gruenbacher:
- Since commit 486408d690e1 ("gfs2: Cancel remote delete work
asynchronously"), inode create and lookup-by-number can overlap more
easily and we can end up with temporary duplicate inodes. Fix the
code to prevent that.
- Fix a BUG demoting weak glock holders from a remote node.
* tag 'gfs2-v5.16-rc4-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/gfs2/linux-gfs2:
gfs2: gfs2_create_inode rework
gfs2: gfs2_inode_lookup rework
gfs2: gfs2_inode_lookup cleanup
gfs2: Fix remote demote of weak glock holders
Qais Yousef [Thu, 2 Dec 2021 11:20:33 +0000 (11:20 +0000)]
sched/uclamp: Fix rq->uclamp_max not set on first enqueue
Commit d81ae8aac85c ("sched/uclamp: Fix initialization of struct
uclamp_rq") introduced a bug where uclamp_max of the rq is not reset to
match the woken up task's uclamp_max when the rq is idle.
The code was relying on rq->uclamp_max initialized to zero, so on first
enqueue
if (uc_se->value > READ_ONCE(uc_rq->value))
WRITE_ONCE(uc_rq->value, uc_se->value);
}
was actually resetting it. But since commit d81ae8aac85c changed the
default to 1024, this no longer works. And since rq->uclamp_flags is
also initialized to 0, neither above code path nor uclamp_idle_reset()
update the rq->uclamp_max on first wake up from idle.
This is only visible from first wake up(s) until the first dequeue to
idle after enabling the static key. And it only matters if the
uclamp_max of this task is < 1024 since only then its uclamp_max will be
effectively ignored.
Fix it by properly initializing rq->uclamp_flags = UCLAMP_FLAG_IDLE to
ensure uclamp_idle_reset() is called which then will update the rq
uclamp_max value as expected.
Andrew Halaney [Fri, 3 Dec 2021 23:32:03 +0000 (17:32 -0600)]
preempt/dynamic: Fix setup_preempt_mode() return value
__setup() callbacks expect 1 for success and 0 for failure. Correct the
usage here to reflect that.
Fixes: 826bfeb37bb4 ("preempt/dynamic: Support dynamic preempt with preempt= boot option") Reported-by: Mark Rutland <mark.rutland@arm.com> Signed-off-by: Andrew Halaney <ahalaney@redhat.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20211203233203.133581-1-ahalaney@redhat.com
Linus Torvalds [Fri, 3 Dec 2021 20:22:56 +0000 (12:22 -0800)]
Merge tag 'pm-5.16-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm
Pull power management fixes from Rafael Wysocki:
"These fix a CPU hot-add issue in the cpufreq core, fix a comment in
the cpufreq core code and update its documentation, and disable the
DTPM (Dynamic Thermal Power Management) code for the time being to
prevent it from causing issues to appear.
Specifics:
- Disable DTPM for this cycle to prevent it from causing issues to
appear on otherwise functional systems (Daniel Lezcano)
- Fix cpufreq sysfs interface failure related to physical CPU hot-add
(Xiongfeng Wang)
- Fix comment in cpufreq core and update its documentation (Tang
Yizhou)"
* tag 'pm-5.16-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm:
powercap: DTPM: Drop unused local variable from init_dtpm()
cpufreq: docs: Update core.rst
cpufreq: Fix a comment in cpufreq_policy_free
powercap/drivers/dtpm: Disable DTPM at boot time
cpufreq: Fix get_cpu_device() failure in add_cpu_dev_symlink()
Linus Torvalds [Fri, 3 Dec 2021 19:46:20 +0000 (11:46 -0800)]
Merge tag 's390-5.16-4' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux
Pull s390 fixes from Heiko Carstens:
- Fix potential overlap of pseudo-MMIO addresses with MIO addresses
- Fix stack unwinder test case inline assembly compile error that
happens with LLVM's integrated assembler
- Update defconfigs
* tag 's390-5.16-4' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux:
s390: update defconfigs
s390/pci: move pseudo-MMIO to prevent MIO overlap
s390/test_unwind: use raw opcode instead of invalid instruction
Linus Torvalds [Fri, 3 Dec 2021 18:50:14 +0000 (10:50 -0800)]
Merge tag 'arm64-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux
Pull arm64 fixes from Will Deacon:
"Three arm64 fixes for -rc4.
One of them is just a trivial documentation fix, whereas the other two
address a warning in the kexec code and a crash in ftrace on systems
implementing BTI.
The latter patch has a couple of ugly ifdefs which Mark plans to clean
up separately, but as-is the patch is straightforward for backporting
to stable kernels.
Summary:
- Add missing BTI landing instructions to the ftrace*_caller
trampolines
- Fix kexec() WARN when DEBUG_VIRTUAL is enabled
- Fix PAC documentation by removing stale references to compiler
flags"
* tag 'arm64-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux:
arm64: ftrace: add missing BTIs
arm64: kexec: use __pa_symbol(empty_zero_page)
arm64: update PAC description for kernel
Linus Torvalds [Fri, 3 Dec 2021 18:44:16 +0000 (10:44 -0800)]
Merge branch 'i2c/for-current' of git://git.kernel.org/pub/scm/linux/kernel/git/wsa/linux
Pull i2c fixes from Wolfram Sang:
"I2C has another set of driver bugfixes, mostly for the stm32f7 driver"
* 'i2c/for-current' of git://git.kernel.org/pub/scm/linux/kernel/git/wsa/linux:
i2c: rk3x: Handle a spurious start completion interrupt flag
i2c: stm32f7: use proper DMAENGINE API for termination
i2c: stm32f7: stop dma transfer in case of NACK
i2c: stm32f7: recover the bus on access timeout
i2c: stm32f7: flush TX FIFO upon transfer errors
i2c: cbus-gpio: set atomic transfer callback
Linus Torvalds [Fri, 3 Dec 2021 18:38:45 +0000 (10:38 -0800)]
Merge tag 'libata-5.16-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/dlemoal/libata
Pull libata fixes from Damien Le Moal:
"Two sparse warning fixes and a couple of patches to fix an issue with
sata_fsl driver module removal:
- A couple of patches to avoid sparse warnings in libata-sata and in
the pata_falcon driver (from Yang and Finn).
- A couple of sata_fsl driver patches fixing IRQ free and proc
unregister on module removal (from Baokun)"
* tag 'libata-5.16-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/dlemoal/libata:
ata: replace snprintf in show functions with sysfs_emit
sata_fsl: fix warning in remove_proc_entry when rmmod sata_fsl
sata_fsl: fix UAF in sata_fsl_port_stop when rmmod sata_fsl
pata_falcon: Avoid type warnings from sparse
Shyam Prasad N [Thu, 2 Dec 2021 07:46:54 +0000 (07:46 +0000)]
cifs: avoid use of dstaddr as key for fscache client cookie
server->dstaddr can change when the DNS mapping for the
server hostname changes. But conn_id is a u64 counter
that is incremented each time a new TCP connection
is setup. So use only that as a key.
Signed-off-by: Shyam Prasad N <sprasad@microsoft.com> Reviewed-by: Paulo Alcantara (SUSE) <pc@cjr.nz> Signed-off-by: Steve French <stfrench@microsoft.com>
Shyam Prasad N [Thu, 2 Dec 2021 07:30:00 +0000 (07:30 +0000)]
cifs: add server conn_id to fscache client cookie
The fscache client cookie uses the server address
(and port) as the cookie key. This is a problem when
nosharesock is used. Two different connections will
use duplicate cookies. Avoid this by adding
server->conn_id to the key, so that it's guaranteed
that cookie will not be duplicated.
Also, for secondary channels of a session, copy the
fscache pointer from the primary channel. The primary
channel is guaranteed not to go away as long as secondary
channels are in use. Also addresses minor problem found
by kernel test robot.
Reported-by: kernel test robot <lkp@intel.com> Signed-off-by: Shyam Prasad N <sprasad@microsoft.com> Reviewed-by: Paulo Alcantara (SUSE) <pc@cjr.nz> Signed-off-by: Steve French <stfrench@microsoft.com>
Shyam Prasad N [Thu, 2 Dec 2021 07:14:42 +0000 (07:14 +0000)]
cifs: wait for tcon resource_id before getting fscache super
The logic for initializing tcon->resource_id is done inside
cifs_root_iget. fscache super cookie relies on this for aux
data. So we need to push the fscache initialization to this
later point during mount.
Signed-off-by: Shyam Prasad N <sprasad@microsoft.com> Reviewed-by: Paulo Alcantara (SUSE) <pc@cjr.nz> Signed-off-by: Steve French <stfrench@microsoft.com>
Paulo Alcantara [Thu, 2 Dec 2021 18:29:35 +0000 (15:29 -0300)]
cifs: fix missed refcounting of ipc tcon
Fix missed refcounting of IPC tcon used for getting domain-based DFS
root referrals. We want to keep it alive as long as mount is active
and can be refreshed. For standalone DFS root referrals it wouldn't
be a problem as the client ends up having an IPC tcon for both mount
and cache.
Fixes: c88f7dcd6d64 ("cifs: support nested dfs links over reconnect") Signed-off-by: Paulo Alcantara (SUSE) <pc@cjr.nz> Reviewed-by: Enzo Matsumiya <ematsumiya@suse.de> Signed-off-by: Steve French <stfrench@microsoft.com>
In the native case, PER_CPU_VAR(cpu_tss_rw + TSS_sp0) is the
trampoline stack. But XEN pv doesn't use trampoline stack, so
PER_CPU_VAR(cpu_tss_rw + TSS_sp0) is also the kernel stack.
In that case, source and destination stacks are identical, which means
that reusing swapgs_restore_regs_and_return_to_usermode() in XEN pv
would cause %rsp to move up to the top of the kernel stack and leave the
IRET frame below %rsp.
This is dangerous as it can be corrupted if #NMI / #MC hit as either of
these events occurring in the middle of the stack pushing would clobber
data on the (original) stack.
And, with XEN pv, swapgs_restore_regs_and_return_to_usermode() pushing
the IRET frame on to the original address is useless and error-prone
when there is any future attempt to modify the code.
[ bp: Massage commit message. ]
Fixes: 7f2590a110b8 ("x86/entry/64: Use a per-CPU trampoline stack for IDT entries") Signed-off-by: Lai Jiangshan <laijs@linux.alibaba.com> Signed-off-by: Borislav Petkov <bp@suse.de> Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com> Link: https://lkml.kernel.org/r/20211126101209.8613-4-jiangshanlai@gmail.com
removed a CR3 write in the faulting path of load_gs_index().
But the path's FENCE_SWAPGS_USER_ENTRY has no fence operation if PTI is
enabled, see spectre_v1_select_mitigation().
Rather, it depended on the serializing CR3 write of SWITCH_TO_KERNEL_CR3
and since it got removed, add a FENCE_SWAPGS_KERNEL_ENTRY call to make
sure speculation is blocked.
Linus Torvalds [Wed, 1 Dec 2021 18:06:14 +0000 (10:06 -0800)]
fget: check that the fd still exists after getting a ref to it
Jann Horn points out that there is another possible race wrt Unix domain
socket garbage collection, somewhat reminiscent of the one fixed in
commit cbcf01128d0a ("af_unix: fix garbage collect vs MSG_PEEK").
See the extended comment about the garbage collection requirements added
to unix_peek_fds() by that commit for details.
The race comes from how we can locklessly look up a file descriptor just
as it is in the process of being closed, and with the right artificial
timing (Jann added a few strategic 'mdelay(500)' calls to do that), the
Unix domain socket garbage collector could see the reference count
decrement of the close() happen before fget() took its reference to the
file and the file was attached onto a new file descriptor.
This is all (intentionally) correct on the 'struct file *' side, with
RCU lookups and lockless reference counting very much part of the
design. Getting that reference count out of order isn't a problem per
se.
But the garbage collector can get confused by seeing this situation of
having seen a file not having any remaining external references and then
seeing it being attached to an fd.
In commit cbcf01128d0a ("af_unix: fix garbage collect vs MSG_PEEK") the
fix was to serialize the file descriptor install with the garbage
collector by taking and releasing the unix_gc_lock.
That's not really an option here, but since this all happens when we are
in the process of looking up a file descriptor, we can instead simply
just re-check that the file hasn't been closed in the meantime, and just
re-do the lookup if we raced with a concurrent close() of the same file
descriptor.
Lai Jiangshan [Fri, 26 Nov 2021 10:11:21 +0000 (18:11 +0800)]
x86/entry: Add a fence for kernel entry SWAPGS in paranoid_entry()
Commit
18ec54fdd6d18 ("x86/speculation: Prepare entry code for Spectre v1 swapgs mitigations")
added FENCE_SWAPGS_{KERNEL|USER}_ENTRY for conditional SWAPGS. In
paranoid_entry(), it uses only FENCE_SWAPGS_KERNEL_ENTRY for both
branches. This is because the fence is required for both cases since the
CR3 write is conditional even when PTI is enabled.
But
96b2371413e8f ("x86/entry/64: Switch CR3 before SWAPGS in paranoid entry")
changed the order of SWAPGS and the CR3 write. And it missed the needed
FENCE_SWAPGS_KERNEL_ENTRY for the user gsbase case.
Add it back by changing the branches so that FENCE_SWAPGS_KERNEL_ENTRY
can cover both branches.
[ bp: Massage, fix typos, remove obsolete comment while at it. ]
Michael Sterritt [Fri, 19 Nov 2021 23:27:57 +0000 (15:27 -0800)]
x86/sev: Fix SEV-ES INS/OUTS instructions for word, dword, and qword
Properly type the operands being passed to __put_user()/__get_user().
Otherwise, these routines truncate data for dependent instructions
(e.g., INSW) and only read/write one byte.
This has been tested by sending a string with REP OUTSW to a port and
then reading it back in with REP INSW on the same port.
Previous behavior was to only send and receive the first char of the
size. For example, word operations for "abcd" would only read/write
"ac". With change, the full string is now written and read back.
Fixes: f980f9c31a923 (x86/sev-es: Compile early handler code into kernel image) Signed-off-by: Michael Sterritt <sterritt@google.com> Signed-off-by: Borislav Petkov <bp@suse.de> Reviewed-by: Paolo Bonzini <pbonzini@redhat.com> Reviewed-by: Marc Orr <marcorr@google.com> Reviewed-by: Peter Gonda <pgonda@google.com> Reviewed-by: Joerg Roedel <jroedel@suse.de> Link: https://lkml.kernel.org/r/20211119232757.176201-1-sterritt@google.com
powercap: DTPM: Drop unused local variable from init_dtpm()
The dtpm_descr variable in init_dtpm() is not used after commit f751db8adaea ("powercap/drivers/dtpm: Disable DTPM at boot time"),
so drop it.
Fixes: f751db8adaea ("powercap/drivers/dtpm: Disable DTPM at boot time") Reported-by: Stephen Rothwell <sfr@canb.auug.org.au> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Jens Axboe [Fri, 3 Dec 2021 02:40:15 +0000 (19:40 -0700)]
io-wq: don't retry task_work creation failure on fatal conditions
We don't want to be retrying task_work creation failure if there's
an actual signal pending for the parent task. If we do, then we can
enter an infinite loop of perpetually retrying and each retry failing
with -ERESTARTNOINTR because a signal is pending.
Al Cooper [Wed, 1 Dec 2021 20:14:02 +0000 (15:14 -0500)]
serial: 8250_bcm7271: UART errors after resuming from S2
There is a small window in time during resume where the hardware
flow control signal RTS can be asserted (which allows a sender to
resume sending data to the UART) but the baud rate has not yet
been restored. This will cause corrupted data and FRAMING, OVERRUN
and BREAK errors. This is happening because the MCTRL register is
shadowed in uart_port struct and is later used during resume to set
the MCTRL register during both serial8250_do_startup() and
uart_resume_port(). Unfortunately, serial8250_do_startup()
happens before the UART baud rate is restored. The fix is to clear
the shadowed mctrl value at the end of suspend and restore it at the
end of resume.
Fixes: 41a469482de2 ("serial: 8250: Add new 8250-core based Broadcom STB driver") Acked-by: Florian Fainelli <f.fainelli@gmail.com> Signed-off-by: Al Cooper <alcooperx@gmail.com> Link: https://lore.kernel.org/r/20211201201402.47446-1-alcooperx@gmail.com Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Zhou Qingyang [Tue, 30 Nov 2021 17:27:00 +0000 (01:27 +0800)]
usb: cdnsp: Fix a NULL pointer dereference in cdnsp_endpoint_init()
In cdnsp_endpoint_init(), cdnsp_ring_alloc() is assigned to pep->ring
and there is a dereference of it in cdnsp_endpoint_init(), which could
lead to a NULL pointer dereference on failure of cdnsp_ring_alloc().
Fix this bug by adding a check of pep->ring.
This bug was found by a static analyzer. The analysis employs
differential checking to identify inconsistent security operations
(e.g., checks or kfrees) between two code paths and confirms that the
inconsistent operations are not recovered in the current function or
the callers, so they constitute bugs.
Note that, as a bug found by static analysis, it can be a false
positive or hard to trigger. Multiple researchers have cross-reviewed
the bug.
Builds with CONFIG_USB_CDNSP_GADGET=y show no new warnings,
and our static analyzer no longer warns about this code.
Fixes: 3d82904559f4 ("usb: cdnsp: cdns3 Add main part of Cadence USBSSP DRD Driver") Cc: stable <stable@vger.kernel.org> Acked-by: Pawel Laszczak <pawell@cadence.com> Acked-by: Peter Chen <peter.chen@kernel.org> Signed-off-by: Zhou Qingyang <zhou1615@umn.edu> Link: https://lore.kernel.org/r/20211130172700.206650-1-zhou1615@umn.edu Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Frank Li [Tue, 30 Nov 2021 15:42:39 +0000 (09:42 -0600)]
usb: cdns3: gadget: fix new urb never complete if ep cancel previous requests
This issue was found at android12 MTP.
1. MTP submit many out urb request.
2. Cancel left requests (>20) when enough data get from host
3. Send ACK by IN endpoint.
4. MTP submit new out urb request.
5. 4's urb never complete.
Actually DMA pos already bigger than previous submit request afbccb7d's TRB (184-184). The reason of (not handled) is that deq position is wrong.
The TRB link is below when irq happen.
DEQ LINK LINK LINK LINK LINK .... TRB(afbccb7d):START DMA(EP_TRADDR).
Original code check LINK TRB, but DEQ just move one step.
LINK DEQ LINK LINK LINK LINK .... TRB(afbccb7d):START DMA(EP_TRADDR).
This patch skip all LINK TRB and sync DEQ to trb's start.
LINK LINK LINK LINK LINK .... DEQ = TRB(afbccb7d):START DMA(EP_TRADDR).
Acked-by: Peter Chen <peter.chen@kernel.org> Cc: stable <stable@vger.kernel.org> Signed-off-by: Frank Li <Frank.Li@nxp.com> Signed-off-by: Jun Li <jun.li@nxp.com> Link: https://lore.kernel.org/r/20211130154239.8029-1-Frank.Li@nxp.com Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
usb: typec: tcpm: Wait in SNK_DEBOUNCED until disconnect
Stub from the spec:
"4.5.2.2.4.2 Exiting from AttachWait.SNK State
A Sink shall transition to Unattached.SNK when the state of both
the CC1 and CC2 pins is SNK.Open for at least tPDDebounce.
A DRP shall transition to Unattached.SRC when the state of both
the CC1 and CC2 pins is SNK.Open for at least tPDDebounce."
This change makes TCPM to wait in SNK_DEBOUNCED state until
CC1 and CC2 pins is SNK.Open for at least tPDDebounce. Previously,
TCPM resets the port if vbus is not present in PD_T_PS_SOURCE_ON.
This causes TCPM to loop continuously when connected to a
faulty power source that does not present vbus. Waiting in
SNK_DEBOUNCED also ensures that TCPM is adherant to
"4.5.2.2.4.2 Exiting from AttachWait.SNK State" requirements.
Mathias Nyman [Fri, 26 Nov 2021 12:23:40 +0000 (14:23 +0200)]
xhci: Fix commad ring abort, write all 64 bits to CRCR register.
Turns out some xHC controllers require all 64 bits in the CRCR register
to be written to execute a command abort.
The lower 32 bits containing the command abort bit is written first.
In case the command ring stops before we write the upper 32 bits then
hardware may use these upper bits to set the commnd ring dequeue pointer.
Solve this by making sure the upper 32 bits contain a valid command
ring dequeue pointer.
The original patch that only wrote the first 32 to stop the ring went
to stable, so this fix should go there as well.
Fixes: ff0e50d3564f ("xhci: Fix command ring pointer corruption while aborting a command") Cc: stable@vger.kernel.org Tested-by: Pavankumar Kondeti <quic_pkondeti@quicinc.com> Signed-off-by: Mathias Nyman <mathias.nyman@linux.intel.com> Link: https://lore.kernel.org/r/20211126122340.1193239-2-mathias.nyman@linux.intel.com Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Joerg Roedel [Thu, 2 Dec 2021 15:32:26 +0000 (16:32 +0100)]
x86/64/mm: Map all kernel memory into trampoline_pgd
The trampoline_pgd only maps the 0xfffffff000000000-0xffffffffffffffff
range of kernel memory (with 4-level paging). This range contains the
kernel's text+data+bss mappings and the module mapping space but not the
direct mapping and the vmalloc area.
This is enough to get the application processors out of real-mode, but
for code that switches back to real-mode the trampoline_pgd is missing
important parts of the address space. For example, consider this code
from arch/x86/kernel/reboot.c, function machine_real_restart() for a
64-bit kernel:
The code switches to the trampoline_pgd, which unmaps the direct mapping
and also the kernel stack. The call to cr4_clear_bits() will find no
stack and crash the machine. The real_mode_header pointer below points
into the direct mapping, and dereferencing it also causes a crash.
The reason this does not crash always is only that kernel mappings are
global and the CR3 switch does not flush those mappings. But if theses
mappings are not in the TLB already, the above code will crash before it
can jump to the real-mode stub.
Extend the trampoline_pgd to contain all kernel mappings to prevent
these crashes and to make code which runs on this page-table more
robust.
Peter Zijlstra [Thu, 2 Dec 2021 20:45:34 +0000 (21:45 +0100)]
objtool: Fix pv_ops noinstr validation
Boris reported that in one of his randconfig builds, objtool got
infinitely stuck. Turns out there's trivial list corruption in the
pv_ops tracking when a function is both in a static table and in a code
assignment.
Avoid re-adding function to the pv_ops[] lists when they're already on
it.
Fixes: db2b0c5d7b6f ("objtool: Support pv_opsindirect calls for noinstr") Reported-by: Borislav Petkov <bp@alien8.de> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Borislav Petkov <bp@suse.de> Tested-by: Borislav Petkov <bp@alien8.de> Link: https://lkml.kernel.org/r/20211202204534.GA16608@worktop.programming.kicks-ass.net
Linus Torvalds [Thu, 2 Dec 2021 22:38:54 +0000 (14:38 -0800)]
Merge tag 'drm-fixes-2021-12-03-1' of git://anongit.freedesktop.org/drm/drm
Pull drm fixes from Dave Airlie:
"Bit of an uptick in patch count this week, though it's all relatively
small overall.
I suspect msm has been queuing up a few fixes to skew it here.
Otherwise amdgpu has a scattered bunch of small fixes, and then some
vc4, i915.
virtio-gpu changes an rc1 introduced uAPI mistake, and makes it
operate more like other drivers. This should be fine as no userspace
relies on the behaviour yet.
* tag 'drm-fixes-2021-12-03-1' of git://anongit.freedesktop.org/drm/drm: (41 commits)
Revert "drm/i915: Implement Wa_1508744258"
drm/amdkfd: process_info lock not needed for svm
drm/amdgpu: adjust the kfd reset sequence in reset sriov function
drm/amd/display: add connector type check for CRC source set
drm/amdkfd: fix double free mem structure
drm/amdkfd: set "r = 0" explicitly before goto
drm/amd/display: Add work around for tunneled MST.
drm/amd/display: Fix for the no Audio bug with Tiled Displays
drm/amd/display: Clear DPCD lane settings after repeater training
drm/amd/display: Allow DSC on supported MST branch devices
drm/amdgpu: Don't halt RLC on GFX suspend
drm/amdgpu: fix the missed handling for SDMA2 and SDMA3
drm/amdgpu: check atomic flag to differeniate with legacy path
drm/amdgpu: cancel the correct hrtimer on exit
drm/amdgpu/sriov/vcn: add new vcn ip revision check case for SIENNA_CICHLID
drm/i915/dp: Perform 30ms delay after source OUI write
dma-buf: system_heap: Use 'for_each_sgtable_sg' in pages free flow
drm/i915: Add support for panels with VESA backlights with PWM enable/disable
drm/vc4: kms: Fix previous HVS commit wait
drm/vc4: kms: Don't duplicate pending commit
...
Dave Airlie [Thu, 2 Dec 2021 19:59:26 +0000 (05:59 +1000)]
Merge tag 'drm-intel-fixes-2021-12-02' of git://anongit.freedesktop.org/drm/drm-intel into drm-fixes
- Fixing a regression where the backlight brightness control stopped working.
- Fix the Intel HDR backlight support detection.
- Reverting a w/a to fix a gpu Hang in TGL. The w/a itself was also
for a hang, but in a much rarer scenario. The proper solution need
to be done with help from user space and it will be addressed later.
- rds: correct socket tunable error in rds_tcp_tune()
- ipv6: fix memory leak in fib6_rule_suppress
- wireguard: reset peer src endpoint when netns exits
- wireguard: improve resilience to DoS around incoming handshakes
- tcp: fix page frag corruption on page fault which involves TCP
- mpls: fix missing attributes in delete notifications
- mt7915: fix NULL pointer dereference with ad-hoc mode
Misc:
- rt2x00: be more lenient about EPROTO errors during start
- mlx4_en: update reported link modes for 1/10G"
* tag 'net-5.16-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (85 commits)
net: dsa: b53: Add SPI ID table
gro: Fix inconsistent indenting
selftests: net: Correct case name
net/rds: correct socket tunable error in rds_tcp_tune()
mctp: Don't let RTM_DELROUTE delete local routes
net/smc: Keep smc_close_final rc during active close
ibmvnic: drop bad optimization in reuse_tx_pools()
ibmvnic: drop bad optimization in reuse_rx_pools()
net/smc: fix wrong list_del in smc_lgr_cleanup_early
Fix Comment of ETH_P_802_3_MIN
ethernet: aquantia: Try MAC address from device tree
ipv4: convert fib_num_tclassid_users to atomic_t
net: avoid uninit-value from tcp_conn_request
net: annotate data-races on txq->xmit_lock_owner
octeontx2-af: Fix a memleak bug in rvu_mbox_init()
net/mlx4_en: Fix an use-after-free bug in mlx4_en_try_alloc_resources()
vrf: Reset IPCB/IP6CB when processing outbound pkts in vrf dev xmit
net: qlogic: qlcnic: Fix a NULL pointer dereference in qlcnic_83xx_add_rings()
net: dsa: mv88e6xxx: Link in pcs_get_state() if AN is bypassed
net: dsa: mv88e6xxx: Fix inband AN for 2500base-x on 88E6393X family
...
Linus Torvalds [Thu, 2 Dec 2021 19:07:41 +0000 (11:07 -0800)]
Merge tag 'trace-v5.16-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace
Pull tracing fixes from Steven Rostedt:
"Three tracing fixes:
- Allow compares of strings when using signed and unsigned characters
- Fix kmemleak false positive for histogram entries
- Handle negative numbers for user defined kretprobe data sizes"
* tag 'trace-v5.16-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace:
kprobes: Limit max data_size of the kretprobe instances
tracing: Fix a kmemleak false positive in tracing_map
tracing/histograms: String compares should not care about signed values
Linus Torvalds [Thu, 2 Dec 2021 18:56:16 +0000 (10:56 -0800)]
Merge tag 'for-linus-5.16-2' of git://github.com/cminyard/linux-ipmi
Pull IPMI fixes from Corey Minyard:
"Some changes that went in 5.16 had issues. When working on the design
a piece was redesigned and things got missed. And the message type was
not being initialized when it was allocated, resulting in crashes.
In addition, the IPMI driver has had a shutdown issue where it could
still have an item in a system workqueue after it had been shutdown.
Move to a private workqueue to avoid that problem"
* tag 'for-linus-5.16-2' of git://github.com/cminyard/linux-ipmi:
ipmi:ipmb: Fix unknown command response
ipmi: fix IPMI_SMI_MSG_TYPE_IPMB_DIRECT response length checking
ipmi: fix oob access due to uninit smi_msg type
ipmi: msghandler: Make symbol 'remove_work_wq' static
ipmi: Move remove_work to dedicated workqueue
sched/cputime: Fix getrusage(RUSAGE_THREAD) with nohz_full
getrusage(RUSAGE_THREAD) with nohz_full may return shorter utime/stime
than the actual time.
task_cputime_adjusted() snapshots utime and stime and then adjust their
sum to match the scheduler maintained cputime.sum_exec_runtime.
Unfortunately in nohz_full, sum_exec_runtime is only updated once per
second in the worst case, causing a discrepancy against utime and stime
that can be updated anytime by the reader using vtime.
To fix this situation, perform an update of cputime.sum_exec_runtime
when the cputime snapshot reports the task as actually running while
the tick is disabled. The related overhead is then contained within the
relevant situations.
timers/nohz: Last resort update jiffies on nohz_full IRQ entry
When at least one CPU runs in nohz_full mode, a dedicated timekeeper CPU
is guaranteed to stay online and to never stop its tick.
Meanwhile on some rare case, the dedicated timekeeper may be running
with interrupts disabled for a while, such as in stop_machine.
If jiffies stop being updated, a nohz_full CPU may end up endlessly
programming the next tick in the past, taking the last jiffies update
monotonic timestamp as a stale base, resulting in an tick storm.
Here is a scenario where it matters:
0) CPU 0 is the timekeeper and CPU 1 a nohz_full CPU.
1) A stop machine callback is queued to execute somewhere.
2) CPU 0 reaches MULTI_STOP_DISABLE_IRQ while CPU 1 is still in
MULTI_STOP_PREPARE. Hence CPU 0 can't do its timekeeping duty. CPU 1
can still take IRQs.
3) CPU 1 receives an IRQ which queues a timer callback one jiffy forward.
4) On IRQ exit, CPU 1 schedules the tick one jiffy forward, taking
last_jiffies_update as a base. But last_jiffies_update hasn't been
updated for 2 jiffies since the timekeeper has interrupts disabled.
5) clockevents_program_event(), which relies on ktime_get(), observes
that the expiration is in the past and therefore programs the min
delta event on the clock.
6) The tick fires immediately, goto 3)
7) Tick storm, the nohz_full CPU is drown and takes ages to reach
MULTI_STOP_DISABLE_IRQ, which is the only way out of this situation.
Solve this with unconditionally updating jiffies if the value is stale
on nohz_full IRQ entry. IRQs and other disturbances are expected to be
rare enough on nohz_full for the unconditional call to ktime_get() to
actually matter.
Reported-by: Paul E. McKenney <paulmck@kernel.org> Signed-off-by: Frederic Weisbecker <frederic@kernel.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Tested-by: Paul E. McKenney <paulmck@kernel.org> Link: https://lore.kernel.org/r/20211026141055.57358-2-frederic@kernel.org
Currently autoloading for SPI devices does not use the DT ID table, it
uses SPI modalises. Supporting OF modalises is going to be difficult if
not impractical, an attempt was made but has been reverted, so ensure
that module autoloading works for this driver by adding an id_table
listing the SPI IDs for everything.
Fixes: 96c8395e2166 ("spi: Revert modalias changes") Signed-off-by: Florian Fainelli <f.fainelli@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Reported-by: Abaci Robot <abaci@linux.alibaba.com> Signed-off-by: Jiapeng Chong <jiapeng.chong@linux.alibaba.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Li Zhijian [Thu, 2 Dec 2021 02:28:41 +0000 (10:28 +0800)]
selftests: net: Correct case name
ipv6_addr_bind/ipv4_addr_bind are function names. Previously, bind test
would not be run by default due to the wrong case names
Fixes: 34d0302ab861 ("selftests: Add ipv6 address bind tests to fcnal-test") Fixes: 75b2b2b3db4c ("selftests: Add ipv4 address bind tests to fcnal-test") Signed-off-by: Li Zhijian <lizhijian@cn.fujitsu.com> Signed-off-by: David S. Miller <davem@davemloft.net>
net/rds: correct socket tunable error in rds_tcp_tune()
Correct an error where setting /proc/sys/net/rds/tcp/rds_tcp_rcvbuf would
instead modify the socket's sk_sndbuf and would leave sk_rcvbuf untouched.
Fixes: c6a58ffed536 ("RDS: TCP: Add sysctl tunables for sndbuf/rcvbuf on rds-tcp socket") Signed-off-by: William Kucharski <william.kucharski@oracle.com> Acked-by: Santosh Shilimkar <santosh.shilimkar@oracle.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Matt Johnston [Wed, 1 Dec 2021 08:07:42 +0000 (16:07 +0800)]
mctp: Don't let RTM_DELROUTE delete local routes
We need to test against the existing route type, not
the rtm_type in the netlink request.
Fixes: 83f0a0b7285b ("mctp: Specify route types, require rtm_type in RTM_*ROUTE messages") Signed-off-by: Matt Johnston <matt@codeconstruct.com.au> Signed-off-by: David S. Miller <davem@davemloft.net>
Tony Lu [Wed, 1 Dec 2021 06:42:16 +0000 (14:42 +0800)]
net/smc: Keep smc_close_final rc during active close
When smc_close_final() returns error, the return code overwrites by
kernel_sock_shutdown() in smc_close_active(). The return code of
smc_close_final() is more important than kernel_sock_shutdown(), and it
will pass to userspace directly.
Fix it by keeping both return codes, if smc_close_final() raises an
error, return it or kernel_sock_shutdown()'s.
Link: https://lore.kernel.org/linux-s390/1f67548e-cbf6-0dce-82b5-10288a4583bd@linux.ibm.com/ Fixes: 606a63c9783a ("net/smc: Ensure the active closing peer first closes clcsock") Suggested-by: Karsten Graul <kgraul@linux.ibm.com> Signed-off-by: Tony Lu <tonylu@linux.alibaba.com> Reviewed-by: Wen Gu <guwen@linux.alibaba.com> Acked-by: Karsten Graul <kgraul@linux.ibm.com> Signed-off-by: David S. Miller <davem@davemloft.net>
ibmvnic: drop bad optimization in reuse_tx_pools()
When trying to decide whether or not reuse existing rx/tx pools
we tried to allow a range of values for the pool parameters rather
than exact matches. This was intended to reuse the resources for
instance when switching between two VIO servers with different
default parameters.
But this optimization is incomplete and breaks when we try to
change the number of queues for instance. The optimization needs
to be updated, so drop it for now and simplify the code.
Fixes: bbd809305bc7 ("ibmvnic: Reuse tx pools when possible") Reported-by: Dany Madden <drt@linux.ibm.com> Signed-off-by: Sukadev Bhattiprolu <sukadev@linux.ibm.com> Reviewed-by: Dany Madden <drt@linux.ibm.com> Reviewed-by: Rick Lindsley <ricklind@linux.ibm.com> Signed-off-by: David S. Miller <davem@davemloft.net>
ibmvnic: drop bad optimization in reuse_rx_pools()
When trying to decide whether or not reuse existing rx/tx pools
we tried to allow a range of values for the pool parameters rather
than exact matches. This was intended to reuse the resources for
instance when switching between two VIO servers with different
default parameters.
But this optimization is incomplete and breaks when we try to
change the number of queues for instance. The optimization needs
to be updated, so drop it for now and simplify the code.
Fixes: 489de956e7a2 ("ibmvnic: Reuse rx pools when possible") Reported-by: Dany Madden <drt@linux.ibm.com> Signed-off-by: Sukadev Bhattiprolu <sukadev@linux.ibm.com> Reviewed-by: Dany Madden <drt@linux.ibm.com> Reviewed-by: Rick Lindsley <ricklind@linux.ibm.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Dust Li [Wed, 1 Dec 2021 03:02:30 +0000 (11:02 +0800)]
net/smc: fix wrong list_del in smc_lgr_cleanup_early
smc_lgr_cleanup_early() meant to delete the link
group from the link group list, but it deleted
the list head by mistake.
This may cause memory corruption since we didn't
remove the real link group from the list and later
memseted the link group structure.
We got a list corruption panic when testing:
Fixes: a0a62ee15a829 ("net/smc: separate locks for SMCD and SMCR link group lists") Signed-off-by: Dust Li <dust.li@linux.alibaba.com> Acked-by: Karsten Graul <kgraul@linux.ibm.com> Reviewed-by: Tony Lu <tonylu@linux.alibaba.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Xiayu Zhang [Wed, 1 Dec 2021 02:57:13 +0000 (10:57 +0800)]
Fix Comment of ETH_P_802_3_MIN
The description of ETH_P_802_3_MIN is misleading.
The value of EthernetType in Ethernet II frame is more than 0x0600,
the value of Length in 802.3 frame is less than 0x0600.
Signed-off-by: Xiayu Zhang <Xiayu.Zhang@mediatek.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Tianhao Chai [Wed, 1 Dec 2021 02:57:06 +0000 (20:57 -0600)]
ethernet: aquantia: Try MAC address from device tree
Apple M1 Mac minis (2020) with 10GE NICs do not have MAC address in the
card, but instead need to obtain MAC addresses from the device tree. In
this case the hardware will report an invalid MAC.
Currently atlantic driver does not query the DT for MAC address and will
randomly assign a MAC if the NIC doesn't have a permanent MAC burnt in.
This patch causes the driver to perfer a valid MAC address from OF (if
present) over HW self-reported MAC and only fall back to a random MAC
address when neither of them is valid.
Signed-off-by: Tianhao Chai <cth451@gmail.com> Reviewed-by: Igor Russkikh <irusskikh@marvell.com> Reviewed-by: Hector Martin <marcan@marcan.st> Signed-off-by: David S. Miller <davem@davemloft.net>
Eric Dumazet [Thu, 2 Dec 2021 02:26:35 +0000 (18:26 -0800)]
ipv4: convert fib_num_tclassid_users to atomic_t
Before commit faa041a40b9f ("ipv4: Create cleanup helper for fib_nh")
changes to net->ipv4.fib_num_tclassid_users were protected by RTNL.
After the change, this is no longer the case, as free_fib_info_rcu()
runs after rcu grace period, without rtnl being held.
Fixes: faa041a40b9f ("ipv4: Create cleanup helper for fib_nh") Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: David Ahern <dsahern@kernel.org> Reviewed-by: David Ahern <dsahern@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>