Use system call functions instead of open-coding svc inline
assemblies. This is mostly to get rid of even more register asm
constructs.
Besides that, it makes the code also a bit easier to understand.
The generated code is identical to what is was before.
Reviewed-by: Sven Schnelle <svens@linux.ibm.com> Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
s390/syscall: provide generic system call functions
Provide generic system call functions which should be used whenever a
system call needs to be done from user space. The only in-kernel code
is vdso, which will be converted with a follow on patch.
Reviewed-by: Sven Schnelle <svens@linux.ibm.com> Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
Heiko Carstens [Fri, 25 Jun 2021 14:42:55 +0000 (16:42 +0200)]
s390/cpacf: get rid of register asm
Using register asm statements has been proven to be very error prone,
especially when using code instrumentation where gcc may add function
calls, which clobbers register contents in an unexpected way.
Therefore get rid of register asm statements in cpacf code, and make
sure this bug class cannot happen.
Reviewed-by: Patrick Steuer <patrick.steuer@de.ibm.com> Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
s390/uv: de-duplicate checks for Protected Host Virtualization
De-duplicate checks for Protected Host Virtualization in decompressor and
kernel.
Set prot_virt_host=0 in the decompressor in *any* of the following cases
and hand it over to the decompressed kernel:
* No explicit prot_virt=1 is given on the kernel command-line
* Protected Guest Virtualization is enabled
* Hardware support not present
* kdump or stand-alone dump
The decompressed kernel needs to use only is_prot_virt_host() instead of
performing again all checks done by the decompressor.
s390/boot: move uv function declarations to boot/uv.h
The functions adjust_to_uv_max() and uv_query_info() are used only
in the decompressor. Therefore, move the function declarations from
the global arch/s390/include/asm/uv.h to arch/s390/boot/uv.h.
s390/jump_label: print real address in a case of a jump label bug
In case of a jump label print the real address of the piece of code
where a mismatch was detected. This is right before the system panics,
so there is nothing revealed.
s390/mm: don't print hashed values for pte_ERROR() & friends
Print the real pte, pmd, etc. values instead of some hashed
value. Otherwise debugging would be even more difficult.
This also matches what most other architectures are doing.
s390/sclp: use only one sclp early buffer to send commands
A buffer that can be used for communication with SCLP is required
to lie below 2GB memory address. Therefore, both sclp_info_sccb
and sclp_early_sccb must fulfill this requirement if passed directly
to the sclp_early_cmd() function. Instead, use only sclp_early_sccb
for communication with SCLP. This allows the buffer sclp_info_sccb
to be placed anywhere in the memory address space and, therefore,
simplifies the process of making the decompressor relocatable later on,
one thing less to relocate. And make sure that the length of the new unified
early SCLP buffer is no less than the length of the removed sclp_info_sccb
buffer which might be larger than the length of the sclp_early_sccb buffer.
s390/cio: remove unused include linux/spinlock.h from cio.h
* The linux/spinlock.h header was included indirectly by the decompressor
and brought unnecessary build dependencies.
* Use proper includes in files which either directly or indirectly included
cio.h and were hidden until now by the included linux/spinlock.h, e.g.
linux/string.h for memcpy() or asm/page.h for PAGE_SIZE.
s390/boot: make stacks part of the decompressor's image
Instead of using constant addresses for the normal and dump-info stacks,
allocate both stacks in the decompressor's image and load the stack register
in a position-independent manner.
This will allow loading and entering the decompressor at an arbitrary
memory address without corrupting the content at the fixed addresses
used until now for both stacks. This is one of the prerequisites
for being able to kexec the decompressor from its load address without
relocating it first.
smpboot: fix duplicate and misplaced inlining directive
gcc doesn't care, but clang quite reasonably pointed out that the recent
commit e9ba16e68cce ("smpboot: Mark idle_init() as __always_inlined to
work around aggressive compiler un-inlining") did some really odd
things:
which not only has that duplicate inlining specifier, but the new
__always_inline was put in the wrong place of the function definition.
We put the storage class specifiers (ie things like "static" and
"extern") first, and the type information after that. And while the
compiler may not care, we put the inline specifier before the types.
So it should be just
static __always_inline void idle_init(unsigned int cpu)
Merge tag 'timers-urgent-2021-07-25' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull timer fixes from Thomas Gleixner:
"A small set of timer related fixes:
- Plug a race between rearm and process tick in the posix CPU timers
code
- Make the optimization to avoid recalculation of the next timer
interrupt work correctly when there are no timers pending"
* tag 'timers-urgent-2021-07-25' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
timers: Fix get_next_timer_interrupt() with no timers pending
posix-cpu-timers: Fix rearm racing against process tick
Merge tag 'locking-urgent-2021-07-25' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 jump label fix from Thomas Gleixner:
"A single fix for jump labels to prevent the compiler from agressive
un-inlining which results in a section mismatch"
* tag 'locking-urgent-2021-07-25' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
jump_labels: Mark __jump_label_transform() as __always_inlined to work around aggressive compiler un-inlining
Merge tag 'efi-urgent-2021-07-25' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull EFI fixes from Thomas Gleixner:
"A set of EFI fixes:
- Prevent memblock and I/O reserved resources to get out of sync when
EFI memreserve is in use.
- Don't claim a non-existing table is invalid
- Don't warn when firmware memory is already reserved correctly"
* tag 'efi-urgent-2021-07-25' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
efi/mokvar: Reserve the table only if it is in boot services data
efi/libstub: Fix the efi_load_initrd function description
firmware/efi: Tell memblock about EFI iomem reservations
efi/tpm: Differentiate missing and invalid final event log table.
Merge tag 'core-urgent-2021-07-25' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull core fix from Thomas Gleixner:
"A single update for the boot code to prevent aggressive un-inlining
which causes a section mismatch"
* tag 'core-urgent-2021-07-25' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
smpboot: Mark idle_init() as __always_inlined to work around aggressive compiler un-inlining
Merge tag '5.14-rc2-smb3-fixes' of git://git.samba.org/sfrench/cifs-2.6
Pull cifs fixes from Steve French:
"Five cifs/smb3 fixes, including a DFS failover fix, two fallocate
fixes, and two trivial coverity cleanups"
* tag '5.14-rc2-smb3-fixes' of git://git.samba.org/sfrench/cifs-2.6:
cifs: fix fallocate when trying to allocate a hole.
CIFS: Clarify SMB1 code for POSIX delete file
CIFS: Clarify SMB1 code for POSIX Create
cifs: support share failover when remounting
cifs: only write 64kb at a time when fallocating a small region of a file
Merge tag 'riscv-for-linus-5.14-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/riscv/linux
Pull RISC-V fixes from Palmer Dabbelt:
- properly set the memory size, which fixes 32-bit systems
- allow initrd to load anywhere in memory, rather that restricting it
to the first 256MiB
- fix the 'mem=' parameter on 64-bit systems to properly account for
the maximum supported memory now that the kernel is outside the
linear map
- avoid installing mappings into the last 4KiB of memory, which
conflicts with error values
- avoid the stack from being freed while it is being walked
- a handful of fixes to the new copy to/from user routines
* tag 'riscv-for-linus-5.14-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/riscv/linux:
riscv: __asm_copy_to-from_user: Fix: Typos in comments
riscv: __asm_copy_to-from_user: Remove unnecessary size check
riscv: __asm_copy_to-from_user: Fix: fail on RV32
riscv: __asm_copy_to-from_user: Fix: overrun copy
riscv: stacktrace: pin the task's stack in get_wchan
riscv: Make sure the kernel mapping does not overlap with IS_ERR_VALUE
riscv: Make sure the linear mapping does not use the kernel mapping
riscv: Fix memory_limit for 64-bit kernel
RISC-V: load initrd wherever it fits into memory
riscv: Fix 32-bit RISC-V boot failure
Commit 71f642833284 ("ACPI: utils: Fix reference counting in
for_each_acpi_dev_match()") started doing "acpi_dev_put()" on a pointer
that was possibly NULL. That fails miserably, because that helper
inline function is not set up to handle that case.
Just make acpi_dev_put() silently accept a NULL pointer, rather than
calling down to put_device() with an invalid offset off that NULL
pointer.
Merge tag 'scsi-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi
Pull SCSI fixes from James Bottomley:
"Four fixes, all in drivers, all of which can lead to user visible
problems in certain situations"
* tag 'scsi-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi:
scsi: target: Fix NULL dereference on XCOPY completion
scsi: mpt3sas: Transition IOC to Ready state during shutdown
scsi: target: Fix protect handling in WRITE SAME(32)
scsi: iscsi: Fix iface sysfs attr detection
Merge tag 'io_uring-5.14-2021-07-24' of git://git.kernel.dk/linux-block
Pull io_uring fixes from Jens Axboe:
- Fix a memory leak due to a race condition in io_init_wq_offload
(Yang)
- Poll error handling fixes (Pavel)
- Fix early fdput() regression (me)
- Don't reissue iopoll requests off release path (me)
- Add a safety check for io-wq queue off wrong path (me)
* tag 'io_uring-5.14-2021-07-24' of git://git.kernel.dk/linux-block:
io_uring: explicitly catch any illegal async queue attempt
io_uring: never attempt iopoll reissue from release path
io_uring: fix early fdput() of file
io_uring: fix memleak in io_init_wq_offload()
io_uring: remove double poll entry on arm failure
io_uring: explicitly count entries for poll reqs
Merge tag 'block-5.14-2021-07-24' of git://git.kernel.dk/linux-block
Pull block fixes from Jens Axboe:
- NVMe pull request (Christoph):
- tracing fix (Keith Busch)
- fix multipath head refcounting (Hannes Reinecke)
- Write Zeroes vs PI fix (me)
- drop a bogus WARN_ON (Zhihao Cheng)
- Increase max blk-cgroup policy size, now that mq-deadline
uses it too (Oleksandr)
* tag 'block-5.14-2021-07-24' of git://git.kernel.dk/linux-block:
nvme: set the PRACT bit when using Write Zeroes with T10 PI
nvme: fix nvme_setup_command metadata trace event
nvme: fix refcounting imbalance when all paths are down
nvme-pci: don't WARN_ON in nvme_reset_work if ctrl.state is not RESETTING
block: increase BLKCG_MAX_POLS
Merge branch 'i2c/for-current' of git://git.kernel.org/pub/scm/linux/kernel/git/wsa/linux
Pull i2c fixes from Wolfram Sang:
"Two bugfixes for the I2C subsystem"
* 'i2c/for-current' of git://git.kernel.org/pub/scm/linux/kernel/git/wsa/linux:
i2c: mpc: Poll for MCF
misc: eeprom: at24: Always append device id even if label property is set.
Merge misc mm fixes from Andrew Morton:
"15 patches.
VM subsystems affected by this patch series: userfaultfd, kfence,
highmem, pagealloc, memblock, pagecache, secretmem, pagemap, and
hugetlbfs"
* akpm:
hugetlbfs: fix mount mode command line processing
mm: fix the deadlock in finish_fault()
mm: mmap_lock: fix disabling preemption directly
mm/secretmem: wire up ->set_page_dirty
writeback, cgroup: do not reparent dax inodes
writeback, cgroup: remove wb from offline list before releasing refcnt
memblock: make for_each_mem_range() traverse MEMBLOCK_HOTPLUG regions
mm: page_alloc: fix page_poison=1 / INIT_ON_ALLOC_DEFAULT_ON interaction
mm: use kmap_local_page in memzero_page
mm: call flush_dcache_page() in memcpy_to_page() and memzero_page()
kfence: skip all GFP_ZONEMASK allocations
kfence: move the size check to the beginning of __kfence_alloc()
kfence: defer kfence_test_init to ensure that kunit debugfs is created
selftest: use mmap instead of posix_memalign to allocate memory
userfaultfd: do not untag user pointers
Had a bug when converting bytes to bits when the cpu was rv32.
The a3 contains the number of bytes and multiple of 8
would be the bits. The LGREG is holding 2 for RV32 and 3 for
RV32, so to achieve multiple of 8 it must always be constant 3.
The 2 was mistakenly used for rv32.
Mike Kravetz [Fri, 23 Jul 2021 22:50:44 +0000 (15:50 -0700)]
hugetlbfs: fix mount mode command line processing
In commit 32021982a324 ("hugetlbfs: Convert to fs_context") processing
of the mount mode string was changed from match_octal() to fsparam_u32.
This changed existing behavior as match_octal does not require octal
values to have a '0' prefix, but fsparam_u32 does.
Use fsparam_u32oct which provides the same behavior as match_octal.
Link: https://lkml.kernel.org/r/20210721183326.102716-1-mike.kravetz@oracle.com Fixes: 32021982a324 ("hugetlbfs: Convert to fs_context") Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com> Reported-by: Dennis Camera <bugs+kernel.org@dtnr.ch> Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: David Howells <dhowells@redhat.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Commit 63f3655f9501 ("mm, memcg: fix reclaim deadlock with writeback")
fix the following ABBA deadlock by pre-allocating the pte page table
without holding the page lock.
Commit f9ce0be71d1f ("mm: Cleanup faultaround and finish_fault()
codepaths") reworked the relevant code but ignored this race. This will
cause the deadlock above to appear again, so fix it.
Link: https://lkml.kernel.org/r/20210721074849.57004-1-zhengqi.arch@bytedance.com Fixes: f9ce0be71d1f ("mm: Cleanup faultaround and finish_fault() codepaths") Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com> Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@kernel.org> Cc: Vladimir Davydov <vdavydov.dev@gmail.com> Cc: Muchun Song <songmuchun@bytedance.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Mike Rapoport [Fri, 23 Jul 2021 22:50:35 +0000 (15:50 -0700)]
mm/secretmem: wire up ->set_page_dirty
Make secretmem up to date with the changes done in commit 0af573780b0b
("mm: require ->set_page_dirty to be explicitly wired up") so that
unconditional call to this method won't cause crashes.
Link: https://lkml.kernel.org/r/20210716063933.31633-1-rppt@kernel.org Fixes: 0af573780b0b ("mm: require ->set_page_dirty to be explicitly wired up") Signed-off-by: Mike Rapoport <rppt@linux.ibm.com> Reviewed-by: David Hildenbrand <david@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Roman Gushchin [Fri, 23 Jul 2021 22:50:32 +0000 (15:50 -0700)]
writeback, cgroup: do not reparent dax inodes
The inode switching code is not suited for dax inodes. An attempt to
switch a dax inode to a parent writeback structure (as a part of a
writeback cleanup procedure) results in a panic like this:
The crash happens on an attempt to iterate over attached pagecache pages
and check the dirty flag: a dax inode's xarray contains pfn's instead of
generic struct page pointers.
This happens for DAX and not for other kinds of non-page entries in the
inodes because it's a tagged iteration, and shadow/swap entries are
never tagged; only DAX entries get tagged.
Fix the problem by bailing out (with the false return value) of
inode_prepare_sbs_switch() if a dax inode is passed.
[willy@infradead.org: changelog addition]
Link: https://lkml.kernel.org/r/20210719171350.3876830-1-guro@fb.com Fixes: c22d70a162d3 ("writeback, cgroup: release dying cgwbs by switching attached inodes") Signed-off-by: Roman Gushchin <guro@fb.com> Reported-by: Murphy Zhou <jencce.kernel@gmail.com> Reported-by: Darrick J. Wong <djwong@kernel.org> Tested-by: Darrick J. Wong <djwong@kernel.org> Tested-by: Murphy Zhou <jencce.kernel@gmail.com> Acked-by: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Jan Kara <jack@suse.cz> Cc: Dave Chinner <dchinner@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Roman Gushchin [Fri, 23 Jul 2021 22:50:29 +0000 (15:50 -0700)]
writeback, cgroup: remove wb from offline list before releasing refcnt
Boyang reported that the commit c22d70a162d3 ("writeback, cgroup:
release dying cgwbs by switching attached inodes") causes the kernel to
crash while running xfstests generic/256 on ext4 on aarch64 and ppc64le.
The problem happens when cgwb_release_workfn() races with
cleanup_offline_cgwbs_workfn(): wb_tryget() in
cleanup_offline_cgwbs_workfn() can be called after percpu_ref_exit() is
cgwb_release_workfn(), which is basically a use-after-free error.
Fix the problem by making removing the writeback structure from the
offline list before releasing the percpu reference counter. It will
guarantee that cleanup_offline_cgwbs_workfn() will not see and not
access writeback structures which are about to be released.
Link: https://lkml.kernel.org/r/20210716201039.3762203-1-guro@fb.com Fixes: c22d70a162d3 ("writeback, cgroup: release dying cgwbs by switching attached inodes") Signed-off-by: Roman Gushchin <guro@fb.com> Reported-by: Boyang Xue <bxue@redhat.com> Suggested-by: Jan Kara <jack@suse.cz> Tested-by: Darrick J. Wong <djwong@kernel.org> Cc: Will Deacon <will@kernel.org> Cc: Dave Chinner <dchinner@redhat.com> Cc: Murphy Zhou <jencce.kernel@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Mike Rapoport [Fri, 23 Jul 2021 22:50:26 +0000 (15:50 -0700)]
memblock: make for_each_mem_range() traverse MEMBLOCK_HOTPLUG regions
Commit b10d6bca8720 ("arch, drivers: replace for_each_membock() with
for_each_mem_range()") didn't take into account that when there is
movable_node parameter in the kernel command line, for_each_mem_range()
would skip ranges marked with MEMBLOCK_HOTPLUG.
The page table setup code in POWER uses for_each_mem_range() to create
the linear mapping of the physical memory and since the regions marked
as MEMORY_HOTPLUG are skipped, they never make it to the linear map.
A later access to the memory in those ranges will fail:
Before commit 51cba1ebc60d ("init_on_alloc: Optimize static branches")
init_on_alloc never enabled static branch by default. It could only be
enabed explicitly by init_mem_debugging_and_hardening().
But after commit 51cba1ebc60d, a static branch could already be enabled
by default. There was no code to ever disable it. That caused
page_poison=1 / init_on_free=1 conflict.
This change extends init_mem_debugging_and_hardening() to also disable
static branch disabling.
The commit message introducing the global memzero_page explicitly
mentions switching to kmap_local_page in the commit log but doesn't
actually do that.
Link: https://lkml.kernel.org/r/20210713055231.137602-3-hch@lst.de Fixes: 28961998f858 ("iov_iter: lift memzero_page() to highmem.h") Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com> Reviewed-by: Ira Weiny <ira.weiny@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
mm: call flush_dcache_page() in memcpy_to_page() and memzero_page()
memcpy_to_page and memzero_page can write to arbitrary pages, which
could be in the page cache or in high memory, so call
flush_kernel_dcache_pages to flush the dcache.
This is a problem when using these helpers on dcache challeneged
architectures. Right now there are just a few users, chances are no one
used the PC floppy driver, the aha1542 driver for an ISA SCSI HBA, and a
few advanced and optional btrfs and ext4 features on those platforms yet
since the conversion.
Link: https://lkml.kernel.org/r/20210713055231.137602-2-hch@lst.de Fixes: bb90d4bc7b6a ("mm/highmem: Lift memcpy_[to|from]_page to core") Fixes: 28961998f858 ("iov_iter: lift memzero_page() to highmem.h") Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Ira Weiny <ira.weiny@intel.com> Cc: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Allocation requests outside ZONE_NORMAL (MOVABLE, HIGHMEM or DMA) cannot
be fulfilled by KFENCE, because KFENCE memory pool is located in a zone
different from the requested one.
Because callers of kmem_cache_alloc() may actually rely on the
allocation to reside in the requested zone (e.g. memory allocations
done with __GFP_DMA must be DMAable), skip all allocations done with
GFP_ZONEMASK and/or respective SLAB flags (SLAB_CACHE_DMA and
SLAB_CACHE_DMA32).
kfence: defer kfence_test_init to ensure that kunit debugfs is created
kfence_test_init and kunit_init both use the same level late_initcall,
which means if kfence_test_init linked ahead of kunit_init,
kfence_test_init will get a NULL debugfs_rootdir as parent dentry, then
kfence_test_init and kfence_debugfs_init both create a debugfs node
named "kfence" under debugfs_mount->mnt_root, and it will throw out
"debugfs: Directory 'kfence' with parent '/' already present!" with
EEXIST. So kfence_test_init should be deferred.
Link: https://lkml.kernel.org/r/20210714113140.2949995-1-o451686892@gmail.com Signed-off-by: Weizhao Ouyang <o451686892@gmail.com> Tested-by: Marco Elver <elver@google.com> Cc: Alexander Potapenko <glider@google.com> Cc: Dmitry Vyukov <dvyukov@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
selftest: use mmap instead of posix_memalign to allocate memory
This test passes pointers obtained from anon_allocate_area to the
userfaultfd and mremap APIs. This causes a problem if the system
allocator returns tagged pointers because with the tagged address ABI
the kernel rejects tagged addresses passed to these APIs, which would
end up causing the test to fail. To make this test compatible with such
system allocators, stop using the system allocator to allocate memory in
anon_allocate_area, and instead just use mmap.
Patch series "userfaultfd: do not untag user pointers", v5.
If a user program uses userfaultfd on ranges of heap memory, it may end
up passing a tagged pointer to the kernel in the range.start field of
the UFFDIO_REGISTER ioctl. This can happen when using an MTE-capable
allocator, or on Android if using the Tagged Pointers feature for MTE
readiness [1].
When a fault subsequently occurs, the tag is stripped from the fault
address returned to the application in the fault.address field of struct
uffd_msg. However, from the application's perspective, the tagged
address *is* the memory address, so if the application is unaware of
memory tags, it may get confused by receiving an address that is, from
its point of view, outside of the bounds of the allocation. We observed
this behavior in the kselftest for userfaultfd [2] but other
applications could have the same problem.
Address this by not untagging pointers passed to the userfaultfd ioctls.
Instead, let the system call fail. Also change the kselftest to use
mmap so that it doesn't encounter this problem.
Do not untag pointers passed to the userfaultfd ioctls. Instead, let
the system call fail. This will provide an early indication of problems
with tag-unaware userspace code instead of letting the code get confused
later, and is consistent with how we decided to handle brk/mmap/mremap
in commit dcde237319e6 ("mm: Avoid creating virtual address aliases in
brk()/mmap()/mremap()"), as well as being consistent with the existing
tagged address ABI documentation relating to how ioctl arguments are
handled.
The code change is a revert of commit 7d0325749a6c ("userfaultfd: untag
user pointers") plus some fixups to some additional calls to
validate_range that have appeared since then.
io_uring: explicitly catch any illegal async queue attempt
Catch an illegal case to queue async from an unrelated task that got
the ring fd passed to it. This should not be possible to hit, but
better be proactive and catch it explicitly. io-wq is extended to
check for early IO_WQ_WORK_CANCEL being set on a work item as well,
so it can run the request through the normal cancelation path.
io_uring: never attempt iopoll reissue from release path
There are two reasons why this shouldn't be done:
1) Ring is exiting, and we're canceling requests anyway. Any request
should be canceled anyway. In theory, this could iterate for a
number of times if someone else is also driving the target block
queue into request starvation, however the likelihood of this
happening is miniscule.
2) If the original task decided to pass the ring to another task, then
we don't want to be reissuing from this context as it may be an
unrelated task or context. No assumptions should be made about
the context in which ->release() is run. This can only happen for pure
read/write, and we'll get -EFAULT on them anyway.
Merge tag 'for-5.14-rc2-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux
Pull btrfs fixes from David Sterba:
"A few fixes and one patch to help some block layer API cleanups:
- skip missing device when running fstrim
- fix unpersisted i_size on fsync after expanding truncate
- fix lock inversion problem when doing qgroup extent tracing
- replace bdgrab/bdput usage, replace gendisk by block_device"
* tag 'for-5.14-rc2-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
btrfs: store a block_device in struct btrfs_ordered_extent
btrfs: fix lock inversion problem when doing qgroup extent tracing
btrfs: check for missing device in btrfs_trim_fs
btrfs: fix unpersisted i_size on fsync after expanding truncate
Merge tag 'ceph-for-5.14-rc3' of git://github.com/ceph/ceph-client
Pull ceph fixes from Ilya Dryomov:
"A subtle deadlock on lock_rwsem (marked for stable) and rbd fixes for
a -rc1 regression.
Also included a rare WARN condition tweak"
* tag 'ceph-for-5.14-rc3' of git://github.com/ceph/ceph-client:
rbd: resurrect setting of disk->private_data in rbd_init_disk()
ceph: don't WARN if we're still opening a session to an MDS
rbd: don't hold lock_rwsem while running_list is being drained
rbd: always kick acquire on "acquired" and "released" notifications
Merge tag 'trace-v5.14-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace
Pull tracing fixes from Steven Rostedt:
- Fix deadloop in ring buffer because of using stale "read" variable
- Fix synthetic event use of field_pos as boolean and not an index
- Fixed histogram special var "cpu" overriding event fields called
"cpu"
- Cleaned up error prone logic in alloc_synth_event()
- Removed call to synchronize_rcu_tasks_rude() when not needed
- Removed redundant initialization of a local variable "ret"
- Fixed kernel crash when updating tracepoint callbacks of different
priorities.
* tag 'trace-v5.14-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace:
tracepoints: Update static_call before tp_funcs when adding a tracepoint
ftrace: Remove redundant initialization of variable ret
ftrace: Avoid synchronize_rcu_tasks_rude() call when not necessary
tracing: Clean up alloc_synth_event()
tracing/histogram: Rename "cpu" to "common_cpu"
tracing: Synthetic event field_pos is an index not a boolean
tracing: Fix bug in rb_per_cpu_empty() that might cause deadloop.
Merge tag 'acpi-5.14-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm
Pull ACPI fixes from Rafael Wysocki:
"These fix a recently broken Kconfig dependency and ACPI device
reference counting in an iterator macro.
Specifics:
- Fix recently broken Kconfig dependency for the ACPI table override
via built-in initrd (Robert Richter)
- Fix ACPI device reference counting in the for_each_acpi_dev_match()
helper macro to avoid use-after-free (Andy Shevchenko)"
* tag 'acpi-5.14-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm:
ACPI: utils: Fix reference counting in for_each_acpi_dev_match()
ACPI: Kconfig: Fix table override from built-in initrd
Merge tag 'driver-core-5.14-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core
Pull driver core fixes from Greg KH:
"Here are two small driver core fixes to resolve some reported problems
for 5.14-rc3. They include:
- aux bus memory leak fix
- unneeded warning message removed when removing a device link.
Both have been in linux-next with no reported problems"
* tag 'driver-core-5.14-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core:
driver core: Prevent warning when removing a device link from unregistered consumer
driver core: auxiliary bus: Fix memory leak when driver_register() fail
Merge tag 'char-misc-5.14-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/char-misc
Pull char/misc fixes from Greg KH:
"Here are some small char/misc driver fixes for 5.14-rc3.
Included in here are:
- MAINTAINERS file updates for two changes in different driver
subsystems
- mhi bus bugfixes
- nds32 bugfix that resolves a reported problem
All have been in linux-next with no reported problems"
* tag 'char-misc-5.14-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/char-misc:
nds32: fix up stack guard gap
MAINTAINERS: Change ACRN HSM driver maintainer
MAINTAINERS: Update for VMCI driver
bus: mhi: pci_generic: Fix inbound IPCR channel
bus: mhi: core: Validate channel ID when processing command completions
bus: mhi: pci_generic: Apply no-op for wake using sideband wake boolean
Merge tag 'usb-5.14-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/usb
Pull USB fixes from Greg KH:
"Here are some USB fixes for 5.14-rc3 to resolve a bunch of tiny
problems reported. Included in here are:
- dtsi revert to resolve a problem which broke android systems that
relied on the dts name to find the USB controller device.
People are still working out the "real" solution for this, but for
now the revert is needed.
- core USB fix for pipe calculation found by syzbot
- typec fixes
- gadget driver fixes
- new usb-serial device ids
- new USB quirks
- xhci fixes
- usb hub fixes for power management issues reported
- other tiny fixes
All have been in linux-next with no reported problems"
* tag 'usb-5.14-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/usb: (27 commits)
USB: serial: cp210x: add ID for CEL EM3588 USB ZigBee stick
Revert "USB: quirks: ignore remote wake-up on Fibocom L850-GL LTE modem"
usb: cdc-wdm: fix build error when CONFIG_WWAN_CORE is not set
Revert "arm64: dts: qcom: Harmonize DWC USB3 DT nodes name"
usb: dwc2: gadget: Fix sending zero length packet in DDMA mode.
usb: dwc2: Skip clock gating on Samsung SoCs
usb: renesas_usbhs: Fix superfluous irqs happen after usb_pkt_pop()
usb: dwc2: gadget: Fix GOUTNAK flow for Slave mode.
usb: phy: Fix page fault from usb_phy_uevent
usb: xhci: avoid renesas_usb_fw.mem when it's unusable
usb: gadget: u_serial: remove WARN_ON on null port
usb: dwc3: avoid NULL access of usb_gadget_driver
usb: max-3421: Prevent corruption of freed memory
usb: gadget: Fix Unbalanced pm_runtime_enable in tegra_xudc_probe
MAINTAINERS: repair reference in USB IP DRIVER FOR HISILICON KIRIN 970
usb: typec: stusb160x: Don't block probing of consumer of "connector" nodes
usb: typec: stusb160x: register role switch before interrupt registration
USB: usb-storage: Add LaCie Rugged USB3-FW to IGNORE_UAS
usb: ehci: Prevent missed ehci interrupts with edge-triggered MSI
usb: hub: Disable USB 3 device initiated lpm if exit latency is too high
...
Merge tag 'sound-5.14-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/tiwai/sound
Pull sound fixes from Takashi Iwai:
"A collection of small fixes, mostly covering device-specific
regressions and bugs over ASoC, HD-audio and USB-audio, while
the ALSA PCM core received a few additional fixes for the
possible (new and old) regressions"
* tag 'sound-5.14-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/tiwai/sound: (29 commits)
ALSA: usb-audio: Add registration quirk for JBL Quantum headsets
ALSA: hda/hdmi: Add quirk to force pin connectivity on NUC10
ALSA: pcm: Fix mmap without buffer preallocation
ALSA: pcm: Fix mmap capability check
ALSA: hda: intel-dsp-cfg: add missing ElkhartLake PCI ID
ASoC: ti: j721e-evm: Check for not initialized parent_clk_id
ASoC: ti: j721e-evm: Fix unbalanced domain activity tracking during startup
ALSA: hda/realtek: Fix pop noise and 2 Front Mic issues on a machine
ALSA: hdmi: Expose all pins on MSI MS-7C94 board
ALSA: sb: Fix potential ABBA deadlock in CSP driver
ASoC: rt5682: Fix the issue of garbled recording after powerd_dbus_suspend
ASoC: amd: reverse stop sequence for stoneyridge platform
ASoC: soc-pcm: add a flag to reverse the stop sequence
ASoC: codecs: wcd938x: setup irq during component bind
ASoC: dt-bindings: renesas: rsnd: Fix incorrect 'port' regex schema
ALSA: usb-audio: Add missing proc text entry for BESPOKEN type
ASoC: codecs: wcd938x: make sdw dependency explicit in Kconfig
ASoC: SOF: Intel: Update ADL descriptor to use ACPI power states
ASoC: rt5631: Fix regcache sync errors on resume
ALSA: pcm: Call substream ack() method upon compat mmap commit
...
tracepoints: Update static_call before tp_funcs when adding a tracepoint
Because of the significant overhead that retpolines pose on indirect
calls, the tracepoint code was updated to use the new "static_calls" that
can modify the running code to directly call a function instead of using
an indirect caller, and this function can be changed at runtime.
In the tracepoint code that calls all the registered callbacks that are
attached to a tracepoint, the following is done:
If there's just a single callback, the static_call is updated to just call
that callback directly. Once another handler is added, then the static
caller is updated to call the iterator, that simply loops over all the
funcs in the array and calls each of the callbacks like the old method
using indirect calling.
The issue was discovered with a race between updating the funcs array and
updating the static_call. The funcs array was updated first and then the
static_call was updated. This is not an issue as long as the first element
in the old array is the same as the first element in the new array. But
that assumption is incorrect, because callbacks also have a priority
field, and if there's a callback added that has a higher priority than the
callback on the old array, then it will become the first callback in the
new array. This means that it is possible to call the old callback with
the new callback data element, which can cause a kernel panic.
static_call = callback1()
funcs[] = {callback1,data1};
callback2 has higher priority than callback1
CPU 1 CPU 2
----- -----
new_funcs = {callback2,data2},
{callback1,data1}
rcu_assign_pointer(tp->funcs, new_funcs);
/*
* Now tp->funcs has the new array
* but the static_call still calls callback1
*/
/* Now callback1 is called with
* callback2's data */
[ KERNEL PANIC ]
update_static_call(iterator);
To prevent this from happening, always switch the static_call to the
iterator before assigning the tp->funcs to the new array. The iterator will
always properly match the callback with its data.
To trigger this bug:
In one terminal:
while :; do hackbench 50; done
In another terminal
echo 1 > /sys/kernel/tracing/events/sched/sched_waking/enable
while :; do
echo 1 > /sys/kernel/tracing/set_event_pid;
sleep 0.5
echo 0 > /sys/kernel/tracing/set_event_pid;
sleep 0.5
done
And it doesn't take long to crash. This is because the set_event_pid adds
a callback to the sched_waking tracepoint with a high priority, which will
be called before the sched_waking trace event callback is called.
Note, the removal to a single callback updates the array first, before
changing the static_call to single callback, which is the proper order as
the first element in the array is the same as what the static_call is
being changed to.
Link: https://lore.kernel.org/io-uring/4ebea8f0-58c9-e571-fd30-0ce4f6f09c70@samba.org/ Cc: stable@vger.kernel.org Fixes: d25e37d89dd2f ("tracepoint: Optimize using static_call()") Reported-by: Stefan Metzmacher <metze@samba.org>
tested-by: Stefan Metzmacher <metze@samba.org> Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
ftrace: Avoid synchronize_rcu_tasks_rude() call when not necessary
synchronize_rcu_tasks_rude() triggers IPIs and forces rescheduling on
all CPUs. It is a costly operation and, when targeting nohz_full CPUs,
very disrupting (hence the name). So avoid calling it when 'old_hash'
doesn't need to be freed.
Currently the histogram logic allows the user to write "cpu" in as an
event field, and it will record the CPU that the event happened on.
The problem with this is that there's a lot of events that have "cpu"
as a real field, and using "cpu" as the CPU it ran on, makes it
impossible to run histograms on the "cpu" field of events.
For example, if I want to have a histogram on the count of the
workqueue_queue_work event on its cpu field, running:
Change the command to "common_cpu" as no event should have "common_*"
fields as that's a reserved name for fields used by all events. And
this makes sense here as common_cpu would be a field used by all events.
Now we can even do:
># echo 'hist:keys=common_cpu,cpu if cpu < 100' > events/workqueue/workqueue_queue_work/trigger
># cat events/workqueue/workqueue_queue_work/hist
# event histogram
#
# trigger info: hist:keys=common_cpu,cpu:vals=hitcount:sort=hitcount:size=2048 if cpu < 100 [active]
#
Now for backward compatibility, I added a trick. If "cpu" is used, and
the field is not found, it will fall back to "common_cpu" and work as
it did before. This way, it will still work for old programs that use
"cpu" to get the actual CPU, but if the event has a "cpu" as a field, it
will get that event's "cpu" field, which is probably what it wants
anyway.
I updated the tracefs/README to include documentation about both the
common_timestamp and the common_cpu. This way, if that text is present in
the README, then an application can know that common_cpu is supported over
just plain "cpu".
Link: https://lkml.kernel.org/r/20210721110053.26b4f641@oasis.local.home Cc: Namhyung Kim <namhyung@kernel.org> Cc: Ingo Molnar <mingo@kernel.org> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: stable@vger.kernel.org Fixes: 8b7622bf94a44 ("tracing: Add cpu field for hist triggers") Reviewed-by: Tom Zanussi <zanussi@kernel.org> Reviewed-by: Masami Hiramatsu <mhiramat@kernel.org> Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
The reason is that the dynamic events array keeps track of the field
position of the fields array, via the field_pos variable in the
synth_field structure. Unfortunately, that field is a boolean for some
reason, which means any field_pos greater than 1 will be a bug (in this
case it was 2).
Link: https://lkml.kernel.org/r/20210721191008.638bce34@oasis.local.home Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Namhyung Kim <namhyung@kernel.org> Cc: Ingo Molnar <mingo@kernel.org> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: stable@vger.kernel.org Fixes: bd82631d7ccdc ("tracing: Add support for dynamic strings to synthetic events") Reviewed-by: Tom Zanussi <zanussi@kernel.org> Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Nicholas Piggin [Thu, 8 Jul 2021 11:26:22 +0000 (21:26 +1000)]
KVM: PPC: Book3S HV Nested: Sanitise H_ENTER_NESTED TM state
The H_ENTER_NESTED hypercall is handled by the L0, and it is a request
by the L1 to switch the context of the vCPU over to that of its L2
guest, and return with an interrupt indication. The L1 is responsible
for switching some registers to guest context, and the L0 switches
others (including all the hypervisor privileged state).
If the L2 MSR has TM active, then the L1 is responsible for
recheckpointing the L2 TM state. Then the L1 exits to L0 via the
H_ENTER_NESTED hcall, and the L0 saves the TM state as part of the exit,
and then it recheckpoints the TM state as part of the nested entry and
finally HRFIDs into the L2 with TM active MSR. Not efficient, but about
the simplest approach for something that's horrendously complicated.
Problems arise if the L1 exits to the L0 with a TM state which does not
match the L2 TM state being requested. For example if the L1 is
transactional but the L2 MSR is non-transactional, or vice versa. The
L0's HRFID can take a TM Bad Thing interrupt and crash.
Fix this by disallowing H_ENTER_NESTED in TM[T] state entirely, and then
ensuring that if the L1 is suspended then the L2 must have TM active,
and if the L1 is not suspended then the L2 must not have TM active.
Fixes: 360cae313702 ("KVM: PPC: Book3S HV: Nested guest entry via hypercall") Cc: stable@vger.kernel.org # v4.20+ Reported-by: Alexey Kardashevskiy <aik@ozlabs.ru> Acked-by: Michael Neuling <mikey@neuling.org> Signed-off-by: Nicholas Piggin <npiggin@gmail.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Nicholas Piggin [Tue, 20 Jul 2021 10:43:09 +0000 (20:43 +1000)]
KVM: PPC: Book3S: Fix H_RTAS rets buffer overflow
The kvmppc_rtas_hcall() sets the host rtas_args.rets pointer based on
the rtas_args.nargs that was provided by the guest. That guest nargs
value is not range checked, so the guest can cause the host rets pointer
to be pointed outside the args array. The individual rtas function
handlers check the nargs and nrets values to ensure they are correct,
but if they are not, the handlers store a -3 (0xfffffffd) failure
indication in rets[0] which corrupts host memory.
Fix this by testing up front whether the guest supplied nargs and nret
would exceed the array size, and fail the hcall directly without storing
a failure indication to rets[0].
Also expand on a comment about why we kill the guest and try not to
return errors directly if we have a valid rets[0] pointer.
Fixes: 8e591cb72047 ("KVM: PPC: Book3S: Add infrastructure to implement kernel-side RTAS calls") Cc: stable@vger.kernel.org # v3.10+ Reported-by: Alexey Kardashevskiy <aik@ozlabs.ru> Signed-off-by: Nicholas Piggin <npiggin@gmail.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Alexandre Ghiti [Tue, 29 Jun 2021 09:13:48 +0000 (11:13 +0200)]
riscv: Make sure the kernel mapping does not overlap with IS_ERR_VALUE
The check that is done in setup_bootmem currently only works for 32-bit
kernel since the kernel mapping has been moved outside of the linear
mapping for 64-bit kernel. So make sure that for 64-bit kernel, the kernel
mapping does not overlap with the last 4K of the addressable memory.
Signed-off-by: Alexandre Ghiti <alex@ghiti.fr> Fixes: 2bfc6cd81bd1 ("riscv: Move kernel mapping outside of linear mapping") Signed-off-by: Palmer Dabbelt <palmerdabbelt@google.com>
Alexandre Ghiti [Tue, 29 Jun 2021 09:13:47 +0000 (11:13 +0200)]
riscv: Make sure the linear mapping does not use the kernel mapping
For 64-bit kernel, the end of the address space is occupied by the
kernel mapping and currently, the functions to populate the kernel page
tables (i.e. create_p*d_mapping) do not override existing mapping so we
must make sure the linear mapping does not map memory in the kernel mapping
by clipping the memory above the memory limit.
Merge tag 'drm-fixes-2021-07-23' of git://anongit.freedesktop.org/drm/drm
Pull drm fixes from Dave Airlie:
"Regular fixes - a bunch of amdgpu fixes are the main thing mostly for
the new gpus. There is also some i915 reverts for older changes that
were having some unwanted side effects. One nouveau fix for a report
regressions, and otherwise just some misc fixes.
* tag 'drm-fixes-2021-07-23' of git://anongit.freedesktop.org/drm/drm: (34 commits)
drm/panel: raspberrypi-touchscreen: Prevent double-free
drm/amdgpu - Corrected the video codecs array name for yellow carp
drm/amd/display: Fix ASSR regression on embedded panels
drm/amdgpu: add yellow carp pci id (v2)
drm/amdgpu: update yellow carp external rev_id handling
drm/amd/pm: Support board calibration on aldebaran
drm/amd/display: change zstate allow msg condition
drm/amd/display: Populate dtbclk entries for dcn3.02/3.03
drm/amd/display: Line Buffer changes
drm/amd/display: Remove MALL function from DCN3.1
drm/amd/display: Only set default brightness for OLED
drm/amd/display: Update bounding box for DCN3.1
drm/amd/display: Query VCO frequency from register for DCN3.1
drm/amd/display: Populate socclk entries for dcn3.02/3.03
drm/amd/display: Fix max vstartup calculation for modes with borders
drm/amd/display: implement workaround for riommu related hang
drm/amd/display: Fix comparison error in dcn21 DML
drm/i915: Correct the docs for intel_engine_cmd_parser
drm/ttm: add missing NULL checks
drm/ttm: Force re-init if ttm_global_init() fails
...
Alexandre Ghiti [Tue, 29 Jun 2021 09:13:46 +0000 (11:13 +0200)]
riscv: Fix memory_limit for 64-bit kernel
As described in Documentation/riscv/vm-layout.rst, the end of the
virtual address space for 64-bit kernel is occupied by the modules/BPF/
kernel mappings so this actually reduces the amount of memory we are able
to map and then use in the linear mapping. So make sure this limit is
correctly set.
Signed-off-by: Alexandre Ghiti <alex@ghiti.fr> Fixes: 2bfc6cd81bd1 ("riscv: Move kernel mapping outside of linear mapping") Signed-off-by: Palmer Dabbelt <palmerdabbelt@google.com>
Ronnie Sahlberg [Fri, 23 Jul 2021 01:21:24 +0000 (11:21 +1000)]
cifs: fix fallocate when trying to allocate a hole.
Remove the conditional checking for out_data_len and skipping the fallocate
if it is 0. This is wrong will actually change any legitimate the fallocate
where the entire region is unallocated into a no-op.
Additionally, before allocating the range, if FALLOC_FL_KEEP_SIZE is set then
we need to clamp the length of the fallocate region as to not extend the size of the file.
Fixes: 966a3cb7c7db ("cifs: improve fallocate emulation") Signed-off-by: Ronnie Sahlberg <lsahlber@redhat.com> Signed-off-by: Steve French <stfrench@microsoft.com>
Merge tag 'fallthrough-fixes-clang-5.14-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/gustavoars/linux
Pull fallthrough fix from Gustavo Silva:
"Fix a fall-through warning when building with -Wimplicit-fallthrough
on PowerPC"
* tag 'fallthrough-fixes-clang-5.14-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/gustavoars/linux:
powerpc/pasemi: Fix fall-through warning for Clang
Dave Airlie [Fri, 23 Jul 2021 00:43:37 +0000 (10:43 +1000)]
Merge tag 'drm-intel-fixes-2021-07-22' of git://anongit.freedesktop.org/drm/drm-intel into drm-fixes
Couple reverts from Jason getting rid of asynchronous command parsing
and fence error propagation and a GVT fix of shadow ppgtt invalidation
with proper D3 state tracking from Colin.
A previous commit shuffled some code around, and inadvertently used
struct file after fdput() had been called on it. As we can't touch
the file post fdput() dropping our reference, move the fdput() to
after that has been done.
Merge tag 'array-bounds-fixes-5.14-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/gustavoars/linux
Pull array bounds warning fix from Gustavo Silva:
"Fix a couple of out-of-bounds warnings in the media subsystem.
This is part of the ongoing efforts to globally enable -Warray-bounds"
* tag 'array-bounds-fixes-5.14-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/gustavoars/linux:
media: ngene: Fix out-of-bounds bug in ngene_command_config_free_buf()
Merge tag 'nvme-5.14-2021-07-22' of git://git.infradead.org/nvme into block-5.14
Pull NVMe fixes from Christoph:
"nvme fixes for Linux 5.14:
- tracing fix (Keith Busch)
- fix multipath head refcounting (Hannes Reinecke)
- Write Zeroes vs PI fix (me)
- drop a bogus WARN_ON (Zhihao Cheng)"
* tag 'nvme-5.14-2021-07-22' of git://git.infradead.org/nvme:
nvme: set the PRACT bit when using Write Zeroes with T10 PI
nvme: fix nvme_setup_command metadata trace event
nvme: fix refcounting imbalance when all paths are down
nvme-pci: don't WARN_ON in nvme_reset_work if ctrl.state is not RESETTING
Steve French [Thu, 22 Jul 2021 19:35:15 +0000 (14:35 -0500)]
CIFS: Clarify SMB1 code for POSIX delete file
Coverity also complains about the way we calculate the offset
(starting from the address of a 4 byte array within the
header structure rather than from the beginning of the struct
plus 4 bytes) for SMB1 CIFSPOSIXDelFile. This changeset
doesn't change the address but makes it slightly clearer.
Addresses-Coverity: 711519 ("Out of bounds write") Signed-off-by: Steve French <stfrench@microsoft.com>
Merge tag 'usb-serial-5.14-rc3' of https://git.kernel.org/pub/scm/linux/kernel/git/johan/usb-serial into usb-linus
Johan writes:
USB-serial fixes for 5.14-rc3
Here are some new device ids and a device-id comment fix.
All have been in linux-next with no reported issues.
* tag 'usb-serial-5.14-rc3' of https://git.kernel.org/pub/scm/linux/kernel/git/johan/usb-serial:
USB: serial: cp210x: add ID for CEL EM3588 USB ZigBee stick
USB: serial: cp210x: fix comments for GE CS1000
USB: serial: option: add support for u-blox LARA-R6 family
Steve French [Thu, 22 Jul 2021 18:50:41 +0000 (13:50 -0500)]
CIFS: Clarify SMB1 code for POSIX Create
Coverity also complains about the way we calculate the offset
(starting from the address of a 4 byte array within the
header structure rather than from the beginning of the struct
plus 4 bytes) for SMB1 CIFSPOSIXCreate. This changeset
doesn't change the address but makes it slightly clearer.
Addresses-Coverity: 711518 ("Out of bounds write") Signed-off-by: Steve French <stfrench@microsoft.com>
Merge tag 'arm64-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux
Pull arm64 fixes from Will Deacon:
"A pair of arm64 fixes for -rc3. The straightforward one is a fix to
our firmware calling stub, which accidentally started corrupting the
link register on machines with SVE. Since these machines don't really
exist yet, it wasn't spotted in -next.
The other fix is a revert-and-a-bit of a patch originally intended to
allow PTE-level huge mappings for the VMAP area on 32-bit PPC 8xx. A
side-effect of this change was that our pXd_set_huge() implementations
could be replaced with generic dummy functions depending on the levels
of page-table being used, which in turn broke the boot if we fail to
create the linear mapping as a result of using these functions to
operate on the pgd. Huge thanks to Michael Ellerman for modifying the
revert so as not to regress PPC 8xx in terms of functionality.
Anyway, that's the background and it's also available in the commit
message along with Link tags pointing at all of the fun.
Summary:
- Fix hang when issuing SMC on SVE-capable system due to
clobbered LR
- Fix boot failure due to missing block mappings with folded
page-table"
* tag 'arm64-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux:
Revert "mm/pgtable: add stubs for {pmd/pub}_{set/clear}_huge"
arm64: smccc: Save lr before calling __arm_smccc_sve_check()
Merge tag 'hyperv-fixes-signed-20210722' of git://git.kernel.org/pub/scm/linux/kernel/git/hyperv/linux
Pull hyperv fixes from Wei Liu:
- bug fix from Haiyang for vmbus CPU assignment
- revert of a bogus patch that went into 5.14-rc1
* tag 'hyperv-fixes-signed-20210722' of git://git.kernel.org/pub/scm/linux/kernel/git/hyperv/linux:
Revert "x86/hyperv: fix logical processor creation"
Drivers: hv: vmbus: Fix duplicate CPU assignments within a device
1) Fix type of bind option flag in af_xdp, from Baruch Siach.
2) Fix use after free in bpf_xdp_link_release(), from Xuan Zhao.
3) PM refcnt imbakance in r8152, from Takashi Iwai.
4) Sign extension ug in liquidio, from Colin Ian King.
5) Mising range check in s390 bpf jit, from Colin Ian King.
6) Uninit value in caif_seqpkt_sendmsg(), from Ziyong Xuan.
7) Fix skb page recycling race, from Ilias Apalodimas.
8) Fix memory leak in tcindex_partial_destroy_work, from Pave Skripkin.
9) netrom timer sk refcnt issues, from Nguyen Dinh Phi.
10) Fix data races aroun tcp's tfo_active_disable_stamp, from Eric
Dumazet.
11) act_skbmod should only operate on ethernet packets, from Peilin Ye.
12) Fix slab out-of-bpunds in fib6_nh_flush_exceptions(),, from Psolo
Abeni.
13) Fix sparx5 dependencies, from Yajun Deng.
* git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (74 commits)
dpaa2-switch: seed the buffer pool after allocating the swp
net: sched: cls_api: Fix the the wrong parameter
net: sparx5: fix unmet dependencies warning
net: dsa: tag_ksz: dont let the hardware process the layer 4 checksum
net: dsa: ensure linearized SKBs in case of tail taggers
ravb: Remove extra TAB
ravb: Fix a typo in comment
net: dsa: sja1105: make VID 4095 a bridge VLAN too
tcp: disable TFO blackhole logic by default
sctp: do not update transport pathmtu if SPP_PMTUD_ENABLE is not set
net: ixp46x: fix ptp build failure
ibmvnic: Remove the proper scrq flush
selftests: net: add ESP-in-UDP PMTU test
udp: check encap socket in __udp_lib_err
sctp: update active_key for asoc when old key is being replaced
r8169: Avoid duplicate sysfs entry creation error
ixgbe: Fix packet corruption due to missing DMA sync
Revert "qed: fix possible unpaired spin_{un}lock_bh in _qed_mcp_cmd_and_union()"
ipv6: fix another slab-out-of-bounds in fib6_nh_flush_exceptions
fsl/fman: Add fibre support
...
Merge tag 'mmc-v5.14-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/ulfh/mmc
Pull MMC fixes from Ulf Hansson:
- Use kref to fix KASAN splats triggered during card removal
- Don't allocate IDA for OF aliases
* tag 'mmc-v5.14-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/ulfh/mmc:
mmc: core: Don't allocate IDA for OF aliases
mmc: core: Use kref in place of struct mmc_blk_data::usage
Paulo Alcantara [Fri, 16 Jul 2021 06:26:41 +0000 (03:26 -0300)]
cifs: support share failover when remounting
When remouting a DFS share, force a new DFS referral of the path and
if the currently cached targets do not match any of the new targets or
there was no cached targets, then mark it for reconnect.
For example:
$ mount //dom/dfs/link /mnt -o username=foo,password=bar
$ ls /mnt
oldfile.txt
change target share of 'link' in server settings
$ mount /mnt -o remount,username=foo,password=bar
$ ls /mnt
newfile.txt
Signed-off-by: Paulo Alcantara (SUSE) <pc@cjr.nz> Signed-off-by: Steve French <stfrench@microsoft.com>
Haoran Luo [Wed, 21 Jul 2021 14:12:07 +0000 (14:12 +0000)]
tracing: Fix bug in rb_per_cpu_empty() that might cause deadloop.
The "rb_per_cpu_empty()" misinterpret the condition (as not-empty) when
"head_page" and "commit_page" of "struct ring_buffer_per_cpu" points to
the same buffer page, whose "buffer_data_page" is empty and "read" field
is non-zero.
An error scenario could be constructed as followed (kernel perspective):
1. All pages in the buffer has been accessed by reader(s) so that all of
them will have non-zero "read" field.
2. Read and clear all buffer pages so that "rb_num_of_entries()" will
return 0 rendering there's no more data to read. It is also required
that the "read_page", "commit_page" and "tail_page" points to the same
page, while "head_page" is the next page of them.
3. Invoke "ring_buffer_lock_reserve()" with large enough "length"
so that it shot pass the end of current tail buffer page. Now the
"head_page", "commit_page" and "tail_page" points to the same page.
4. Discard current event with "ring_buffer_discard_commit()", so that
"head_page", "commit_page" and "tail_page" points to a page whose buffer
data page is now empty.
When the error scenario has been constructed, "tracing_read_pipe" will
be trapped inside a deadloop: "trace_empty()" returns 0 since
"rb_per_cpu_empty()" returns 0 when it hits the CPU containing such
constructed ring buffer. Then "trace_find_next_entry_inc()" always
return NULL since "rb_num_of_entries()" reports there's no more entry
to read. Finally "trace_seq_to_user()" returns "-EBUSY" spanking
"tracing_read_pipe" back to the start of the "waitagain" loop.
I've also written a proof-of-concept script to construct the scenario
and trigger the bug automatically, you can use it to trace and validate
my reasoning above:
Tests has been carried out on linux kernel 5.14-rc2
(2734d6c1b1a089fb593ef6a23d4b70903526fe0c), my fixed version
of kernel (for testing whether my update fixes the bug) and
some older kernels (for range of affected kernels). Test result is
also attached to the proof-of-concept repository.
Filipe Manana [Wed, 21 Jul 2021 16:31:48 +0000 (17:31 +0100)]
btrfs: fix lock inversion problem when doing qgroup extent tracing
At btrfs_qgroup_trace_extent_post() we call btrfs_find_all_roots() with a
NULL value as the transaction handle argument, which makes that function
take the commit_root_sem semaphore, which is necessary when we don't hold
a transaction handle or any other mechanism to prevent a transaction
commit from wiping out commit roots.
However btrfs_qgroup_trace_extent_post() can be called in a context where
we are holding a write lock on an extent buffer from a subvolume tree,
namely from btrfs_truncate_inode_items(), called either during truncate
or unlink operations. In this case we end up with a lock inversion problem
because the commit_root_sem is a higher level lock, always supposed to be
acquired before locking any extent buffer.
Lockdep detects this lock inversion problem since we switched the extent
buffer locks from custom locks to semaphores, and when running btrfs/158
from fstests, it reported the following trace:
[ 9057.626435] ======================================================
[ 9057.627541] WARNING: possible circular locking dependency detected
[ 9057.628334] 5.14.0-rc2-btrfs-next-93 #1 Not tainted
[ 9057.628961] ------------------------------------------------------
[ 9057.629867] kworker/u16:4/30781 is trying to acquire lock:
[ 9057.630824] ffff8e2590f58760 (btrfs-tree-00){++++}-{3:3}, at: __btrfs_tree_read_lock+0x24/0x110 [btrfs]
[ 9057.632542]
but task is already holding lock:
[ 9057.633551] ffff8e25582d4b70 (&fs_info->commit_root_sem){++++}-{3:3}, at: iterate_extent_inodes+0x10b/0x280 [btrfs]
[ 9057.635255]
which lock already depends on the new lock.
Fix this by making btrfs_find_all_roots() never attempt to lock the
commit_root_sem when it is called from btrfs_qgroup_trace_extent_post().
We can't just pass a non-NULL transaction handle to btrfs_find_all_roots()
from btrfs_qgroup_trace_extent_post(), because that would make backref
lookup not use commit roots and acquire read locks on extent buffers, and
therefore could deadlock when btrfs_qgroup_trace_extent_post() is called
from the btrfs_truncate_inode_items() code path which has acquired a write
lock on an extent buffer of the subvolume btree.
CC: stable@vger.kernel.org # 4.19+ Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
$ mkfs.btrfs -fq -d raid1 -m raid1 /dev/loop0 /dev/loop1
$ mount /dev/loop0 /btrfs
$ umount /btrfs
$ btrfs dev scan --forget
$ mount -o degraded /dev/loop0 /btrfs
$ fstrim /btrfs
The reason is we call btrfs_trim_free_extents() for the missing device,
which uses device->bdev (NULL for missing device) to find if the device
supports discard.
Fix is to check if the device is missing before calling
btrfs_trim_free_extents().
CC: stable@vger.kernel.org # 5.4+ Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
Filipe Manana [Tue, 6 Jul 2021 14:41:15 +0000 (15:41 +0100)]
btrfs: fix unpersisted i_size on fsync after expanding truncate
If we have an inode that does not have the full sync flag set, was changed
in the current transaction, then it is logged while logging some other
inode (like its parent directory for example), its i_size is increased by
a truncate operation, the log is synced through an fsync of some other
inode and then finally we explicitly call fsync on our inode, the new
i_size is not persisted.
The following example shows how to trigger it, with comments explaining
how and why the issue happens:
# Fsync bar, this will be a noop since the file has not yet been
# modified in the current transaction. The goal here is to clear
# BTRFS_INODE_NEEDS_FULL_SYNC from the inode's runtime flags.
$ xfs_io -c "fsync" /mnt/bar
# Now rename both files, without changing their parent directory.
$ mv /mnt/bar /mnt/bar2
$ mv /mnt/foo /mnt/foo2
# Increase the size of bar2 with a truncate operation.
$ xfs_io -c "truncate 2M" /mnt/bar2
# Now fsync foo2, this results in logging its parent inode (the root
# directory), and logging the parent results in logging the inode of
# file bar2 (its inode item and the new name). The inode of file bar2
# is logged with an i_size of 0 bytes since it's logged in
# LOG_INODE_EXISTS mode, meaning we are only logging its names (and
# xattrs if it had any) and the i_size of the inode will not be changed
# when the log is replayed.
$ xfs_io -c "fsync" /mnt/foo2
# Now explicitly fsync bar2. This resulted in doing nothing, not
# logging the inode with the new i_size of 2M and the hole from file
# offset 1M to 2M. Because the inode did not have the flag
# BTRFS_INODE_NEEDS_FULL_SYNC set, when it was logged through the
# fsync of file foo2, its last_log_commit field was updated,
# resulting in this explicit of file bar2 not doing anything.
$ xfs_io -c "fsync" /mnt/bar2
# File bar2 content and size before a power failure.
$ od -A d -t x1 /mnt/bar2 0000000 ab ab ab ab ab ab ab ab ab ab ab ab ab ab ab ab
* 1048576 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
* 2097152
<power failure>
# Mount the filesystem to replay the log.
$ mount /dev/sdc /mnt
# Read the file again, should have the same content and size as before
# the power failure happened, but it doesn't, i_size is still at 1M.
$ od -A d -t x1 /mnt/bar2 0000000 ab ab ab ab ab ab ab ab ab ab ab ab ab ab ab ab
* 1048576
This started to happen after commit 209ecbb8585bf6 ("btrfs: remove stale
comment and logic from btrfs_inode_in_log()"), since btrfs_inode_in_log()
no longer checks if the inode's list of modified extents is not empty.
However, checking that list is not the right way to address this case
and the check was added long time ago in commit 125c4cf9f37c98
("Btrfs: set inode's logged_trans/last_log_commit after ranged fsync")
for a different purpose, to address consecutive ranged fsyncs.
The reason that checking for the list emptiness makes this test pass is
because during an expanding truncate we create an extent map to represent
a hole from the old i_size to the new i_size, and add that extent map to
the list of modified extents in the inode. However if we are low on
available memory and we can not allocate a new extent map, then we don't
treat it as an error and just set the full sync flag on the inode, so that
the next fsync does not rely on the list of modified extents - so checking
for the emptiness of the list to decide if the inode needs to be logged is
not reliable, and results in not logging the inode if it was not possible
to allocate the extent map for the hole.
Fix this by ensuring that if we are only logging that an inode exists
(inode item, names/references and xattrs), we don't update the inode's
last_log_commit even if it does not have the full sync runtime flag set.
A test case for fstests follows soon.
CC: stable@vger.kernel.org # 5.13+ Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
dpaa2-switch: seed the buffer pool after allocating the swp
Any interraction with the buffer pool (seeding a buffer, acquire one) is
made through a software portal (SWP, a DPIO object).
There are circumstances where the dpaa2-switch driver probes on a DPSW
before any DPIO devices have been probed. In this case, seeding of the
buffer pool will lead to a panic since no SWPs are initialized.
To fix this, seed the buffer pool after making sure that the software
portals have been probed and are ready to be used.
Fixes: 0b1b71370458 ("staging: dpaa2-switch: handle Rx path on control interface") Signed-off-by: Ioana Ciornei <ioana.ciornei@nxp.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Yajun Deng [Thu, 22 Jul 2021 03:23:43 +0000 (11:23 +0800)]
net: sched: cls_api: Fix the the wrong parameter
The 4th parameter in tc_chain_notify() should be flags rather than seq.
Let's change it back correctly.
Fixes: 32a4f5ecd738 ("net: sched: introduce chain object to uapi") Signed-off-by: Yajun Deng <yajun.deng@linux.dev> Signed-off-by: David S. Miller <davem@davemloft.net>
Randy Dunlap [Wed, 21 Jul 2021 22:33:36 +0000 (15:33 -0700)]
net: sparx5: fix unmet dependencies warning
WARNING: unmet direct dependencies detected for PHY_SPARX5_SERDES
Depends on [n]: (ARCH_SPARX5 || COMPILE_TEST [=n]) && OF [=y] && HAS_IOMEM [=y]
Selected by [y]:
- SPARX5_SWITCH [=y] && NETDEVICES [=y] && ETHERNET [=y] && NET_VENDOR_MICROCHIP [=y] && NET_SWITCHDEV [=y] && HAS_IOMEM [=y] && OF [=y]
Signed-off-by: Randy Dunlap <rdunlap@infradead.org> Cc: Lars Povlsen <lars.povlsen@microchip.com> Cc: Steen Hegelund <Steen.Hegelund@microchip.com> Cc: UNGLinuxDriver@microchip.com Cc: "David S. Miller" <davem@davemloft.net> Cc: Jakub Kicinski <kuba@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>