erofs: support parsing big pcluster compact indexes
Different from non-compact indexes, several lclusters are packed
as the compact form at once and an unique base blkaddr is stored for
each pack, so each lcluster index would take less space on avarage
(e.g. 2 bytes for COMPACT_2B.) btw, that is also why BIG_PCLUSTER
switch should be consistent for compact head0/1.
Prior to big pcluster, the size of all pclusters was 1 lcluster.
Therefore, when a new HEAD lcluster was scanned, blkaddr would be
bumped by 1 lcluster. However, that way doesn't work anymore for
big pcluster since we actually don't know the compressed size of
pclusters in advance (before reading CBLKCNT lcluster).
So, instead, let blkaddr of each pack be the first pcluster blkaddr
with a valid CBLKCNT, in detail,
1) if CBLKCNT starts at the pack, this first valid pcluster is
itself, e.g.
_____________________________________________________________
|_CBLKCNT0_|_NONHEAD_| .. |_HEAD_|_CBLKCNT1_| ... |_HEAD_| ...
^ = blkaddr base ^ += CBLKCNT0 ^ += CBLKCNT1
2) if CBLKCNT doesn't start at the pack, the first valid pcluster
is the next pcluster, e.g.
_________________________________________________________
| NONHEAD_| .. |_HEAD_|_CBLKCNT0_| ... |_HEAD_|_HEAD_| ...
^ = blkaddr base ^ += CBLKCNT0
^ += 1
When a CBLKCNT is found, blkaddr will be increased by CBLKCNT
lclusters, or a new HEAD is found immediately, bump blkaddr by 1
instead (see the picture above.)
Also noted if CBLKCNT is the end of the pack, instead of storing
delta1 (distance of the next HEAD lcluster) as normal NONHEADs,
it still uses the compressed block count (delta0) since delta1
can be calculated indirectly but the block count can't.
Adjust decoding logic to fit big pcluster compact indexes as well.
erofs: support parsing big pcluster compress indexes
When INCOMPAT_BIG_PCLUSTER sb feature is enabled, legacy compress indexes
will also have the same on-disk header compact indexes to keep per-file
configurations instead of leaving it zeroed.
If ADVISE_BIG_PCLUSTER is set for a file, CBLKCNT will be loaded for each
pcluster in this file by parsing 1st non-head lcluster.
erofs: adjust per-CPU buffers according to max_pclusterblks
Adjust per-CPU buffers on demand since big pcluster definition is
available. Also, bail out unsupported pcluster size according to
Z_EROFS_PCLUSTER_MAX_SIZE.
Big pcluster indicates the size of compressed data for each physical
pcluster is no longer fixed as block size, but could be more than 1
block (more accurately, 1 logical pcluster)
When big pcluster feature is enabled for head0/1, delta0 of the 1st
non-head lcluster index will keep block count of this pcluster in
lcluster size instead of 1. Or, the compressed size of pcluster
should be 1 lcluster if pcluster has no non-head lcluster index.
Also note that BIG_PCLUSTER feature reuses COMPR_CFGS feature since
it depends on COMPR_CFGS and will be released together.
erofs: fix up inplace I/O pointer for big pcluster
When picking up inplace I/O pages, it should be traversed in reverse
order in aligned with the traversal order of file-backed online pages.
Also, index should be updated together when preloading compressed pages.
Previously, only page-sized pclustersize was supported so no problem
at all. Also rename `compressedpages' to `icpage_ptr' to reflect its
functionality.
Since multiple pcluster sizes could be used at once, the number of
compressed pages will become a variable factor. It's necessary to
introduce slab pools rather than a single slab cache now.
This limits the pclustersize to 1M (Z_EROFS_PCLUSTER_MAX_SIZE), and
get rid of the obsolete EROFS_FS_CLUSTER_PAGE_LIMIT, which has no
use now.
To deal the with the cases which inplace decompression is infeasible
for some inplace I/O. Per-CPU buffers was introduced to get rid of page
allocation latency and thrash for low-latency decompression algorithms
such as lz4.
For the big pcluster feature, introduce multipage per-CPU buffers to
keep such inplace I/O pclusters temporarily as well but note that
per-CPU pages are just consecutive virtually.
When a new big pcluster fs is mounted, its max pclustersize will be
read and per-CPU buffers can be growed if needed. Shrinking adjustable
per-CPU buffers is more complex (because we don't know if such size
is still be used), so currently just release them all when unloading.
Formal big pcluster design is actually more powerful / flexable than
the previous thought whose pclustersize was fixed as power-of-2 blocks,
which was obviously inefficient and space-wasting. Instead, pclustersize
can now be set independently for each pcluster, so various pcluster
sizes can also be used together in one file if mkfs wants (for example,
according to data type and/or compression ratio).
Let's get rid of previous physical_clusterbits[] setting (also notice
that corresponding on-disk fields are still 0 for now). Therefore,
head1/2 can be used for at most 2 different algorithms in one file and
again pclustersize is now independent of these.
Gao Xiang [Mon, 29 Mar 2021 10:00:12 +0000 (18:00 +0800)]
erofs: add on-disk compression configurations
Add a bitmap for available compression algorithms and a variable-sized
on-disk table for compression options in preparation for upcoming big
pcluster and LZMA algorithm, which follows the end of super block.
To parse the compression options, the bitmap is scanned one by one.
For each available algorithm, there is data followed by 2-byte `length'
correspondingly (it's enough for most cases, or entire fs blocks should
be used.)
With such available algorithm bitmap, kernel itself can also refuse to
mount such filesystem if any unsupported compression algorithm exists.
Note that COMPR_CFGS feature will be enabled with BIG_PCLUSTER.
Huang Jianan [Mon, 29 Mar 2021 01:23:06 +0000 (09:23 +0800)]
erofs: support adjust lz4 history window size
lz4 uses LZ4_DISTANCE_MAX to record history preservation. When
using rolling decompression, a block with a higher compression
ratio will cause a larger memory allocation (up to 64k). It may
cause a large resource burden in extreme cases on devices with
small memory and a large number of concurrent IOs. So appropriately
reducing this value can improve performance.
Decreasing this value will reduce the compression ratio (except
when input_size <LZ4_DISTANCE_MAX). But considering that erofs
currently only supports 4k output, reducing this value will not
significantly reduce the compression benefits.
The maximum value of LZ4_DISTANCE_MAX defined by lz4 is 64k, and
we can only reduce this value. For the old kernel, it just can't
reduce the memory allocation during rolling decompression without
affecting the decompression result.
Yue Hu [Thu, 25 Mar 2021 07:10:08 +0000 (15:10 +0800)]
erofs: don't use erofs_map_blocks() any more
Currently, erofs_map_blocks() will be called only from
erofs_{bmap, read_raw_page} which are all for uncompressed files.
So, the compression branch in erofs_map_blocks() is pointless. Let's
remove it and use erofs_map_blocks_flatmode() directly. Also update
related comments.
Gao Xiang [Sun, 21 Mar 2021 18:32:27 +0000 (02:32 +0800)]
erofs: complete a missing case for inplace I/O
Add a missing case which could cause unnecessary page allocation but
not directly use inplace I/O instead, which increases runtime extra
memory footprint.
The detail is, considering an online file-backed page, the right half
of the page is chosen to be cached (e.g. the end page of a readahead
request) and some of its data doesn't exist in managed cache, so the
pcluster will be definitely kept in the submission chain. (IOWs, it
cannot be decompressed without I/O, e.g., due to the bypass queue).
Currently, DELAYEDALLOC/TRYALLOC cases can be downgraded as NOINPLACE,
and stop online pages from inplace I/O. After this patch, unneeded page
allocations won't be observed in pickup_page_for_submission() then.
Huang Jianan [Wed, 17 Mar 2021 03:54:48 +0000 (11:54 +0800)]
erofs: use sync decompression for atomic contexts only
Sync decompression was introduced to get rid of additional kworker
scheduling overhead. But there is no such overhead in non-atomic
contexts. Therefore, it should be better to turn off sync decompression
to avoid the current thread waiting in z_erofs_runqueue.
Huang Jianan [Wed, 17 Mar 2021 03:54:47 +0000 (11:54 +0800)]
erofs: use workqueue decompression for atomic contexts only
z_erofs_decompressqueue_endio may not be executed in the atomic
context, for example, when dm-verity is turned on. In this scenario,
data can be decompressed directly to get rid of additional kworker
scheduling overhead.
Huang Jianan [Tue, 16 Mar 2021 03:15:14 +0000 (11:15 +0800)]
erofs: avoid memory allocation failure during rolling decompression
Currently, err would be treated as io error. Therefore, it'd be
better to ensure memory allocation during rolling decompression
to avoid such io error.
In the long term, we might consider adding another !Uptodate case
for such case.
Linus Torvalds [Sun, 28 Mar 2021 20:22:54 +0000 (13:22 -0700)]
Merge tag 'perf-tools-fixes-for-v5.12-2020-03-28' of git://git.kernel.org/pub/scm/linux/kernel/git/acme/linux
Pull perf tooling fixes from Arnaldo Carvalho de Melo:
- Avoid write of uninitialized memory when generating PERF_RECORD_MMAP*
records.
- Fix 'perf top' BPF support related crash with perf_event_paranoid=3 +
kptr_restrict.
- Validate raw event with sysfs exported format bits.
- Fix waipid on SIGCHLD delivery bugs in 'perf daemon'.
- Change to use bash for daemon test on Debian, where the default is
dash and thus fails for use of bashisms in this test.
- Fix memory leak in vDSO found using ASAN.
- Remove now useless (due to the fact that BPF now supports static
vars) failing sub test "BPF relocation checker".
- Fix auxtrace queue conflict.
- Sync linux/kvm.h with the kernel sources.
* tag 'perf-tools-fixes-for-v5.12-2020-03-28' of git://git.kernel.org/pub/scm/linux/kernel/git/acme/linux:
perf test: Change to use bash for daemon test
perf record: Fix memory leak in vDSO found using ASAN
perf test: Remove now useless failing sub test "BPF relocation checker"
perf daemon: Return from kill functions
perf daemon: Force waipid for all session on SIGCHLD delivery
perf top: Fix BPF support related crash with perf_event_paranoid=3 + kptr_restrict
perf pmu: Validate raw event with sysfs exported format bits
perf synthetic events: Avoid write of uninitialized memory when generating PERF_RECORD_MMAP* records
tools headers UAPI: Sync linux/kvm.h with the kernel sources
perf synthetic-events: Fix uninitialized 'kernel_thread' variable
perf auxtrace: Fix auxtrace queue conflict
Linus Torvalds [Sun, 28 Mar 2021 19:19:16 +0000 (12:19 -0700)]
Merge tag 'x86-urgent-2021-03-28' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 fixes from Ingo Molnar:
"Two fixes:
- Fix build failure on Ubuntu with new GCC packages that turn
on -fcf-protection
- Fix SME memory encryption PTE encoding bug - AFAICT the code
worked on 4K page sizes (level 1) but had the wrong shift at
higher page level orders (level 2 and higher)"
* tag 'x86-urgent-2021-03-28' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/build: Turn off -fcf-protection for realmode targets
x86/mem_encrypt: Correct physical address calculation in __set_clr_pte_enc()
Linus Torvalds [Sun, 28 Mar 2021 19:12:22 +0000 (12:12 -0700)]
Merge tag 'locking-urgent-2021-03-28' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull locking fix from Ingo Molnar:
"Fix the non-debug mutex_lock_io_nested() method to map to
mutex_lock_io() instead of mutex_lock().
Right now nothing uses this API explicitly, but this is an
accident waiting to happen"
* tag 'locking-urgent-2021-03-28' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
locking/mutex: Fix non debug version of mutex_lock_io_nested()
Linus Torvalds [Sun, 28 Mar 2021 19:06:21 +0000 (12:06 -0700)]
Merge tag '5.12-rc4-smb3' of git://git.samba.org/sfrench/cifs-2.6
Pull cifs fixes from Steve French:
"Five cifs/smb3 fixes, two for stable.
Includes an important fix for encryption and an ACL fix, as well as a
fix for possible reflink data corruption"
* tag '5.12-rc4-smb3' of git://git.samba.org/sfrench/cifs-2.6:
smb3: fix cached file size problems in duplicate extents (reflink)
cifs: Silently ignore unknown oplock break handle
cifs: revalidate mapping when we open files for SMB1 POSIX
cifs: Fix chmod with modefromsid when an older ACE already exists.
cifs: Adjust key sizes and key generation routines for AES256 encryption
Linus Torvalds [Sun, 28 Mar 2021 18:42:05 +0000 (11:42 -0700)]
Merge tag 'io_uring-5.12-2021-03-27' of git://git.kernel.dk/linux-block
Pull io_uring fixes from Jens Axboe:
- Use thread info versions of flag testing, as discussed last week.
- The series enabling PF_IO_WORKER to just take signals, instead of
needing to special case that they do not in a bunch of places. Ends
up being pretty trivial to do, and then we can revert all the special
casing we're currently doing.
- Kill dead pointer assignment
- Fix hashed part of async work queue trace
- Fix sign extension issue for IORING_OP_PROVIDE_BUFFERS
- Fix a link completion ordering regression in this merge window
- Cancellation fixes
* tag 'io_uring-5.12-2021-03-27' of git://git.kernel.dk/linux-block:
io_uring: remove unsued assignment to pointer io
io_uring: don't cancel extra on files match
io_uring: don't cancel-track common timeouts
io_uring: do post-completion chore on t-out cancel
io_uring: fix timeout cancel return code
Revert "signal: don't allow STOP on PF_IO_WORKER threads"
Revert "kernel: freezer should treat PF_IO_WORKER like PF_KTHREAD for freezing"
Revert "kernel: treat PF_IO_WORKER like PF_KTHREAD for ptrace/signals"
Revert "signal: don't allow sending any signals to PF_IO_WORKER threads"
kernel: stop masking signals in create_io_thread()
io_uring: handle signals for IO threads like a normal thread
kernel: don't call do_exit() for PF_IO_WORKER threads
io_uring: maintain CQE order of a failed link
io-wq: fix race around pending work on teardown
io_uring: do ctx sqd ejection in a clear context
io_uring: fix provide_buffers sign extension
io_uring: don't skip file_end_write() on reissue
io_uring: correct io_queue_async_work() traces
io_uring: don't use {test,clear}_tsk_thread_flag() for current
Linus Torvalds [Sun, 28 Mar 2021 18:37:42 +0000 (11:37 -0700)]
Merge tag 'block-5.12-2021-03-27' of git://git.kernel.dk/linux-block
Pull block fixes from Jens Axboe:
- Fix regression from this merge window with the xarray partition
change, which allowed partition counts that overflow the u8 that
holds the partition number (Ming)
- Fix zone append warning (Johannes)
- Segmentation count fix for multipage bvecs (David)
- Partition scan fix (Chris)
* tag 'block-5.12-2021-03-27' of git://git.kernel.dk/linux-block:
block: don't create too many partitions
block: support zone append bvecs
block: recalculate segment count for multi-segment discards correctly
block: clear GD_NEED_PART_SCAN later in bdev_disk_changed
Linus Torvalds [Sun, 28 Mar 2021 18:34:47 +0000 (11:34 -0700)]
Merge tag 'scsi-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi
Pull SCSI fixes from James Bottomley:
"Seven fixes, all in drivers (qla2xxx, mkt3sas, qedi, target,
ibmvscsi).
The most serious are the target pscsi oom and the qla2xxx revert which
can otherwise cause a use after free"
* tag 'scsi-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi:
scsi: target: pscsi: Clean up after failure in pscsi_map_sg()
scsi: target: pscsi: Avoid OOM in pscsi_map_sg()
scsi: mpt3sas: Fix error return code of mpt3sas_base_attach()
scsi: qedi: Fix error return code of qedi_alloc_global_queues()
scsi: Revert "qla2xxx: Make sure that aborted commands are freed"
scsi: ibmvfc: Make ibmvfc_wait_for_ops() MQ aware
scsi: ibmvfc: Fix potential race in ibmvfc_wait_for_ops()
Pavel Begunkov [Thu, 25 Mar 2021 18:32:45 +0000 (18:32 +0000)]
io_uring: don't cancel extra on files match
As tasks always wait and kill their io-wq on exec/exit, files are of no
more concern to us, so we don't need to specifically cancel them by hand
in those cases. Moreover we should not, because io_match_task() looks at
req->task->files now, which is always true and so leads to extra
cancellations, that wasn't a case before per-task io-wq.
Pavel Begunkov [Thu, 25 Mar 2021 18:32:44 +0000 (18:32 +0000)]
io_uring: don't cancel-track common timeouts
Don't account usual timeouts (i.e. not linked) as REQ_F_INFLIGHT but
keep behaviour prior to dd59a3d595cc1 ("io_uring: reliably cancel linked
timeouts").
Pavel Begunkov [Thu, 25 Mar 2021 18:32:43 +0000 (18:32 +0000)]
io_uring: do post-completion chore on t-out cancel
Don't forget about io_commit_cqring() + io_cqring_ev_posted() after
exit/exec cancelling timeouts. Both functions declared only after
io_kill_timeouts(), so to avoid tons of forward declarations move
it down.
Before IO threads accepted signals, the freezer using take signals to wake
up an IO thread would cause them to loop without any way to clear the
pending signal. That is no longer the case, so stop special casing
PF_IO_WORKER in the freezer.
The IO threads do allow signals now, including SIGSTOP, and we can allow
ptrace attach. Attaching won't reveal anything interesting for the IO
threads, but it will allow eg gdb to attach to a task with io_urings
and IO threads without complaining. And once attached, it will allow
the usual introspection into regular threads.
Jens Axboe [Fri, 26 Mar 2021 15:05:22 +0000 (09:05 -0600)]
kernel: stop masking signals in create_io_thread()
This is racy - move the blocking into when the task is created and
we're marking it as PF_IO_WORKER anyway. The IO threads are now
prepared to handle signals like SIGSTOP as well, so clear that from
the mask to allow proper stopping of IO threads.
Jens Axboe [Fri, 26 Mar 2021 00:16:06 +0000 (18:16 -0600)]
io_uring: handle signals for IO threads like a normal thread
We go through various hoops to disallow signals for the IO threads, but
there's really no reason why we cannot just allow them. The IO threads
never return to userspace like a normal thread, and hence don't go through
normal signal processing. Instead, just check for a pending signal as part
of the work loop, and call get_signal() to handle it for us if anything
is pending.
With that, we can support receiving signals, including special ones like
SIGSTOP.
Acked-by: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
Ming Lei [Sat, 27 Mar 2021 07:13:09 +0000 (15:13 +0800)]
block: don't create too many partitions
Commit a33df75c6328 ("block: use an xarray for disk->part_tbl") drops the
check on max supported number of partitionsr, and allows partition with
bigger partition numbers to be added. However, ->bd_partno is defined as
u8, so partition index of xarray table may not match with ->bd_partno.
Then delete_partition() may delete one unmatched partition, and caused
use-after-free.
Reviewed-by: Bart Van Assche <bvanassche@acm.org> Reported-by: syzbot+8fede7e30c7cee0de139@syzkaller.appspotmail.com Fixes: a33df75c6328 ("block: use an xarray for disk->part_tbl") Signed-off-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
Steve French [Fri, 26 Mar 2021 23:41:55 +0000 (18:41 -0500)]
smb3: fix cached file size problems in duplicate extents (reflink)
There were two problems (one of which could cause data corruption)
that were noticed with duplicate extents (ie reflink)
when debugging why various xfstests were being incorrectly skipped
(e.g. generic/138, generic/140, generic/142). First, we were not
updating the file size locally in the cache when extending a
file due to reflink (it would refresh after actimeo expires)
but xfstest was checking the size immediately which was still
0 so caused the test to be skipped. Second, we were setting
the target file size (which could shrink the file) in all cases
to the end of the reflinked range rather than only setting the
target file size when reflink would extend the file.
CC: <stable@vger.kernel.org> Signed-off-by: Steve French <stfrench@microsoft.com>
Make SMB2 not print out an error when an oplock break is received for an
unknown handle, similar to SMB1. The debug message which is printed for
these unknown handles may also be misleading, so fix that too.
The SMB2 lease break path is not affected by this patch.
Without this, a program which writes to a file from one thread, and
opens, reads, and writes the same file from another thread triggers the
below errors several times a minute when run against a Samba server
configured with "smb2 leases = no".
Signed-off-by: Vincent Whitchurch <vincent.whitchurch@axis.com> Reviewed-by: Tom Talpey <tom@talpey.com> Reviewed-by: Paulo Alcantara (SUSE) <pc@cjr.nz> Signed-off-by: Steve French <stfrench@microsoft.com>
Under SMB1 + POSIX, if an inode is reused on a server after we have read and
cached a part of a file, when we then open the new file with the
re-cycled inode there is a chance that we may serve the old data out of cache
to the application.
This only happens for SMB1 (deprecated) and when posix are used.
The simplest solution to avoid this race is to force a revalidate
on smb1-posix open.
Signed-off-by: Ronnie Sahlberg <lsahlber@redhat.com> Reviewed-by: Paulo Alcantara (SUSE) <pc@cjr.nz> Signed-off-by: Steve French <stfrench@microsoft.com>
Shyam Prasad N [Fri, 26 Mar 2021 10:28:16 +0000 (10:28 +0000)]
cifs: Fix chmod with modefromsid when an older ACE already exists.
My recent fixes to cifsacl to maintain inherited ACEs had
regressed modefromsid when an older ACL already exists.
Found testing xfstest 495 with modefromsid mount option
Fixes: f5065508897a ("cifs: Retain old ACEs when converting between mode bits and ACL") Signed-off-by: Shyam Prasad N <sprasad@microsoft.com> Reviewed-by: Paulo Alcantara (SUSE) <pc@cjr.nz> Signed-off-by: Steve French <stfrench@microsoft.com>
Jens Axboe [Fri, 26 Mar 2021 14:57:10 +0000 (08:57 -0600)]
kernel: don't call do_exit() for PF_IO_WORKER threads
Right now we're never calling get_signal() from PF_IO_WORKER threads, but
in preparation for doing so, don't handle a fatal signal for them. The
workers have state they need to cleanup when exiting, so just return
instead of calling do_exit() on their behalf. The threads themselves will
detect a fatal signal and do proper shutdown.
- Fix DM core's zoned model and zone sectors checks.
- Fix spurious "detected capacity change" pr_info() when creating new
DM device.
- Fix DM ioctl out of bounds array access in handling of
DM_LIST_DEVICES_CMD when no devices exist.
* tag 'for-5.12/dm-fixes-2' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm:
dm ioctl: fix out of bounds array access when no devices
dm: don't report "detected capacity change" on device creation
dm table: Fix zoned model check and zone sectors check
dm verity: fix DM_VERITY_OPTS_MAX value
Mikulas Patocka [Fri, 26 Mar 2021 18:32:32 +0000 (14:32 -0400)]
dm ioctl: fix out of bounds array access when no devices
If there are not any dm devices, we need to zero the "dev" argument in
the first structure dm_name_list. However, this can cause out of
bounds write, because the "needed" variable is zero and len may be
less than eight.
Fix this bug by reporting DM_BUFFER_FULL_FLAG if the result buffer is
too small to hold the "nl->dev" value.
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Reported-by: Dan Carpenter <dan.carpenter@oracle.com> Cc: stable@vger.kernel.org Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Linus Torvalds [Fri, 26 Mar 2021 18:33:39 +0000 (11:33 -0700)]
Merge tag 'acpi-5.12-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm
Pull ACPI fixes from Rafael Wysocki:
"These fix a memory management regression in ACPICA, repair an ACPI
blacklist entry damaged inadvertently during the 5.11 cycle and fix
the bookkeeping of devices with the same primary device ID in the ACPI
core.
Specifics:
- Make ACPICA use the same object cache consistently when allocating
and freeing objects (Vegard Nossum)
- Add a callback pointer removed inadvertently during the 5.11 cycle
to the ACPI backlight blacklist entry for Sony VPCEH3U1E (Chris
Chiu)
- Make the ACPI device enumeration core use IDA for creating names of
ACPI device objects with the same primary device ID to avoid using
duplicate device object names in some cases (Andy Shevchenko)"
* tag 'acpi-5.12-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm:
ACPICA: Always create namespace nodes using acpi_ns_create_node()
ACPI: scan: Use unique number for instance_no
ACPI: video: Add missing callback back for Sony VPCEH3U1E
Linus Torvalds [Fri, 26 Mar 2021 18:29:36 +0000 (11:29 -0700)]
Merge tag 'pm-5.12-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm
Pull power management fixes from Rafael Wysocki:
"These fix an issue related to device links in the runtime PM framework
and debugfs usage in the Energy Model code.
Specifics:
- Modify the runtime PM device suspend to avoid suspending supplier
devices before the consumer device's status changes to
RPM_SUSPENDED (Rafael Wysocki)
- Change the Energy Model code to prevent it from attempting to
create its main debugfs directory too early (Lukasz Luba)"
* tag 'pm-5.12-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm:
PM: EM: postpone creating the debugfs dir till fs_initcall
PM: runtime: Defer suspending suppliers
Linus Torvalds [Fri, 26 Mar 2021 18:19:38 +0000 (11:19 -0700)]
Merge tag 'soc-fixes-5.12' of git://git.kernel.org/pub/scm/linux/kernel/git/soc/soc
Pull ARM SoC fixes from Arnd Bergmann:
"Too many fixes have accumulated in the soc tree, so this is a fairly
large set. As usual, most of the fixes are for devicetree files, but
there are also notable code changes for imx and omap regressions as
well as some maintainer file updates.
imx:
- Fix an Ethernet issue on imx6ul-14x14-evk board that is caused by
independent PHY reset.
- Add missing `dma-coherent` property for LayerScape device trees to
fix a kernel BUG report.
- Use IRQCHIP_DECLARE for AVIC driver to fix a boot issue on i.MX25
with fw_devlink=on.
- Add missing I2C pinctrl entry for imx8mp-phyboard-pollux-rdk board
to fix the broken I2C GPIO recovery support.
- Add `fsl,use-minimum-ecc` property for imx6ull-myir-mys-6ulx-eval
device tree to fix UBI filesystem mount failure.
at91:
- wrong phy address that blocks Ethernet use on boards with sama5d27
SoM1
- restrictive pin possibilities for sam9x60
omap:
- Fix ocp interconnect bus access error reporting for omap_l3_noc by
setting IRQF_NO_THREAD
- Fix changed mmc slot order regression by adding mmc aliases for
am335x
- Fix smartreflex init regression caused by dropped legacy data
- Fix ti-sysc driver warning on unbind if reset is not deasserted
- Fix flakey reset deassert for dra7 iva
stm32:
- MAINTAINER file updates
broadcom:
- brcmstb SoC ID build fix
- MAINTAINER file updates"
* tag 'soc-fixes-5.12' of git://git.kernel.org/pub/scm/linux/kernel/git/soc/soc:
MAINTAINERS: Add Alain Volmat as STM32 I2C/SMBUS maintainer
MAINTAINERS: Remove Vincent Abriou for STM/STI DRM drivers.
MAINTAINERS: Update some st.com email addresses to foss.st.com
ARM: dts: imx6ull: fix ubi filesystem mount failed
ARM: imx6ul-14x14-evk: Do not reset the Ethernet PHYs independently
arm64: dts: imx8mp-phyboard-pollux-rdk: Add missing pinctrl entry
arm64: dts: ls1012a: mark crypto engine dma coherent
arm64: dts: ls1043a: mark crypto engine dma coherent
arm64: dts: ls1046a: mark crypto engine dma coherent
ARM: imx: avic: Convert to using IRQCHIP_DECLARE
ARM: dts: at91: sam9x60: fix mux-mask to match product's datasheet
ARM: dts: at91: sam9x60: fix mux-mask for PA7 so it can be set to A, B and C
ARM: dts: at91-sama5d27_som1: fix phy address to 7
soc: ti: omap-prm: Fix occasional abort on reset deassert for dra7 iva
bus: ti-sysc: Fix warning on unbind if reset is not deasserted
ARM: OMAP2+: Fix smartreflex init regression after dropping legacy data
soc: ti: omap-prm: Fix reboot issue with invalid pcie reset map for dra7
MAINTAINERS: rectify BROADCOM PMB (POWER MANAGEMENT BUS) DRIVER
ARM: dts: am33xx: add aliases for mmc interfaces
bus: omap_l3_noc: mark l3 irqs as IRQF_NO_THREAD
Linus Torvalds [Fri, 26 Mar 2021 18:15:25 +0000 (11:15 -0700)]
Merge tag 'for-linus-5.12b-rc5-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip
Pull xen fixes from Juergen Gross:
"This contains a small series with a more elegant fix of a problem
which was originally fixed in rc2"
* tag 'for-linus-5.12b-rc5-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip:
Revert "xen: fix p2m size in dom0 for disabled memory hotplug case"
xen/x86: make XEN_BALLOON_MEMORY_HOTPLUG_LIMIT depend on MEMORY_HOTPLUG
Linus Torvalds [Fri, 26 Mar 2021 18:05:18 +0000 (11:05 -0700)]
Merge tag 'drm-fixes-2021-03-26' of git://anongit.freedesktop.org/drm/drm
Pull drm fixes from Dave Airlie:
"As expected last week things were overly quiet so this week things
seem to have caught up. It still isn't too major.
msm and amdgpu lead the size here, the msm fixes are pretty varied
across the driver, the amdgpu one is mostly the S0ix fixes with some
other minor ones. Otherwise there are a few i915 fixes and one each
for nouveau, etnaviv and rcar-du.
i915:
- DisplayPort LTTPR fixes around link training and limiting it
according to supported spec version.
- Fix enabled_planes bitmask to really represent only logically
enabled planes.
- Fix DSS CTL registers for ICL DSI transcoders
- Fix the GT fence revocation runtime PM logic.
nouveau:
- cursor size regression fix
amdgpu:
- S0ix fixes
- Add PCI ID
- Polaris PCIe DPM fix
- Display fix for high refresh rate monitors"
* tag 'drm-fixes-2021-03-26' of git://anongit.freedesktop.org/drm/drm: (37 commits)
drm/nouveau/kms/nve4-nv108: Limit cursors to 128x128
drm/i915: Fix the GT fence revocation runtime PM logic
drm/amdgpu/display: restore AUX_DPHY_TX_CONTROL for DCN2.x
drm/amdgpu: Add additional Sienna Cichlid PCI ID
drm/amd/pm: workaround for audio noise issue
drm/i915/dsc: fix DSS CTL register usage for ICL DSI transcoders
drm/i915: Fix enabled_planes bitmask
drm/i915: Disable LTTPR support when the LTTPR rev < 1.4
drm/i915: Disable LTTPR support when the DPCD rev < 1.4
drm/i915/ilk-glk: Fix link training on links with LTTPRs
drm/msm/disp/dpu1: icc path needs to be set before dpu runtime resume
drm/amdgpu: skip kfd suspend/resume for S0ix
drm/amdgpu: drop S0ix checks around CG/PG in suspend
drm/amdgpu: skip CG/PG for gfx during S0ix
drm/amdgpu: update comments about s0ix suspend/resume
drm/amdgpu/swsmu: skip gfx cgpg on s0ix suspend
drm/amdgpu: re-enable suspend phase 2 for S0ix
drm/amdgpu: move s0ix check into amdgpu_device_ip_suspend_phase2 (v3)
drm/amdgpu: clean up non-DC suspend/resume handling
drm/amdgpu: don't evict vram on APUs for suspend to ram (v4)
...
Shyam Prasad N [Thu, 25 Mar 2021 12:34:54 +0000 (12:34 +0000)]
cifs: Adjust key sizes and key generation routines for AES256 encryption
For AES256 encryption (GCM and CCM), we need to adjust the size of a few
fields to 32 bytes instead of 16 to accommodate the larger keys.
Also, the L value supplied to the key generator needs to be changed from
to 256 when these algorithms are used.
Keeping the ioctl struct for dumping keys of the same size for now.
Will send out a different patch for that one.
Signed-off-by: Shyam Prasad N <sprasad@microsoft.com> Reviewed-by: Ronnie Sahlberg <lsahlber@redhat.com> CC: <stable@vger.kernel.org> # v5.10+ Signed-off-by: Steve French <stfrench@microsoft.com>
Leo Yan [Sat, 20 Mar 2021 10:45:54 +0000 (18:45 +0800)]
perf test: Change to use bash for daemon test
When executing the daemon test on Arm64 and x86 with Debian (Buster)
distro, both skip the test case with the log:
# ./perf test -v 76
76: daemon operations :
--- start ---
test child forked, pid 11687
test daemon list
trap: SIGINT: bad trap
./tests/shell/daemon.sh: 173: local: cpu-clock: bad variable name
test child finished with -2
---- end ----
daemon operations: Skip
So the error happens for the variable expansion when use local variable
in the shell script. Since Debian Buster uses dash but not bash as
non-interactive shell, when execute the daemon testing, it hits a known
issue for dash which was reported [1].
To resolve this issue, one option is to add double quotes for all local
variables assignment, so need to change the code from:
local line=`perf daemon --config ${config} -x: | head -2 | tail -1`
... to:
local line="`perf daemon --config ${config} -x: | head -2 | tail -1`"
But the testing script has bunch of local variables, this leads to big
changes for whole script.
On the other hand, the testing script asks to use the "local" feature
which is bash-specific, so this patch explicitly uses "#!/bin/bash" to
ensure running the script with bash.
After:
# ./perf test -v 76
76: daemon operations :
--- start ---
test child forked, pid 11329
test daemon list
test daemon reconfig
test daemon stop
test daemon signal
signal 12 sent to session 'test [11596]'
signal 12 sent to session 'test [11596]'
test daemon ping
test daemon lock
test child finished with 0
---- end ----
daemon operations: Ok
Linus Torvalds [Thu, 25 Mar 2021 23:46:43 +0000 (16:46 -0700)]
Merge tag 'integrity-v5.12-fix' of git://git.kernel.org/pub/scm/linux/kernel/git/zohar/linux-integrity
Pull integrity fix from Mimi Zohar:
"Just one patch to address a NULL ptr dereferencing when there is a
mismatch between the user enabled LSMs and IMA/EVM"
* tag 'integrity-v5.12-fix' of git://git.kernel.org/pub/scm/linux/kernel/git/zohar/linux-integrity:
integrity: double check iint_cache was initialized
Linus Torvalds [Thu, 25 Mar 2021 22:38:22 +0000 (15:38 -0700)]
Merge tag 'for-5.12-rc4-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux
Pull btrfs fixes from David Sterba:
"Fixes for issues that have some user visibility and are simple enough
for this time of development cycle:
- a few fixes for rescue= mount option, adding more checks for
missing trees
- fix sleeping in atomic context on qgroup deletion
- fix subvolume deletion on mount
- fix build with M= syntax
- fix checksum mismatch error message for direct io"
* tag 'for-5.12-rc4-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
btrfs: fix check_data_csum() error message for direct I/O
btrfs: fix sleep while in non-sleep context during qgroup removal
btrfs: fix subvolume/snapshot deletion not triggered on mount
btrfs: fix build when using M=fs/btrfs
btrfs: do not initialize dev replace for bad dev root
btrfs: initialize device::fs_info always
btrfs: do not initialize dev stats if we have no dev_root
btrfs: zoned: remove outdated WARN_ON in direct IO
Dave Airlie [Thu, 25 Mar 2021 20:21:18 +0000 (06:21 +1000)]
Merge tag 'drm-intel-fixes-2021-03-25-1' of git://anongit.freedesktop.org/drm/drm-intel into drm-fixes
- DisplayPort LTTPR fixes around link training and limiting it
according to supported spec version. (Imre)
- Fix enabled_planes bitmask to really represent only logically
enabled planes (Ville).
- Fix DSS CTL registers for ICL DSI transcoders (Jani)
- Fix the GT fence revocation runtime PM logic. (Imre)
Pavel Begunkov [Thu, 25 Mar 2021 19:05:14 +0000 (19:05 +0000)]
io_uring: maintain CQE order of a failed link
Arguably we want CQEs of linked requests be in a strict order of
submission as it always was. Now if init of a request fails its CQE may
be posted before all prior linked requests including the head of the
link. Fix it by failing it last.
Linus Torvalds [Thu, 25 Mar 2021 18:43:43 +0000 (11:43 -0700)]
Merge branch 'akpm' (patches from Andrew)
Merge misc fixes from Andrew Morton:
"14 patches.
Subsystems affected by this patch series: mm (hugetlb, kasan, gup,
selftests, z3fold, kfence, memblock, and highmem), squashfs, ia64,
gcov, and mailmap"
* emailed patches from Andrew Morton <akpm@linux-foundation.org>:
mailmap: update Andrey Konovalov's email address
mm/highmem: fix CONFIG_DEBUG_KMAP_LOCAL_FORCE_MAP
mm: memblock: fix section mismatch warning again
kfence: make compatible with kmemleak
gcov: fix clang-11+ support
ia64: fix format strings for err_inject
ia64: mca: allocate early mca with GFP_ATOMIC
squashfs: fix xattr id and id lookup sanity checks
squashfs: fix inode lookup sanity checks
z3fold: prevent reclaim/free race for headless pages
selftests/vm: fix out-of-tree build
mm/mmu_notifiers: ensure range_end() is paired with range_start()
kasan: fix per-page tags for non-page_alloc pages
hugetlb_cgroup: fix imbalanced css_get and css_put pair for shared mappings
Linus Torvalds [Thu, 25 Mar 2021 18:23:35 +0000 (11:23 -0700)]
Merge tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma
Pull rdma fixes from Jason Gunthorpe:
"Not much going on, just some small bug fixes:
- Typo causing a regression in mlx5 devx
- Regression in the recent hns rework causing the HW to get out of
sync
- Long-standing cxgb4 adaptor crash when destroying cm ids"
* tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma:
RDMA/cxgb4: Fix adapter LE hash errors while destroying ipv6 listening server
RDMA/hns: Fix bug during CMDQ initialization
RDMA/mlx5: Fix typo in destroy_mkey inbox
Linus Torvalds [Thu, 25 Mar 2021 18:07:40 +0000 (11:07 -0700)]
Merge tag 'arm64-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux
Pull arm64 fixes from Will Deacon:
"Minor fixes all over, ranging from typos to tests to errata
workarounds:
- Fix possible memory hotplug failure with KASLR
- Fix FFR value in SVE kselftest
- Fix backtraces reported in /proc/$pid/stack
- Disable broken CnP implementation on NVIDIA Carmel
- Typo fixes and ACPI documentation clarification
- Fix some W=1 warnings"
* tag 'arm64-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux:
arm64: kernel: disable CNP on Carmel
arm64/process.c: fix Wmissing-prototypes build warnings
kselftest/arm64: sve: Do not use non-canonical FFR register value
arm64: mm: correct the inside linear map range during hotplug check
arm64: kdump: update ppos when reading elfcorehdr
arm64: cpuinfo: Fix a typo
Documentation: arm64/acpi : clarify arm64 support of IBFT
arm64: stacktrace: don't trace arch_stack_walk()
arm64: csum: cast to the proper type
Ira Weiny [Thu, 25 Mar 2021 04:37:53 +0000 (21:37 -0700)]
mm/highmem: fix CONFIG_DEBUG_KMAP_LOCAL_FORCE_MAP
The kernel test robot found that __kmap_local_sched_out() was not
correctly skipping the guard pages when DEBUG_KMAP_LOCAL_FORCE_MAP was
set.[1] This was due to DEBUG_HIGHMEM check being used.
Mike Rapoport [Thu, 25 Mar 2021 04:37:50 +0000 (21:37 -0700)]
mm: memblock: fix section mismatch warning again
Commit 34dc2efb39a2 ("memblock: fix section mismatch warning") marked
memblock_bottom_up() and memblock_set_bottom_up() as __init, but they
could be referenced from non-init functions like
memblock_find_in_range_node() on architectures that enable
CONFIG_ARCH_KEEP_MEMBLOCK.
For such builds kernel test robot reports:
WARNING: modpost: vmlinux.o(.text+0x74fea4): Section mismatch in reference from the function memblock_find_in_range_node() to the function .init.text:memblock_bottom_up()
The function memblock_find_in_range_node() references the function __init memblock_bottom_up().
This is often because memblock_find_in_range_node lacks a __init annotation or the annotation of memblock_bottom_up is wrong.
Replace __init annotations with __init_memblock annotations so that the
appropriate section will be selected depending on
CONFIG_ARCH_KEEP_MEMBLOCK.
Link: https://lore.kernel.org/lkml/202103160133.UzhgY0wt-lkp@intel.com Link: https://lkml.kernel.org/r/20210316171347.14084-1-rppt@kernel.org Fixes: 34dc2efb39a2 ("memblock: fix section mismatch warning") Signed-off-by: Mike Rapoport <rppt@linux.ibm.com> Reviewed-by: Arnd Bergmann <arnd@arndb.de> Reported-by: kernel test robot <lkp@intel.com> Reviewed-by: David Hildenbrand <david@redhat.com> Acked-by: Nick Desaulniers <ndesaulniers@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Marco Elver [Thu, 25 Mar 2021 04:37:47 +0000 (21:37 -0700)]
kfence: make compatible with kmemleak
Because memblock allocations are registered with kmemleak, the KFENCE
pool was seen by kmemleak as one large object. Later allocations
through kfence_alloc() that were registered with kmemleak via
slab_post_alloc_hook() would then overlap and trigger a warning.
Therefore, once the pool is initialized, we can remove (free) it from
kmemleak again, since it should be treated as allocator-internal and be
seen as "free memory".
The second problem is that kmemleak is passed the rounded size, and not
the originally requested size, which is also the size of KFENCE objects.
To avoid kmemleak scanning past the end of an object and trigger a
KFENCE out-of-bounds error, fix the size if it is a KFENCE object.
For simplicity, to avoid a call to kfence_ksize() in
slab_post_alloc_hook() (and avoid new IS_ENABLED(CONFIG_DEBUG_KMEMLEAK)
guard), just call kfence_ksize() in mm/kmemleak.c:create_object().
Link: https://lkml.kernel.org/r/20210317084740.3099921-1-elver@google.com Signed-off-by: Marco Elver <elver@google.com> Reported-by: Luis Henriques <lhenriques@suse.de> Reviewed-by: Catalin Marinas <catalin.marinas@arm.com> Tested-by: Luis Henriques <lhenriques@suse.de> Cc: Alexander Potapenko <glider@google.com> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Andrey Konovalov <andreyknvl@google.com> Cc: Jann Horn <jannh@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Nick Desaulniers [Thu, 25 Mar 2021 04:37:44 +0000 (21:37 -0700)]
gcov: fix clang-11+ support
LLVM changed the expected function signatures for llvm_gcda_start_file()
and llvm_gcda_emit_function() in the clang-11 release. Users of
clang-11 or newer may have noticed their kernels failing to boot due to
a panic when enabling CONFIG_GCOV_KERNEL=y +CONFIG_GCOV_PROFILE_ALL=y.
Fix up the function signatures so calling these functions doesn't panic
the kernel.
arch/ia64/kernel/err_inject.c: In function 'show_resources':
arch/ia64/kernel/err_inject.c:62:22: warning:
format '%lx' expects argument of type 'long unsigned int',
but argument 3 has type 'u64' {aka 'long long unsigned int'}
62 | return sprintf(buf, "%lx", name[cpu]); \
| ^~~~~~~
The sleep warning happens at early boot right at secondary CPU
activation bootup:
smp: Bringing up secondary CPUs ...
BUG: sleeping function called from invalid context at mm/page_alloc.c:4942
in_atomic(): 0, irqs_disabled(): 1, non_block: 0, pid: 0, name: swapper/1
CPU: 1 PID: 0 Comm: swapper/1 Not tainted 5.12.0-rc2-00007-g79e228d0b611-dirty #99
..
Call Trace:
show_stack+0x90/0xc0
dump_stack+0x150/0x1c0
___might_sleep+0x1c0/0x2a0
__might_sleep+0xa0/0x160
__alloc_pages_nodemask+0x1a0/0x600
alloc_page_interleave+0x30/0x1c0
alloc_pages_current+0x2c0/0x340
__get_free_pages+0x30/0xa0
ia64_mca_cpu_init+0x2d0/0x3a0
cpu_init+0x8b0/0x1440
start_secondary+0x60/0x700
start_ap+0x750/0x780
Fixed BSP b0 value from CPU 1
As I understand interrupts are not enabled yet and system has a lot of
memory. There is little chance to sleep and switch to GFP_ATOMIC should
be a no-op.
Thomas Hebb [Thu, 25 Mar 2021 04:37:29 +0000 (21:37 -0700)]
z3fold: prevent reclaim/free race for headless pages
Commit ca0246bb97c2 ("z3fold: fix possible reclaim races") introduced
the PAGE_CLAIMED flag "to avoid racing on a z3fold 'headless' page
release." By atomically testing and setting the bit in each of
z3fold_free() and z3fold_reclaim_page(), a double-free was avoided.
However, commit dcf5aedb24f8 ("z3fold: stricter locking and more careful
reclaim") appears to have unintentionally broken this behavior by moving
the PAGE_CLAIMED check in z3fold_reclaim_page() to after the page lock
gets taken, which only happens for non-headless pages. For headless
pages, the check is now skipped entirely and races can occur again.
mm/mmu_notifiers: ensure range_end() is paired with range_start()
If one or more notifiers fails .invalidate_range_start(), invoke
.invalidate_range_end() for "all" notifiers. If there are multiple
notifiers, those that did not fail are expecting _start() and _end() to
be paired, e.g. KVM's mmu_notifier_count would become imbalanced.
Disallow notifiers that can fail _start() from implementing _end() so
that it's unnecessary to either track which notifiers rejected _start(),
or had already succeeded prior to a failed _start().
Note, the existing behavior of calling _start() on all notifiers even
after a previous notifier failed _start() was an unintented "feature".
Make it canon now that the behavior is depended on for correctness.
As of today, the bug is likely benign:
1. The only caller of the non-blocking notifier is OOM kill.
2. The only notifiers that can fail _start() are the i915 and Nouveau
drivers.
3. The only notifiers that utilize _end() are the SGI UV GRU driver
and KVM.
4. The GRU driver will never coincide with the i195/Nouveau drivers.
5. An imbalanced kvm->mmu_notifier_count only causes soft lockup in the
_guest_, and the guest is already doomed due to being an OOM victim.
Fix the bug now to play nice with future usage, e.g. KVM has a
potential use case for blocking memslot updates in KVM while an
invalidation is in-progress, and failure to unblock would result in said
updates being blocked indefinitely and hanging.
Found by inspection. Verified by adding a second notifier in KVM that
periodically returns -EAGAIN on non-blockable ranges, triggering OOM,
and observing that KVM exits with an elevated notifier count.
Andrey Konovalov [Thu, 25 Mar 2021 04:37:20 +0000 (21:37 -0700)]
kasan: fix per-page tags for non-page_alloc pages
To allow performing tag checks on page_alloc addresses obtained via
page_address(), tag-based KASAN modes store tags for page_alloc
allocations in page->flags.
Currently, the default tag value stored in page->flags is 0x00.
Therefore, page_address() returns a 0x00ffff... address for pages that
were not allocated via page_alloc.
This might cause problems. A particular case we encountered is a
conflict with KFENCE. If a KFENCE-allocated slab object is being freed
via kfree(page_address(page) + offset), the address passed to kfree()
will get tagged with 0x00 (as slab pages keep the default per-page
tags). This leads to is_kfence_address() check failing, and a KFENCE
object ending up in normal slab freelist, which causes memory
corruptions.
This patch changes the way KASAN stores tag in page-flags: they are now
stored xor'ed with 0xff. This way, KASAN doesn't need to initialize
per-page flags for every created page, which might be slow.
With this change, page_address() returns natively-tagged (with 0xff)
pointers for pages that didn't have tags set explicitly.
This patch fixes the encountered conflict with KFENCE and prevents more
similar issues that can occur in the future.
Link: https://lkml.kernel.org/r/1a41abb11c51b264511d9e71c303bb16d5cb367b.1615475452.git.andreyknvl@google.com Fixes: 2813b9c02962 ("kasan, mm, arm64: tag non slab memory allocated via pagealloc") Signed-off-by: Andrey Konovalov <andreyknvl@google.com> Reviewed-by: Marco Elver <elver@google.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Will Deacon <will.deacon@arm.com> Cc: Vincenzo Frascino <vincenzo.frascino@arm.com> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Andrey Ryabinin <aryabinin@virtuozzo.com> Cc: Alexander Potapenko <glider@google.com> Cc: Peter Collingbourne <pcc@google.com> Cc: Evgenii Stepanov <eugenis@google.com> Cc: Branislav Rankov <Branislav.Rankov@arm.com> Cc: Kevin Brodsky <kevin.brodsky@arm.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Miaohe Lin [Thu, 25 Mar 2021 04:37:17 +0000 (21:37 -0700)]
hugetlb_cgroup: fix imbalanced css_get and css_put pair for shared mappings
The current implementation of hugetlb_cgroup for shared mappings could
have different behavior. Consider the following two scenarios:
1.Assume initial css reference count of hugetlb_cgroup is 1:
1.1 Call hugetlb_reserve_pages with from = 1, to = 2. So css reference
count is 2 associated with 1 file_region.
1.2 Call hugetlb_reserve_pages with from = 2, to = 3. So css reference
count is 3 associated with 2 file_region.
1.3 coalesce_file_region will coalesce these two file_regions into
one. So css reference count is 3 associated with 1 file_region
now.
2.Assume initial css reference count of hugetlb_cgroup is 1 again:
2.1 Call hugetlb_reserve_pages with from = 1, to = 3. So css reference
count is 2 associated with 1 file_region.
Therefore, we might have one file_region while holding one or more css
reference counts. This inconsistency could lead to imbalanced css_get()
and css_put() pair. If we do css_put one by one (i.g. hole punch case),
scenario 2 would put one more css reference. If we do css_put all
together (i.g. truncate case), scenario 1 will leak one css reference.
The imbalanced css_get() and css_put() pair would result in a non-zero
reference when we try to destroy the hugetlb cgroup. The hugetlb cgroup
directory is removed __but__ associated resource is not freed. This
might result in OOM or can not create a new hugetlb cgroup in a busy
workload ultimately.
In order to fix this, we have to make sure that one file_region must
hold exactly one css reference. So in coalesce_file_region case, we
should release one css reference before coalescence. Also only put css
reference when the entire file_region is removed.
The last thing to note is that the caller of region_add() will only hold
one reference to h_cg->css for the whole contiguous reservation region.
But this area might be scattered when there are already some
file_regions reside in it. As a result, many file_regions may share only
one h_cg->css reference. In order to ensure that one file_region must
hold exactly one css reference, we should do css_get() for each
file_region and release the reference held by caller when they are done.
[linmiaohe@huawei.com: fix imbalanced css_get and css_put pair for shared mappings] Link: https://lkml.kernel.org/r/20210316023002.53921-1-linmiaohe@huawei.com Link: https://lkml.kernel.org/r/20210301120540.37076-1-linmiaohe@huawei.com Fixes: 075a61d07a8e ("hugetlb_cgroup: add accounting for shared mappings") Reported-by: kernel test robot <lkp@intel.com> (auto build test ERROR) Signed-off-by: Miaohe Lin <linmiaohe@huawei.com> Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com> Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Cc: Wanpeng Li <liwp.linux@gmail.com> Cc: Mina Almasry <almasrymina@google.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
which shouldn't happen, but seems to be possible due to a race on whether
or not the io-wq manager sees a fatal signal first, or whether the io-wq
workers do. If we race with queueing work and then send a fatal signal to
the owning task, and the io-wq worker sees that before the manager sets
IO_WQ_BIT_EXIT, then it's possible to have the worker exit and leave work
behind.
Just turn the WARN_ON_ONCE() into a cancelation condition instead.
RDMA/cxgb4: Fix adapter LE hash errors while destroying ipv6 listening server
Not setting the ipv6 bit while destroying ipv6 listening servers may
result in potential fatal adapter errors due to lookup engine memory hash
errors. Therefore always set ipv6 field while destroying ipv6 listening
servers.
Fixes: 830662f6f032 ("RDMA/cxgb4: Add support for active and passive open connection with IPv6 address") Link: https://lore.kernel.org/r/20210324190453.8171-1-bharat@chelsio.com Signed-off-by: Potnuri Bharat Teja <bharat@chelsio.com> Reviewed-by: Leon Romanovsky <leonro@nvidia.com> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Rich Wiley [Wed, 24 Mar 2021 00:28:09 +0000 (17:28 -0700)]
arm64: kernel: disable CNP on Carmel
On NVIDIA Carmel cores, CNP behaves differently than it does on standard
ARM cores. On Carmel, if two cores have CNP enabled and share an L2 TLB
entry created by core0 for a specific ASID, a non-shareable TLBI from
core1 may still see the shared entry. On standard ARM cores, that TLBI
will invalidate the shared entry as well.
This causes issues with patchsets that attempt to do local TLBIs based
on cpumasks instead of broadcast TLBIs. Avoid these issues by disabling
CNP support for NVIDIA Carmel cores.
Martin Wilck [Tue, 23 Mar 2021 21:24:31 +0000 (22:24 +0100)]
scsi: target: pscsi: Clean up after failure in pscsi_map_sg()
If pscsi_map_sg() fails, make sure to drop references to already allocated
bios.
Link: https://lore.kernel.org/r/20210323212431.15306-2-mwilck@suse.com Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Lee Duncan <lduncan@suse.com> Signed-off-by: Martin Wilck <mwilck@suse.com> Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
Martin Wilck [Tue, 23 Mar 2021 21:24:30 +0000 (22:24 +0100)]
scsi: target: pscsi: Avoid OOM in pscsi_map_sg()
pscsi_map_sg() uses the variable nr_pages as a hint for bio_kmalloc() how
many vector elements to allocate. If nr_pages is < BIO_MAX_PAGES, it will
be reset to 0 after successful allocation of the bio.
If bio_add_pc_page() fails later for whatever reason, pscsi_map_sg() tries
to allocate another bio, passing nr_vecs = 0. This causes bio_add_pc_page()
to fail immediately in the next call. pci_map_sg() continues to allocate
zero-length bios until memory is exhausted and the kernel crashes with
OOM. This can be easily observed by exporting a SATA DVD drive via pscsi.
The target crashes as soon as the client tries to access the DVD LUN. In
the case I analyzed, bio_add_pc_page() would fail because the DVD device's
max_sectors_kb (128) was exceeded.
Avoid this by simply not resetting nr_pages to 0 after allocating the
bio. This way, the client receives an I/O error when it tries to send
requests exceeding the devices max_sectors_kb, and eventually gets it
right. The client must still limit max_sectors_kb e.g. by an udev rule if
(like in my case) the driver doesn't report valid block limits, otherwise
it encounters I/O errors.
Link: https://lore.kernel.org/r/20210323212431.15306-1-mwilck@suse.com Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Lee Duncan <lduncan@suse.com> Signed-off-by: Martin Wilck <mwilck@suse.com> Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
Jia-Ju Bai [Mon, 8 Mar 2021 03:30:24 +0000 (19:30 -0800)]
scsi: qedi: Fix error return code of qedi_alloc_global_queues()
When kzalloc() returns NULL to qedi->global_queues[i], no error return code
of qedi_alloc_global_queues() is assigned. To fix this bug, status is
assigned with -ENOMEM in this case.
Link: https://lore.kernel.org/r/20210308033024.27147-1-baijiaju1990@gmail.com Fixes: ace7f46ba5fd ("scsi: qedi: Add QLogic FastLinQ offload iSCSI driver framework.") Reported-by: TOTE Robot <oslab@tsinghua.edu.cn> Acked-by: Manish Rangankar <mrangankar@marvell.com> Signed-off-by: Jia-Ju Bai <baijiaju1990@gmail.com> Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
Bart Van Assche [Sat, 20 Mar 2021 23:23:53 +0000 (16:23 -0700)]
scsi: Revert "qla2xxx: Make sure that aborted commands are freed"
Calling vha->hw->tgt.tgt_ops->free_cmd() from qlt_xmit_response() is wrong
since the command for which a response is sent must remain valid until the
SCSI target core calls .release_cmd(). It has been observed that the
following scenario triggers a kernel crash:
- transport_handle_queue_full() tries to retransmit the response
Fix this crash by reverting the patch that introduced it.
Link: https://lore.kernel.org/r/20210320232359.941-2-bvanassche@acm.org Fixes: 0dcec41acb85 ("scsi: qla2xxx: Make sure that aborted commands are freed") Cc: Quinn Tran <qutran@marvell.com> Cc: Mike Christie <michael.christie@oracle.com> Reviewed-by: Daniel Wagner <dwagner@suse.de> Reviewed-by: Himanshu Madhani <himanshu.madhani@oracle.com> Signed-off-by: Bart Van Assche <bvanassche@acm.org> Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
Tyrel Datwyler [Fri, 19 Mar 2021 20:50:29 +0000 (14:50 -0600)]
scsi: ibmvfc: Make ibmvfc_wait_for_ops() MQ aware
During MQ enablement of the ibmvfc driver ibmvfc_wait_for_ops() was
missed. This function is responsible for waiting on commands to complete
that match a certain criteria such as LUN or cancel key. The implementation
as is only scans the CRQ for events ignoring any sub-queues and as a result
will exit successfully without doing anything when operating in MQ
channelized mode.
Check the MQ and channel use flags to determine which queues are
applicable, and scan each queue accordingly. Note in MQ mode SCSI commands
are only issued down sub-queues and the CRQ is only used for driver
specific management commands. As such the CRQ events are ignored when
operating in MQ mode with channels.
Link: https://lore.kernel.org/r/20210319205029.312969-3-tyreld@linux.ibm.com Fixes: 9000cb998bcf ("scsi: ibmvfc: Enable MQ and set reasonable defaults") Reviewed-by: Brian King <brking@linux.vnet.ibm.com> Signed-off-by: Tyrel Datwyler <tyreld@linux.ibm.com> Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
Tyrel Datwyler [Fri, 19 Mar 2021 20:50:28 +0000 (14:50 -0600)]
scsi: ibmvfc: Fix potential race in ibmvfc_wait_for_ops()
For various EH activities the ibmvfc driver uses ibmvfc_wait_for_ops() to
wait for the completion of commands that match a given criteria be it
cancel key, or specific LUN. With recent changes commands are completed
outside the lock in bulk by removing them from the sent list and adding
them to a private completion list. This introduces a potential race in
ibmvfc_wait_for_ops() since the criteria for a command to be outstanding is
no longer simply being on the sent list, but instead not being on the free
list.
Avoid this race by scanning the entire command event pool and checking that
any matching command that ibmvfc needs to wait on is not already on the
free list.
Link: https://lore.kernel.org/r/20210319205029.312969-2-tyreld@linux.ibm.com Fixes: 1f4a4a19508d ("scsi: ibmvfc: Complete commands outside the host/queue lock") Reviewed-by: Brian King <brking@linux.vnet.ibm.com> Signed-off-by: Tyrel Datwyler <tyreld@linux.ibm.com> Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
Pull networking fixes from David Miller:
"Various fixes, all over:
1) Fix overflow in ptp_qoriq_adjfine(), from Yangbo Lu.
2) Always store the rx queue mapping in veth, from Maciej
Fijalkowski.
3) Don't allow vmlinux btf in map_create, from Alexei Starovoitov.
4) Fix memory leak in octeontx2-af from Colin Ian King.
5) Use kvalloc in bpf x86 JIT for storing jit'd addresses, from
Yonghong Song.
6) Fix tx ptp stats in mlx5, from Aya Levin.
7) Check correct ip version in tun decap, fropm Roi Dayan.
8) Fix rate calculation in mlx5 E-Switch code, from arav Pandit.
9) Work item memork leak in mlx5, from Shay Drory.
10) Fix ip6ip6 tunnel crash with bpf, from Daniel Borkmann.
11) Lack of preemptrion awareness in macvlan, from Eric Dumazet.
12) Fix data race in pxa168_eth, from Pavel Andrianov.
13) Range validate stab in red_check_params(), from Eric Dumazet.
14) Inherit vlan filtering setting properly in b53 driver, from
Florian Fainelli.
15) Fix rtnl locking in igc driver, from Sasha Neftin.
16) Pause handling fixes in igc driver, from Muhammad Husaini
Zulkifli.
17) Missing rtnl locking in e1000_reset_task, from Vitaly Lifshits.
18) Use after free in qlcnic, from Lv Yunlong.
19) fix crash in fritzpci mISDN, from Tong Zhang.
20) Premature rx buffer reuse in igb, from Li RongQing.
21) Missing termination of ip[a driver message handler arrays, from
Alex Elder.
22) Fix race between "x25_close" and "x25_xmit"/"x25_rx" in hdlc_x25
driver, from Xie He.
23) Use after free in c_can_pci_remove(), from Tong Zhang.
24) Uninitialized variable use in nl80211, from Jarod Wilson.
25) Off by one size calc in bpf verifier, from Piotr Krysiuk.
26) Use delayed work instead of deferrable for flowtable GC, from
Yinjun Zhang.
27) Fix infinite loop in NPC unmap of octeontx2 driver, from
Hariprasad Kelam.
28) Fix being unable to change MTU of dwmac-sun8i devices due to lack
of fifo sizes, from Corentin Labbe.
29) DMA use after free in r8169 with WoL, fom Heiner Kallweit.
30) Mismatched prototypes in isdn-capi, from Arnd Bergmann.
31) Fix psample UAPI breakage, from Ido Schimmel"
* git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (171 commits)
psample: Fix user API breakage
math: Export mul_u64_u64_div_u64
ch_ktls: fix enum-conversion warning
octeontx2-af: Fix memory leak of object buf
ptp_qoriq: fix overflow in ptp_qoriq_adjfine() u64 calcalation
net: bridge: don't notify switchdev for local FDB addresses
net/sched: act_ct: clear post_ct if doing ct_clear
net: dsa: don't assign an error value to tag_ops
isdn: capi: fix mismatched prototypes
net/mlx5: SF, do not use ecpu bit for vhca state processing
net/mlx5e: Fix division by 0 in mlx5e_select_queue
net/mlx5e: Fix error path for ethtool set-priv-flag
net/mlx5e: Offload tuple rewrite for non-CT flows
net/mlx5e: Allow to match on MPLS parameters only for MPLS over UDP
net/mlx5: Add back multicast stats for uplink representor
net: ipconfig: ic_dev can be NULL in ic_close_devs
MAINTAINERS: Combine "QLOGIC QLGE 10Gb ETHERNET DRIVER" sections into one
docs: networking: Fix a typo
r8169: fix DMA being used after buffer free if WoL is enabled
net: ipa: fix init header command validation
...
Lyude Paul [Fri, 5 Mar 2021 01:52:41 +0000 (20:52 -0500)]
drm/nouveau/kms/nve4-nv108: Limit cursors to 128x128
While Kepler does technically support 256x256 cursors, it turns out that
Kepler actually has some additional requirements for scanout surfaces that
we're not enforcing correctly, which aren't present on Maxwell and later.
Cursor surfaces must always use small pages (4K), and overlay surfaces must
always use large pages (128K).
Fixing this correctly though will take a bit more work: as we'll need to
add some code in prepare_fb() to move cursor FBs in large pages to small
pages, and vice-versa for overlay FBs. So until we have the time to do
that, just limit cursor surfaces to 128x128 - a size small enough to always
default to small pages.
This means small ovlys are still broken on Kepler, but it is extremely
unlikely anyone cares about those anyway :).
Signed-off-by: Lyude Paul <lyude@redhat.com> Fixes: d3b2f0f7921c ("drm/nouveau/kms/nv50-: Report max cursor size to userspace") Cc: <stable@vger.kernel.org> # v5.11+ Signed-off-by: Ben Skeggs <bskeggs@redhat.com>
Ido Schimmel [Wed, 24 Mar 2021 19:43:32 +0000 (21:43 +0200)]
psample: Fix user API breakage
Cited commit added a new attribute before the existing group reference
count attribute, thereby changing its value and breaking existing
applications on new kernels.
Before:
# psample -l
libpsample ERROR psample_group_foreach: failed to recv message: Operation not supported
After:
# psample -l
Group Num Refcount Group Seq
1 1 0
Fix by restoring the value of the old attribute and remove the
misleading comments from the enumerator to avoid future bugs.
Cc: stable@vger.kernel.org Fixes: d8bed686ab96 ("net: psample: Add tunnel support") Signed-off-by: Ido Schimmel <idosch@nvidia.com> Reported-by: Adiel Bidani <adielb@nvidia.com> Reviewed-by: Jiri Pirko <jiri@nvidia.com> Reviewed-by: Petr Machata <petrm@nvidia.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Roger Pau Monne [Wed, 24 Mar 2021 12:24:24 +0000 (13:24 +0100)]
Revert "xen: fix p2m size in dom0 for disabled memory hotplug case"
This partially reverts commit 882213990d32 ("xen: fix p2m size in dom0
for disabled memory hotplug case")
There's no need to special case XEN_UNPOPULATED_ALLOC anymore in order
to correctly size the p2m. The generic memory hotplug option has
already been tied together with the Xen hotplug limit, so enabling
memory hotplug should already trigger a properly sized p2m on Xen PV.
Note that XEN_UNPOPULATED_ALLOC depends on ZONE_DEVICE which pulls in
MEMORY_HOTPLUG.
Leave the check added to __set_phys_to_machine and the adjusted
comment about EXTRA_MEM_RATIO.