The Intel fixed counters use a special table to override the JSON
information.
During this override the period information from the JSON file got
dropped, which results in inst_retired.any and similar running with
frequency mode instead of a period.
When the LBR data and the instructions in a binary do not match the loop
printing instructions could get confused and print a long stream of
bogus <bad> instructions.
The problem was that if the instruction decoder cannot decode an
instruction it ilen wasn't initialized, so the loop going through the
basic block would continue with the previous value.
Harden the code to avoid such problems:
- Make sure ilen is always freshly initialized and is 0 for bad
instructions.
- Do not overrun the code buffer while printing instructions
- Print a warning message if the final jump is not on an instruction
boundary.
Whenever an mmap/mmap2 event occurs, the map tree must be updated to add a new
entry. If a new map overlaps a previous map, the overlapped section of the
previous map is effectively unmapped, but the non-overlapping sections are
still valid.
maps__fixup_overlappings() is responsible for creating any new map entries from
the previously overlapped map. It optionally creates a before and an after map.
When creating the after map the existing code failed to adjust the map.pgoff.
This meant the new after map would incorrectly calculate the file offset
for the ip. This results in incorrect symbol name resolution for any ip in the
after region.
Make maps__fixup_overlappings() correctly populate map.pgoff.
Add an assert that new mapping matches old mapping at the beginning of
the after map.
Committer-testing:
Validated correct parsing of libcoreclr.so symbols from .NET Core 3.0 preview9
(which didn't strip symbols).
Preparation:
~/dotnet3.0-preview9/dotnet new webapi -o perfSymbol
cd perfSymbol
~/dotnet3.0-preview9/dotnet publish
perf record ~/dotnet3.0-preview9/dotnet \
bin/Debug/netcoreapp3.0/publish/perfSymbol.dll
^C
Signed-off-by: Steve MacLean <Steve.MacLean@Microsoft.com> Tested-by: Brian Robbins <brianrob@microsoft.com> Acked-by: Jiri Olsa <jolsa@kernel.org> Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com> Cc: Andi Kleen <ak@linux.intel.com> Cc: Davidlohr Bueso <dave@stgolabs.net> Cc: Eric Saint-Etienne <eric.saint.etienne@oracle.com> Cc: John Keeping <john@metanate.com> Cc: John Salem <josalem@microsoft.com> Cc: Leo Yan <leo.yan@linaro.org> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Namhyung Kim <namhyung@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Song Liu <songliubraving@fb.com> Cc: Stephane Eranian <eranian@google.com> Cc: Tom McDonald <thomas.mcdonald@microsoft.com> Link: http://lore.kernel.org/lkml/BN8PR21MB136270949F22A6A02335C238F7800@BN8PR21MB1362.namprd21.prod.outlook.com Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com> Signed-off-by: Sasha Levin <sashal@kernel.org> Signed-off-by: Connor Kuehl <connor.kuehl@canonical.com> Signed-off-by: Khalid Elmously <khalid.elmously@canonical.com>
make -C tools/perf CLANG=1 CC=clang EXTRA_CFLAGS="-O3
will turn the dereference operation into a ud2 instruction, raising a
SIGILL rather than a SIGSEGV. Use raise(..) for correctness and clarity.
Similar issues were addressed in Numfor Mbiziwo-Tiapo's patch:
https://lkml.org/lkml/2019/7/8/1234
Committer testing:
Before:
[root@quaco ~]# perf test hooks
55: perf hooks : Ok
[root@quaco ~]# perf test -v hooks
55: perf hooks :
--- start ---
test child forked, pid 17092
SIGSEGV is observed as expected, try to recover.
Fatal error (SEGFAULT) in perf hook 'test'
test child finished with 0
---- end ----
perf hooks: Ok
[root@quaco ~]#
After:
[root@quaco ~]# perf test hooks
55: perf hooks : Ok
[root@quaco ~]# perf test -v hooks
55: perf hooks :
--- start ---
test child forked, pid 17909
SIGSEGV is observed as expected, try to recover.
Fatal error (SEGFAULT) in perf hook 'test'
test child finished with 0
---- end ----
perf hooks: Ok
[root@quaco ~]#
Fixes: a074865e60ed ("perf tools: Introduce perf hooks") Signed-off-by: Ian Rogers <irogers@google.com> Tested-by: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com> Cc: Andi Kleen <ak@linux.intel.com> Cc: Jiri Olsa <jolsa@redhat.com> Cc: Namhyung Kim <namhyung@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Stephane Eranian <eranian@google.com> Cc: Wang Nan <wangnan0@huawei.com> Link: http://lore.kernel.org/lkml/20190925195924.152834-2-irogers@google.com Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com> Signed-off-by: Sasha Levin <sashal@kernel.org> Signed-off-by: Connor Kuehl <connor.kuehl@canonical.com> Signed-off-by: Khalid Elmously <khalid.elmously@canonical.com>
The center temperature of the supported devices stored in the constant
BMC150_ACCEL_TEMP_CENTER_VAL is not 24 degrees but 23 degrees.
It seems that some datasheets were inconsistent on this value leading
to the error. For most usecases will only make minor difference so
not queued for stable.
meson_saradc's irq handler uses priv->regmap so make sure that it is
allocated before the irq get enabled.
This also fixes crash when CONFIG_DEBUG_SHIRQ is enabled, as device
managed resources are freed in the inverted order they had been
allocated, priv->regmap was freed before the spurious fake irq that
CONFIG_DEBUG_SHIRQ adds called the handler.
Fixes: 3af109131b7eb8 ("iio: adc: meson-saradc: switch from polling to interrupt mode") Reported-by: Elie Roudninski <xademax@gmail.com> Signed-off-by: Remi Pommarel <repk@triplefau.lt> Reviewed-by: Martin Blumenstingl <martin.blumenstingl@googlemail.com> Tested-by: Elie ROUDNINSKI <xademax@gmail.com> Reviewed-by: Kevin Hilman <khilman@baylibre.com> Signed-off-by: Jonathan Cameron <Jonathan.Cameron@huawei.com> Signed-off-by: Sasha Levin <sashal@kernel.org> Signed-off-by: Connor Kuehl <connor.kuehl@canonical.com> Signed-off-by: Khalid Elmously <khalid.elmously@canonical.com>
[Background]
Btrfs qgroup uses two types of reserved space for METADATA space,
PERTRANS and PREALLOC.
PERTRANS is metadata space reserved for each transaction started by
btrfs_start_transaction().
While PREALLOC is for delalloc, where we reserve space before joining a
transaction, and finally it will be converted to PERTRANS after the
writeback is done.
[Inconsistency]
However there is inconsistency in how we handle PREALLOC metadata space.
The most obvious one is:
In btrfs_buffered_write():
btrfs_delalloc_release_extents(BTRFS_I(inode), reserve_bytes, true);
We always free qgroup PREALLOC meta space.
While in btrfs_truncate_block():
btrfs_delalloc_release_extents(BTRFS_I(inode), blocksize, (ret != 0));
We only free qgroup PREALLOC meta space when something went wrong.
[The Correct Behavior]
The correct behavior should be the one in btrfs_buffered_write(), we
should always free PREALLOC metadata space.
The reason is, the btrfs_delalloc_* mechanism works by:
- Reserve metadata first, even it's not necessary
In btrfs_delalloc_reserve_metadata()
- Free the unused metadata space
Normally in:
btrfs_delalloc_release_extents()
|- btrfs_inode_rsv_release()
Here we do calculation on whether we should release or not.
E.g. for 64K buffered write, the metadata rsv works like:
/* The first page */
reserve_meta: num_bytes=calc_inode_reservations()
free_meta: num_bytes=0
total: num_bytes=calc_inode_reservations()
/* The first page caused one outstanding extent, thus needs metadata
rsv */
/* The 2nd page */
reserve_meta: num_bytes=calc_inode_reservations()
free_meta: num_bytes=calc_inode_reservations()
total: not changed
/* The 2nd page doesn't cause new outstanding extent, needs no new meta
rsv, so we free what we have reserved */
/* The 3rd~16th pages */
reserve_meta: num_bytes=calc_inode_reservations()
free_meta: num_bytes=calc_inode_reservations()
total: not changed (still space for one outstanding extent)
This means, if btrfs_delalloc_release_extents() determines to free some
space, then those space should be freed NOW.
So for qgroup, we should call btrfs_qgroup_free_meta_prealloc() other
than btrfs_qgroup_convert_reserved_meta().
The good news is:
- The callers are not that hot
The hottest caller is in btrfs_buffered_write(), which is already
fixed by commit 336a8bb8e36a ("btrfs: Fix wrong
btrfs_delalloc_release_extents parameter"). Thus it's not that
easy to cause false EDQUOT.
- The trans commit in advance for qgroup would hide the bug
Since commit f5fef4593653 ("btrfs: qgroup: Make qgroup async transaction
commit more aggressive"), when btrfs qgroup metadata free space is slow,
it will try to commit transaction and free the wrongly converted
PERTRANS space, so it's not that easy to hit such bug.
[FIX]
So to fix the problem, remove the @qgroup_free parameter for
btrfs_delalloc_release_extents(), and always pass true to
btrfs_inode_rsv_release().
Reported-by: Filipe Manana <fdmanana@suse.com> Fixes: 43b18595d660 ("btrfs: qgroup: Use separate meta reservation type for delalloc") CC: stable@vger.kernel.org # 4.19+ Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Sasha Levin <sashal@kernel.org> Signed-off-by: Connor Kuehl <connor.kuehl@canonical.com> Signed-off-by: Khalid Elmously <khalid.elmously@canonical.com>
If we failed to allocate the data extent(s) for the inode space cache, we
were bailing out without releasing the previously reserved metadata. This
was triggering the following warnings when unmounting a filesystem:
Several fstests test cases triggered this often, such as generic/083,
generic/102, generic/172, generic/269 and generic/300 at least, producing
stack traces like the following in dmesg/syslog:
Commit 721b1d98fb517a ("dm snapshot: Fix excessive memory usage and
workqueue stalls") introduced a semaphore to limit the maximum number of
in-flight kcopyd (COW) jobs.
The implementation of this throttling mechanism is prone to a deadlock:
1. One or more threads write to the origin device causing COW, which is
performed by kcopyd.
2. At some point some of these threads might reach the s->cow_count
semaphore limit and block in down(&s->cow_count), holding a read lock
on _origins_lock.
3. Someone tries to acquire a write lock on _origins_lock, e.g.,
snapshot_ctr(), which blocks because the threads at step (2) already
hold a read lock on it.
4. A COW operation completes and kcopyd runs dm-snapshot's completion
callback, which ends up calling pending_complete().
pending_complete() tries to resubmit any deferred origin bios. This
requires acquiring a read lock on _origins_lock, which blocks.
This happens because the read-write semaphore implementation gives
priority to writers, meaning that as soon as a writer tries to enter
the critical section, no readers will be allowed in, until all
writers have completed their work.
So, pending_complete() waits for the writer at step (3) to acquire
and release the lock. This writer waits for the readers at step (2)
to release the read lock and those readers wait for
pending_complete() (the kcopyd thread) to signal the s->cow_count
semaphore: DEADLOCK.
The above was thoroughly analyzed and documented by Nikos Tsironis as
part of his initial proposal for fixing this deadlock, see:
https://www.redhat.com/archives/dm-devel/2019-October/msg00001.html
Fix this deadlock by reworking COW throttling so that it waits without
holding any locks. Add a variable 'in_progress' that counts how many
kcopyd jobs are running. A function wait_for_in_progress() will sleep if
'in_progress' is over the limit. It drops _origins_lock in order to
avoid the deadlock.
This simple refactoring moves code for modifying the semaphore cow_count
into separate functions to prepare for changes that will extend these
methods to provide for a more sophisticated mechanism for COW
throttling.
We've got two issues with the non-regular file handling for non-blocking
IO:
1) We don't want to re-do a short read in full for a non-regular file,
as we can't just read the data again.
2) For non-regular files that don't support non-blocking IO attempts,
we need to punt to async context even if the file is opened as
non-blocking. Otherwise the caller always gets -EAGAIN.
Add two new request flags to handle these cases. One is just a cache
of the inode S_ISREG() status, the other tells io_uring that we always
need to punt this request to async context, even if REQ_F_NOWAIT is set.
Kai-Heng Feng [Wed, 23 Oct 2019 14:25:26 +0000 (22:25 +0800)]
ALSA: hda: Allow HDA to be runtime suspended when dGPU is not bound to a driver
BugLink: https://bugs.launchpad.net/bugs/1840835
Nvidia proprietary driver doesn't support runtime power management, so
when a user only wants to use the integrated GPU, it's a common practice
to let dGPU not to bind any driver, and let its upstream port to be
runtime suspended. At the end of runtime suspension the port uses
platform power management to disable power through _OFF method of power
resource, which is listed by _PR3.
After commit b516ea586d71 ("PCI: Enable NVIDIA HDA controllers"), when
the dGPU comes with an HDA function, the HDA won't be suspended if the
dGPU is unbound, so the power resource can't be turned off by its
upstream port driver.
Commit 37a3a98ef601 ("ALSA: hda - Enable runtime PM only for
discrete GPU") only allows HDA to be runtime suspended once GPU is
bound, to keep APU's HDA working.
However, HDA on dGPU isn't that useful if dGPU is not bound to any
driver. So let's relax the runtime suspend requirement for dGPU's HDA
function, to disable the power source to save lots of power.
Kai-Heng Feng [Wed, 23 Oct 2019 14:25:25 +0000 (22:25 +0800)]
PCI: Add a helper to check Power Resource Requirements _PR3 existence
BugLink: https://bugs.launchpad.net/bugs/1840835
A driver may want to know the existence of _PR3, to choose different
runtime suspend behavior. A user will be add in next patch.
UBUNTU: SAUCE: shiftfs: drop CAP_SYS_RESOURCE from effective capabilities
BugLink: https://bugs.launchpad.net/bugs/1849483
Currently shiftfs allows to exceed project quota and reserved space on
e.g. ext2. See [1] and especially [2] for a bug report. This is very
much not what we want. Quotas and reserverd space settings set on the
host need to respected. The cause for this issue is overriding the
credentials with the superblock creator's credentials whenever we
perform operations such as fallocate() or writes while retaining
CAP_SYS_RESOURCE.
The fix is to drop CAP_SYS_RESOURCE from the effective capability set
after we have made a copy of the superblock creator's credential at
superblock creation time. This very likely gives us more security than
we had before and the regression potential seems limited. I would like
to try this apporach first before coming up with something potentially
more sophisticated. I don't see why CAP_SYS_RESOURCE should become a
limiting factor in most use-cases.
[1]: https://github.com/lxc/lxd/issues/6333
[2]: https://github.com/lxc/lxd/issues/6333#issuecomment-545154838 Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com> Acked-by: Stefan Bader <stefan.bader@canonical.com> Acked-by: Connor Kuehl <connor.kuehl@canonical.com> Signed-off-by: Khalid Elmously <khalid.elmously@canonical.com>
BugLink: https://bugs.launchpad.net/bugs/1849482
Set the s_maxbytes limit to MAX_LFS_FILESIZE.
Currently shiftfs limits the maximum size for fallocate() needlessly
causing calls such as fallocate --length 2GB ./file to fail. This
limitation is arbitrary since it's not caused by the underlay but
rather by shiftfs itself capping the s_maxbytes. This causes bugs such
as the one reported in [1].
[1]: https://github.com/lxc/lxd/issues/6333 Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com> Acked-by: Stefan Bader <stefan.bader@canonical.com> Acked-by: Connor Kuehl <connor.kuehl@canonical.com> Signed-off-by: Khalid Elmously <khalid.elmously@canonical.com>
Dan Elkouby [Tue, 29 Oct 2019 16:25:55 +0000 (09:25 -0700)]
Bluetooth: hidp: Fix assumptions on the return value of hidp_send_message
BugLink: https://bugs.launchpad.net/bugs/1850443
hidp_send_message was changed to return non-zero values on success,
which some other bits did not expect. This caused spurious errors to be
propagated through the stack, breaking some drivers, such as hid-sony
for the Dualshock 4 in Bluetooth mode.
As pointed out by Dan Carpenter, hid-microsoft directly relied on that
assumption as well.
Fixes: 48d9cc9d85dd ("Bluetooth: hidp: Let hidp_send_message return number of queued bytes") Signed-off-by: Dan Elkouby <streetwalkermc@gmail.com> Reviewed-by: Dan Carpenter <dan.carpenter@oracle.com> Reviewed-by: Jiri Kosina <jkosina@suse.cz> Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
(cherry picked from commit 8bb3537095f107ed55ad51f6241165b397aaafac) Signed-off-by: Kamal Mostafa <kamal@canonical.com> Acked-by: Connor Kuehl <connor.kuehl@canonical.com> Acked-by: Stefan Bader <stefan.bader@canonical.com> Signed-off-by: Khalid Elmously <khalid.elmously@canonical.com>
Aaron Ma [Wed, 30 Oct 2019 06:56:01 +0000 (14:56 +0800)]
UBUNTU: SAUCE: ALSA: hda/realtek - Fix 2 front mics of codec 0x623
BugLink: https://bugs.launchpad.net/bugs/1850599
These 2 ThinkCentres installed a new realtek codec ID 0x623,
it has 2 front mics with the same location on pin 0x18 and 0x19.
Apply fixup ALC283_FIXUP_HEADSET_MIC to change 1 front mic
location to right, then pulseaudio can handle them.
One "Front Mic" and one "Mic" will be shown, and audio output works
fine.
Signed-off-by: Aaron Ma <aaron.ma@canonical.com> Cc: <stable@vger.kernel.org> Link: https://lore.kernel.org/r/20191024114439.31522-1-aaron.ma@canonical.com Signed-off-by: Takashi Iwai <tiwai@suse.de>
(cherry picked from commit 8a6c55d0f883e9a7e7c91841434f3b6bbf932bb2
https://git.kernel.org/pub/scm/linux/kernel/git/tiwai/sound.git) Signed-off-by: Aaron Ma <aaron.ma@canonical.com> Acked-by: AceLan Kao <acelan.kao@canonical.com> Acked-by: Seth Forshee <seth.forshee@canonical.com> Signed-off-by: Khalid Elmously <khalid.elmously@canonical.com>
On HP spectre x360 convertible the message:
hid-sensor-hub 001F:8087:0AC2.0002: hid_field_extract() called with n (192) > 32! (kworker/1:2)
is continually printed many times per second, crowding out all else.
Protect dmesg by printing the warning only one time.
The size of the hid report field data structure should probably be increased.
The data structure is treated as a u32 in Linux, but an unlimited number
of bits in the USB hid spec, so there is some rearchitecture needed now that
devices are sending more than 32 bits.
Signed-off-by: Joshua Clayton <stillcompiling@gmail.com> Signed-off-by: Benjamin Tissoires <benjamin.tissoires@redhat.com>
(cherry picked from commit 0af10eed9b7308187c7865024248b2a2a5aa382a) Signed-off-by: Kai-Heng Feng <kai.heng.feng@canonical.com> Acked-by: Connor Kuehl <connor.kuehl@canonical.com> Acked-by: AceLan Kao <acelan.kao@canonical.com> Signed-off-by: Khalid Elmously <khalid.elmously@canonical.com>
Joshua Clayton [Wed, 30 Oct 2019 07:16:50 +0000 (15:16 +0800)]
HID: core: reformat and reduce hid_printk macros
BugLink: https://bugs.launchpad.net/bugs/1850600
Reformat hid_printk macros to use standard __VA_ARGS__ syntax.
Per Joe Perches hid_printk(), hid_emerg(), hid_crit(), and hid_alert() are
unlikely ever to be used. Remove them.
Signed-off-by: Joshua Clayton <stillcompiling@gmail.com> Signed-off-by: Benjamin Tissoires <benjamin.tissoires@redhat.com>
(cherry picked from commit 337c22ab1d4f316f37e7a7c78bdea3d768270542) Signed-off-by: Kai-Heng Feng <kai.heng.feng@canonical.com> Acked-by: Connor Kuehl <connor.kuehl@canonical.com> Acked-by: AceLan Kao <acelan.kao@canonical.com> Signed-off-by: Khalid Elmously <khalid.elmously@canonical.com>
Seth Forshee [Thu, 7 Nov 2019 16:08:25 +0000 (10:08 -0600)]
UBUNTU: SAUCE: ovl: Restore vm_file value when lower fs mmap fails
BugLink: https://bugs.launchpad.net/bugs/1850994
ovl_mmap() overwrites vma->vm_file before calling the lower
filesystem mmap but does not restore the original value on
failure. This means it is giving a pointer to the lower fs file
back to the caller with no reference, which is a bad practice.
However, it does not lead to any issues with upstream kernels as
no caller accesses vma->vm_file after call_mmap().
With the aufs patches applied the story is different. Whereas
mmap_region() previously fput a local variable containing the
file it assigned to vm_file, it now calls vma_fput() which will
fput vm_file, for which it has no reference, and the reference
for the original vm_file is not put.
Fix this by restoring vma->vm_file to the original value when the
mmap call into the lower fs fails.
Seth Forshee [Thu, 7 Nov 2019 16:08:24 +0000 (10:08 -0600)]
UBUNTU: SAUCE: shiftfs: Restore vm_file value when lower fs mmap fails
BugLink: https://bugs.launchpad.net/bugs/1850994
shiftfs_mmap() overwrites vma->vm_file before calling the lower
filesystem mmap but does not restore the original value on
failure. This means it is giving a pointer to the lower fs file
back to the caller with no reference, which is a bad practice.
However, it does not lead to any issues with upstream kernels as
no caller accesses vma->vm_file after call_mmap().
With the aufs patches applied the story is different. Whereas
mmap_region() previously fput a local variable containing the
file it assigned to vm_file, it now calls vma_fput() which will
fput vm_file, for which it has no reference, and the reference
for the original vm_file is not put.
Fix this by restoring vma->vm_file to the original value when the
mmap call into the lower fs fails.
Seth Forshee [Thu, 7 Nov 2019 17:21:53 +0000 (11:21 -0600)]
UBUNTU: SAUCE: fs: Move SB_I_NOSUID to the top of s_iflags
BugLink: https://bugs.launchpad.net/bugs/1851677
SB_I_NOSUID was added by a sauce patch, and over time it has come
to occpy the same bit in s_iflags as SB_I_USERNS_VISIBLE without
being noticed. overlayfs will set SB_I_NOSUID when any lower
mount is nosuid. When this happens for a user namespace mount,
mount_too_revealing() will perform additional, unnecessary checks
which may block mounting when it should be allowed.
Move SB_I_NOSUID to prevent this conflict, and move it to the top
of s_iflags to make future conflicts less likely.
Seth Forshee [Tue, 5 Nov 2019 20:35:05 +0000 (14:35 -0600)]
UBUNTU: SAUCE: (efi-lockdown) Really don't allow lifting lockdown from userspace
BugLink: https://bugs.launchpad.net/bugs/1851380
"UBUNTU: SAUCE: (efi-lockdown) Add a SysRq option to lift kernel
lockdown" adds a sysrq key to lift kernel lockdown, which is
meant to only allow a physically present user to lift lockdown
using a keyboard. However, the code has a bug which also allows
root to lift lockdown through /proc/sysrq-trigger. Fix this bug
to make this work as intended.
UBUNTU: [Debian]: do not skip tests for linux-hwe-edge
Ignore: yes
Signed-off-by: Thadeu Lima de Souza Cascardo <cascardo@canonical.com> Acked-by: Kleber Sacilotto de Souza <kleber.souza@canonical.com> Acked-by: Po-Hsu Lin <po-hsu.lin@canonical.com> Signed-off-by: Kleber Sacilotto de Souza <kleber.souza@canonical.com>
Nicolas pointed out that the cxgb4 driver is doing dma off of the stack,
which is generally considered a very bad thing. On some architectures it
could be a security problem, but odds are none of them actually run this
driver, so it's just a "normal" bug.
Resolve this by allocating the memory for a message off of the heap
instead of the stack. kmalloc() always will give us a proper memory
location that DMA will work correctly from.
Link: https://lore.kernel.org/r/20191001165611.GA3542072@kroah.com Reported-by: Nicolas Waisman <nico@semmle.com> Tested-by: Potnuri Bharat Teja <bharat@chelsio.com> Signed-off-by: Jason Gunthorpe <jgg@mellanox.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: Connor Kuehl <connor.kuehl@canonical.com> Signed-off-by: Kleber Sacilotto de Souza <kleber.souza@canonical.com>
rq_qos_del() incorrectly assigns the node being deleted to the head if
it was the first on the list in the !prev path. Fix it by iterating
with ** instead.
Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Josef Bacik <josef@toxicpanda.com> Fixes: a79050434b45 ("blk-rq-qos: refactor out common elements of blk-wbt") Cc: stable@vger.kernel.org # v4.19+ Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: Connor Kuehl <connor.kuehl@canonical.com> Signed-off-by: Kleber Sacilotto de Souza <kleber.souza@canonical.com>
Commit d698a388146c ("of: reserved-memory: ignore disabled memory-region
nodes") added an early return in of_reserved_mem_device_init_by_idx(), but
didn't call of_node_put() on a device_node whose ref-count was incremented
in the call to of_parse_phandle() preceding the early exit.
_find_opp_of_np() doesn't traverse the list of OPP tables but instead
just the entries within an OPP table and so only requires to lock the
OPP table itself.
The lockdep_assert_held() was added there by mistake and isn't really
required.
There is an arbitrary difference between the system resume and
runtime resume code paths for PCI devices regarding the delay to
apply when switching the devices from D3cold to D0.
Namely, pci_restore_standard_config() used in the runtime resume
code path calls pci_set_power_state() which in turn invokes
__pci_start_power_transition() to power up the device through the
platform firmware and that function applies the transition delay
(as per PCI Express Base Specification Revision 2.0, Section 6.6.1).
However, pci_pm_default_resume_early() used in the system resume
code path calls pci_power_up() which doesn't apply the delay at
all and that causes issues to occur during resume from
suspend-to-idle on some systems where the delay is required.
Since there is no reason for that difference to exist, modify
pci_power_up() to follow pci_set_power_state() more closely and
invoke __pci_start_power_transition() from there to call the
platform firmware to power up the device (in case that's necessary).
Fixes: db288c9c5f9d ("PCI / PM: restore the original behavior of pci_set_power_state()") Reported-by: Daniel Drake <drake@endlessm.com> Tested-by: Daniel Drake <drake@endlessm.com> Link: https://lore.kernel.org/linux-pm/CAD8Lp44TYxrMgPLkHCqF9hv6smEurMXvmmvmtyFhZ6Q4SE+dig@mail.gmail.com/T/#m21be74af263c6a34f36e0fc5c77c5449d9406925 Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Acked-by: Bjorn Helgaas <bhelgaas@google.com> Cc: 3.10+ <stable@vger.kernel.org> # 3.10+ Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: Connor Kuehl <connor.kuehl@canonical.com> Signed-off-by: Kleber Sacilotto de Souza <kleber.souza@canonical.com>
xenvif_connect_data() calls module_put() in case of error. This is
wrong as there is no related module_get().
Remove the superfluous module_put().
Fixes: 279f438e36c0a7 ("xen-netback: Don't destroy the netdev until the vif is shut down") Cc: <stable@vger.kernel.org> # 3.12 Signed-off-by: Juergen Gross <jgross@suse.com> Reviewed-by: Paul Durrant <paul@xen.org> Reviewed-by: Wei Liu <wei.liu@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: Connor Kuehl <connor.kuehl@canonical.com> Signed-off-by: Kleber Sacilotto de Souza <kleber.souza@canonical.com>
In the future, we're going to want to extend the ceph_reply_info_extra
for create replies. Currently though, the kernel code doesn't accept an
extra blob that is larger than the expected data.
Change the code to skip over any unrecognized fields at the end of the
extra blob, rather than returning -EIO.
It is incorrect to set the cpufreq syscore shutdown callback pointer
to cpufreq_suspend(), because that function cannot be run in the
syscore stage of system shutdown for two reasons: (a) it may attempt
to carry out actions depending on devices that have already been shut
down at that point and (b) the RCU synchronization carried out by it
may not be able to make progress then.
The latter issue has been present since commit 45975c7d21a1 ("rcu:
Define RCU-sched API in terms of RCU for Tree RCU PREEMPT builds"),
but the former one has been there since commit 90de2a4aa9f3 ("cpufreq:
suspend cpufreq governors on shutdown") regardless.
Fix that by dropping cpufreq_syscore_ops altogether and making
device_shutdown() call cpufreq_suspend() directly before shutting
down devices, which is along the lines of what system-wide power
management does.
Fixes: 45975c7d21a1 ("rcu: Define RCU-sched API in terms of RCU for Tree RCU PREEMPT builds") Fixes: 90de2a4aa9f3 ("cpufreq: suspend cpufreq governors on shutdown") Reported-by: Ville Syrjälä <ville.syrjala@linux.intel.com> Tested-by: Ville Syrjälä <ville.syrjala@linux.intel.com> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Acked-by: Viresh Kumar <viresh.kumar@linaro.org> Cc: 4.0+ <stable@vger.kernel.org> # 4.0+ Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: Connor Kuehl <connor.kuehl@canonical.com> Signed-off-by: Kleber Sacilotto de Souza <kleber.souza@canonical.com>
Connecting a vCPU to a XIVE KVM device means establishing a 1:1
association between a vCPU id and the offset (VP id) of a VP
structure within a fixed size block of VPs. We currently try to
enforce the 1:1 relationship by checking that a vCPU with the
same id isn't already connected. This is good but unfortunately
not enough because we don't map VP ids to raw vCPU ids but to
packed vCPU ids, and the packing function kvmppc_pack_vcpu_id()
isn't bijective by design. We got away with it because QEMU passes
vCPU ids that fit well in the packing pattern. But nothing prevents
userspace to come up with a forged vCPU id resulting in a packed id
collision which causes the KVM device to associate two vCPUs to the
same VP. This greatly confuses the irq layer and ultimately crashes
the kernel, as shown below.
Example: a guest with 1 guest thread per core, a core stride of
8 and 300 vCPUs has vCPU ids 0,8,16...2392. If QEMU is patched to
inject at some point an invalid vCPU id 348, which is the packed
version of itself and 2392, we get:
This affects both XIVE and XICS-on-XIVE devices since the beginning.
Check the VP id instead of the vCPU id when a new vCPU is connected.
The allocation of the XIVE CPU structure in kvmppc_xive_connect_vcpu()
is moved after the check to avoid the need for rollback.
Cc: stable@vger.kernel.org # v4.12+ Signed-off-by: Greg Kurz <groug@kaod.org> Reviewed-by: Cédric Le Goater <clg@kaod.org> Signed-off-by: Paul Mackerras <paulus@ozlabs.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: Connor Kuehl <connor.kuehl@canonical.com> Signed-off-by: Kleber Sacilotto de Souza <kleber.souza@canonical.com>
The @type can be completely garbage, as DATA type is not possible for
trace_qgroup_meta_reserve() trace event.
[CAUSE]
Ther are several problems related to qgroup trace events:
- Unassigned entry member
Member entry::type of trace_qgroup_update_reserve() and
trace_qgourp_meta_reserve() is not assigned
- Redundant entry member
Member entry::type is completely useless in
trace_qgroup_meta_convert()
Fixes: 4ee0d8832c2e ("btrfs: qgroup: Update trace events for metadata reservation") CC: stable@vger.kernel.org # 4.10+ Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: Connor Kuehl <connor.kuehl@canonical.com> Signed-off-by: Kleber Sacilotto de Souza <kleber.souza@canonical.com>
We were checking for the full fsync flag in the inode before locking the
inode, which is racy, since at that that time it might not be set but
after we acquire the inode lock some other task set it. One case where
this can happen is on a system low on memory and some concurrent task
failed to allocate an extent map and therefore set the full sync flag on
the inode, to force the next fsync to work in full mode.
A consequence of missing the full fsync flag set is hitting the problems
fixed by commit 0c713cbab620 ("Btrfs: fix race between ranged fsync and
writeback of adjacent ranges"), BUG_ON() when dropping extents from a log
tree, hitting assertion failures at tree-log.c:copy_items() or all sorts
of weird inconsistencies after replaying a log due to file extents items
representing ranges that overlap.
So just move the check such that it's done after locking the inode and
before starting writeback again.
Fixes: 0c713cbab620 ("Btrfs: fix race between ranged fsync and writeback of adjacent ranges") CC: stable@vger.kernel.org # 5.2+ Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: Connor Kuehl <connor.kuehl@canonical.com> Signed-off-by: Kleber Sacilotto de Souza <kleber.souza@canonical.com>
If we fail to reserve metadata for delalloc operations we end up releasing
the previously reserved qgroup amount twice, once explicitly under the
'out_qgroup' label by calling btrfs_qgroup_free_meta_prealloc() and once
again, under label 'out_fail', by calling btrfs_inode_rsv_release() with a
value of 'true' for its 'qgroup_free' argument, which results in
btrfs_qgroup_free_meta_prealloc() being called again, so we end up having
a double free.
Also if we fail to reserve the necessary qgroup amount, we jump to the
label 'out_fail', which calls btrfs_inode_rsv_release() and that in turns
calls btrfs_qgroup_free_meta_prealloc(), even though we weren't able to
reserve any qgroup amount. So we freed some amount we never reserved.
So fix this by removing the call to btrfs_inode_rsv_release() in the
failure path, since it's not necessary at all as we haven't changed the
inode's block reserve in any way at this point.
Fixes: c8eaeac7b73434 ("btrfs: reserve delalloc metadata differently") CC: stable@vger.kernel.org # 5.2+ Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: Connor Kuehl <connor.kuehl@canonical.com> Signed-off-by: Kleber Sacilotto de Souza <kleber.souza@canonical.com>
The patch 32b593bfcb58 ("Btrfs: remove no longer used function to run
delayed refs asynchronously") removed the async delayed refs but the
thread has been created, without any use. Remove it to avoid resource
consumption.
Fixes: 32b593bfcb58 ("Btrfs: remove no longer used function to run delayed refs asynchronously") CC: stable@vger.kernel.org # 5.2+ Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: Connor Kuehl <connor.kuehl@canonical.com> Signed-off-by: Kleber Sacilotto de Souza <kleber.souza@canonical.com>
If we error out when finding a page at relocate_file_extent_cluster(), we
need to release the outstanding extents counter on the relocation inode,
set by the previous call to btrfs_delalloc_reserve_metadata(), otherwise
the inode's block reserve size can never decrease to zero and metadata
space is leaked. Therefore add a call to btrfs_delalloc_release_extents()
in case we can't find the target page.
Fixes: 8b62f87bad9c ("Btrfs: rework outstanding_extents") CC: stable@vger.kernel.org # 4.19+ Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: Connor Kuehl <connor.kuehl@canonical.com> Signed-off-by: Kleber Sacilotto de Souza <kleber.souza@canonical.com>
In btrfs_read_block_groups(), if we have an invalid block group which
has mixed type (DATA|METADATA) while the fs doesn't have MIXED_GROUPS
feature, we error out without freeing the block group cache.
This patch will add the missing btrfs_put_block_group() to prevent
memory leak.
Note for stable backports: the file to patch in versions <= 5.3 is
fs/btrfs/extent-tree.c
Fixes: 49303381f19a ("Btrfs: bail out if block group has different mixed flag") CC: stable@vger.kernel.org # 4.9+ Reviewed-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: Connor Kuehl <connor.kuehl@canonical.com> Signed-off-by: Kleber Sacilotto de Souza <kleber.souza@canonical.com>
The configuration registers for the LED group have inverted
polarity, which puts the GPIO into open-drain state when used in
GPIO mode. Switch to '0' for GPIO and '1' for LED modes.
Fixes: 87466ccd9401 ("pinctrl: armada-37xx: Add pin controller support for Armada 37xx") Signed-off-by: Patrick Williams <alpawi@amazon.com> Cc: <stable@vger.kernel.org> Link: https://lore.kernel.org/r/20191001155154.99710-1-alpawi@amazon.com Signed-off-by: Linus Walleij <linus.walleij@linaro.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: Connor Kuehl <connor.kuehl@canonical.com> Signed-off-by: Kleber Sacilotto de Souza <kleber.souza@canonical.com>
The 37xx configuration registers are only 32 bits long, so
pins 32-35 spill over into the next register. The calculation
for the register address was done, but the bitmask was not, so
any configuration to pin 32 or above resulted in a bitmask that
overflowed and performed no action.
Fix the register / offset calculation to also adjust the offset.
e3f72b749da2 pinctrl: cherryview: fix Strago DMI workaround 86c5dd6860a6 pinctrl: cherryview: limit Strago DMI workarounds to version 1.0
because even with 1.1 versions of BIOS there are some pins that are
configured as interrupts but not claimed by any driver, and they
sometimes fire up and result in interrupt storms that cause touchpad
stop functioning and other issues.
Given that we are unlikely to qualify another firmware version for a
while it is better to keep the workaround active on all Strago boards.
Reported-by: Alex Levin <levinale@chromium.org> Fixes: 86c5dd6860a6 ("pinctrl: cherryview: limit Strago DMI workarounds to version 1.0") Cc: stable@vger.kernel.org Signed-off-by: Dmitry Torokhov <dmitry.torokhov@gmail.com> Reviewed-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com> Tested-by: Alex Levin <levinale@chromium.org> Signed-off-by: Mika Westerberg <mika.westerberg@linux.intel.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: Connor Kuehl <connor.kuehl@canonical.com> Signed-off-by: Kleber Sacilotto de Souza <kleber.souza@canonical.com>
Now that there's Hyper-V IOMMU driver, Linux can switch to x2apic mode
when supported by the vcpus.
However, the apic access functions for Hyper-V enlightened apic assume
xapic mode only.
As a result, Linux fails to bring up secondary cpus when run as a guest
in QEMU/KVM with both hv_apic and x2apic enabled.
According to Michael Kelley, when in x2apic mode, the Hyper-V synthetic
apic MSRs behave exactly the same as the corresponding architectural
x2apic MSRs, so there's no need to override the apic accessors. The
only exception is hv_apic_eoi_write, which benefits from lazy EOI when
available; however, its implementation works for both xapic and x2apic
modes.
Check that the per-cpu cluster mask pointer has been set prior to
clearing a dying cpu's bit. The per-cpu pointer is not set until the
target cpu reaches smp_callin() during CPUHP_BRINGUP_CPU, whereas the
teardown function, x2apic_dead_cpu(), is associated with the earlier
CPUHP_X2APIC_PREPARE. If an error occurs before the cpu is awakened,
e.g. if do_boot_cpu() itself fails, x2apic_dead_cpu() will dereference
the NULL pointer and cause a panic.
Our hardware (UV aka Superdome Flex) has address ranges marked
reserved by the BIOS. Access to these ranges is caught as an error,
causing the BIOS to halt the system.
Initial page tables mapped a large range of physical addresses that
were not checked against the list of BIOS reserved addresses, and
sometimes included reserved addresses in part of the mapped range.
Including the reserved range in the map allowed processor speculative
accesses to the reserved range, triggering a BIOS halt.
Used early in booting, the page table level2_kernel_pgt addresses 1
GiB divided into 2 MiB pages, and it was set up to linearly map a full
1 GiB of physical addresses that included the physical address range
of the kernel image, as chosen by KASLR. But this also included a
large range of unused addresses on either side of the kernel image.
And unlike the kernel image's physical address range, this extra
mapped space was not checked against the BIOS tables of usable RAM
addresses. So there were times when the addresses chosen by KASLR
would result in processor accessible mappings of BIOS reserved
physical addresses.
The kernel code did not directly access any of this extra mapped
space, but having it mapped allowed the processor to issue speculative
accesses into reserved memory, causing system halts.
This was encountered somewhat rarely on a normal system boot, and much
more often when starting the crash kernel if "crashkernel=512M,high"
was specified on the command line (this heavily restricts the physical
address of the crash kernel, in our case usually within 1 GiB of
reserved space).
The solution is to invalidate the pages of this table outside the kernel
image's space before the page table is activated. It fixes this problem
on our hardware.
The SiFive PLIC interrupt controller seems to have all the HW
features to support the fasteoi flow, but the driver seems to be
stuck in a distant past. Bring it into the 21st century.
Signed-off-by: Marc Zyngier <maz@kernel.org> Tested-by: Palmer Dabbelt <palmer@sifive.com> (QEMU Boot) Tested-by: Darius Rad <darius@bluespec.com> (on 2 HW PLIC implementations) Tested-by: Paul Walmsley <paul.walmsley@sifive.com> (HiFive Unleashed) Reviewed-by: Palmer Dabbelt <palmer@sifive.com> Cc: stable@vger.kernel.org Link: https://lore.kernel.org/r/8636gxskmj.wl-maz@kernel.org Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: Connor Kuehl <connor.kuehl@canonical.com> Signed-off-by: Kleber Sacilotto de Souza <kleber.souza@canonical.com>
GFP_NOWAIT allocation can fail anytime - it doesn't wait for memory being
available and it fails if the mempool is exhausted and there is not enough
memory.
If we go down this path:
map_bio -> mg_start -> alloc_migration -> mempool_alloc(GFP_NOWAIT)
we can see that map_bio() doesn't check the return value of mg_start(),
and the bio is leaked.
If we go down this path:
map_bio -> mg_start -> mg_lock_writes -> alloc_prison_cell ->
dm_bio_prison_alloc_cell_v2 -> mempool_alloc(GFP_NOWAIT) ->
mg_lock_writes -> mg_complete
the bio is ended with an error - it is unacceptable because it could
cause filesystem corruption if the machine ran out of memory
temporarily.
Change GFP_NOWAIT to GFP_NOIO, so that the mempool code will properly
wait until memory becomes available. mempool_alloc with GFP_NOIO can't
fail, so remove the code paths that deal with allocation failure.
Users reported a v5.3 performance regression and inability to establish
huge page mappings. A revised version of the ndctl "dax.sh" huge page
unit test identifies commit 23c84eb78375 "dax: Fix missed wakeup with
PMD faults" as the source.
Update get_unlocked_entry() to check for NULL entries before checking
the entry order, otherwise NULL is misinterpreted as a present pte
conflict. The 'order' check needs to happen before the locked check as
an unlocked entry at the wrong order must fallback to lookup the correct
order.
Reported-by: Jeff Smits <jeff.smits@intel.com> Reported-by: Doug Nelson <doug.nelson@intel.com> Cc: <stable@vger.kernel.org> Fixes: 23c84eb78375 ("dax: Fix missed wakeup with PMD faults") Reviewed-by: Jan Kara <jack@suse.cz> Cc: Jeff Moyer <jmoyer@redhat.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Link: https://lore.kernel.org/r/157167532455.3945484.11971474077040503994.stgit@dwillia2-desk3.amr.corp.intel.com Signed-off-by: Dan Williams <dan.j.williams@intel.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: Connor Kuehl <connor.kuehl@canonical.com> Signed-off-by: Kleber Sacilotto de Souza <kleber.souza@canonical.com>
Any subsequent call to perf_trace_event_reg() will observe total_ref_count > 0,
causing the perf_trace_buf to be always NULL. This can result in perf_trace_buf
getting accessed from perf_trace_buf_alloc() without being initialized. Acquiring
event_mutex in perf_kprobe_init() before calling perf_trace_event_init() should
fix this race.
The race caused the following bug:
Unable to handle kernel paging request at virtual address 0000003106f2003c
Mem abort info:
ESR = 0x96000045
Exception class = DABT (current EL), IL = 32 bits
SET = 0, FnV = 0
EA = 0, S1PTW = 0
Data abort info:
ISV = 0, ISS = 0x00000045
CM = 0, WnR = 1
user pgtable: 4k pages, 39-bit VAs, pgdp = ffffffc034b9b000
[0000003106f2003c] pgd=0000000000000000, pud=0000000000000000
Internal error: Oops: 96000045 [#1] PREEMPT SMP
Process syz-executor (pid: 18393, stack limit = 0xffffffc093190000)
pstate: 80400005 (Nzcv daif +PAN -UAO)
pc : __memset+0x20/0x1ac
lr : memset+0x3c/0x50
sp : ffffffc09319fc50
allows CAP_EXCLUSIVE events to be grouped with other events. Since all
of those also happen to be AUX events (which is not the case the other
way around, because arch/s390), this changes the rules for stopping the
output: the AUX event may not be on its PMU's context any more, if it's
grouped with a HW event, in which case it will be on that HW event's
context instead. If that's the case, munmap() of the AUX buffer can't
find and stop the AUX event, potentially leaving the last reference with
the atomic context, which will then end up freeing the AUX buffer. This
will then trip warnings:
Fix this by using the context's PMU context when looking for events
to stop, instead of the event's PMU context.
Signed-off-by: Alexander Shishkin <alexander.shishkin@linux.intel.com> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: Jiri Olsa <jolsa@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Stephane Eranian <eranian@google.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Vince Weaver <vincent.weaver@maine.edu> Cc: stable@vger.kernel.org Link: https://lkml.kernel.org/r/20191022073940.61814-1-alexander.shishkin@linux.intel.com Signed-off-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: Connor Kuehl <connor.kuehl@canonical.com> Signed-off-by: Kleber Sacilotto de Souza <kleber.souza@canonical.com>
Currently the code assumes that if a file info entry belongs
to lists of open file handles of an inode and a tcon then
it has non-zero reference. The recent changes broke that
assumption when putting the last reference of the file info.
There may be a situation when a file is being deleted but
nothing prevents another thread to reference it again
and start using it. This happens because we do not hold
the inode list lock while checking the number of references
of the file info structure. Fix this by doing the proper
locking when doing the check.
Fixes: 487317c99477d ("cifs: add spinlock for the openFileList to cifsInodeInfo") Fixes: cb248819d209d ("cifs: use cifsInodeInfo->open_file_lock while iterating to avoid a panic") Cc: Stable <stable@vger.kernel.org> Reviewed-by: Ronnie Sahlberg <lsahlber@redhat.com> Signed-off-by: Pavel Shilovsky <pshilov@microsoft.com> Signed-off-by: Steve French <stfrench@microsoft.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: Connor Kuehl <connor.kuehl@canonical.com> Signed-off-by: Kleber Sacilotto de Souza <kleber.souza@canonical.com>
cifs_setattr_nounix has two paths which miss free operations
for xid and fullpath.
Use goto cifs_setattr_exit like other paths to fix them.
CC: Stable <stable@vger.kernel.org> Fixes: aa081859b10c ("cifs: flush before set-info if we have writeable handles") Signed-off-by: Chuhong Yuan <hslester96@gmail.com> Signed-off-by: Steve French <stfrench@microsoft.com> Reviewed-by: Pavel Shilovsky <pshilov@microsoft.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: Connor Kuehl <connor.kuehl@canonical.com> Signed-off-by: Kleber Sacilotto de Souza <kleber.souza@canonical.com>
According to MS-CIFS specification MID 0xFFFF should not be used by the
CIFS client, but we actually do. Besides, this has proven to cause races
leading to oops between SendReceive2/cifs_demultiplex_thread. On SMB1,
MID is a 2 byte value easy to reach in CurrentMid which may conflict with
an oplock break notification request coming from server
Signed-off-by: Roberto Bergantinos Corpas <rbergant@redhat.com> Reviewed-by: Ronnie Sahlberg <lsahlber@redhat.com> Reviewed-by: Aurelien Aptel <aaptel@suse.com> Signed-off-by: Steve French <stfrench@microsoft.com> CC: Stable <stable@vger.kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: Connor Kuehl <connor.kuehl@canonical.com> Signed-off-by: Kleber Sacilotto de Souza <kleber.souza@canonical.com>
It appears that the only case where we need to apply the TX2_219_TVM
mitigation is when the core is in SMT mode. So let's condition the
enabling on detecting a CPU whose MPIDR_EL1.Aff0 is non-zero.
Cc: <stable@vger.kernel.org> Signed-off-by: Marc Zyngier <marc.zyngier@arm.com> Signed-off-by: Will Deacon <will@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: Connor Kuehl <connor.kuehl@canonical.com> Signed-off-by: Kleber Sacilotto de Souza <kleber.souza@canonical.com>
As a PRFM instruction racing against a TTBR update can have undesirable
effects on TX2, NOP-out such PRFM on cores that are affected by
the TX2-219 erratum.
Cc: <stable@vger.kernel.org> Signed-off-by: Marc Zyngier <marc.zyngier@arm.com> Signed-off-by: Will Deacon <will@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: Connor Kuehl <connor.kuehl@canonical.com> Signed-off-by: Kleber Sacilotto de Souza <kleber.souza@canonical.com>
In order to workaround the TX2-219 erratum, it is necessary to trap
TTBRx_EL1 accesses to EL2. This is done by setting HCR_EL2.TVM on
guest entry, which has the side effect of trapping all the other
VM-related sysregs as well.
To minimize the overhead, a fast path is used so that we don't
have to go all the way back to the main sysreg handling code,
unless the rest of the hypervisor expects to see these accesses.
Cc: <stable@vger.kernel.org> Signed-off-by: Marc Zyngier <marc.zyngier@arm.com> Signed-off-by: Will Deacon <will@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: Connor Kuehl <connor.kuehl@canonical.com> Signed-off-by: Kleber Sacilotto de Souza <kleber.souza@canonical.com>
ghes_edac models a single logical memory controller, and uses a global
ghes_init variable to ensure only the first ghes_edac_register() will
do anything.
ghes_edac is registered the first time a GHES entry in the HEST is
probed. There may be multiple entries, so subsequent attempts to
register ghes_edac are silently ignored as the work has already been
done.
When a GHES entry is unregistered, it calls ghes_edac_unregister(),
which free()s the memory behind the global variables in ghes_edac.
But there may be multiple GHES entries, the next call to
ghes_edac_unregister() will dereference the free()d memory, and attempt
to free it a second time.
This may also be triggered on a platform with one GHES entry, if the
driver is unbound/re-bound and unbound. The re-bind step will do
nothing because of ghes_init, the second unbind will then do the same
work as the first.
Doing the unregister work on the first call is unsafe, as another
CPU may be processing a notification in ghes_edac_report_mem_error(),
using the memory we are about to free.
ghes_init is already half of the reference counting. We only need
to do the register work for the first call, and the unregister work
for the last. Add the unregister check.
This means we no longer free ghes_edac's memory while there are
GHES entries that may receive a notification.
This was detected by KASAN and DEBUG_TEST_DRIVER_REMOVE.
[ bp: merge into a single patch. ]
Fixes: 0fe5f281f749 ("EDAC, ghes: Model a single, logical memory controller") Reported-by: John Garry <john.garry@huawei.com> Signed-off-by: James Morse <james.morse@arm.com> Signed-off-by: Borislav Petkov <bp@suse.de> Cc: linux-edac <linux-edac@vger.kernel.org> Cc: Mauro Carvalho Chehab <mchehab@kernel.org> Cc: Robert Richter <rrichter@marvell.com> Cc: Tony Luck <tony.luck@intel.com> Cc: <stable@vger.kernel.org> Link: https://lkml.kernel.org/r/20191014171919.85044-2-james.morse@arm.com Link: https://lkml.kernel.org/r/304df85b-8b56-b77e-1a11-aa23769f2e7c@huawei.com Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: Connor Kuehl <connor.kuehl@canonical.com> Signed-off-by: Kleber Sacilotto de Souza <kleber.souza@canonical.com>
Sven noticed that calling ioremap() and iounmap() multiple times leads
to a vmap memory leak:
vmap allocation for size 4198400 failed:
use vmalloc=<size> to increase size
A recent commit removed the NULL pointer check from the clock_getres()
implementation causing a test case to fault.
POSIX requires an explicit NULL pointer check for clock_getres() aside of
the validity check of the clock_id argument for obscure reasons.
Add it back for both 32bit and 64bit.
Note, this is only a partial revert of the offending commit which does not
bring back the broken fallback invocation in the the 32bit compat
implementations of clock_getres() and clock_gettime().
Commit "bpf: Process in-kernel BTF" in linux-next introduced an undefined
__weak symbol, which results in an R_390_GLOB_DAT relocation type. That
is not yet handled by the KASLR relocation code, and the kernel stops with
the message "Unknown relocation type".
Add code to detect and handle R_390_GLOB_DAT relocation types and undefined
symbols.
If a process is interrupted while accessing the crypto device and the
global ap_perms_mutex is contented, release() could return early and
fail to free related resources.
Custom outs*/ins* implementations are long gone from the xtensa port,
remove matching EXPORT_SYMBOLs.
This fixes the following build warnings issued by modpost since commit 15bfc2348d54 ("modpost: check for static EXPORT_SYMBOL* functions"):
WARNING: "insb" [vmlinux] is a static EXPORT_SYMBOL
WARNING: "insw" [vmlinux] is a static EXPORT_SYMBOL
WARNING: "insl" [vmlinux] is a static EXPORT_SYMBOL
WARNING: "outsb" [vmlinux] is a static EXPORT_SYMBOL
WARNING: "outsw" [vmlinux] is a static EXPORT_SYMBOL
WARNING: "outsl" [vmlinux] is a static EXPORT_SYMBOL
Mmap /dev/dax more than once, then read the poison location using
address from one of the mappings. The other mappings due to not having
the page mapped in will cause SIGKILLs delivered to the process.
SIGKILL succeeds over SIGBUS, so user process loses the opportunity to
handle the UE.
Although one may add MAP_POPULATE to mmap(2) to work around the issue,
MAP_POPULATE makes mapping 128GB of pmem several magnitudes slower, so
isn't always an option.
Uninitialized memmaps contain garbage and in the worst case trigger
kernel BUGs, especially with CONFIG_PAGE_POISONING. They should not get
touched.
Let's make sure that we only consider online memory (managed by the
buddy) that has initialized memmaps. ZONE_DEVICE is not applicable.
page_zone() will call page_to_nid(), which will trigger
VM_BUG_ON_PGFLAGS(PagePoisoned(page), page) with CONFIG_PAGE_POISONING
and CONFIG_DEBUG_VM_PGFLAGS when called on uninitialized memmaps. This
can be the case when an offline memory block (e.g., never onlined) is
spanned by a zone.
Note: As explained by Michal in [1], alloc_contig_range() will verify
the range. So it boils down to the wrong access in this function.
Until commit 92d12f9544b7 ("memblock: refactor internal allocation
functions") the maximal address for memblock allocations was forced to
memblock.current_limit only for the allocation functions returning
virtual address. The changes introduced by that commit moved the limit
enforcement into the allocation core and as a result the allocation
functions returning physical address also started to limit allocations
to memblock.current_limit.
Commit 1a61ab8038e72 ("mm: memcontrol: replace zone summing with
lruvec_page_state()") has made lruvec_page_state to use per-cpu counters
instead of calculating it directly from lru_zone_size with an idea that
this would be more effective.
Tim has reported that this is not really the case for their database
benchmark which is showing an opposite results where lruvec_page_state
is taking up a huge chunk of CPU cycles (about 25% of the system time
which is roughly 7% of total cpu cycles) on 5.3 kernels. The workload
is running on a larger machine (96cpus), it has many cgroups (500) and
it is heavily direct reclaim bound.
The likely culprit is the cache traffic the lruvec_page_state_local
generates. Dave Hansen says:
: I was thinking purely of the cache footprint. If it's reading
: pn->lruvec_stat_local->count[idx] is three separate cachelines, so 192
: bytes of cache *96 CPUs = 18k of data, mostly read-only. 1 cgroup would
: be 18k of data for the whole system and the caching would be pretty
: efficient and all 18k would probably survive a tight page fault loop in
: the L1. 500 cgroups would be ~90k of data per CPU thread which doesn't
: fit in the L1 and probably wouldn't survive a tight page fault loop if
: both logical threads were banging on different cgroups.
:
: It's just a theory, but it's why I noted the number of cgroups when I
: initially saw this show up in profiles
Fix the regression by partially reverting the said commit and calculate
the lru size explicitly.
Link: http://lkml.kernel.org/r/20190905071034.16822-1-honglei.wang@oracle.com Fixes: 1a61ab8038e72 ("mm: memcontrol: replace zone summing with lruvec_page_state()") Signed-off-by: Honglei Wang <honglei.wang@oracle.com> Reported-by: Tim Chen <tim.c.chen@linux.intel.com> Acked-by: Tim Chen <tim.c.chen@linux.intel.com> Tested-by: Tim Chen <tim.c.chen@linux.intel.com> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Vladimir Davydov <vdavydov.dev@gmail.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Roman Gushchin <guro@fb.com> Cc: Tejun Heo <tj@kernel.org> Cc: Dave Hansen <dave.hansen@intel.com> Cc: <stable@vger.kernel.org> [5.2+] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: Connor Kuehl <connor.kuehl@canonical.com> Signed-off-by: Kleber Sacilotto de Souza <kleber.souza@canonical.com>
Florian and Dave reported [1] a NULL pointer dereference in
__reset_isolation_pfn(). While the exact cause is unclear, staring at
the code revealed two bugs, which might be related.
One bug is that if zone starts in the middle of pageblock, block_page
might correspond to different pfn than block_pfn, and then the
pfn_valid_within() checks will check different pfn's than those accessed
via struct page. This might result in acessing an unitialized page in
CONFIG_HOLES_IN_ZONE configs.
The other bug is that end_page refers to the first page of next
pageblock and not last page of current pageblock. The online and valid
check is then wrong and with sections, the while (page < end_page) loop
might wander off actual struct page arrays.
The kernel panics on an attempt to dereference the NULL memcg pointer.
When shutdown_cache() is called from the kmem_cache_destroy() context, a
memcg kmem_cache might have empty slab pages in a partial list, which are
still charged to the memory cgroup.
These pages are released by free_partial() at the beginning of
shutdown_cache(): either directly or by scheduling a RCU-delayed work
(if the kmem_cache has the SLAB_TYPESAFE_BY_RCU flag). The latter case
is when the reported panic can happen: memcg_unlink_cache() is called
immediately after shrinking partial lists, without waiting for scheduled
RCU works. It sets the kmem_cache->memcg_params.memcg pointer to NULL,
and the following attempt to dereference it by __free_slab() from the
RCU work context causes the panic.
To fix the issue, let's postpone the release of the memcg pointer to
destroy_memcg_params(). It's called from a separate work context by
slab_caches_to_rcu_destroy_workfn(), which contains a full RCU barrier.
This guarantees that all scheduled page release RCU works will complete
before the memcg pointer will be zeroed.
Big thanks for Karsten for the perfect report containing all necessary
information, his help with the analysis of the problem and testing of the
fix.
Patch series "mm/memory_hotplug: Shrink zones before removing memory",
v6.
This series fixes the access of uninitialized memmaps when shrinking
zones/nodes and when removing memory. Also, it contains all fixes for
crashes that can be triggered when removing certain namespace using
memunmap_pages() - ZONE_DEVICE, reported by Aneesh.
We stop trying to shrink ZONE_DEVICE, as it's buggy, fixing it would be
more involved (we don't have SECTION_IS_ONLINE as an indicator), and
shrinking is only of limited use (set_zone_contiguous() cannot detect
the ZONE_DEVICE as contiguous).
We continue shrinking !ZONE_DEVICE zones, however, I reduced the amount
of code to a minimum. Shrinking is especially necessary to keep
zone->contiguous set where possible, especially, on memory unplug of
DIMMs at zone boundaries.
Zones are now properly shrunk when offlining memory blocks or when
onlining failed. This allows to properly shrink zones on memory unplug
even if the separate memory blocks of a DIMM were onlined to different
zones or re-onlined to a different zone after offlining.
With an altmap, the memmap falling into the reserved altmap space are not
initialized and, therefore, contain a garbage NID and a garbage zone.
Make sure to read the NID/zone from a memmap that was initialized.
This fixes a kernel crash that is observed when destroying a namespace:
kernel BUG at include/linux/mm.h:1107!
cpu 0x1: Vector: 700 (Program Check) at [c000000274087890]
pc: c0000000004b9728: memunmap_pages+0x238/0x340
lr: c0000000004b9724: memunmap_pages+0x234/0x340
...
pid = 3669, comm = ndctl
kernel BUG at include/linux/mm.h:1107!
devm_action_release+0x30/0x50
release_nodes+0x268/0x2d0
device_release_driver_internal+0x174/0x240
unbind_store+0x13c/0x190
drv_attr_store+0x44/0x60
sysfs_kf_write+0x70/0xa0
kernfs_fop_write+0x1ac/0x290
__vfs_write+0x3c/0x70
vfs_write+0xe4/0x200
ksys_write+0x7c/0x140
system_call+0x5c/0x68
The "page_zone(pfn_to_page(pfn)" was introduced by 69324b8f4833 ("mm,
devm_memremap_pages: add MEMORY_DEVICE_PRIVATE support"), however, I
think we will never have driver reserved memory with
MEMORY_DEVICE_PRIVATE (no altmap AFAIKS).
[david@redhat.com: minimze code changes, rephrase description] Link: http://lkml.kernel.org/r/20191006085646.5768-2-david@redhat.com Fixes: 2c2a5af6fed2 ("mm, memory_hotplug: add nid parameter to arch_remove_memory") Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com> Signed-off-by: David Hildenbrand <david@redhat.com> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Jason Gunthorpe <jgg@ziepe.ca> Cc: Logan Gunthorpe <logang@deltatee.com> Cc: Ira Weiny <ira.weiny@intel.com> Cc: Damian Tometzki <damian.tometzki@gmail.com> Cc: Alexander Duyck <alexander.h.duyck@linux.intel.com> Cc: Alexander Potapenko <glider@google.com> Cc: Andy Lutomirski <luto@kernel.org> Cc: Anshuman Khandual <anshuman.khandual@arm.com> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Borislav Petkov <bp@alien8.de> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Christian Borntraeger <borntraeger@de.ibm.com> Cc: Christophe Leroy <christophe.leroy@c-s.fr> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Fenghua Yu <fenghua.yu@intel.com> Cc: Gerald Schaefer <gerald.schaefer@de.ibm.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Halil Pasic <pasic@linux.ibm.com> Cc: Heiko Carstens <heiko.carstens@de.ibm.com> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Jun Yao <yaojun8558363@gmail.com> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Masahiro Yamada <yamada.masahiro@socionext.com> Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@linux.ibm.com> Cc: Oscar Salvador <osalvador@suse.de> Cc: Pankaj Gupta <pagupta@redhat.com> Cc: Paul Mackerras <paulus@samba.org> Cc: Pavel Tatashin <pasha.tatashin@soleen.com> Cc: Pavel Tatashin <pavel.tatashin@microsoft.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Qian Cai <cai@lca.pw> Cc: Rich Felker <dalias@libc.org> Cc: Robin Murphy <robin.murphy@arm.com> Cc: Steve Capper <steve.capper@arm.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Tom Lendacky <thomas.lendacky@amd.com> Cc: Tony Luck <tony.luck@intel.com> Cc: Vasily Gorbik <gor@linux.ibm.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Wei Yang <richard.weiyang@gmail.com> Cc: Wei Yang <richardw.yang@linux.intel.com> Cc: Will Deacon <will@kernel.org> Cc: Yoshinori Sato <ysato@users.sourceforge.jp> Cc: Yu Zhao <yuzhao@google.com> Cc: <stable@vger.kernel.org> [5.0+] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: Connor Kuehl <connor.kuehl@canonical.com> Signed-off-by: Kleber Sacilotto de Souza <kleber.souza@canonical.com>
Uninitialized memmaps contain garbage and in the worst case trigger
kernel BUGs, especially with CONFIG_PAGE_POISONING. They should not get
touched.
For example, when not onlining a memory block that is spanned by a zone
and reading /proc/pagetypeinfo with CONFIG_DEBUG_VM_PGFLAGS and
CONFIG_PAGE_POISONING, we can trigger a kernel BUG:
Please note that this change does not affect ZONE_DEVICE, because
pagetypeinfo_showmixedcount_print() is called from
mm/vmstat.c:pagetypeinfo_showmixedcount() only for populated zones, and
ZONE_DEVICE is never populated (zone->present_pages always 0).
[david@redhat.com: move check to outer loop, add comment, rephrase description] Link: http://lkml.kernel.org/r/20191011140638.8160-1-david@redhat.com Fixes: f1dd2cd13c4b ("mm, memory_hotplug: do not associate hotadded memory to zones until online") # visible after d0dc12e86b319 Signed-off-by: Qian Cai <cai@lca.pw> Signed-off-by: David Hildenbrand <david@redhat.com> Acked-by: Michal Hocko <mhocko@suse.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: "Peter Zijlstra (Intel)" <peterz@infradead.org> Cc: Miles Chen <miles.chen@mediatek.com> Cc: Mike Rapoport <rppt@linux.vnet.ibm.com> Cc: Qian Cai <cai@lca.pw> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: <stable@vger.kernel.org> [4.13+] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: Connor Kuehl <connor.kuehl@canonical.com> Signed-off-by: Kleber Sacilotto de Souza <kleber.souza@canonical.com>
A long time ago we fixed a similar deadlock in show_slab_objects() [1].
However, it is apparently due to the commits like 01fb58bcba63 ("slab:
remove synchronous synchronize_sched() from memcg cache deactivation
path") and 03afc0e25f7f ("slab: get_online_mems for
kmem_cache_{create,destroy,shrink}"), this kind of deadlock is back by
just reading files in /sys/kernel/slab which will generate a lockdep
splat below.
Since the "mem_hotplug_lock" here is only to obtain a stable online node
mask while racing with NUMA node hotplug, in the worst case, the results
may me miscalculated while doing NUMA node hotplug, but they shall be
corrected by later reads of the same files.
WARNING: possible circular locking dependency detected
------------------------------------------------------
cat/5224 is trying to acquire lock: ffff900012ac3120 (mem_hotplug_lock.rw_sem){++++}, at:
show_slab_objects+0x94/0x3a8
but task is already holding lock: b8ff009693eee398 (kn->count#45){++++}, at: kernfs_seq_start+0x44/0xf0
which lock already depends on the new lock.
the existing dependency chain (in reverse order) is:
I think it is important to mention that this doesn't expose the
show_slab_objects to use-after-free. There is only a single path that
might really race here and that is the slab hotplug notifier callback
__kmem_cache_shrink (via slab_mem_going_offline_callback) but that path
doesn't really destroy kmem_cache_node data structures.
According to the App note[1] detailing the tuning algorithm, for
temperatures < -20C, the initial tuning value should be min(largest value
in LPW - 24, ceil(13/16 ratio of LPW)). The largest value in LPW is
(max_window + 4 * (max_len - 1)) and not (max_window + 4 * max_len) itself.
Fix this implementation.
Since ceeeb99cd821 we no longer abuse the DMA_CTRL_ACK flag for custom
driver use and introduced the MXS_DMA_CTRL_WAIT4END instead. We have not
changed all users to this flag though. This patch fixes it for the
mxs-mmc driver.
We currently use the ring values directly, but that can lead to issues
if the application is malicious and changes these values on our behalf.
Created in-kernel cached versions of them, and just overwrite the user
side when we update them. This is similar to how we treat the sq/cq
ring tail/head updates.
Reported-by: Pavel Begunkov <asml.silence@gmail.com> Reviewed-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Sasha Levin <sashal@kernel.org> Signed-off-by: Connor Kuehl <connor.kuehl@canonical.com> Signed-off-by: Kleber Sacilotto de Souza <kleber.souza@canonical.com>
io_ring_submit() finalises with
1. io_commit_sqring(), which releases sqes to the userspace
2. Then calls to io_queue_link_head(), accessing released head's sqe
io_sq_thread() processes sqes by 8 without considering links. As a
result, links will be randomely subdivided.
The easiest way to fix it is to call io_get_sqring() inside
io_submit_sqes() as do io_ring_submit().
Downsides:
1. This removes optimisation of not grabbing mm_struct for fixed files
2. It submitting all sqes in one go, without finer-grained sheduling
with cq processing.
There are three places where we access uninitialized memmaps, namely:
- /proc/kpagecount
- /proc/kpageflags
- /proc/kpagecgroup
We have initialized memmaps either when the section is online or when the
page was initialized to the ZONE_DEVICE. Uninitialized memmaps contain
garbage and in the worst case trigger kernel BUGs, especially with
CONFIG_PAGE_POISONING.
For example, not onlining a DIMM during boot and calling /proc/kpagecount
with CONFIG_PAGE_POISONING:
For now, let's drop support for ZONE_DEVICE from the three pseudo files
in order to fix this. To distinguish offline memory (with garbage
memmap) from ZONE_DEVICE memory with properly initialized memmaps, we
would have to check get_dev_pagemap() and pfn_zone_device_reserved()
right now. The usage of both (especially, special casing devmem) is
frowned upon and needs to be reworked.
The fundamental issue we have is:
if (pfn_to_online_page(pfn)) {
/* memmap initialized */
} else if (pfn_valid(pfn)) {
/*
* ???
* a) offline memory. memmap garbage.
* b) devmem: memmap initialized to ZONE_DEVICE.
* c) devmem: reserved for driver. memmap garbage.
* (d) devmem: memmap currently initializing - garbage)
*/
}
We'll leave the pfn_zone_device_reserved() check in stable_page_flags()
in place as that function is also used from memory failure. We now no
longer dump information about pages that are not in use anymore -
offline.
Link: http://lkml.kernel.org/r/20191009142435.3975-2-david@redhat.com Fixes: f1dd2cd13c4b ("mm, memory_hotplug: do not associate hotadded memory to zones until online") [visible after d0dc12e86b319] Signed-off-by: David Hildenbrand <david@redhat.com> Reported-by: Qian Cai <cai@lca.pw> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Alexey Dobriyan <adobriyan@gmail.com> Cc: Stephen Rothwell <sfr@canb.auug.org.au> Cc: Toshiki Fukasawa <t-fukasawa@vx.jp.nec.com> Cc: Pankaj gupta <pagupta@redhat.com> Cc: Mike Rapoport <rppt@linux.vnet.ibm.com> Cc: Anthony Yznaga <anthony.yznaga@oracle.com> Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com> Cc: <stable@vger.kernel.org> [4.13+] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: Connor Kuehl <connor.kuehl@canonical.com> Signed-off-by: Kleber Sacilotto de Souza <kleber.souza@canonical.com>
Uninitialized memmaps contain garbage and in the worst case trigger kernel
BUGs, especially with CONFIG_PAGE_POISONING. They should not get touched.
Right now, when trying to soft-offline a PFN that resides on a memory
block that was never onlined, one gets a misleading error with
CONFIG_PAGE_POISONING:
But the actual result depends on the garbage in the memmap.
soft_offline_page() can only work with online pages, it returns -EIO in
case of ZONE_DEVICE. Make sure to only forward pages that are online
(iow, managed by the buddy) and, therefore, have an initialized memmap.
Add a check against pfn_to_online_page() and similarly return -EIO.
Link: http://lkml.kernel.org/r/20191010141200.8985-1-david@redhat.com Fixes: f1dd2cd13c4b ("mm, memory_hotplug: do not associate hotadded memory to zones until online") [visible after d0dc12e86b319] Signed-off-by: David Hildenbrand <david@redhat.com> Acked-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: "Rafael J. Wysocki" <rafael@kernel.org> Cc: <stable@vger.kernel.org> [4.13+] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: Connor Kuehl <connor.kuehl@canonical.com> Signed-off-by: Kleber Sacilotto de Souza <kleber.souza@canonical.com>
user_pages array should always be freed after validation regardless if
user pages are changed after bo is created because with HMM change parse
bo always allocate user pages array to get user pages for userptr bo.
v2: remove unused local variable and amend commit
v3: add back get user pages in gem_userptr_ioctl, to detect application
bug where an userptr VMA is not ananymous memory and reject it.
Signed-off-by: Philip Yang <Philip.Yang@amd.com> Tested-by: Joe Barnett <thejoe@gmail.com> Reviewed-by: Christian König <christian.koenig@amd.com> Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com> Cc: stable@vger.kernel.org # 5.3 Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: Connor Kuehl <connor.kuehl@canonical.com> Signed-off-by: Kleber Sacilotto de Souza <kleber.souza@canonical.com>
The first come first served apporoach to handling the VBT
child device AUX ch conflicts has backfired. We have machines
in the wild where the VBT specifies both port A eDP and
port E DP (in that order) with port E being the real one.
So let's try to flip the preference around and let the last
child device win once again.
Cc: stable@vger.kernel.org Cc: Jani Nikula <jani.nikula@intel.com> Tested-by: Masami Ichikawa <masami256@gmail.com> Tested-by: Torsten <freedesktop201910@liggy.de>
Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=111966 Fixes: 36a0f92020dc ("drm/i915/bios: make child device order the priority order") Signed-off-by: Ville Syrjälä <ville.syrjala@linux.intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20191011202030.8829-1-ville.syrjala@linux.intel.com Acked-by: Jani Nikula <jani.nikula@intel.com>
(cherry picked from commit 41e35ffb380bde1379e4030bb5b2ac824d5139cf) Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: Connor Kuehl <connor.kuehl@canonical.com> Signed-off-by: Kleber Sacilotto de Souza <kleber.souza@canonical.com>
Daniel Vetter uncovered a nasty cycle in using the mmu-notifiers to
invalidate userptr objects which also happen to be pulled into GGTT
mmaps. That is when we unbind the userptr object (on mmu invalidation),
we revoke all CPU mmaps, which may then recurse into mmu invalidation.
We looked for ways of breaking the cycle, but the revocation on
invalidation is required and cannot be avoided. The only solution we
could see was to not allow such GGTT bindings of userptr objects in the
first place. In practice, no one really wants to use a GGTT mmapping of
a CPU pointer...
Just before Daniel's explosive lockdep patches land in v5.4-rc1, we got
a genuine blip from CI:
<4>[ 246.793958] ======================================================
<4>[ 246.793972] WARNING: possible circular locking dependency detected
<4>[ 246.793989] 5.3.0-gbd6c56f50d15-drmtip_372+ #1 Tainted: G U
<4>[ 246.794003] ------------------------------------------------------
<4>[ 246.794017] kswapd0/145 is trying to acquire lock:
<4>[ 246.794030] 000000003f565be6 (&dev->struct_mutex/1){+.+.}, at: userptr_mn_invalidate_range_start+0x18f/0x220 [i915]
<4>[ 246.794250]
but task is already holding lock:
<4>[ 246.794263] 000000001799cef9 (&anon_vma->rwsem){++++}, at: page_lock_anon_vma_read+0xe6/0x2a0
<4>[ 246.794291]
which lock already depends on the new lock.