Cornelia Huck [Thu, 13 Jun 2019 11:08:15 +0000 (13:08 +0200)]
s390/cio: introduce driver_override on the css bus
Sometimes, we want to control which of the matching drivers
binds to a subchannel device (e.g. for subchannels we want to
handle via vfio-ccw).
For pci devices, a mechanism to do so has been introduced in 782a985d7af2 ("PCI: Introduce new device binding path using
pci_dev.driver_override"). It makes sense to introduce the
driver_override attribute for subchannel devices as well, so
that we can easily extend the 'driverctl' tool (which makes
use of the driver_override attribute for pci).
Note that unlike pci we still require a driver override to
match the subchannel type; matching more than one subchannel
type is probably not useful anyway.
Signed-off-by: Cornelia Huck <cohuck@redhat.com> Reviewed-by: Halil Pasic <pasic@linux.ibm.com> Reviewed-by: Sebastian Ott <sebott@linux.ibm.com> Signed-off-by: Sebastian Ott <sebott@linux.ibm.com> Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
Vasily Gorbik [Mon, 24 Jun 2019 15:02:28 +0000 (17:02 +0200)]
Merge tag 'vfio-ccw-20190621' of https://git.kernel.org/pub/scm/linux/kernel/git/kvms390/vfio-ccw into features
Refactoring of the vfio-ccw cp handling, simplifying the
code and avoiding unneeded allocating/copying.
* tag 'vfio-ccw-20190621' of https://git.kernel.org/pub/scm/linux/kernel/git/kvms390/vfio-ccw:
vfio-ccw: Remove copy_ccw_from_iova()
vfio-ccw: Factor out the ccw0-to-ccw1 transition
vfio-ccw: Copy CCW data outside length calculation
vfio-ccw: Skip second copy of guest cp to host
vfio-ccw: Move guest_cp storage into common struct
s390/cio: Combine direct and indirect CCW paths
vfio-ccw: Rearrange IDAL allocation in direct CCW
vfio-ccw: Remove pfn_array_table
vfio-ccw: Adjust the first IDAW outside of the nested loops
vfio-ccw: Rearrange pfn_array and pfn_array_table arrays
s390/cio: Use generalized CCW handler in cp_init()
s390/cio: Generalize the TIC handler
s390/cio: Refactor the routine that handles TIC CCWs
s390/cio: Squash cp_free() and cp_unpin_free()
Eric Farman [Tue, 18 Jun 2019 20:23:51 +0000 (22:23 +0200)]
vfio-ccw: Factor out the ccw0-to-ccw1 transition
This is a really useful function, but it's buried in the
copy_ccw_from_iova() routine so that ccwchain_calc_length()
can just work with Format-1 CCWs while doing its counting.
But it means we're translating a full 2K of "CCWs" to Format-1,
when in reality there's probably far fewer in that space.
Let's factor it out, so maybe we can do something with it later.
Signed-off-by: Eric Farman <farman@linux.ibm.com>
Message-Id: <20190618202352.39702-5-farman@linux.ibm.com> Reviewed-by: Cornelia Huck <cohuck@redhat.com> Reviewed-by: Farhan Ali <alifm@linux.ibm.com> Signed-off-by: Cornelia Huck <cohuck@redhat.com>
Eric Farman [Tue, 18 Jun 2019 20:23:49 +0000 (22:23 +0200)]
vfio-ccw: Skip second copy of guest cp to host
We already pinned/copied/unpinned 2K (256 CCWs) of guest memory
to the host space anchored off vfio_ccw_private. There's no need
to do that again once we have the length calculated, when we could
just copy the section we need to the "permanent" space for the I/O.
Signed-off-by: Eric Farman <farman@linux.ibm.com>
Message-Id: <20190618202352.39702-3-farman@linux.ibm.com> Reviewed-by: Cornelia Huck <cohuck@redhat.com> Reviewed-by: Farhan Ali <alifm@linux.ibm.com> Signed-off-by: Cornelia Huck <cohuck@redhat.com>
Eric Farman [Tue, 18 Jun 2019 20:23:48 +0000 (22:23 +0200)]
vfio-ccw: Move guest_cp storage into common struct
Rather than allocating/freeing a piece of memory every time
we try to figure out how long a CCW chain is, let's use a piece
of memory allocated for each device.
The io_mutex added with commit 4f76617378ee9 ("vfio-ccw: protect
the I/O region") is held for the duration of the VFIO_CCW_EVENT_IO_REQ
event that accesses/uses this space, so there should be no race
concerns with another CPU attempting an (unexpected) SSCH for the
same device.
Heiko Carstens [Mon, 17 Jun 2019 12:02:41 +0000 (14:02 +0200)]
s390: fix stfle zero padding
The stfle inline assembly returns the number of double words written
(condition code 0) or the double words it would have written
(condition code 3), if the memory array it got as parameter would have
been large enough.
The current stfle implementation assumes that the array is always
large enough and clears those parts of the array that have not been
written to with a subsequent memset call.
If however the array is not large enough memset will get a negative
length parameter, which means that memset clears memory until it gets
an exception and the kernel crashes.
To fix this simply limit the maximum length. Move also the inline
assembly to an extra function to avoid clobbering of register 0, which
might happen because of the added min_t invocation together with code
instrumentation.
The bug was introduced with commit 14375bc4eb8d ("[S390] cleanup
facility list handling") but was rather harmless, since it would only
write to a rather large array. It became a potential problem with
commit 3ab121ab1866 ("[S390] kernel: Add z/VM LGR detection"). Since
then it writes to an array with only four double words, while some
machines already deliver three double words. As soon as machines have
a facility bit within the fifth double a crash on IPL would happen.
Heiko Carstens [Mon, 17 Jun 2019 12:02:40 +0000 (14:02 +0200)]
s390: replace defconfig with performance_defconfig
Replace defconfig with performance_defconfig. defconfig had some more
or less random debug options enabled, where nobody knows why anymore.
Just remove the old defconfig and replace it with performance_defconfig,
which reduces the number of configs to maintain. A config with debugging
options enabled is debug_defconfig which is supposed to be rather close
to performance_defconfig except that is has debug options enabled.
Eric Farman [Thu, 6 Jun 2019 20:28:31 +0000 (22:28 +0200)]
s390/cio: Combine direct and indirect CCW paths
With both the direct-addressed and indirect-addressed CCW paths
simplified to this point, the amount of shared code between them is
(hopefully) more easily visible. Move the processing of IDA-specific
bits into the direct-addressed path, and add some useful commentary of
what the individual pieces are doing. This allows us to remove the
entire ccwchain_fetch_idal() routine and maintain a single function
for any non-TIC CCW.
Eric Farman [Thu, 6 Jun 2019 20:28:28 +0000 (22:28 +0200)]
vfio-ccw: Adjust the first IDAW outside of the nested loops
Now that pfn_array_table[] is always an array of 1, it seems silly to
check for the very first entry in an array in the middle of two nested
loops, since we know it'll only ever happen once.
Let's move this outside the loops to simplify things, even though
the "k" variable is still necessary.
Eric Farman [Thu, 6 Jun 2019 20:28:27 +0000 (22:28 +0200)]
vfio-ccw: Rearrange pfn_array and pfn_array_table arrays
While processing a channel program, we currently have two nested
arrays that carry a slightly different structure. The direct CCW
path creates this:
ccwchain->pfn_array_table[1]->pfn_array[#pages]
while an IDA CCW creates:
ccwchain->pfn_array_table[#idaws]->pfn_array[1]
The distinction appears to state that each pfn_array_table entry
points to an array of contiguous pages, represented by a pfn_array,
um, array. Since the direct-addressed scenario can ONLY represent
contiguous pages, it makes the intermediate array necessary but
difficult to recognize. Meanwhile, since an IDAL can contain
non-contiguous pages and there is no logic in vfio-ccw to detect
adjacent IDAWs, it is the second array that is necessary but appearing
to be superfluous.
I am not aware of any documentation that states the pfn_array[] needs
to be of contiguous pages; it is just what the code does today.
I don't see any reason for this either, let's just flip the IDA
codepath around so that it generates:
ch_pat->pfn_array_table[1]->pfn_array[#idaws]
This will bring it in line with the direct-addressed codepath,
so that we can understand the behavior of this memory regardless
of what type of CCW is being processed. And it means the casual
observer does not need to know/care whether the pfn_array[]
represents contiguous pages or not.
NB: The existing vfio-ccw code only supports 4K-block Format-2 IDAs,
so that "#pages" == "#idaws" in this area. This means that we will
have difficulty with this overlap in terminology if support for
Format-1 or 2K-block Format-2 IDAs is ever added. I don't think that
this patch changes our ability to make that distinction.
Eric Farman [Thu, 6 Jun 2019 20:28:25 +0000 (22:28 +0200)]
s390/cio: Generalize the TIC handler
Refactor ccwchain_handle_tic() into a routine that handles a channel
program address (which itself is a CCW pointer), rather than a CCW pointer
that is only a TIC CCW. This will make it easier to reuse this code for
other CCW commands.
Eric Farman [Thu, 6 Jun 2019 20:28:24 +0000 (22:28 +0200)]
s390/cio: Refactor the routine that handles TIC CCWs
Extract the "does the target of this TIC already exist?" check from
ccwchain_handle_tic(), so that it's easier to refactor that function
into one that cp_init() is able to use.
Eric Farman [Thu, 6 Jun 2019 20:28:23 +0000 (22:28 +0200)]
s390/cio: Squash cp_free() and cp_unpin_free()
The routine cp_free() does nothing but call cp_unpin_free(), and while
most places call cp_free() there is one caller of cp_unpin_free() used
when the cp is guaranteed to have not been marked initialized.
This seems like a dubious way to make a distinction, so let's combine
these routines and make cp_free() do all the work.
Heiko Carstens [Sat, 8 Jun 2019 10:13:57 +0000 (12:13 +0200)]
processor: get rid of cpu_relax_yield
stop_machine is the only user left of cpu_relax_yield. Given that it
now has special semantics which are tied to stop_machine introduce a
weak stop_machine_yield function which architectures can override, and
get rid of the generic cpu_relax_yield implementation.
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
The stop_machine loop to advance the state machine and to wait for all
affected CPUs to check-in calls cpu_relax_yield in a tight loop until
the last missing CPUs acknowledged the state transition.
On a virtual system where not all logical CPUs are backed by real CPUs
all the time it can take a while for all CPUs to check-in. With the
current definition of cpu_relax_yield a diagnose 0x44 is done which
tells the hypervisor to schedule *some* other CPU. That can be any
CPU and not necessarily one of the CPUs that need to run in order to
advance the state machine. This can lead to a pretty bad diagnose 0x44
storm until the last missing CPU finally checked-in.
Replace the undirected cpu_relax_yield based on diagnose 0x44 with a
directed yield. Each CPU in the wait loop will pick up the next CPU
in the cpumask of stop_machine. The diagnose 0x9c is used to tell the
hypervisor to run this next CPU instead of the current one. If there
is only a limited number of real CPUs backing the virtual CPUs we
end up with the real CPUs passed around in a round-robin fashion.
[heiko.carstens@de.ibm.com]:
Use cpumask_next_wrap as suggested by Peter Zijlstra.
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Halil Pasic [Mon, 1 Oct 2018 17:01:58 +0000 (19:01 +0200)]
virtio/s390: use DMA memory for ccw I/O and classic notifiers
Before virtio-ccw could get away with not using DMA API for the pieces of
memory it does ccw I/O with. With protected virtualization this has to
change, since the hypervisor needs to read and sometimes also write these
pieces of memory.
The hypervisor is supposed to poke the classic notifiers, if these are
used, out of band with regards to ccw I/O. So these need to be allocated
as DMA memory (which is shared memory for protected virtualization
guests).
Let us factor out everything from struct virtio_ccw_device that needs to
be DMA memory in a satellite that is allocated as such.
Note: The control blocks of I/O instructions do not need to be shared.
These are marshalled by the ultravisor.
Signed-off-by: Halil Pasic <pasic@linux.ibm.com> Reviewed-by: Pierre Morel <pmorel@linux.ibm.com> Reviewed-by: Cornelia Huck <cohuck@redhat.com> Reviewed-by: Michael Mueller <mimu@linux.ibm.com> Tested-by: Michael Mueller <mimu@linux.ibm.com> Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Halil Pasic [Mon, 3 Dec 2018 16:18:07 +0000 (17:18 +0100)]
virtio/s390: add indirection to indicators access
This will come in handy soon when we pull out the indicators from
virtio_ccw_device to a memory area that is shared with the hypervisor
(in particular for protected virtualization guests).
Signed-off-by: Halil Pasic <pasic@linux.ibm.com> Reviewed-by: Pierre Morel <pmorel@linux.ibm.com> Reviewed-by: Cornelia Huck <cohuck@redhat.com> Reviewed-by: Michael Mueller <mimu@linux.ibm.com> Tested-by: Michael Mueller <mimu@linux.ibm.com> Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Halil Pasic [Thu, 13 Sep 2018 16:57:16 +0000 (18:57 +0200)]
s390/airq: use DMA memory for adapter interrupts
Protected virtualization guests have to use shared pages for airq
notifier bit vectors, because the hypervisor needs to write these bits.
Let us make sure we allocate DMA memory for the notifier bit vectors by
replacing the kmem_cache with a dma_cache and kalloc() with
cio_dma_zalloc().
Signed-off-by: Halil Pasic <pasic@linux.ibm.com> Reviewed-by: Sebastian Ott <sebott@linux.ibm.com> Reviewed-by: Michael Mueller <mimu@linux.ibm.com> Tested-by: Michael Mueller <mimu@linux.ibm.com> Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Halil Pasic [Tue, 26 Mar 2019 11:41:09 +0000 (12:41 +0100)]
s390/cio: add basic protected virtualization support
As virtio-ccw devices are channel devices, we need to use the
dma area within the common I/O layer for any communication with
the hypervisor.
Note that we do not need to use that area for control blocks
directly referenced by instructions, e.g. the orb.
It handles neither QDIO in the common code, nor any device type specific
stuff (like channel programs constructed by the DASD driver).
An interesting side effect is that virtio structures are now going to
get allocated in 31 bit addressable storage.
Signed-off-by: Halil Pasic <pasic@linux.ibm.com> Reviewed-by: Sebastian Ott <sebott@linux.ibm.com> Reviewed-by: Cornelia Huck <cohuck@redhat.com> Reviewed-by: Michael Mueller <mimu@linux.ibm.com> Tested-by: Michael Mueller <mimu@linux.ibm.com> Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Halil Pasic [Tue, 2 Apr 2019 16:47:29 +0000 (18:47 +0200)]
s390/cio: introduce DMA pools to cio
To support protected virtualization cio will need to make sure the
memory used for communication with the hypervisor is DMA memory.
Let us introduce one global pool for cio.
Our DMA pools are implemented as a gen_pool backed with DMA pages. The
idea is to avoid each allocation effectively wasting a page, as we
typically allocate much less than PAGE_SIZE.
Signed-off-by: Halil Pasic <pasic@linux.ibm.com> Reviewed-by: Sebastian Ott <sebott@linux.ibm.com> Reviewed-by: Cornelia Huck <cohuck@redhat.com> Reviewed-by: Michael Mueller <mimu@linux.ibm.com> Tested-by: Michael Mueller <mimu@linux.ibm.com> Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Halil Pasic [Thu, 13 Sep 2018 16:57:16 +0000 (18:57 +0200)]
s390/mm: force swiotlb for protected virtualization
On s390, protected virtualization guests have to use bounced I/O
buffers. That requires some plumbing.
Let us make sure, any device that uses DMA API with direct ops correctly
is spared from the problems, that a hypervisor attempting I/O to a
non-shared page would bring.
Signed-off-by: Halil Pasic <pasic@linux.ibm.com> Reviewed-by: Claudio Imbrenda <imbrenda@linux.ibm.com> Reviewed-by: Michael Mueller <mimu@linux.ibm.com> Tested-by: Michael Mueller <mimu@linux.ibm.com> Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
s390/crypto: sha: Use -ENODEV instead of -EOPNOTSUPP
Let's use the error value that is typically used if HW support is not
available when trying to load a module - this is also what systemd's
systemd-modules-load.service expects.
Reviewed-by: Cornelia Huck <cohuck@redhat.com> Reviewed-by: Harald Freudenberger <freude@linux.ibm.com> Signed-off-by: David Hildenbrand <david@redhat.com> Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
s390/crypto: prng: Use -ENODEV instead of -EOPNOTSUPP
Let's use the error value that is typically used if HW support is not
available when trying to load a module - this is also what systemd's
systemd-modules-load.service expects.
Reviewed-by: Cornelia Huck <cohuck@redhat.com> Reviewed-by: Harald Freudenberger <freude@linux.ibm.com> Signed-off-by: David Hildenbrand <david@redhat.com> Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
s390/crypto: ghash: Use -ENODEV instead of -EOPNOTSUPP
Let's use the error value that is typically used if HW support is not
available when trying to load a module - this is also what systemd's
systemd-modules-load.service expects.
Reviewed-by: Cornelia Huck <cohuck@redhat.com> Reviewed-by: Harald Freudenberger <freude@linux.ibm.com> Signed-off-by: David Hildenbrand <david@redhat.com> Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
systemd-modules-load.service automatically tries to load the pkey module
on systems that have MSA.
Pkey also requires the MSA3 facility and a bunch of subfunctions.
Failing with -EOPNOTSUPP makes "systemd-modules-load.service" fail on
any system that does not have all needed subfunctions. For example,
when running under QEMU TCG (but also on systems where protected keys
are disabled via the HMC).
Let's use -ENODEV, so systemd-modules-load.service properly ignores
failing to load the pkey module because of missing HW functionality.
While at it, also convert the -EOPNOTSUPP in pkey_clr2protkey() to -ENODEV.
Reviewed-by: Cornelia Huck <cohuck@redhat.com> Reviewed-by: Harald Freudenberger <freude@linux.ibm.com> Signed-off-by: David Hildenbrand <david@redhat.com> Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Instead of keeping the documentation inside s390dbf.rst,
move them to arch/s390/include/asm/debug.h, using standard
kernel-doc markups.
Keeping the documentation close to the code helps to keep it
updated. It also makes easier to document other stuff inside
debug.h, as all it needs is to add kernel-doc markups inside
it, as the file will be already be included at the produced
documentation.
-
Those were converted to kerneldoc using this script specially
designed to parse ths file, and manually editted:
<script>
use strict;
my $mode = "";
my $parameter = "";
my $ret = "";
my $descr = "";
sub add_var($)
{
my $ln = shift;
$ln =~ s/^\s+//;
$ln =~ s/\s+$//;
return if ($ln eq "");
$ln =~ s/^(\S+)\s+/$1\t/;
print " * \@$ln\n";
}
sub add_return($)
{
my $ln = shift;
print " *\n * Return:\n" if ($mode ne "Return Value:");
$ln =~ s/^\s+//;
$ln =~ s/\s+$//;
return if ($ln eq "");
print " * - $ln\n";
}
sub add_description($)
{
my $ln = shift;
print " *\n * \n" if ($mode ne "Description:");
$ln =~ s/^\s+//;
$ln =~ s/\s+$//;
return if ($ln eq "");
print " * $ln\n";
}
sub flush_results()
{
print " */\n\n";
}
while (<>) {
if (m/^[\-]+$/) {
flush_results();
$mode = "";
$parameter = "";
$ret = "";
$descr = "";
next;
}
if (m/(Parameter:)(.*)/) {
print " *\n" if ($mode eq "func");
add_var($2);
$mode = $1;
next;
}
if (m/(Return Value:)(.*)/) {
add_return($2);
$mode = $1;
next;
}
if (m/(Description:)(.*)/) {
add_description($2);
$mode = $1;
next;
}
if ($mode eq "Parameter:") {
add_var($_);
next;
}
if ($mode eq "Return Value:") {
add_return($_);
next;
}
if ($mode eq "Description:") {
add_description($_);
next;
}
next if (m/^\s*$/);
docs: s390: convert docs to ReST and rename to *.rst
Convert all text files with s390 documentation to ReST format.
Tried to preserve as much as possible the original document
format. Still, some of the files required some work in order
for it to be visible on both plain text and after converted
to html.
The conversion is actually:
- add blank lines and identation in order to identify paragraphs;
- fix tables markups;
- add some lists markups;
- mark literal blocks;
- adjust title markups.
At its new index.rst, let's add a :orphan: while this is not linked to
the main index.rst file, in order to avoid build warnings.
Julian Wiedmann [Mon, 3 Jun 2019 05:47:04 +0000 (07:47 +0200)]
s390/qdio: handle PENDING state for QEBSM devices
When a CQ-enabled device uses QEBSM for SBAL state inspection,
get_buf_states() can return the PENDING state for an Output Queue.
get_outbound_buffer_frontier() isn't prepared for this, and any PENDING
buffer will permanently stall all further completion processing on this
Queue.
This isn't a concern for non-QEBSM devices, as get_buf_states() for such
devices will manually turn PENDING buffers into EMPTY ones.
Fixes: 104ea556ee7f ("qdio: support asynchronous delivery of storage blocks") Signed-off-by: Julian Wiedmann <jwi@linux.ibm.com> Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
arch/s390/boot/ipl_report.c: In function 'find_bootdata_space':
arch/s390/boot/ipl_report.c:42:26: warning: taking address of packed member of 'struct ipl_rb_components' may result in an unaligned pointer value [-Waddress-of-packed-member]
42 | for_each_rb_entry(comp, comps)
| ^~~~~
This is effectively the s390 variant of commit 20c6c1890455
("x86/boot: Disable the address-of-packed-member compiler warning").
Sebastian Ott [Tue, 4 Jun 2019 11:51:36 +0000 (13:51 +0200)]
s390/cio: fix kdoc for tiqdio_thinint_handler
Add missing parameter description to fix the following warning:
drivers/s390/cio/qdio_thinint.c:183: warning:
Function parameter or member 'floating' not described in 'tiqdio_thinint_handler'
Signed-off-by: Sebastian Ott <sebott@linux.ibm.com> Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Within an EP11 cprb there exists a byte field flags. Bit 0x20
of this field indicates a special cprb. A special cprb triggers
special handling in the firmware below the OS layer.
However, a special cprb also needs to have the S bit in GPR0
set when NQAP is called. This was not the case for EP11 cprbs
and this patch now introduces the code to support this.
Signed-off-by: Harald Freudenberger <freude@linux.ibm.com> Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Masahiro Yamada [Tue, 4 Jun 2019 08:29:47 +0000 (17:29 +0900)]
s390: fix unrecognized __aligned() in uapi header
__aligned() is a shorthand that is only available in the kernel space
because it is defined in include/linux/compiler_attributes.h, which is
not exported to the user space.
Detected by compile-testing exported headers.
./usr/include/asm/runtime_instr.h:60:37: error: expected declaration specifiers or ‘...’ before numeric constant
} __attribute__((packed)) __aligned(8);
^
Remove the CONFIG_UEVENT_HELPER_PATH because:
1. It is disabled since commit 1be01d4a5714 ("driver: base: Disable
CONFIG_UEVENT_HELPER by default") as its dependency (UEVENT_HELPER) was
made default to 'n',
2. It is not recommended (help message: "This should not be used today
[...] creates a high system load") and was kept only for ancient
userland,
3. Certain userland specifically requests it to be disabled (systemd
README: "Legacy hotplug slows down the system and confuses udev").
Heiko Carstens [Mon, 3 Jun 2019 13:22:25 +0000 (15:22 +0200)]
s390: enforce CONFIG_HOTPLUG_CPU
x86 and powerpc (partially) enforce already CONFIG_HOTPLUG_CPU. On
s390 it is enabled on all distributions by default since ages.
The only exception is our zfcpdump kernel.
However to simplify testing, enforce HOTPLUG_CPU. This was suggested
by Paul McKenney, since his rcutorture test environments for CONFIG_SMP=y
only support HOTPLUG_CPU=y.
Suggested-by: Paul E. McKenney <paulmck@linux.ibm.com> Acked-by: Christian Borntraeger <borntraeger@de.ibm.com> Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Heiko Carstens [Mon, 3 Jun 2019 12:25:18 +0000 (14:25 +0200)]
s390: enforce CONFIG_SMP
There never have been distributions that shiped with CONFIG_SMP=n for
s390. In addition the kernel currently doesn't even compile with
CONFIG_SMP=n for s390. Most likely it wouldn't even work, even if we
fix the compile error, since nobody tests it, since there is no use
case that I can think of.
Therefore simply enforce CONFIG_SMP and get rid of some more or
less unused code.
Reviewed-by: Christian Borntraeger <borntraeger@de.ibm.com> Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
s390/mm: mmap base does not depend on ADDR_NO_RANDOMIZE personality
randomize_stack_top() checks for current task flag PF_RANDOMIZE in order
to use stack randomization and PF_RANDOMIZE is set when
ADDR_NO_RANDOMIZE is unset, so no need to check for ADDR_NO_RANDOMIZE
in stack_maxrandom_size.
[heiko.carstens@de.ibm.com]: See also commit 01578e36163c ("x86/elf:
Remove the unnecessary ADDR_NO_RANDOMIZE checks")
Masahiro Yamada [Fri, 17 May 2019 07:54:27 +0000 (16:54 +0900)]
s390: drop meaningless 'targets' from tools Makefile
'targets' should be specified to include .*.cmd files to evaluate
if_changed or friends.
Here, facility-defs.h and dis-defs.h are generated by filechk.
Because filechk does not generate .*.cmd file, the 'targets' addition
is meaningless. The filechk correctly updates the target when its
content is changed.
Masahiro Yamada [Fri, 17 May 2019 07:54:24 +0000 (16:54 +0900)]
s390: do not pass $(LINUXINCLUDE) to gen_opcode_table.c
I guess HOSTCFLAGS_gen_opcode_table.o was blindly copied from
HOSTCFLAGS_gen_facilities.o
The reason of adding $(LINUXINCLUDE) to HOSTCFLAGS_gen_facilities.o
is because gen_facilities.c references some CONFIG options. (Kbuild
does not cater to this for host tools automatically.)
On the other hand, gen_opcode_table.c does not reference CONFIG
options at all. So, there is no good reason to pass $(LINUXINCLUDE).
s390/jump_label: replace stop_machine with smp_call_function
The use of stop_machine to replace the mask bits of the jump label branch
is a very heavy-weight operation. This is in fact not necessary, the
mask of the branch can simply be updated, followed by a signal processor
to all the other CPUs to force them to pick up the modified instruction.
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
[heiko.carstens@de.ibm.com]: Change jump_label_make_nop() so we get
brcl 0,offset instead of brcl 0,0. This
makes sure that only the mask part of the
instruction gets changed when updated. Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Eric Farman [Thu, 16 May 2019 16:14:03 +0000 (18:14 +0200)]
s390/cio: Remove vfio-ccw checks of command codes
If the CCW being processed is a No-Operation, then by definition no
data is being transferred. Let's fold those checks into the normal
CCW processors, rather than skipping out early.
Likewise, if the CCW being processed is a "test" (a category defined
here as an opcode that contains zero in the lowest four bits) then no
special processing is necessary as far as vfio-ccw is concerned.
These command codes have not been valid since the S/370 days, meaning
they are invalid in the same way as one that ends in an eight [1] or
an otherwise valid command code that is undefined for the device type
in question. Considering that, let's just process "test" CCWs like
any other CCW, and send everything to the hardware.
[1] POPS states that a x08 is a TIC CCW, and that having any high-order
bits enabled is invalid for format-1 CCWs. For format-0 CCWs, the
high-order bits are ignored.
Signed-off-by: Eric Farman <farman@linux.ibm.com>
Message-Id: <20190516161403.79053-4-farman@linux.ibm.com> Acked-by: Farhan Ali <alifm@linux.ibm.com> Signed-off-by: Cornelia Huck <cohuck@redhat.com>
CCW[0] moves a little more than two pages, but also has the
Suppress Length Indication (SLI) bit set to handle the expectation
that considerably less data will be moved. CCW[1] also has the SLI
bit set, and has a length of zero. Once vfio-ccw does its magic,
the kernel issues a start subchannel on behalf of the guest with this:
Both CCWs were converted to an IDAL and have the corresponding flags
set (which is by design), but only the address of the first data
address is converted to something the host is aware of. The second
CCW still has the address used by the guest, which happens to be (A)
(probably) an invalid address for the host, and (B) an invalid IDAW
address (doubleword boundary, etc.).
While the I/O fails, it doesn't fail correctly. In this example, we
would receive a program check for an invalid IDAW address, instead of
a unit check for an invalid command.
To fix this, revert commit 4cebc5d6a6ff ("vfio: ccw: validate the
count field of a ccw before pinning") and allow the individual fetch
routines to process them like anything else. We'll make a slight
adjustment to our allocation of the pfn_array (for direct CCWs) or
IDAL (for IDAL CCWs) memory, so that we have room for at least one
address even though no guest memory will be pinned and thus the
IDAW will not be populated with a host address.
Signed-off-by: Eric Farman <farman@linux.ibm.com>
Message-Id: <20190516161403.79053-3-farman@linux.ibm.com> Acked-by: Farhan Ali <alifm@linux.ibm.com> Signed-off-by: Cornelia Huck <cohuck@redhat.com>
Eric Farman [Thu, 16 May 2019 16:14:01 +0000 (18:14 +0200)]
s390/cio: Don't pin vfio pages for empty transfers
The skip flag of a CCW offers the possibility of data not being
transferred, but is only meaningful for certain commands.
Specifically, it is only applicable for a read, read backward, sense,
or sense ID CCW and will be ignored for any other command code
(SA22-7832-11 page 15-64, and figure 15-30 on page 15-75).
(A sense ID is xE4, while a sense is x04 with possible modifiers in the
upper four bits. So we will cover the whole "family" of sense CCWs.)
For those scenarios, since there is no requirement for the target
address to be valid, we should skip the call to vfio_pin_pages() and
rely on the IDAL address we have allocated/built for the channel
program. The fact that the individual IDAWs within the IDAL are
invalid is fine, since they aren't actually checked in these cases.
Set pa_nr to zero when skipping the pfn_array_pin() call, since it is
defined as the number of pages pinned and is used to determine
whether to call vfio_unpin_pages() upon cleanup.
The pfn_array_pin() routine returns the number of pages that were
pinned, but now might be skipped for some CCWs. Thus we need to
calculate the expected number of pages ourselves such that we are
guaranteed to allocate a reasonable number of IDAWs, which will
provide a valid address in CCW.CDA regardless of whether the IDAWs
are filled in with pinned/translated addresses or not.
Signed-off-by: Eric Farman <farman@linux.ibm.com>
Message-Id: <20190516161403.79053-2-farman@linux.ibm.com> Acked-by: Farhan Ali <alifm@linux.ibm.com> Signed-off-by: Cornelia Huck <cohuck@redhat.com>
Eric Farman [Tue, 14 May 2019 23:42:45 +0000 (01:42 +0200)]
s390/cio: Initialize the host addresses in pfn_array
Let's initialize the host address to something that is invalid,
rather than letting it default to zero. This just makes it easier
to notice when a pin operation has failed or been skipped.
Signed-off-by: Eric Farman <farman@linux.ibm.com>
Message-Id: <20190514234248.36203-5-farman@linux.ibm.com> Reviewed-by: Farhan Ali <alifm@linux.ibm.com> Signed-off-by: Cornelia Huck <cohuck@redhat.com>
Eric Farman [Tue, 14 May 2019 23:42:44 +0000 (01:42 +0200)]
s390/cio: Split pfn_array_alloc_pin into pieces
The pfn_array_alloc_pin routine is doing too much. Today, it does the
alloc of the pfn_array struct and its member arrays, builds the iova
address lists out of a contiguous piece of guest memory, and asks vfio
to pin the resulting pages.
Let's effectively revert a significant portion of commit 5c1cfb1c3948
("vfio: ccw: refactor and improve pfn_array_alloc_pin()") such that we
break pfn_array_alloc_pin() into its component pieces, and have one
routine that allocates/populates the pfn_array structs, and another
that actually pins the memory. In the future, we will be able to
handle scenarios where pinning memory isn't actually appropriate.
Signed-off-by: Eric Farman <farman@linux.ibm.com>
Message-Id: <20190514234248.36203-4-farman@linux.ibm.com> Reviewed-by: Farhan Ali <alifm@linux.ibm.com> Signed-off-by: Cornelia Huck <cohuck@redhat.com>
Eric Farman [Tue, 14 May 2019 23:42:43 +0000 (01:42 +0200)]
s390/cio: Set vfio-ccw FSM state before ioeventfd
Otherwise, the guest can believe it's okay to start another I/O
and bump into the non-idle state. This results in a cc=2 (with
the asynchronous CSCH/HSCH code) returned to the guest, which is
unfortunate since everything is otherwise working normally.
Signed-off-by: Eric Farman <farman@linux.ibm.com> Reviewed-by: Pierre Morel <pmorel@linux.ibm.com>
Message-Id: <20190514234248.36203-3-farman@linux.ibm.com> Reviewed-by: Farhan Ali <alifm@linux.ibm.com> Signed-off-by: Cornelia Huck <cohuck@redhat.com>
Eric Farman [Tue, 14 May 2019 23:42:42 +0000 (01:42 +0200)]
s390/cio: Update SCSW if it points to the end of the chain
Per the POPs [1], when processing an interrupt the SCSW.CPA field of an
IRB generally points to 8 bytes after the last CCW that was executed
(there are exceptions, but this is the most common behavior).
In the case of an error, this points us to the first un-executed CCW
in the chain. But in the case of normal I/O, the address points beyond
the end of the chain. While the guest generally only cares about this
when possibly restarting a channel program after error recovery, we
should convert the address even in the good scenario so that we provide
a consistent, valid, response upon I/O completion.
[1] Figure 16-6 in SA22-7832-11. The footnotes in that table also state
that this is true even if the resulting address is invalid or protected,
but moving to the end of the guest chain should not be a surprise.
Signed-off-by: Eric Farman <farman@linux.ibm.com>
Message-Id: <20190514234248.36203-2-farman@linux.ibm.com> Reviewed-by: Farhan Ali <alifm@linux.ibm.com> Signed-off-by: Cornelia Huck <cohuck@redhat.com>
Linus Torvalds [Sun, 2 Jun 2019 18:10:01 +0000 (11:10 -0700)]
Merge branch 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 fixes from Ingo Molnar:
"Two fixes: a quirk for KVM guests running on certain AMD CPUs, and a
KASAN related build fix"
* 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/CPU/AMD: Don't force the CPB cap when running under a hypervisor
x86/boot: Provide KASAN compatible aliases for string routines
Linus Torvalds [Sun, 2 Jun 2019 18:08:12 +0000 (11:08 -0700)]
Merge branch 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull perf fixes from Ingo Molnar:
"On the kernel side there's a bunch of ring-buffer ordering fixes for a
reproducible bug, plus a PEBS constraints regression fix.
Plus tooling fixes"
* 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
tools headers UAPI: Sync kvm.h headers with the kernel sources
perf record: Fix s390 missing module symbol and warning for non-root users
perf machine: Read also the end of the kernel
perf test vmlinux-kallsyms: Ignore aliases to _etext when searching on kallsyms
perf session: Add missing swap ops for namespace events
perf namespace: Protect reading thread's namespace
tools headers UAPI: Sync drm/drm.h with the kernel
tools headers UAPI: Sync drm/i915_drm.h with the kernel
tools headers UAPI: Sync linux/fs.h with the kernel
tools headers UAPI: Sync linux/sched.h with the kernel
tools arch x86: Sync asm/cpufeatures.h with the with the kernel
tools include UAPI: Update copy of files related to new fspick, fsmount, fsconfig, fsopen, move_mount and open_tree syscalls
perf arm64: Fix mksyscalltbl when system kernel headers are ahead of the kernel
perf data: Fix 'strncat may truncate' build failure with recent gcc
perf/ring-buffer: Use regular variables for nesting
perf/ring-buffer: Always use {READ,WRITE}_ONCE() for rb->user_page data
perf/ring_buffer: Add ordering to rb->nest increment
perf/ring_buffer: Fix exposing a temporarily decreased data_head
perf/x86/intel/ds: Fix EVENT vs. UEVENT PEBS constraints
Linus Torvalds [Sun, 2 Jun 2019 18:06:13 +0000 (11:06 -0700)]
Merge branch 'efi-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull EFI fixes from Ingo Molnar:
"Two EFI fixes: a quirk for weird systabs, plus add more robust error
handling in the old 1:1 mapping code"
* 'efi-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
efi: Allow the number of EFI configuration tables entries to be zero
efi/x86/Add missing error handling to old_memmap 1:1 mapping code
Linus Torvalds [Sun, 2 Jun 2019 17:22:38 +0000 (10:22 -0700)]
Merge tag 'spdx-5.2-rc3-2' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core
Pull SPDX fixes from Greg KH:
"Here are just two small patches, that fix up some found SPDX
identifier issues.
The first patch fixes an error in a previous SPDX fixup patch, that
causes build errors when doing 'make clean' on the tree (the fact that
almost no one noticed it reflects the fact that kernel developers
don't like doing that option very often...)
The second patch fixes up a number of places in the tree where people
mistyped the string "SPDX-License-Identifier". Given that people can
not even type their own name all the time without mistakes, this was
bound to happen, and odds are, we will have to add some type of check
for this to checkpatch.pl to catch this happening in the future.
Both of these have passed testing by 0-day"
* tag 'spdx-5.2-rc3-2' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core:
treewide: fix typos of SPDX-License-Identifier
crypto: ux500 - fix license comment syntax error
Linus Torvalds [Sun, 2 Jun 2019 17:21:04 +0000 (10:21 -0700)]
Merge tag 'powerpc-5.2-3' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux
Pull powerpc fixes from Michael Ellerman:
"A minor fix to our IMC PMU code to print a less confusing error
message when the driver can't initialise properly.
A fix for a bug where a user requesting an unsupported branch sampling
filter can corrupt PMU state, preventing the PMU from counting
properly.
And finally a fix for a bug in our support for kexec_file_load(),
which prevented loading a kernel and initramfs. Most versions of kexec
don't yet use kexec_file_load().
Thanks to: Anju T Sudhakar, Dave Young, Madhavan Srinivasan, Ravi
Bangoria, Thiago Jung Bauermann"
* tag 'powerpc-5.2-3' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux:
powerpc/kexec: Fix loading of kernel + initramfs with kexec_file_load()
powerpc/perf: Fix MMCRA corruption by bhrb_filter
powerpc/powernv: Return for invalid IMC domain
Linus Torvalds [Sun, 2 Jun 2019 17:19:39 +0000 (10:19 -0700)]
Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm
Pull KVM fixes from Paolo Bonzini:
"Fixes for PPC and s390"
* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm:
KVM: PPC: Book3S HV: Restore SPRG3 in kvmhv_p9_guest_entry()
KVM: PPC: Book3S HV: Fix lockdep warning when entering guest on POWER9
KVM: PPC: Book3S HV: XIVE: Fix page offset when clearing ESB pages
KVM: PPC: Book3S HV: XIVE: Take the srcu read lock when accessing memslots
KVM: PPC: Book3S HV: XIVE: Do not clear IRQ data of passthrough interrupts
KVM: PPC: Book3S HV: XIVE: Introduce a new mutex for the XIVE device
KVM: PPC: Book3S HV: XIVE: Fix the enforced limit on the vCPU identifier
KVM: PPC: Book3S HV: XIVE: Do not test the EQ flag validity when resetting
KVM: PPC: Book3S HV: XIVE: Clear file mapping when device is released
KVM: PPC: Book3S HV: Don't take kvm->lock around kvm_for_each_vcpu
KVM: PPC: Book3S: Use new mutex to synchronize access to rtas token list
KVM: PPC: Book3S HV: Use new mutex to synchronize MMU setup
KVM: PPC: Book3S HV: Avoid touching arch.mmu_ready in XIVE release functions
KVM: s390: Do not report unusabled IDs via KVM_CAP_MAX_VCPU_ID
kvm: fix compile on s390 part 2
Linus Torvalds [Sun, 2 Jun 2019 17:16:09 +0000 (10:16 -0700)]
Merge branch 'fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/evalenti/linux-soc-thermal
Pull thermal SoC fix from Eduardo Valentin:
"A single revert, detected to cause issues on the tsens driver"
* 'fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/evalenti/linux-soc-thermal:
Revert "drivers: thermal: tsens: Add new operation to check if a sensor is enabled"
Linus Torvalds [Sun, 2 Jun 2019 17:14:25 +0000 (10:14 -0700)]
Merge tag 'led-fixes-for-5.2-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/j.anaszewski/linux-leds
Pull LED fix from Jacek Anaszewski:
"Fix for a recent change in LED core, that didn't take into account the
possibility of calling led_blink_setup() from atomic context"
* tag 'led-fixes-for-5.2-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/j.anaszewski/linux-leds:
leds: avoid flush_work in atomic context
Linus Torvalds [Sun, 2 Jun 2019 16:26:34 +0000 (09:26 -0700)]
Merge tag 'scsi-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi
Pull SCSI fixes from James Bottomley:
"Six minor fixes to device drivers and one to the multipath alua
handler.
The most extensive fix is the zfcp port remove prevention one, but
it's impact is only s390"
* tag 'scsi-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi:
scsi: libsas: delete sas port if expander discover failed
scsi: libsas: only clear phy->in_shutdown after shutdown event done
scsi: scsi_dh_alua: Fix possible null-ptr-deref
scsi: smartpqi: properly set both the DMA mask and the coherent DMA mask
scsi: zfcp: fix to prevent port_remove with pure auto scan LUNs (only sdevs)
scsi: zfcp: fix missing zfcp_port reference put on -EBUSY from port_remove
scsi: libcxgbi: add a check for NULL pointer in cxgbi_check_route()
Linus Torvalds [Sun, 2 Jun 2019 15:51:30 +0000 (08:51 -0700)]
Merge branch 'akpm' (patches from Andrew)
Merge misc fixes from Andrew Morton:
"Various fixes and followups"
* emailed patches from Andrew Morton <akpm@linux-foundation.org>:
mm, compaction: make sure we isolate a valid PFN
include/linux/generic-radix-tree.h: fix kerneldoc comment
kernel/signal.c: trace_signal_deliver when signal_group_exit
drivers/iommu/intel-iommu.c: fix variable 'iommu' set but not used
spdxcheck.py: fix directory structures
kasan: initialize tag to 0xff in __kasan_kmalloc
z3fold: fix sheduling while atomic
scripts/gdb: fix invocation when CONFIG_COMMON_CLK is not set
mm/gup: continue VM_FAULT_RETRY processing even for pre-faults
ocfs2: fix error path kobject memory leak
memcg: make it work on sparse non-0-node systems
mm, memcg: consider subtrees in memory.events
prctl_set_mm: downgrade mmap_sem to read lock
prctl_set_mm: refactor checks from validate_prctl_map
kernel/fork.c: make max_threads symbol static
arch/arm/boot/compressed/decompress.c: fix build error due to lz4 changes
arch/parisc/configs/c8000_defconfig: remove obsoleted CONFIG_DEBUG_SLAB_LEAK
mm/vmalloc.c: fix typo in comment
lib/sort.c: fix kernel-doc notation warnings
mm: fix Documentation/vm/hmm.rst Sphinx warnings
When we have holes in a normal memory zone, we could endup having
cached_migrate_pfns which may not necessarily be valid, under heavy memory
pressure with swapping enabled ( via __reset_isolation_suitable(),
triggered by kswapd).
Later if we fail to find a page via fast_isolate_freepages(), we may end
up using the migrate_pfn we started the search with, as valid page. This
could lead to accessing NULL pointer derefernces like below, due to an
invalid mem_section pointer.
Unable to handle kernel NULL pointer dereference at virtual address 0000000000000008 [47/1825]
Mem abort info:
ESR = 0x96000004
Exception class = DABT (current EL), IL = 32 bits
SET = 0, FnV = 0
EA = 0, S1PTW = 0
Data abort info:
ISV = 0, ISS = 0x00000004
CM = 0, WnR = 0
user pgtable: 4k pages, 48-bit VAs, pgdp = 0000000082f94ae9
[0000000000000008] pgd=0000000000000000
Internal error: Oops: 96000004 [#1] SMP
...
CPU: 10 PID: 6080 Comm: qemu-system-aar Not tainted 510-rc1+ #6
Hardware name: AmpereComputing(R) OSPREY EV-883832-X3-0001/OSPREY, BIOS 4819 09/25/2018
pstate: 60000005 (nZCv daif -PAN -UAO)
pc : set_pfnblock_flags_mask+0x58/0xe8
lr : compaction_alloc+0x300/0x950
[...]
Process qemu-system-aar (pid: 6080, stack limit = 0x0000000095070da5)
Call trace:
set_pfnblock_flags_mask+0x58/0xe8
compaction_alloc+0x300/0x950
migrate_pages+0x1a4/0xbb0
compact_zone+0x750/0xde8
compact_zone_order+0xd8/0x118
try_to_compact_pages+0xb4/0x290
__alloc_pages_direct_compact+0x84/0x1e0
__alloc_pages_nodemask+0x5e0/0xe18
alloc_pages_vma+0x1cc/0x210
do_huge_pmd_anonymous_page+0x108/0x7c8
__handle_mm_fault+0xdd4/0x1190
handle_mm_fault+0x114/0x1c0
__get_user_pages+0x198/0x3c0
get_user_pages_unlocked+0xb4/0x1d8
__gfn_to_pfn_memslot+0x12c/0x3b8
gfn_to_pfn_prot+0x4c/0x60
kvm_handle_guest_abort+0x4b0/0xcd8
handle_exit+0x140/0x1b8
kvm_arch_vcpu_ioctl_run+0x260/0x768
kvm_vcpu_ioctl+0x490/0x898
do_vfs_ioctl+0xc4/0x898
ksys_ioctl+0x8c/0xa0
__arm64_sys_ioctl+0x28/0x38
el0_svc_common+0x74/0x118
el0_svc_handler+0x38/0x78
el0_svc+0x8/0xc
Code: f8607840f100001f8b0114019a801020 (f9400400)
---[ end trace af6a35219325a9b6 ]---
The issue was reported on an arm64 server with 128GB with holes in the
zone (e.g, [32GB@4GB, 96GB@544GB]), with a swap device enabled, while
running 100 KVM guest instances.
This patch fixes the issue by ensuring that the page belongs to a valid
PFN when we fallback to using the lower limit of the scan range upon
failure in fast_isolate_freepages().
Link: http://lkml.kernel.org/r/1558711908-15688-1-git-send-email-suzuki.poulose@arm.com Fixes: 5a811889de10f1eb ("mm, compaction: use free lists to quickly locate a migration target") Signed-off-by: Suzuki K Poulose <suzuki.poulose@arm.com> Reported-by: Marc Zyngier <marc.zyngier@arm.com> Reviewed-by: Mel Gorman <mgorman@techsingularity.net> Reviewed-by: Anshuman Khandual <anshuman.khandual@arm.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Qian Cai <cai@lca.pw> Cc: Marc Zyngier <marc.zyngier@arm.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Zhenliang Wei [Sat, 1 Jun 2019 05:30:52 +0000 (22:30 -0700)]
kernel/signal.c: trace_signal_deliver when signal_group_exit
In the fixes commit, removing SIGKILL from each thread signal mask and
executing "goto fatal" directly will skip the call to
"trace_signal_deliver". At this point, the delivery tracking of the
SIGKILL signal will be inaccurate.
Therefore, we need to add trace_signal_deliver before "goto fatal" after
executing sigdelset.
Note: SEND_SIG_NOINFO matches the fact that SIGKILL doesn't have any info.
Link: http://lkml.kernel.org/r/20190425025812.91424-1-weizhenliang@huawei.com Fixes: cf43a757fd4944 ("signal: Restore the stop PTRACE_EVENT_EXIT") Signed-off-by: Zhenliang Wei <weizhenliang@huawei.com> Reviewed-by: Christian Brauner <christian@brauner.io> Reviewed-by: Oleg Nesterov <oleg@redhat.com> Cc: Eric W. Biederman <ebiederm@xmission.com> Cc: Ivan Delalande <colona@arista.com> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Deepa Dinamani <deepa.kernel@gmail.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Qian Cai [Sat, 1 Jun 2019 05:30:49 +0000 (22:30 -0700)]
drivers/iommu/intel-iommu.c: fix variable 'iommu' set but not used
Commit cf04eee8bf0e ("iommu/vt-d: Include ACPI devices in iommu=pt")
added for_each_active_iommu() in iommu_prepare_static_identity_mapping()
but never used the each element, i.e, "drhd->iommu".
drivers/iommu/intel-iommu.c: In function
'iommu_prepare_static_identity_mapping':
drivers/iommu/intel-iommu.c:3037:22: warning: variable 'iommu' set but
not used [-Wunused-but-set-variable]
struct intel_iommu *iommu;
Fixed the warning by appending a compiler attribute __maybe_unused for it.
Link: http://lkml.kernel.org/r/20190523013314.2732-1-cai@lca.pw Signed-off-by: Qian Cai <cai@lca.pw> Suggested-by: Andrew Morton <akpm@linux-foundation.org> Cc: Joerg Roedel <jroedel@suse.de> Cc: David Woodhouse <dwmw2@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The LICENSE directory has recently changed structure and this makes
spdxcheck fails as per below:
FAIL: "Blob or Tree named 'other' not found"
Traceback (most recent call last):
File "scripts/spdxcheck.py", line 240, in <module>
spdx = read_spdxdata(repo)
File "scripts/spdxcheck.py", line 41, in read_spdxdata
for el in lictree[d].traverse():
[...]
KeyError: "Blob or Tree named 'other' not found"
Fix the script to restore the correctness on checkpatch License checking.
References: 62be257e986d ("LICENSES: Rename other to deprecated")
References: 8ea8814fcdcb ("LICENSES: Clearly mark dual license only licenses") Link: http://lkml.kernel.org/r/20190523084755.56739-1-vincenzo.frascino@arm.com Signed-off-by: Vincenzo Frascino <vincenzo.frascino@arm.com> Cc: Joe Perches <joe@perches.com> Cc: Christoph Hellwig <hch@lst.de> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Jeremy Cline <jcline@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
When building with -Wuninitialized and CONFIG_KASAN_SW_TAGS unset, Clang
warns:
mm/kasan/common.c:484:40: warning: variable 'tag' is uninitialized when
used here [-Wuninitialized]
kasan_unpoison_shadow(set_tag(object, tag), size);
^~~
set_tag ignores tag in this configuration but clang doesn't realize it at
this point in its pipeline, as it points to arch_kasan_set_tag as being
the point where it is used, which will later be expanded to (void
*)(object) without a use of tag. Initialize tag to 0xff, as it removes
this warning and doesn't change the meaning of the code.
Fabiano Rosas [Sat, 1 Jun 2019 05:30:36 +0000 (22:30 -0700)]
scripts/gdb: fix invocation when CONFIG_COMMON_CLK is not set
CLK_GET_RATE_NOCACHE depends on CONFIG_COMMON_CLK. Importing constants.py
when CONFIG_COMMON_CLK is not defined causes:
(gdb) lx-symbols
(...)
File "scripts/gdb/linux/proc.py", line 15, in <module>
from linux import constants
File "scripts/gdb/linux/constants.py", line 2, in <module>
LX_CLK_GET_RATE_NOCACHE = gdb.parse_and_eval("CLK_GET_RATE_NOCACHE")
gdb.error: No symbol "CLK_GET_RATE_NOCACHE" in current context.
Link: http://lkml.kernel.org/r/20190523195313.24701-1-farosas@linux.ibm.com Fixes: e7e6f462c1be ("scripts/gdb: print cached rate in lx-clk-summary") Signed-off-by: Fabiano Rosas <farosas@linux.ibm.com> Reviewed-by: Stephen Boyd <swboyd@chromium.org> Cc: Jan Kiszka <jan.kiszka@siemens.com> Cc: Kieran Bingham <kbingham@kernel.org> Cc: Leonard Crestez <leonard.crestez@nxp.com> Cc: Jackie Liu <liuyun01@kylinos.cn> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Mike Rapoport [Sat, 1 Jun 2019 05:30:33 +0000 (22:30 -0700)]
mm/gup: continue VM_FAULT_RETRY processing even for pre-faults
When get_user_pages*() is called with pages = NULL, the processing of
VM_FAULT_RETRY terminates early without actually retrying to fault-in all
the pages.
If the pages in the requested range belong to a VMA that has userfaultfd
registered, handle_userfault() returns VM_FAULT_RETRY *after* user space
has populated the page, but for the gup pre-fault case there's no actual
retry and the caller will get no pages although they are present.
This issue was uncovered when running post-copy memory restore in CRIU
after d9c9ce34ed5c ("x86/fpu: Fault-in user stack if
copy_fpstate_to_sigframe() fails").
After this change, the copying of FPU state to the sigframe switched from
copy_to_user() variants which caused a real page fault to get_user_pages()
with pages parameter set to NULL.
In post-copy mode of CRIU, the destination memory is managed with
userfaultfd and lack of the retry for pre-fault case in get_user_pages()
causes a crash of the restored process.
Making the pre-fault behavior of get_user_pages() the same as the "normal"
one fixes the issue.
Link: http://lkml.kernel.org/r/1557844195-18882-1-git-send-email-rppt@linux.ibm.com Fixes: d9c9ce34ed5c ("x86/fpu: Fault-in user stack if copy_fpstate_to_sigframe() fails") Signed-off-by: Mike Rapoport <rppt@linux.ibm.com> Tested-by: Andrei Vagin <avagin@gmail.com> [https://travis-ci.org/avagin/linux/builds/533184940] Tested-by: Hugh Dickins <hughd@google.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Cc: Borislav Petkov <bp@suse.de> Cc: Pavel Machek <pavel@ucw.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
If a call to kobject_init_and_add() fails we should call kobject_put()
otherwise we leak memory.
Add call to kobject_put() in the error path of call to
kobject_init_and_add(). Please note, this has the side effect that the
release method is called if kobject_init_and_add() fails.
Link: http://lkml.kernel.org/r/20190513033458.2824-1-tobin@kernel.org Signed-off-by: Tobin C. Harding <tobin@kernel.org> Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Reviewed-by: Joseph Qi <joseph.qi@linux.alibaba.com> Cc: Mark Fasheh <mark@fasheh.com> Cc: Joel Becker <jlbec@evilplan.org> Cc: Junxiao Bi <junxiao.bi@oracle.com> Cc: Changwei Ge <gechangwei@live.cn> Cc: Gang He <ghe@suse.com> Cc: Jun Piao <piaojun@huawei.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Jiri Slaby [Sat, 1 Jun 2019 05:30:26 +0000 (22:30 -0700)]
memcg: make it work on sparse non-0-node systems
We have a single node system with node 0 disabled:
Scanning NUMA topology in Northbridge 24
Number of physical nodes 2
Skipping disabled node 0
Node 1 MemBase 0000000000000000 Limit 00000000fbff0000
NODE_DATA(1) allocated [mem 0xfbfda000-0xfbfeffff]
This causes crashes in memcg when system boots:
BUG: unable to handle kernel NULL pointer dereference at 0000000000000008
#PF error: [normal kernel read fault]
...
RIP: 0010:list_lru_add+0x94/0x170
...
Call Trace:
d_lru_add+0x44/0x50
dput.part.34+0xfc/0x110
__fput+0x108/0x230
task_work_run+0x9f/0xc0
exit_to_usermode_loop+0xf5/0x100
It is reproducible as far as 4.12. I did not try older kernels. You have
to have a new enough systemd, e.g. 241 (the reason is unknown -- was not
investigated). Cannot be reproduced with systemd 234.
The system crashes because the size of lru array is never updated in
memcg_update_all_list_lrus and the reads are past the zero-sized array,
causing dereferences of random memory.
The root cause are list_lru_memcg_aware checks in the list_lru code. The
test in list_lru_memcg_aware is broken: it assumes node 0 is always
present, but it is not true on some systems as can be seen above.
So fix this by avoiding checks on node 0. Remember the memcg-awareness by
a bool flag in struct list_lru.
Link: http://lkml.kernel.org/r/20190522091940.3615-1-jslaby@suse.cz Fixes: 60d3fd32a7a9 ("list_lru: introduce per-memcg lists") Signed-off-by: Jiri Slaby <jslaby@suse.cz> Acked-by: Michal Hocko <mhocko@suse.com> Suggested-by: Vladimir Davydov <vdavydov.dev@gmail.com> Acked-by: Vladimir Davydov <vdavydov.dev@gmail.com> Reviewed-by: Shakeel Butt <shakeelb@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Raghavendra K T <raghavendra.kt@linux.vnet.ibm.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Chris Down [Sat, 1 Jun 2019 05:30:22 +0000 (22:30 -0700)]
mm, memcg: consider subtrees in memory.events
memory.stat and other files already consider subtrees in their output, and
we should too in order to not present an inconsistent interface.
The current situation is fairly confusing, because people interacting with
cgroups expect hierarchical behaviour in the vein of memory.stat,
cgroup.events, and other files. For example, this causes confusion when
debugging reclaim events under low, as currently these always read "0" at
non-leaf memcg nodes, which frequently causes people to misdiagnose breach
behaviour. The same confusion applies to other counters in this file when
debugging issues.
Aggregation is done at write time instead of at read-time since these
counters aren't hot (unlike memory.stat which is per-page, so it does it
at read time), and it makes sense to bundle this with the file
notifications.
After this patch, events are propagated up the hierarchy:
[root@ktst ~]# cat /sys/fs/cgroup/system.slice/memory.events
low 0
high 0
max 0
oom 0
oom_kill 0
[root@ktst ~]# systemd-run -p MemoryMax=1 true
Running as unit: run-r251162a189fb4562b9dabfdc9b0422f5.service
[root@ktst ~]# cat /sys/fs/cgroup/system.slice/memory.events
low 0
high 0
max 7
oom 1
oom_kill 1
As this is a change in behaviour, this can be reverted to the old
behaviour by mounting with the `memory_localevents' flag set. However, we
use the new behaviour by default as there's a lack of evidence that there
are any current users of memory.events that would find this change
undesirable.
akpm: this is a behaviour change, so Cc:stable. THis is so that
forthcoming distros which use cgroup v2 are more likely to pick up the
revised behaviour.
Link: http://lkml.kernel.org/r/20190208224419.GA24772@chrisdown.name Signed-off-by: Chris Down <chris@chrisdown.name> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Reviewed-by: Shakeel Butt <shakeelb@google.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Tejun Heo <tj@kernel.org> Cc: Roman Gushchin <guro@fb.com> Cc: Dennis Zhou <dennis@kernel.org> Cc: Suren Baghdasaryan <surenb@google.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Michal Koutný [Sat, 1 Jun 2019 05:30:19 +0000 (22:30 -0700)]
prctl_set_mm: downgrade mmap_sem to read lock
The commit a3b609ef9f8b ("proc read mm's {arg,env}_{start,end} with mmap
semaphore taken.") added synchronization of reading argument/environment
boundaries under mmap_sem. Later commit 88aa7cc688d4 ("mm: introduce
arg_lock to protect arg_start|end and env_start|end in mm_struct") avoided
the coarse use of mmap_sem in similar situations. But there still
remained two places that (mis)use mmap_sem.
get_cmdline should also use arg_lock instead of mmap_sem when it reads the
boundaries.
The second place that should use arg_lock is in prctl_set_mm. By
protecting the boundaries fields with the arg_lock, we can downgrade
mmap_sem to reader lock (analogous to what we already do in
prctl_set_mm_map).
[akpm@linux-foundation.org: coding style fixes] Link: http://lkml.kernel.org/r/20190502125203.24014-3-mkoutny@suse.com Fixes: 88aa7cc688d4 ("mm: introduce arg_lock to protect arg_start|end and env_start|end in mm_struct") Signed-off-by: Michal Koutný <mkoutny@suse.com> Signed-off-by: Laurent Dufour <ldufour@linux.ibm.com> Co-developed-by: Laurent Dufour <ldufour@linux.ibm.com> Reviewed-by: Cyrill Gorcunov <gorcunov@gmail.com> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Yang Shi <yang.shi@linux.alibaba.com> Cc: Mateusz Guzik <mguzik@redhat.com> Cc: Kirill Tkhai <ktkhai@virtuozzo.com> Cc: Konstantin Khlebnikov <khlebnikov@yandex-team.ru> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Michal Koutný [Sat, 1 Jun 2019 05:30:16 +0000 (22:30 -0700)]
prctl_set_mm: refactor checks from validate_prctl_map
Despite comment of validate_prctl_map claims there are no capability
checks, it is not completely true since commit 4d28df6152aa ("prctl: Allow
local CAP_SYS_ADMIN changing exe_file"). Extract the check out of the
function and make the function perform purely arithmetic checks.
This patch should not change any behavior, it is mere refactoring for
following patch.
[akpm@linux-foundation.org: coding style fixes] Link: http://lkml.kernel.org/r/20190502125203.24014-2-mkoutny@suse.com Signed-off-by: Michal Koutný <mkoutny@suse.com> Reviewed-by: Kirill Tkhai <ktkhai@virtuozzo.com> Reviewed-by: Cyrill Gorcunov <gorcunov@gmail.com> Cc: Kirill Tkhai <ktkhai@virtuozzo.com> Cc: Laurent Dufour <ldufour@linux.ibm.com> Cc: Mateusz Guzik <mguzik@redhat.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Yang Shi <yang.shi@linux.alibaba.com> Cc: Konstantin Khlebnikov <khlebnikov@yandex-team.ru> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
arch/arm/boot/compressed/decompress.c: fix build error due to lz4 changes
include/linux/cpumask.h: In function 'cpumask_parse':
include/linux/cpumask.h:636:21: error: implicit declaration of function 'strchrnul'; did you mean 'strchr'? [-Werror=implicit-function-declaration]
Because arch/arm/boot/compressed/decompress.c does
#define _LINUX_STRING_H_
preventing linux/string.h from providing strchrnul. It also #includes
asm/string.h, which for arm has a declaration of strchr(), explaining why
this didn't use to fail.
Link: http://lkml.kernel.org/r/20190528115346.f5a7kn3hdnuf5rts@linutronix.de Fixes: 3713a4e1fdb8d ("include/linux/cpumask.h: fix double string traverse in cpumask_parse") Suggested-by: Rasmus Villemoes <linux@rasmusvillemoes.dk> Cc: Yury Norov <ynorov@marvell.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Russell King <rmk+kernel@arm.linux.org.uk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>