Ravi Bangoria [Thu, 14 May 2020 11:17:29 +0000 (16:47 +0530)]
powerpc/watchpoint/ptrace: Return actual num of available watchpoints
User can ask for num of available watchpoints(dbginfo.num_data_bps)
using ptrace(PPC_PTRACE_GETHWDBGINFO). Return actual number of
available watchpoints on the machine rather than hardcoded 1.
Ravi Bangoria [Thu, 14 May 2020 11:17:28 +0000 (16:47 +0530)]
powerpc/watchpoint: Introduce function to get nr watchpoints dynamically
So far we had only one watchpoint, so we have hardcoded HBP_NUM to 1.
But Power10 is introducing 2nd DAWR and thus kernel should be able to
dynamically find actual number of watchpoints supported by hw it's
running on. Introduce function for the same. Also convert HBP_NUM macro
to HBP_NUM_MAX, which will now represent maximum number of watchpoints
supported by Powerpc.
Jordan Niethe [Wed, 6 May 2020 03:40:49 +0000 (13:40 +1000)]
powerpc sstep: Add support for prefixed load/stores
This adds emulation support for the following prefixed integer
load/stores:
* Prefixed Load Byte and Zero (plbz)
* Prefixed Load Halfword and Zero (plhz)
* Prefixed Load Halfword Algebraic (plha)
* Prefixed Load Word and Zero (plwz)
* Prefixed Load Word Algebraic (plwa)
* Prefixed Load Doubleword (pld)
* Prefixed Store Byte (pstb)
* Prefixed Store Halfword (psth)
* Prefixed Store Word (pstw)
* Prefixed Store Doubleword (pstd)
* Prefixed Load Quadword (plq)
* Prefixed Store Quadword (pstq)
the follow prefixed floating-point load/stores:
* Prefixed Load Floating-Point Single (plfs)
* Prefixed Load Floating-Point Double (plfd)
* Prefixed Store Floating-Point Single (pstfs)
* Prefixed Store Floating-Point Double (pstfd)
and for the following prefixed VSX load/stores:
* Prefixed Load VSX Scalar Doubleword (plxsd)
* Prefixed Load VSX Scalar Single-Precision (plxssp)
* Prefixed Load VSX Vector [0|1] (plxv, plxv0, plxv1)
* Prefixed Store VSX Scalar Doubleword (pstxsd)
* Prefixed Store VSX Scalar Single-Precision (pstxssp)
* Prefixed Store VSX Vector [0|1] (pstxv, pstxv0, pstxv1)
Signed-off-by: Jordan Niethe <jniethe5@gmail.com> Reviewed-by: Balamuruhan S <bala24@linux.ibm.com>
[mpe: Use CONFIG_PPC64 not __powerpc64__, use get_op()] Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20200506034050.24806-30-jniethe5@gmail.com
Jordan Niethe [Wed, 6 May 2020 03:40:48 +0000 (13:40 +1000)]
powerpc: Support prefixed instructions in alignment handler
If a prefixed instruction results in an alignment exception, the
SRR1_PREFIXED bit is set. The handler attempts to emulate the
responsible instruction and then increment the NIP past it. Use
SRR1_PREFIXED to determine by how much the NIP should be incremented.
Prefixed instructions are not permitted to cross 64-byte boundaries. If
they do the alignment interrupt is invoked with SRR1 BOUNDARY bit set.
If this occurs send a SIGBUS to the offending process if in user mode.
If in kernel mode call bad_page_fault().
Jordan Niethe [Fri, 15 May 2020 02:12:55 +0000 (12:12 +1000)]
powerpc: Add prefixed instructions to instruction data type
For powerpc64, redefine the ppc_inst type so both word and prefixed
instructions can be represented. On powerpc32 the type will remain the
same. Update places which had assumed instructions to be 4 bytes long.
Signed-off-by: Jordan Niethe <jniethe5@gmail.com> Reviewed-by: Alistair Popple <alistair@popple.id.au>
[mpe: Rework the get_user_inst() macros to be parameterised, and don't
assign to the dest if an error occurred. Use CONFIG_PPC64 not
__powerpc64__ in a few places. Address other comments from
Christophe. Fix some sparse complaints.] Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20200506034050.24806-24-jniethe5@gmail.com
Jordan Niethe [Fri, 15 May 2020 01:15:28 +0000 (11:15 +1000)]
powerpc/optprobes: Add register argument to patch_imm64_load_insns()
Currently patch_imm32_load_insns() is used to load an instruction to
r4 to be emulated by emulate_step(). For prefixed instructions we
would like to be able to load a 64bit immediate to r4. To prepare for
this make patch_imm64_load_insns() take an argument that decides which
register to load an immediate to - rather than hardcoding r3.
Jordan Niethe [Wed, 6 May 2020 03:40:42 +0000 (13:40 +1000)]
powerpc: Define new SRR1 bits for a ISA v3.1
Add the BOUNDARY SRR1 bit definition for when the cause of an
alignment exception is a prefixed instruction that crosses a 64-byte
boundary. Add the PREFIXED SRR1 bit definition for exceptions caused
by prefixed instructions.
Bit 35 of SRR1 is called SRR1_ISI_N_OR_G. This name comes from it
being used to indicate that an ISI was due to the access being no-exec
or guarded. ISA v3.1 adds another purpose. It is also set if there is
an access in a cache-inhibited location for prefixed instruction.
Rename from SRR1_ISI_N_OR_G to SRR1_ISI_N_G_OR_CIP.
Alistair Popple [Wed, 6 May 2020 03:40:41 +0000 (13:40 +1000)]
powerpc: Enable Prefixed Instructions
Prefix instructions have their own FSCR bit which needs to enabled via
a CPU feature. The kernel will save the FSCR for problem state but it
needs to be enabled initially.
If prefixed instructions are made unavailable by the [H]FSCR, attempting
to use them will cause a facility unavailable exception. Add "PREFIX" to
the facility_strings[].
Currently there are no prefixed instructions that are actually emulated
by emulate_instruction() within facility_unavailable_exception().
However, when caused by a prefixed instructions the SRR1 PREFIXED bit is
set. Prepare for dealing with emulated prefixed instructions by checking
for this bit.
Signed-off-by: Alistair Popple <alistair@popple.id.au> Signed-off-by: Jordan Niethe <jniethe5@gmail.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Reviewed-by: Nicholas Piggin <npiggin@gmail.com> Link: https://lore.kernel.org/r/20200506034050.24806-22-jniethe5@gmail.com
Jordan Niethe [Wed, 6 May 2020 03:40:40 +0000 (13:40 +1000)]
powerpc: Make test_translate_branch() independent of instruction length
test_translate_branch() uses two pointers to instructions within a
buffer, p and q, to test patch_branch(). The pointer arithmetic done on
them assumes a size of 4. This will not work if the instruction length
changes. Instead do the arithmetic relative to the void * to the buffer.
Jordan Niethe [Wed, 6 May 2020 03:40:39 +0000 (13:40 +1000)]
powerpc/xmon: Move insertion of breakpoint for xol'ing
When a new breakpoint is created, the second instruction of that
breakpoint is patched with a trap instruction. This assumes the length
of the instruction is always the same. In preparation for prefixed
instructions, remove this assumption. Insert the trap instruction at the
same time the first instruction is inserted.
Jordan Niethe [Wed, 6 May 2020 03:40:38 +0000 (13:40 +1000)]
powerpc/xmon: Use a function for reading instructions
Currently in xmon, mread() is used for reading instructions. In
preparation for prefixed instructions, create and use a new function,
mread_instr(), especially for reading instructions.
Jordan Niethe [Wed, 6 May 2020 03:40:35 +0000 (13:40 +1000)]
powerpc/kprobes: Use patch_instruction()
Instead of using memcpy() and flush_icache_range() use
patch_instruction() which not only accomplishes both of these steps but
will also make it easier to add support for prefixed instructions.
Jordan Niethe [Wed, 6 May 2020 03:40:34 +0000 (13:40 +1000)]
powerpc: Add a probe_kernel_read_inst() function
Introduce a probe_kernel_read_inst() function to use in cases where
probe_kernel_read() is used for getting an instruction. This will be
more useful for prefixed instructions.
Jordan Niethe [Wed, 6 May 2020 03:40:33 +0000 (13:40 +1000)]
powerpc: Add a probe_user_read_inst() function
Introduce a probe_user_read_inst() function to use in cases where
probe_user_read() is used for getting an instruction. This will be
more useful for prefixed instructions.
Signed-off-by: Jordan Niethe <jniethe5@gmail.com> Reviewed-by: Alistair Popple <alistair@popple.id.au>
[mpe: Don't write to *inst on error, fold in __user annotations] Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20200506034050.24806-14-jniethe5@gmail.com
Jordan Niethe [Wed, 6 May 2020 03:40:32 +0000 (13:40 +1000)]
powerpc: Use a function for reading instructions
Prefixed instructions will mean there are instructions of different
length. As a result dereferencing a pointer to an instruction will not
necessarily give the desired result. Introduce a function for reading
instructions from memory into the instruction data type.
Jordan Niethe [Wed, 6 May 2020 03:40:31 +0000 (13:40 +1000)]
powerpc: Use a datatype for instructions
Currently unsigned ints are used to represent instructions on powerpc.
This has worked well as instructions have always been 4 byte words.
However, ISA v3.1 introduces some changes to instructions that mean
this scheme will no longer work as well. This change is Prefixed
Instructions. A prefixed instruction is made up of a word prefix
followed by a word suffix to make an 8 byte double word instruction.
No matter the endianness of the system the prefix always comes first.
Prefixed instructions are only planned for powerpc64.
Introduce a ppc_inst type to represent both prefixed and word
instructions on powerpc64 while keeping it possible to exclusively
have word instructions on powerpc32.
Jordan Niethe [Wed, 6 May 2020 03:40:28 +0000 (13:40 +1000)]
powerpc: Use a function for getting the instruction op code
In preparation for using a data type for instructions that can not be
directly used with the '>>' operator use a function for getting the op
code of an instruction.
Jordan Niethe [Wed, 6 May 2020 03:40:27 +0000 (13:40 +1000)]
powerpc: Use an accessor for instructions
In preparation for introducing a more complicated instruction type to
accommodate prefixed instructions use an accessor for getting an
instruction as a u32.
Jordan Niethe [Wed, 6 May 2020 03:40:26 +0000 (13:40 +1000)]
powerpc: Use a macro for creating instructions from u32s
In preparation for instructions having a more complex data type start
using a macro, ppc_inst(), for making an instruction out of a u32. A
macro is used so that instructions can be used as initializer elements.
Currently this does nothing, but it will allow for creating a data type
that can represent prefixed instructions.
Signed-off-by: Jordan Niethe <jniethe5@gmail.com>
[mpe: Change include guard to _ASM_POWERPC_INST_H] Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Reviewed-by: Alistair Popple <alistair@popple.id.au> Link: https://lore.kernel.org/r/20200506034050.24806-7-jniethe5@gmail.com
Jordan Niethe [Wed, 6 May 2020 03:40:25 +0000 (13:40 +1000)]
powerpc: Change calling convention for create_branch() et. al.
create_branch(), create_cond_branch() and translate_branch() return the
instruction that they create, or return 0 to signal an error. Separate
these concerns in preparation for an instruction type that is not just
an unsigned int. Fill the created instruction to a pointer passed as
the first parameter to the function and use a non-zero return value to
signify an error.
Jordan Niethe [Wed, 6 May 2020 03:40:24 +0000 (13:40 +1000)]
powerpc/xmon: Use bitwise calculations in_breakpoint_table()
A modulo operation is used for calculating the current offset from a
breakpoint within the breakpoint table. As instruction lengths are
always a power of 2, this can be replaced with a bitwise 'and'. The
current check for word alignment can be replaced with checking that the
lower 2 bits are not set.
Suggested-by: Christophe Leroy <christophe.leroy@c-s.fr> Signed-off-by: Jordan Niethe <jniethe5@gmail.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Reviewed-by: Alistair Popple <alistair@popple.id.au> Link: https://lore.kernel.org/r/20200506034050.24806-5-jniethe5@gmail.com
Jordan Niethe [Wed, 6 May 2020 03:40:23 +0000 (13:40 +1000)]
powerpc/xmon: Move breakpoints to text section
The instructions for xmon's breakpoint are stored bpt_table[] which is in
the data section. This is problematic as the data section may be marked
as no execute. Move bpt_table[] to the text section.
Jordan Niethe [Wed, 6 May 2020 03:40:22 +0000 (13:40 +1000)]
powerpc/xmon: Move breakpoint instructions to own array
To execute an instruction out of line after a breakpoint, the NIP is set
to the address of struct bpt::instr. Here a copy of the instruction that
was replaced with a breakpoint is kept, along with a trap so normal flow
can be resumed after XOLing. The struct bpt's are located within the
data section. This is problematic as the data section may be marked as
no execute.
Instead of each struct bpt holding the instructions to be XOL'd, make a
new array, bpt_table[], with enough space to hold instructions for the
number of supported breakpoints. A later patch will move this to the
text section.
Make struct bpt::instr a pointer to the instructions in bpt_table[]
associated with that breakpoint. This association is a simple mapping:
bpts[n] -> bpt_table[n * words per breakpoint]. Currently we only need
the copied instruction followed by a trap, so 2 words per breakpoint.
Jordan Niethe [Wed, 6 May 2020 03:40:21 +0000 (13:40 +1000)]
powerpc/xmon: Remove store_inst() for patch_instruction()
For modifying instructions in xmon, patch_instruction() can serve the
same role that store_inst() is performing with the advantage of not
being specific to xmon. In some places patch_instruction() is already
being using followed by store_inst(). In these cases just remove the
store_inst(). Otherwise replace store_inst() with patch_instruction().
Geoff Levand [Sat, 9 May 2020 18:58:32 +0000 (18:58 +0000)]
powerpc/ps3: Fix kexec shutdown hang
The ps3_mm_region_destroy() and ps3_mm_vas_destroy() routines
are called very late in the shutdown via kexec's mmu_cleanup_all
routine. By the time mmu_cleanup_all runs it is too late to use
udbg_printf, and calling it will cause PS3 systems to hang.
Remove all debugging statements from ps3_mm_region_destroy() and
ps3_mm_vas_destroy() and replace any error reporting with calls
to lv1_panic.
With this change builds with 'DEBUG' defined will not cause kexec
reboots to hang, and builds with 'DEBUG' defined or not will end
in lv1_panic if an error is encountered.
Since commit dcebd755926b ("block: use bio_for_each_bvec() to compute
multi-page bvec count"), the kernel will bug_on on the PS3 because
bio_split() is called with sectors == 0:
The problem originates from setting the segment boundary of the
request queue to -1UL. This makes get_max_segment_size() return zero
when offset is zero, whatever the max segment size. The test with
BLK_SEG_BOUNDARY_MASK fails and 'mask - (mask & offset) + 1' overflows
to zero in the return statement.
Not setting the segment boundary and using the default
value (BLK_SEG_BOUNDARY_MASK) fixes the problem.
Nicholas Piggin [Fri, 8 May 2020 04:34:07 +0000 (14:34 +1000)]
powerpc/traps: Make unrecoverable NMIs die instead of panic
System Reset and Machine Check interrupts that are not recoverable due
to being nested or interrupting when RI=0 currently panic. This is not
necessary, and can often just kill the current context and recover.
Nicholas Piggin [Fri, 8 May 2020 04:34:06 +0000 (14:34 +1000)]
powerpc/traps: Do not trace system reset
Similarly to the previous patch, do not trace system reset. This code
is used when there is a crash or hang, and tracing disturbs the system
more and has been known to crash in the crash handling path.
Signed-off-by: Nicholas Piggin <npiggin@gmail.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Reviewed-by: Christophe Leroy <christophe.leroy@c-s.fr> Acked-by: Naveen N. Rao <naveen.n.rao@linux.vnet.ibm.com> Link: https://lore.kernel.org/r/20200508043408.886394-15-npiggin@gmail.com
Nicholas Piggin [Fri, 8 May 2020 04:34:05 +0000 (14:34 +1000)]
powerpc/64s: machine check do not trace real-mode handler
Rather than notrace annotations throughout a significant part of the
machine check code across kernel/ pseries/ and powernv/ which can
easily be broken and is infrequently tested, use paca->ftrace_enabled
to blanket-disable tracing of the real-mode non-maskable handler.
Signed-off-by: Nicholas Piggin <npiggin@gmail.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Reviewed-by: Christophe Leroy <christophe.leroy@c-s.fr> Acked-by: Naveen N. Rao <naveen.n.rao@linux.vnet.ibm.com> Link: https://lore.kernel.org/r/20200508043408.886394-14-npiggin@gmail.com
machine_check_early() is taken as an NMI, so nmi_enter() is used
there. machine_check_exception() is no longer taken as an NMI (it's
invoked via irq_work in the case a machine check hits in kernel mode),
so remove the nmi_enter() from that case.
In NMI context, hash faults don't try to refill the hash table, which
can lead to crashes accessing non-pinned kernel pages. System reset
still has this potential problem.
Nicholas Piggin [Fri, 8 May 2020 04:34:02 +0000 (14:34 +1000)]
powerpc/pseries: Machine check use rtas_call_unlocked() with args on stack
With the previous patch, machine checks can use rtas_call_unlocked()
which avoids the RTAS spinlock which would deadlock if a machine
check hits while making an RTAS call.
This also avoids the complex RTAS error logging which has more RTAS
calls and includes kmalloc (which can return memory beyond RMA, which
would also crash).
Nicholas Piggin [Fri, 8 May 2020 04:34:00 +0000 (14:34 +1000)]
powerpc/pseries/ras: fwnmi sreset should not interlock
PAPR does not specify that fwnmi sreset should be interlocked, and
PowerVM (and therefore now QEMU) do not require it.
These "ibm,nmi-interlock" calls are ignored by firmware, but there
is a possibility that the sreset could have interrupted a machine
check and release the machine check's interlock too early, corrupting
it if another machine check came in.
This is an extremely rare case, but it should be fixed for clarity
and reducing the code executed in the sreset path. Firmware also
does not provide error information for the sreset case to look at, so
remove that comment.
Nicholas Piggin [Fri, 8 May 2020 04:33:58 +0000 (14:33 +1000)]
powerpc/pseries/ras: Fix FWNMI_VALID off by one
This was discovered developing qemu fwnmi sreset support. This
off-by-one bug means the last 16 bytes of the rtas area can not
be used for a 16 byte save area.
It's not a serious bug, and QEMU implementation has to retain a
workaround for old kernels, but it's good to tighten it.
Nicholas Piggin [Fri, 8 May 2020 04:33:56 +0000 (14:33 +1000)]
powerpc/64s/exceptions: Machine check reconcile irq state
pseries fwnmi machine check code pops the soft-irq checks in rtas_call
(after the next patch to remove rtas_token from this call path).
Rather than play whack a mole with these and forever having fragile
code, it seems better to have the early machine check handler perform
the same kind of reconcile as the other NMI interrupts.
Nicholas Piggin [Fri, 8 May 2020 04:33:55 +0000 (14:33 +1000)]
powerpc/64s/exceptions: Change irq reconcile for NMIs from reusing _DAR to RESULT
A spare interrupt stack slot is needed to save irq state when
reconciling NMIs (sreset and decrementer soft-nmi). _DAR is used
for this, but we want to reconcile machine checks as well, which
do use _DAR. Switch to using RESULT instead, as it's used by
system calls.
The architecture allows for machine check exceptions to cause idle
wakeups which resume at the 0x200 address which has to return via
the idle wakeup code, but the early machine check handler is run
first.
The case of a no state-loss sleep is broken because the early
handler uses non-volatile register r1 , which is needed for the wakeup
protocol, but it is not restored.
Fix this by loading r1 from the MCE exception frame before returning
to the idle wakeup code. Also update the comment which has become
stale since the idle rewrite in C.
This crash was found and fix confirmed with a machine check injection
test in qemu powernv model (which is not upstream in qemu yet).
Sam Bobroff [Tue, 28 Apr 2020 03:45:06 +0000 (13:45 +1000)]
powerpc/eeh: Release EEH device state synchronously
EEH device state is currently removed (by eeh_remove_device()) during
the device release handler, which is invoked as the device's reference
count drops to zero. This may take some time, or forever, as other
threads may hold references.
However, the PCI device state is released synchronously by
pci_stop_and_remove_bus_device(). This mismatch causes problems, for
example the device may be re-discovered as a new device before the
release handler has been called, leaving the PCI and EEH state
mismatched.
So instead, call eeh_remove_device() from the bus device removal
handlers, which are called synchronously in the removal path.
Sam Bobroff [Tue, 28 Apr 2020 03:45:05 +0000 (13:45 +1000)]
powerpc/eeh: Fix pseries_eeh_configure_bridge()
If a device is hot unplgged during EEH recovery, it's possible for the
RTAS call to ibm,configure-pe in pseries_eeh_configure() to return
parameter error (-3), however negative return values are not checked
for and this leads to an infinite loop.
Fix this by correctly bailing out on negative values.
Our mitigation is currently always a barrier instruction, which
doesn't map that well onto the existing possibilities for the PR_SPEC
values.
However even if we added a "barrier" type PR_SPEC value, userspace
would still need to consult some other source to work out which type
of barrier to use. So reporting "vulnerable" seems sufficient, as
userspace can see that and then consult its source to determine what
barrier to use.
Michael Ellerman [Thu, 23 Apr 2020 06:00:38 +0000 (16:00 +1000)]
drivers/macintosh: Fix memleak in windfarm_pm112 driver
create_cpu_loop() calls smu_sat_get_sdb_partition() which does
kmalloc() and returns the allocated buffer. In fact it's called twice,
and neither buffer is freed.
Michael Ellerman [Sun, 26 Apr 2020 11:44:10 +0000 (21:44 +1000)]
selftests/powerpc: Add a test of counting larx/stcx
This is based on the count_instructions test.
However this one also counts the number of failed stcx's, and in
conjunction with knowing the size of the stcx loop, can calculate the
total number of instructions executed even in the face of
non-deterministic stcx failures.
Michael Ellerman [Tue, 28 Apr 2020 12:31:52 +0000 (22:31 +1000)]
powerpc: Drop unneeded cast in task_pt_regs()
There's no need to cast in task_pt_regs() as tsk->thread.regs should
already be a struct pt_regs. If someone's using task_pt_regs() on
something that's not a task but happens to have a thread.regs then
we'll deal with them later.
ie. if MSR_VSX is set then both of MSR_FP and MSR_VEC are also set.
Dumping tsk->thread.regs->msr we see that it's: 0x1db60000
Which is not a normal looking MSR, in fact the only valid bit is
MSR_VSX, all the other bits are reserved in the current definition of
the MSR.
We can see from the oops that it was swapper/0 that we were switching
from when we hit the warning, ie. init_task. So its thread.regs points
to the base (high addresses) in init_stack.
Dumping the content of init_task->thread.regs, with the members of
pt_regs annotated (the 16 bytes larger version), we see:
This looks suspiciously like stack frames, not a pt_regs. If we look
closely we can see return addresses from the stack trace above, c000000002004820 (start_kernel) and c00000000000c49c (start_here_common).
init_task->thread.regs is setup at build time in processor.h:
The early boot code where we setup the initial stack is:
LOAD_REG_ADDR(r3,init_thread_union)
/* set up a stack pointer */
LOAD_REG_IMMEDIATE(r1,THREAD_SIZE)
add r1,r3,r1
li r0,0
stdu r0,-STACK_FRAME_OVERHEAD(r1)
Which creates a stack frame of size 112 bytes (STACK_FRAME_OVERHEAD).
Which is far too small to contain a pt_regs.
So the result is init_task->thread.regs is pointing at some stack
frames on the init stack, not at a pt_regs.
We have gotten away with this for so long because with pt_regs at its
current size the MSR happens to point into the first frame, at a
location that is not written to by the early asm. With the 16 byte
expansion the MSR falls into the second frame, which is used by the
compiler, and collides with a saved register that tends to be
non-zero.
As far as I can see this has been wrong since the original merge of
64-bit ppc support, back in 2002.
Conceptually swapper should have no regs, it never entered from
userspace, and in fact that's what we do on 32-bit. It's also
presumably what the "bogus" comment is referring to.
So I think the right fix is to just not-initialise regs at all. I'm
slightly worried this will break some code that isn't prepared for a
NULL regs, but we'll have to see.
Remove the comment in head_64.S which refers to us setting up the
regs (even though we never did), and is otherwise not really accurate
any more.
powerpc/mm: Replace zero-length array with flexible-array
The current codebase makes use of the zero-length array language
extension to the C90 standard, but the preferred mechanism to declare
variable-length types such as these ones is a flexible array member[1][2],
introduced in C99:
struct foo {
int stuff;
struct boo array[];
};
By making use of the mechanism above, we will get a compiler warning
in case the flexible array does not occur last in the structure, which
will help us prevent some kind of undefined behavior bugs from being
inadvertently introduced[3] to the codebase from now on.
Also, notice that, dynamic memory allocations won't be affected by
this change:
"Flexible array members have incomplete type, and so the sizeof operator
may not be applied. As a quirk of the original implementation of
zero-length arrays, sizeof evaluates to zero."[1]
sizeof(flexible-array-member) triggers a warning because flexible array
members have incomplete type[1]. There are some instances of code in
which the sizeof operator is being incorrectly/erroneously applied to
zero-length arrays and the result is zero. Such instances may be hiding
some bugs. So, this work (flexible-array member conversions) will also
help to get completely rid of those sorts of issues.
powerpc: Replace zero-length array with flexible-array
The current codebase makes use of the zero-length array language
extension to the C90 standard, but the preferred mechanism to declare
variable-length types such as these ones is a flexible array member[1][2],
introduced in C99:
struct foo {
int stuff;
struct boo array[];
};
By making use of the mechanism above, we will get a compiler warning
in case the flexible array does not occur last in the structure, which
will help us prevent some kind of undefined behavior bugs from being
inadvertently introduced[3] to the codebase from now on.
Also, notice that, dynamic memory allocations won't be affected by
this change:
"Flexible array members have incomplete type, and so the sizeof operator
may not be applied. As a quirk of the original implementation of
zero-length arrays, sizeof evaluates to zero."[1]
sizeof(flexible-array-member) triggers a warning because flexible array
members have incomplete type[1]. There are some instances of code in
which the sizeof operator is being incorrectly/erroneously applied to
zero-length arrays and the result is zero. Such instances may be hiding
some bugs. So, this work (flexible-array member conversions) will also
help to get completely rid of those sorts of issues.
Nicholas Piggin [Thu, 7 May 2020 12:13:32 +0000 (22:13 +1000)]
powerpc: Use trap metadata to prevent double restart rather than zeroing trap
It's not very nice to zero trap for this, because then system calls no
longer have trap_is_syscall(regs) invariant, and we can't distinguish
between sc and scv system calls (in a later patch).
Take one last unused bit from the low bits of the pt_regs.trap word
for this instead. There is not a really good reason why it should be
in trap as opposed to another field, but trap has some concept of
flags and it exists. Ideally I think we would move trap to 2-byte
field and have 2 more bytes available independently.
Add a selftests case for this, which can be seen to fail if
trap_norestart() is changed to return false.
Nicholas Piggin [Thu, 7 May 2020 12:13:30 +0000 (22:13 +1000)]
powerpc: Use set_trap() and avoid open-coding trap masking
The pt_regs.trap field keeps 4 low bits for some metadata about the
trap or how it was handled, which is masked off in order to test the
architectural trap number.
Add a set_trap() accessor to set this, equivalent to TRAP() for
returning it. This is actually not quite the equivalent of TRAP()
because it always clears the low bits, which may be harmless if
it can only be updated via ptrace syscall, but it seems dangerous.
In fact settting TRAP from ptrace doesn't seem like a great idea
so maybe it's better deleted.
powerpc: module_[32|64].c: replace swap function with built-in one
Replace relaswap with built-in one, because relaswap
does a simple byte to byte swap.
Since Spectre mitigations have made indirect function calls more
expensive, and the default simple byte copies swap is implemented
without them, an "optimized" custom swap function is now
a waste of time as well as code.
Fix a cut'n'paste error in a warning message. This should be
'cpu-idle-state-residency-ns' to match the property searched in the
previous 'of_property_read_u32_array()'
Fixes: 9c7b185ab2fe ("powernv/cpuidle: Parse dt idle properties into global structure") Signed-off-by: Christophe JAILLET <christophe.jaillet@wanadoo.fr> Reviewed-by: Gautham R. Shenoy <ego@linux.vnet.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20200502115949.139000-1-christophe.jaillet@wanadoo.fr
Merge the lockless page table walk rework into next
This merges the lockless page table walk rework series from Aneesh.
Because it touches powerpc KVM code we are sharing it with the kvm-ppc
tree in our topic/ppc-kvm branch.
This is the cover letter from Aneesh:
Avoid IPI while updating page table entries.
Problem Summary:
Slow termination of KVM guest with large guest RAM config due to a
large number of IPIs that were caused by clearing level 1 PTE
entries (THP) entries. This is shown in the stack trace below.
Why we need to do IPI when clearing PMD entries:
This was added as part of commit: 13bd817bb884 ("powerpc/thp: Serialize pmd clear against a linux page table walk")
serialize_against_pte_lookup makes sure that all parallel lockless
page table walk completes before we convert a PMD pte entry to regular
pmd entry. We end up doing that conversion in the below scenarios
The lockless page table walk work with the assumption that we can
dereference the page table contents without holding a lock. For this
to work, we need to make sure we read the page table contents
atomically and page table pages are not going to be freed/released
while we are walking the table pages. We can achieve by using a rcu
based freeing for page table pages or if the architecture implements
broadcast tlbie, we can block the IPI as we walk the page table pages.
To support both the above framework, lockless page table walk is done
with irq disabled instead of rcu_read_lock()
We do have two interface for lockless page table walk, gup fast and
__find_linux_pte. This patch series makes __find_linux_pte table walk
safe against the conversion of PMD PTE to regular PMD.
gup fast:
gup fast is already safe against THP split because kernel now
differentiate between a pmd split and a compound page split. gup fast
can run parallel to a pmd split and we prevent a parallel gup fast to
a hugepage split, by freezing the page refcount and failing the
speculative page ref increment.
Similar to how gup is safe against parallel pmd split, this patch
series updates the __find_linux_pte callers to be safe against a
parallel pmd split. We do that by enforcing the following rules.
1) Don't reload the pte value, because that can be updated in
parallel.
2) Code should be able to work with a stale PTE value and not the
recent one. ie, the pte value that we are looking at may not be the
latest value in the page table.
3) Before looking at pte value check for _PAGE_PTE bit. We now do this
as part of pte_present() check.
Performance:
This speeds up Qemu guest RAM del/unplug time as below
128 core, 496GB guest:
powerpc/mm/book3s64: Fix MADV_DONTNEED and parallel page fault race
MADV_DONTNEED holds mmap_sem in read mode and that implies a
parallel page fault is possible and the kernel can end up with a level 1 PTE
entry (THP entry) converted to a level 0 PTE entry without flushing
the THP TLB entry.
Most architectures including POWER have issues with kernel instantiating a level
0 PTE entry while holding level 1 TLB entries.
The code sequence I am looking at is
down_read(mmap_sem) down_read(mmap_sem)
zap_pmd_range()
zap_huge_pmd()
pmd lock held
pmd_cleared
table details added to mmu_gather
pmd_unlock()
insert a level 0 PTE entry()
tlb_finish_mmu().
Fix this by forcing a tlb flush before releasing pmd lock if this is
not a fullmm invalidate. We can safely skip this invalidate for
task exit case (fullmm invalidate) because in that case we are sure
there can be no parallel fault handlers.
This do change the Qemu guest RAM del/unplug time as below
powerpc/mm/book3s64: Avoid sending IPI on clearing PMD
Now that all the lockless page table walk is careful w.r.t the PTE
address returned, we can now revert
commit: 13bd817bb884 ("powerpc/thp: Serialize pmd clear against a linux page table walk.")
We also drop the equivalent IPI from other pte updates routines. We still keep
IPI in hash pmdp collapse and that is to take care of parallel hash page table
insert. The radix pmdp collapse flush can possibly be removed once I am sure
generic code doesn't have the any expectations around parallel gup walk.
This speeds up Qemu guest RAM del/unplug time as below
powerpc/kvm/book3s: use find_kvm_host_pte in pute_tce functions
Current code just hold rmap lock to ensure parallel page table update is
prevented. That is not sufficient. The kernel should also check whether
a mmu_notifer callback was running in parallel.
powerpc/kvm/nested: Add helper to walk nested shadow linux page table.
The locking rules for walking nested shadow linux page table is different from process
scoped table. Hence add a helper for nested page table walk and also
add check whether we are holding the right locks.
powerpc/kvm/book3s: Add helper to walk partition scoped linux page table.
The locking rules for walking partition scoped table is different from process
scoped table. Hence add a helper for secondary linux page table walk and also
add check whether we are holding the right locks.
powerpc/perf/callchain: Use __get_user_pages_fast in read_user_stack_slow
read_user_stack_slow is called with interrupts soft disabled and it copies contents
from the page which we find mapped to a specific address. To convert
userspace address to pfn, the kernel now uses lockless page table walk.
The kernel needs to make sure the pfn value read remains stable and is not released
and reused for another process while the contents are read from the page. This
can only be achieved by holding a page reference.
One of the first approaches I tried was to check the pte value after the kernel
copies the contents from the page. But as shown below we can still get it wrong
A lockless page table walk should be safe against parallel THP collapse, THP
split and madvise(MADV_DONTNEED)/parallel fault. This patch makes sure kernel
won't reload the pteval when checking for different conditions. The patch also added
a check for pte_present to make sure the kernel is indeed operating
on a PTE and not a pointer to level 0 table page.
The pfn value we find here can be different from the actual pfn on which
machine check happened. This can happen if we raced with a parallel update
of the page table. In such a scenario we end up isolating a wrong pfn. But that
doesn't have any other side effect.
powerpc/book3s64/hash: Use the pte_t address from the caller
Don't fetch the pte value using lockless page table walk. Instead use the value from the
caller. hash_preload is called with ptl lock held. So it is safe to use the
pte_t address directly.