powerpc/perf: Exclude pmc5/6 from the irrelevant PMU group constraints
PMU counter support functions enforces event constraints for group of
events to check if all events in a group can be monitored. Incase of
event codes using PMC5 and PMC6 ( 500fa and 600f4 respectively ), not
all constraints are applicable, say the threshold or sample bits. But
current code includes pmc5 and pmc6 in some group constraints (like
IC_DC Qualifier bits) which is actually not applicable and hence
results in those events not getting counted when scheduled along with
group of other events. Patch fixes this by excluding PMC5/6 from
constraints which are not relevant for it.
All threads of a SMT4/SMT8 core can either be part of CPU's coregroup
mask or outside the coregroup. Use this relation to reduce the
number of iterations needed to find all the CPUs that share the same
coregroup
Use a temporary mask to iterate through the CPUs that may share
coregroup mask. Also instead of setting one CPU at a time into
cpu_coregroup_mask, copy the SMT4/SMT8/submask at one shot.
powerpc/smp: Move coregroup mask updation to a new function
Move the logic for updating the coregroup mask of a CPU to its own
function. This will help in reworking the updation of coregroup mask in
subsequent patch.
All threads of a SMT4 core can either be part of this CPU's l2-cache
mask or not related to this CPU l2-cache mask. Use this relation to
reduce the number of iterations needed to find all the CPUs that share
the same l2-cache.
Use a temporary mask to iterate through the CPUs that may share l2_cache
mask. Also instead of setting one CPU at a time into cpu_l2_cache_mask,
copy the SMT4/sub mask at one shot.
powerpc/smp: Check for duplicate topologies and consolidate
CACHE and COREGROUP domains are now part of default topology. However on
systems that don't support CACHE or COREGROUP, these domains will
eventually be degenerated. The degeneration happens per CPU. Do note the
current fixup_topology() logic ensures that mask of a domain that is not
supported on the current platform is set to the previous domain.
Instead of waiting for the scheduler to degenerated try to consolidate
based on their masks and sd_flags. This is done just before setting
the scheduler topology.
powerpc/smp: Depend on cpu_l1_cache_map when adding CPUs
Currently on hotplug/hotunplug, CPU iterates through all the CPUs in
its core to find threads in its thread group. However this info is
already captured in cpu_l1_cache_map. Hence reduce iterations and
cleanup add_cpu_to_smallcore_masks function.
powerpc/smp: Stop passing mask to update_mask_by_l2
update_mask_by_l2 is called only once. But it passes cpu_l2_cache_mask
as parameter. Instead of passing cpu_l2_cache_mask, use it directly in
update_mask_by_l2.
powerpc/smp: Limit CPUs traversed to within a node.
All the arch specific topology cpumasks are within a node/DIE.
However when setting these per CPU cpumasks, system traverses through
all the online CPUs. This is redundant.
Reduce the traversal to only CPUs that are online in the node to which
the CPU belongs to.
While offlining a CPU, system currently iterate through all the CPUs in
the DIE to clear sibling, l2_cache and smallcore maps. However if there
are more cores in a DIE, system can end up spending more time iterating
through CPUs which are completely unrelated.
Optimize this by only iterating through smaller but relevant cpumap.
If shared_cache is set, cpu_l2_cache_map should be relevant else
cpu_sibling_map would be relevant.
Anton Blanchard reported that his 4096 vcpu KVM guest took around 30
minutes to boot. He also analyzed it to the time taken to iterate while
setting the cpu_core_mask.
Further analysis shows that cpu_core_mask and cpu_cpu_mask for any CPU
would be equal on Power. However updating cpu_core_mask took forever to
update as its a per cpu cpumask variable. Instead cpu_cpu_mask was a per
NODE /per DIE cpumask that was shared by all the respective CPUs.
Also cpu_cpu_mask is needed from a scheduler perspective. However
cpu_core_map is an exported symbol. Hence stop updating cpu_core_map
and make it point to cpu_cpu_mask.
On Power, cpu_core_mask and cpu_cpu_mask refer to the same set of CPUs.
cpu_cpu_mask is needed by scheduler, hence look at deprecating
cpu_core_mask. Before deleting the cpu_core_mask, ensure its only user
is moved to cpu_cpu_mask.
powerpc/tm: Save and restore AMR on treclaim and trechkpt
Althought AMR is stashed in the checkpoint area, currently we don't save
it to the per thread checkpoint struct after a treclaim and so we don't
restore it either from that struct when we trechkpt. As a consequence when
the transaction is later rolled back the kernel space AMR value when the
trechkpt was done appears in userspace.
That commit saves and restores AMR accordingly on treclaim and trechkpt.
Since AMR value is also used in kernel space in other functions, it also
takes care of stashing kernel live AMR into the stack before treclaim and
before trechkpt, restoring it later, just before returning from tm_reclaim
and __tm_recheckpoint.
Is also fixes two nonrelated comments about CR and MSR.
When support for EEH on PowerNV was added a lot of pseries specific code
was made "generic" and some of the quirks of pseries EEH came along for the
ride. One of the stranger quirks is eeh_pe containing two types of PE
address: pe->addr and pe->config_addr. There reason for this appears to be
historical baggage rather than any real requirements.
On pseries EEH PEs are manipulated using RTAS calls. Each EEH RTAS call
takes a "PE configuration address" as an input which is used to identify
which EEH PE is being manipulated by the call. When initialising the EEH
state for a device the first thing we need to do is determine the
configuration address for the PE which contains the device so we can enable
EEH on that PE. This process is outlined in PAPR which is the modern
(i.e post-2003) FW specification for pseries. However, EEH support was
first described in the pSeries RISC Platform Architecture (RPA) and
although they are mostly compatible EEH is one of the areas where they are
not.
The major difference is that RPA doesn't actually have the concept of a PE.
On RPA systems the EEH RTAS calls are done on a per-device basis using the
same config_addr that would be passed to the RTAS functions to access PCI
config space (e.g. ibm,read-pci-config). The config_addr is not identical
since the function and config register offsets of the config_addr must be
set to zero. EEH operations being done on a per-device basis doesn't make a
whole lot of sense when you consider how EEH was implemented on legacy PCI
systems.
For legacy PCI(-X) systems EEH was implemented using special PCI-PCI
bridges which contained logic to detect errors and freeze the secondary
bus when one occurred. This means that the EEH enabled state is shared
among all devices behind that EEH bridge. As a result there's no way to
implement the per-device control required for the semantics specified by
RPA. It can be made to work if we assume that a separate EEH bridge exists
for each EEH capable PCI slot and there are no bridges behind those slots.
However, RPA also specifies the ibm,configure-bridge RTAS call for
re-initalising bridges behind EEH capable slots after they are reset due
to an EEH event so that is probably not a valid assumption. This
incoherence was fixed in later PAPR, which succeeded RPA. Unfortunately,
since Linux EEH support seems to have been implemented based on the RPA
spec some of the legacy assumptions were carried over (probably for POWER4
compatibility).
The fix made in PAPR was the introduction of the "PE" concept and
redefining the EEH RTAS calls (set-eeh-option, reset-slot, etc) to operate
on a per-PE basis so all devices behind an EEH bride would share the same
EEH state. The "config_addr" argument to the EEH RTAS calls became the
"PE_config_addr" and the OS was required to use the
ibm,get-config-addr-info RTAS call to find the correct PE address for the
device. When support for the new interfaces was added to Linux it was
implemented using something like:
config_addr = pdn->eeh_config_addr;
if (pdn->eeh_pe_config_addr)
config_addr = pdn->eeh_pe_config_addr;
rtas_call(..., config_addr, ...);
In other words, if the ibm,get-config-addr-info RTAS call is implemented
and returned a valid result we'd use that as the argument to the EEH
RTAS calls. If not, Linux would fall back to using the device's
config_addr. Over time these addresses have moved around going from pci_dn
to eeh_dev and finally into eeh_pe. Today the users look like this:
config_addr = pe->config_addr;
if (pe->addr)
config_addr = pe->addr;
rtas_call(..., config_addr, ...);
However, considering the EEH core always operates on a per-PE basis and
even on pseries the only per-device operation is the initial call to
ibm,set-eeh-option I'm not sure if any of this actually works on an RPA
system today. It doesn't make much sense to have the fallback address in
a generic structure either since the bulk of the code which reference it
is in pseries anyway.
The EEH core makes a token effort to support looking up a PE using the
config_addr by having two arguments to eeh_pe_get(). However, a survey of
all the callers to eeh_pe_get() shows that all bar one have the config_addr
argument hard-coded to zero.The only caller that doesn't is in
eeh_pe_tree_insert() which has:
if (!eeh_has_flag(EEH_VALID_PE_ZERO) && !edev->pe_config_addr)
return -EINVAL;
pe = eeh_pe_get(hose, edev->pe_config_addr, edev->bdfn);
The third argument (config_addr) is only used if the second (pe->addr)
argument is invalid. The preceding check ensures that the call to
eeh_pe_get() will never happen if edev->pe_config_addr is invalid so there
is no situation where eeh_pe_get() will search for a PE based on the 3rd
argument. The check also means that we'll never insert a PE into the tree
where pe_config_addr is zero since EEH_VALID_PE_ZERO is never set on
pseries. All the users of the fallback address on pseries never actually
use the fallback and all the only caller that supplies something for the
config_addr argument to eeh_pe_get() never use it either. It's all dead
code.
This patch removes the fallback address from eeh_pe since nothing uses it.
Specificly, we do this by:
1) Removing pe->config_addr
2) Removing the EEH_VALID_PE_ZERO flag
3) Removing the fallback address argument to eeh_pe_get().
4) Removing all the checks for pe->addr being zero in the pseries EEH code.
This leaves us with PE's only being identified by what's in their pe->addr
field and the EEH core relying on the platform to ensure that eeh_dev's are
only inserted into the EEH tree if they're actually inside a PE.
powerpc/pseries/eeh: Allow zero to be a valid PE configuration address
There's no real reason why zero can't be a valid PE configuration address.
Under qemu each sPAPR PHB (i.e. EEH supporting) has the passed-though
devices on bus zero, so the PE address of bus <dddd>:00 should be zero.
However, all previous versions of Linux will reject that, so Qemu at least
goes out of it's way to avoid it. The Qemu implementation of
ibm,get-config-addr-info2 RTAS has the following comment:
> /*
> * We always have PE address of form "00BB0001". "BB"
> * represents the bus number of PE's primary bus.
> */
So qemu puts a one into the register portion of the PE's config_addr to
avoid it being zero. The whole is pretty silly considering that RTAS will
return a negative error code if it can't map the device's config_addr to a
PE.
This patch fixes Linux to treat zero as a valid PE address. This shouldn't
have any real effects due to the Qemu hack mentioned above. And the fact
that Linux EEH has worked historically on PowerVM means they never pass
through devices on bus zero so we would never see the problem there either.
powerpc/pseries/eeh: Rework device EEH PE determination
The process Linux uses for determining if a device supports EEH or not
appears to be at odds with what PAPR says the OS should be doing. The
current flow is something like:
1. Assume pe_config_addr is equal the the device's config_addr.
2. Attempt to enable EEH on that PE
3. Verify EEH was enabled (POWER4 bug workaround)
4. Try find the pe_config_addr using the ibm,get-config-addr-info2 RTAS
call.
5. If that fails walk the pci_dn tree upwards trying to find a parent
device with EEH support. If we find one then add the device to that PE.
The first major problem with this process is that we need the PE config
address in step 2) since its needs to be passed to the ibm,set-eeh-option
RTAS call when enabling EEH for th PE. We hack around this requirement in
by making the assumption in 1) and delay finding the actual PE address
until 4). This is fine if:
a) The PCI device is the 0th function, and
b) The device is on the PE's root bus.
Granted, the current sequence does appear to work on most systems even when
these conditions are false. At a guess PowerVM's RTAS has workarounds to
accommodate Linux's quirks or the RTAS call to enable EEH is treated as
no-op on most platforms since EEH is usually enabled by default. However,
what is currently implemented is a bit sketch and is downright confusing
since it doesn't match up with what what PAPR suggests we should be doing.
This patch re-works how we handle EEH init so that we find the PE config
address using the ibm,get-config-addr-info2 RTAS call first, then use the
found address to finish the EEH init process. It also drops the Power4
workaround since as of commit 471d7ff8b51b ("powerpc/64s: Remove POWER4
support") the kernel does not support running on a Power4 CPU so there's
no need to support the Power4 platform's quirks either. With the patch
applied the sequence is now:
1. Find the pe_config_addr from the device using the RTAS call.
2. Enable the PE.
3. Insert the edev into the tree and create an eeh_pe if needed.
The other change made here is ignoring unsupported devices entirely.
Currently the device's BARs are saved to the eeh_dev even if the device is
not part of an EEH PE. Not being part of a PE means that an EEH recovery
pass will never see that device so the saving the BARs is pointless.
powerpc/pseries/eeh: Clean up pe_config_addr lookups
De-duplicate, and fix up the comments, and make the prototype just take a
pci_dn since the job of the function is to return the pe_config_addr of the
PE which contains a given device.
powerpc/eeh: Move EEH initialisation to an arch initcall
The initialisation of EEH mostly happens in a core_initcall_sync initcall,
followed by registering a bus notifier later on in an arch_initcall.
Anything involving initcall dependecies is mostly incomprehensible unless
you've spent a while staring at code so here's the full sequence:
ppc_md.setup_arch <-- pci_controllers are created here
...time passes...
core_initcall <-- pci_dns are created from DT nodes
core_initcall_sync <-- platforms call eeh_init()
postcore_initcall <-- PCI bus type is registered
postcore_initcall_sync
arch_initcall <-- EEH pci_bus notifier registered
subsys_initcall <-- PHBs are scanned here
There's no real requirement to do the EEH setup at the core_initcall_sync
level. It just needs to be done after pci_dn's are created and before we
start scanning PHBs. Simplify the flow a bit by moving the platform EEH
inititalisation to an arch_initcall so we can fold the bus notifier
registration into eeh_init().
Fold pseries_eeh_init() into eeh_pseries_init() rather than having
eeh_init() call it via eeh_ops->init(). It's simpler and it'll let us
delete eeh_ops.init.
Fold pnv_eeh_init() into eeh_powernv_init() rather than having eeh_init()
call it via eeh_ops->init(). It's simpler and it'll let us delete
eeh_ops.init.
Drop the EEH register / unregister ops thing and have the platform pass the
ops structure into eeh_init() directly. This takes one initcall out of the
EEH setup path and it means we're only doing EEH setup on the platforms
which actually support it. It's also less code and generally easier to
follow.
Daniel Axtens [Thu, 24 Sep 2020 01:49:22 +0000 (11:49 +1000)]
powerpc: PPC_SECURE_BOOT should not require PowerNV
In commit 61f879d97ce4 ("powerpc/pseries: Detect secure and trusted
boot state of the system.") we taught the kernel how to understand the
secure-boot parameters used by a pseries guest.
However, CONFIG_PPC_SECURE_BOOT still requires PowerNV. I didn't
catch this because pseries_le_defconfig includes support for
PowerNV and so everything still worked. Indeed, most configs will.
Nonetheless, technically PPC_SECURE_BOOT doesn't require PowerNV
any more.
The secure variables support (PPC_SECVAR_SYSFS) doesn't do anything
on pSeries yet, but I don't think it's worth adding a new condition -
at some stage we'll want to add a backend for pSeries anyway.
Fixes: 61f879d97ce4 ("powerpc/pseries: Detect secure and trusted boot state of the system.") Signed-off-by: Daniel Axtens <dja@axtens.net> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20200924014922.172914-1-dja@axtens.net
Wang Wensheng [Fri, 18 Sep 2020 08:59:51 +0000 (08:59 +0000)]
powerpc/papr_scm: Fix warnings about undeclared variable
Build the kernel with 'make C=2':
arch/powerpc/platforms/pseries/papr_scm.c:825:1: warning: symbol
'dev_attr_perf_stats' was not declared. Should it be static?
Nicholas Piggin [Tue, 15 Sep 2020 11:46:46 +0000 (21:46 +1000)]
powerpc/64: fix irq replay pt_regs->softe value
Replayed interrupts get an "artificial" struct pt_regs constructed to
pass to interrupt handler functions. This did not get the softe field
set correctly, it's as though the interrupt has hit while irqs are
disabled. It should be IRQS_ENABLED.
This is possibly harmless, asynchronous handlers should not be testing
if irqs were disabled, but it might be possible for example some code
is shared with synchronous or NMI handlers, and it makes more sense if
debug output looks at this.
Nicholas Piggin [Tue, 15 Sep 2020 11:46:45 +0000 (21:46 +1000)]
powerpc/64: fix irq replay missing preempt
Prior to commit 3282a3da25bd ("powerpc/64: Implement soft interrupt
replay in C"), replayed interrupts returned by the regular interrupt
exit code, which performs preemption in case an interrupt had set
need_resched.
This logic was missed by the conversion. Adding preempt_disable/enable
around the interrupt replay and final irq enable will reschedule if
needed.
Nicholas Piggin [Wed, 16 Sep 2020 03:02:34 +0000 (13:02 +1000)]
powerpc/64s: Add cp_abort after tlbiel to invalidate copy-buffer address
The copy buffer is implemented as a real address in the nest which is
translated from EA by copy, and used for memory access by paste. This
requires that it be invalidated by TLB invalidation.
TLBIE does invalidate the copy buffer, but TLBIEL does not. Add
cp_abort to the tlbiel sequence.
powerpc/powernv/elog: Fix race while processing OPAL error log event.
Every error log reported by OPAL is exported to userspace through a
sysfs interface and notified using kobject_uevent(). The userspace
daemon (opal_errd) then reads the error log and acknowledges the error
log is saved safely to disk. Once acknowledged the kernel removes the
respective sysfs file entry causing respective resources to be
released including kobject.
However it's possible the userspace daemon may already be scanning
elog entries when a new sysfs elog entry is created by the kernel.
User daemon may read this new entry and ack it even before kernel can
notify userspace about it through kobject_uevent() call. If that
happens then we have a potential race between
elog_ack_store->kobject_put() and kobject_uevent which can lead to
use-after-free of a kernfs object resulting in a kernel crash. eg:
BUG: Unable to handle kernel data access on read at 0x6b6b6b6b6b6b6bfb
Faulting instruction address: 0xc0000000008ff2a0
Oops: Kernel access of bad area, sig: 11 [#1]
LE PAGE_SIZE=64K MMU=Hash SMP NR_CPUS=2048 NUMA PowerNV
CPU: 27 PID: 805 Comm: irq/29-opal-elo Not tainted 5.9.0-rc2-gcc-8.2.0-00214-g6f56a67bcbb5-dirty #363
...
NIP kobject_uevent_env+0xa0/0x910
LR elog_event+0x1f4/0x2d0
Call Trace:
0x5deadbeef0000122 (unreliable)
elog_event+0x1f4/0x2d0
irq_thread_fn+0x4c/0xc0
irq_thread+0x1c0/0x2b0
kthread+0x1c4/0x1d0
ret_from_kernel_thread+0x5c/0x6c
This patch fixes this race by protecting the sysfs file
creation/notification by holding a reference count on kobject until we
safely send kobject_uevent().
The function create_elog_obj() returns the elog object which if used
by caller function will end up in use-after-free problem again.
However, the return value of create_elog_obj() function isn't being
used today and there is no need as well. Hence change it to return
void to make this fix complete.
Fixes: 774fea1a38c6 ("powerpc/powernv: Read OPAL error log and export it through sysfs") Cc: stable@vger.kernel.org # v3.15+ Reported-by: Oliver O'Halloran <oohall@gmail.com> Signed-off-by: Mahesh Salgaonkar <mahesh@linux.ibm.com> Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com> Reviewed-by: Oliver O'Halloran <oohall@gmail.com> Reviewed-by: Vasant Hegde <hegdevasant@linux.vnet.ibm.com>
[mpe: Rework the logic to use a single return, reword comments, add oops] Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20201006122051.190176-1-mpe@ellerman.id.au
CC arch/powerpc/kernel/traps.o
../arch/powerpc/kernel/traps.c:1663:6: error: no previous prototype for ‘stack_overflow_exception’ [-Werror=missing-prototypes]
void stack_overflow_exception(struct pt_regs *regs)
^~~~~~~~~~~~~~~~~~~~~~~~
Fixes: 3978eb78517c ("powerpc/32: Add early stack overflow detection with VMAP stack.") Signed-off-by: Cédric Le Goater <clg@kaod.org> Reviewed-by: Christophe Leroy <christophe.leroy@csgroup.eu> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20200914211007.2285999-8-clg@kaod.org
powerpc/sstep: Remove empty if statement checking for invalid form
The check should be performed by the caller. This fixes a compile
error with W=1.
../arch/powerpc/lib/sstep.c: In function ‘mlsd_8lsd_ea’:
../arch/powerpc/lib/sstep.c:225:3: error: suggest braces around empty body in an ‘if’ statement [-Werror=empty-body]
; /* Invalid form. Should already be checked for by caller! */
^
powerpc/sysfs: Remove unused 'err' variable in sysfs_create_dscr_default()
This fixes a compile error with W=1.
arch/powerpc/kernel/sysfs.c: In function ‘sysfs_create_dscr_default’:
arch/powerpc/kernel/sysfs.c:228:7: error: variable ‘err’ set but not used [-Werror=unused-but-set-variable]
int err = 0;
^~~
cc1: all warnings being treated as errors
Michael Ellerman [Fri, 21 Aug 2020 10:34:07 +0000 (20:34 +1000)]
powerpc/prom_init: Check display props exist before enabling btext
It's possible to enable CONFIG_PPC_EARLY_DEBUG_BOOTX for a pseries
kernel (maybe it shouldn't be), which is then booted with qemu/slof.
But if you do that the kernel crashes in draw_byte(), with a DAR
pointing somewhere near INT_MAX.
Adding some debug to prom_init we see that we're not able to read the
"address" property from OF, so we're just using whatever junk value
was on the stack.
So check the properties can be read properly from OF, if not we bail
out before initialising btext, which avoids the crash.
Michael Ellerman [Wed, 19 Aug 2020 01:56:34 +0000 (11:56 +1000)]
powerpc/smp: Move ppc_md.cpu_die() to smp_ops.cpu_offline_self()
We have smp_ops->cpu_die() and ppc_md.cpu_die(). One of them offlines
the current CPU and one offlines another CPU, can you guess which is
which? Also one is in smp_ops and one is in ppc_md?
So rename ppc_md.cpu_die(), to cpu_offline_self(), because that's what
it does. And move it into smp_ops where it belongs.
Michael Ellerman [Wed, 19 Aug 2020 01:56:32 +0000 (11:56 +1000)]
powerpc: Move arch_cpu_idle_dead() into smp.c
arch_cpu_idle_dead() is in idle.c, which makes sense, but it's inside
a CONFIG_HOTPLUG_CPU block.
It would be more at home in smp.c, inside the existing
CONFIG_HOTPLUG_CPU block. Note that CONFIG_HOTPLUG_CPU depends on
CONFIG_SMP so even though smp.c is not built for SMP=n builds, that's
fine.
Michael Ellerman [Wed, 16 Sep 2020 11:56:37 +0000 (21:56 +1000)]
powerpc/perf: Add declarations to fix sparse warnings
Sparse warns about all the init functions:
symbol init_ppc970_pmu was not declared. Should it be static?
symbol init_power5p_pmu was not declared. Should it be static?
symbol init_power5_pmu was not declared. Should it be static?
symbol init_power6_pmu was not declared. Should it be static?
symbol init_power7_pmu was not declared. Should it be static?
symbol init_power9_pmu was not declared. Should it be static?
symbol init_power8_pmu was not declared. Should it be static?
symbol init_generic_compat_pmu was not declared. Should it be static?
They're already declared in internal.h, so just make sure all the C
files include that directly or indirectly.
Yang Yingliang [Thu, 17 Sep 2020 02:06:43 +0000 (10:06 +0800)]
powerpc/book3s64: fix link error with CONFIG_PPC_RADIX_MMU=n
Fix link error when CONFIG_PPC_RADIX_MMU is disabled:
powerpc64-linux-gnu-ld: arch/powerpc/platforms/pseries/lpar.o:(.toc+0x0): undefined reference to `mmu_pid_bits'
Michael Ellerman [Thu, 17 Sep 2020 02:20:16 +0000 (12:20 +1000)]
powerpc/process: Fix uninitialised variable error
Clang, and GCC with -Wmaybe-uninitialized, can't see that val is
unused in get_fpexec_mode():
arch/powerpc/kernel/process.c:1940:7: error: variable 'val' is used
uninitialized whenever 'if' condition is true
if (cpu_has_feature(CPU_FTR_SPE)) {
^~~~~~~~~~~~~~~~~~~~~~~~~~~~
We know that CPU_FTR_SPE will only be true iff CONFIG_SPE is also
true, but the compiler doesn't.
Avoid it by initialising val to zero.
Reported-by: kernel test robot <lkp@intel.com> Fixes: 532ed1900d37 ("powerpc/process: Remove useless #ifdef CONFIG_SPE") Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Reviewed-by: Nick Desaulniers <ndesaulniers@google.com> Tested-by: Nick Desaulniers <ndesaulniers@google.com> Link: https://lore.kernel.org/r/20200917024509.3253837-1-mpe@ellerman.id.au
Michael Ellerman [Wed, 16 Sep 2020 12:21:05 +0000 (22:21 +1000)]
Merge coregroup support into next
From Srikar's cover letter, with some reformatting:
Cleanup of existing powerpc topologies and add coregroup support on
powerpc. Coregroup is a group of (subset of) cores of a DIE that share
a resource.
Summary of some of the testing done with coregroup patchset.
It includes ebizzy, schbench, perf bench sched pipe and topology
verification. On the left side are results from powerpc/next tree and
on the right are the results with the patchset applied. Topological
verification clearly shows that there is no change in topology with
and without the patches on all the 3 class of systems that were
tested.
Power 9 PowerNV (2 Node/ 160 Cpu System)
----------------------------------------
perf bench sched pipe (lesser time and higher ops/sec is better)
Running 'sched/pipe' benchmark: Running 'sched/pipe' benchmark:
Executed 1000000 pipe operations between two processes Executed 1000000 pipe operations between two processes
perf bench sched pipe (lesser time and higher ops/sec is better)
Running 'sched/pipe' benchmark: Running 'sched/pipe' benchmark:
Executed 1000000 pipe operations between two processes Executed 1000000 pipe operations between two processes
perf bench sched pipe (lesser time and higher ops/sec is better)
Running 'sched/pipe' benchmark: Running 'sched/pipe' benchmark:
Executed 1000000 pipe operations between two processes Executed 1000000 pipe operations between two processes
Similar verification was done on Power 8 (8 Node 256 CPU LPAR) and
Power 9 (2 node 128 Cpu LPAR) and they showed the topology before and
after the patch to be identical. If Interested, I could provide the
same.
On Power 9 (with device-tree enablement to show coregroups):
Lookup the coregroup id from the associativity array.
If unable to detect the coregroup id, fallback on the core id.
This way, ensure sched_domain degenerates and an extra sched domain is
not created.
Ideally this function should have been implemented in
arch/powerpc/kernel/smp.c. However if its implemented in mm/numa.c, we
don't need to find the primary domain again.
If the device-tree mentions more than one coregroup, then kernel
implements only the last or the smallest coregroup, which currently
corresponds to the penultimate domain in the device-tree.
Add percpu coregroup maps and masks to create coregroup domain.
If a coregroup doesn't exist, the coregroup domain will be degenerated
in favour of SMT/CACHE domain. Do note this patch is only creating stubs
for cpu_to_coregroup_id. The actual cpu_to_coregroup_id implementation
would be in a subsequent patch.
powerpc/smp: Allocate cpumask only after searching thread group
If allocated earlier and the search fails, then cpu_l1_cache_map cpumask
is unnecessarily cleared. However cpu_l1_cache_map can be allocated /
cleared after we search thread group.
Please note CONFIG_CPUMASK_OFFSTACK is not set on Powerpc. Hence cpumask
allocated by zalloc_cpumask_var_node is never freed.
Add support for grouping cores based on the device-tree classification.
- The last domain in the associativity domains always refers to the
core.
- If primary reference domain happens to be the penultimate domain in
the associativity domains device-tree property, then there are no
coregroups. However if its not a penultimate domain, then there are
coregroups. There can be more than one coregroup. For now we would be
interested in the last or the smallest coregroups, i.e one sub-group
per DIE.
Currently there are no firmwares that are exposing this grouping. Hence
allow the basis for grouping to be abstract. Once the firmware starts
using this grouping, code would be added to detect the type of grouping
and adjust the sd domain flags accordingly.
In start_secondary, even if shared_cache was already set, system does a
redundant match for cpumask. This redundant check can be removed by
checking if shared_cache is already set.
While here, localize the sibling_mask variable to within the if
condition.
powerpc/smp: Merge Power9 topology with Power topology
A new sched_domain_topology_level was added just for Power9. However the
same can be achieved by merging powerpc_topology with power9_topology
and makes the code more simpler especially when adding a new sched
domain.
Currently Linux kernel with CONFIG_NUMA on a system with multiple
possible nodes, marks node 0 as online at boot. However in practice,
there are systems which have node 0 as memoryless and cpuless.
This can cause numa_balancing to be enabled on systems with only one node
with memory and CPUs. The existence of this dummy node which is cpuless and
memoryless node can confuse users/scripts looking at output of lscpu /
numactl.
By marking, node 0 as offline, lets stop assuming that node 0 is
always online. If node 0 has CPU or memory that are online, node 0 will
again be set as online.
Note: On Powerpc, cpu_to_node of possible but not present cpus would
previously return 0. Hence this commit depends on commit ("powerpc/numa: Set
numa_node for all possible cpus") and commit ("powerpc/numa: Prefer node id
queried from vphn"). Without the 2 commits, Powerpc system might crash.
1. User space applications like Numactl, lscpu, that parse the sysfs tend to
believe there is an extra online node. This tends to confuse users and
applications. Other user space applications start believing that system was
not able to use all the resources (i.e missing resources) or the system was
not setup correctly.
2. Also existence of dummy node also leads to inconsistent information. The
number of online nodes is inconsistent with the information in the
device-tree and resource-dump
3. When the dummy node is present, single node non-Numa systems end up showing
up as NUMA systems and numa_balancing gets enabled. This will mean we take
the hit from the unnecessary numa hinting faults.
Node id queried from the static device tree may not
be correct. For example: it may always show 0 on a shared processor.
Hence prefer the node id queried from vphn and fallback on the device tree
based node id if vphn query fails.
A Powerpc system with multiple possible nodes and with CONFIG_NUMA
enabled always used to have a node 0, even if node 0 does not any cpus
or memory attached to it. As per PAPR, node affinity of a cpu is only
available once its present / online. For all cpus that are possible but
not present, cpu_to_node() would point to node 0.
To ensure a cpuless, memoryless dummy node is not online, powerpc need
to make sure all possible but not present cpu_to_node are set to a
proper node.
powerpc/numa: Restrict possible nodes based on platform
As per draft LoPAPR (Revision 2.9_pre7), section B.5.3 "Run Time
Abstraction Services (RTAS) Node" available at:
https://openpowerfoundation.org/wp-content/uploads/2020/07/LoPAR-20200611.pdf
... there are 2 device tree properties:
"ibm,max-associativity-domains"
which defines the maximum number of domains that the firmware i.e
PowerVM can support.
and:
"ibm,current-associativity-domains"
which defines the maximum number of domains that the current
platform can support.
The value of "ibm,max-associativity-domains" is always greater than or
equal to "ibm,current-associativity-domains" property. If the latter
property is not available, use "ibm,max-associativity-domain" as a
fallback. In this yet to be released LoPAPR, "ibm,current-associativity-domains"
is mentioned in page 833 / B.5.3 which is covered under under
"Appendix B. System Binding" section
Currently powerpc uses the "ibm,max-associativity-domains" property
while setting the possible number of nodes. This is currently set at
32. However the possible number of nodes for a platform may be
significantly less. Hence set the possible number of nodes based on
"ibm,current-associativity-domains" property.
Nathan Lynch had raised a valid concern that post LPM (Live Partition
Migration), a user could DLPAR add processors and memory after LPM
with "new" associativity properties:
https://lore.kernel.org/linuxppc-dev/871rljfet9.fsf@linux.ibm.com/t/#u
He also pointed out that "ibm,max-associativity-domains" has the same
contents on all currently available PowerVM systems, unlike
"ibm,current-associativity-domains" and hence may be better able to
handle the new NUMA associativity properties.
However with the recent commit dbce45628085 ("powerpc/numa: Limit
possible nodes to within num_possible_nodes"), all new NUMA
associativity properties are capped to initially set nr_node_ids.
Hence this commit should be safe with any new DLPAR add post LPM.
Note the maximum nodes this platform can support is only 2 but the
possible nodes is set to 32.
This is important because lot of kernel and user space code allocate
structures for all possible nodes leading to a lot of memory that is
allocated but not used.
I ran a simple experiment to create and destroy 100 memory cgroups on
boot on a 8 node machine (Power8 Alpine).
On Power9, a pair of SMT4 cores can be presented by the firmware as a SMT8
core for backward compatibility reasons, with the fusion of two SMT4 cores.
Powerpc allows LPARs to be live migrated from Power8 to Power9. Existing
software developed/configured for Power8, expects to see a SMT8 core.
In order to maintain userspace backward compatibility (with Power8 chips in
case of Power9) in enterprise Linux systems, the topology_sibling_cpumask
has to be set to SMT8 core.
cpu_smt_mask() should generally point to the cpu mask of the SMT4 core.
Hence override the default cpu_smt_mask() to be powerpc specific
allowing for better scheduling behaviour on Power.
perf bench sched pipe (cede disabled)
usec/ops : lesser is better
Without patch
N Min Max Median Avg Stddev
101 7.884227 12.576538 7.956474 8.0170722 0.46159054
Observations: With the patch, the initial run/iteration takes a slight
longer time. This can be attributed to the fact that now we pick a CPU
from a idle core which could be sleep mode. Once we remove the cede,
state the numbers improve in favour of the patch.
ebizzy:
transactions per second (higher is better)
without patch
N Min Max Median Avg Stddev
100 1018433130447011932081182315.7 60018.733
sched/topology: Allow archs to override cpu_smt_mask
cpu_smt_mask tracks topology_sibling_cpumask. This would be good for
most architectures. One of the users of cpu_smt_mask(), would be to
identify idle-cores. On Power9, a pair of SMT4 cores can be presented
by the firmware as a SMT8 core for backward compatibility reasons.
powerpc allows LPARs to be live migrated from Power8 to Power9. Do
note Power8 had only SMT8 cores. Existing software which has been
developed/configured for Power8 would expect to see SMT8 core.
Maintaining the illusion of SMT8 core is a requirement to make that
work.
In order to maintain above userspace backward compatibility with
previous versions of processor, Power9 onwards there is option to the
firmware to advertise a pair of SMT4 cores as a fused cores aka SMT8
core. On Power9 this pair shares the L2 cache as well. However, from
the scheduler's point of view, a core should be determined by SMT4,
since its a completely independent unit of compute. Hence allow
powerpc architecture to override the default cpu_smt_mask() to point
to the SMT4 cores in a SMT8 mode.
This will ensure the scheduler is always given the right information.
powerpc/papr_scm: Fix warning triggered by perf_stats_show()
A warning is reported by the kernel in case perf_stats_show() returns
an error code. The warning is of the form below:
papr_scm ibm,persistent-memory:ibm,pmemory@44100001:
Failed to query performance stats, Err:-10
dev_attr_show: perf_stats_show+0x0/0x1c0 [papr_scm] returned bad count
fill_read_buffer: dev_attr_show+0x0/0xb0 returned bad count
On investigation it looks like that the compiler is silently
truncating the return value of drc_pmem_query_stats() from 'long' to
'int', since the variable used to store the return code 'rc' is an
'int'. This truncated value is then returned back as a 'ssize_t' back
from perf_stats_show() to 'dev_attr_show()' which thinks of it as a
large unsigned number and triggers this warning..
To fix this we update the type of variable 'rc' from 'int' to
'ssize_t' that prevents the compiler from truncating the return value
of drc_pmem_query_stats() and returning correct signed value back from
perf_stats_show().
Fixes: 2d02bf835e57 ("powerpc/papr_scm: Fetch nvdimm performance stats from PHYP") Signed-off-by: Vaibhav Jain <vaibhav@linux.ibm.com> Reviewed-by: Ira Weiny <ira.weiny@intel.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20200912081451.66225-1-vaibhav@linux.ibm.com
Nicholas Piggin [Mon, 14 Sep 2020 04:52:19 +0000 (14:52 +1000)]
powerpc/64s/radix: Fix mm_cpumask trimming race vs kthread_use_mm
Commit 0cef77c7798a7 ("powerpc/64s/radix: flush remote CPUs out of
single-threaded mm_cpumask") added a mechanism to trim the mm_cpumask of
a process under certain conditions. One of the assumptions is that
mm_users would not be incremented via a reference outside the process
context with mmget_not_zero() then go on to kthread_use_mm() via that
reference.
That invariant was broken by io_uring code (see previous sparc64 fix),
but I'll point Fixes: to the original powerpc commit because we are
changing that assumption going forward, so this will make backports
match up.
Fix this by no longer relying on that assumption, but by having each CPU
check the mm is not being used, and clearing their own bit from the mask
only if it hasn't been switched-to by the time the IPI is processed.
This relies on commit 38cf307c1f20 ("mm: fix kthread_use_mm() vs TLB
invalidate") and ARCH_WANT_IRQS_OFF_ACTIVATE_MM to disable irqs over mm
switch sequences.
Fixes: 0cef77c7798a7 ("powerpc/64s/radix: flush remote CPUs out of single-threaded mm_cpumask") Signed-off-by: Nicholas Piggin <npiggin@gmail.com> Reviewed-by: Michael Ellerman <mpe@ellerman.id.au>
Depends-on: 38cf307c1f20 ("mm: fix kthread_use_mm() vs TLB invalidate") Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20200914045219.3736466-5-npiggin@gmail.com
Nicholas Piggin [Mon, 14 Sep 2020 04:52:18 +0000 (14:52 +1000)]
sparc64: remove mm_cpumask clearing to fix kthread_use_mm race
The de facto (and apparently uncommented) standard for using an mm had,
thanks to this code in sparc if nothing else, been that you must have a
reference on mm_users *and that reference must have been obtained with
mmget()*, i.e., from a thread with a reference to mm_users that had used
the mm.
The introduction of mmget_not_zero() in commit d2005e3f41d4
("userfaultfd: don't pin the user memory in userfaultfd_file_create()")
allowed mm_count holders to aoperate on user mappings asynchronously
from the actual threads using the mm, but they were not to load those
mappings into their TLB (i.e., walking vmas and page tables is okay,
kthread_use_mm() is not).
io_uring 2b188cc1bb857 ("Add io_uring IO interface") added code which
does a kthread_use_mm() from a mmget_not_zero() refcount.
The problem with this is code which previously assumed mm == current->mm
and mm->mm_users == 1 implies the mm will remain single-threaded at
least until this thread creates another mm_users reference, has now
broken.
arch/sparc/kernel/smp_64.c:
if (atomic_read(&mm->mm_users) == 1) {
cpumask_copy(mm_cpumask(mm), cpumask_of(cpu));
goto local_flush_and_out;
}
vs fs/io_uring.c
if (unlikely(!(ctx->flags & IORING_SETUP_SQPOLL) ||
!mmget_not_zero(ctx->sqo_mm)))
return -EFAULT;
kthread_use_mm(ctx->sqo_mm);
mmget_not_zero() could come in right after the mm_users == 1 test, then
kthread_use_mm() which sets its CPU in the mm_cpumask. That update could
be lost if cpumask_copy() occurs afterward.
I propose we fix this by allowing mmget_not_zero() to be a first-class
reference, and not have this obscure undocumented and unchecked
restriction.
The basic fix for sparc64 is to remove its mm_cpumask clearing code. The
optimisation could be effectively restored by sending IPIs to mm_cpumask
members and having them remove themselves from mm_cpumask. This is more
tricky so I leave it as an exercise for someone with a sparc64 SMP.
powerpc has a (currently similarly broken) example.
Nicholas Piggin [Mon, 14 Sep 2020 04:52:17 +0000 (14:52 +1000)]
powerpc: select ARCH_WANT_IRQS_OFF_ACTIVATE_MM
powerpc uses IPIs in some situations to switch a kernel thread away
from a lazy tlb mm, which is subject to the TLB flushing race
described in the changelog introducing ARCH_WANT_IRQS_OFF_ACTIVATE_MM.
Nicholas Piggin [Mon, 14 Sep 2020 04:52:16 +0000 (14:52 +1000)]
mm: fix exec activate_mm vs TLB shootdown and lazy tlb switching race
Reading and modifying current->mm and current->active_mm and switching
mm should be done with irqs off, to prevent races seeing an intermediate
state.
This is similar to commit 38cf307c1f20 ("mm: fix kthread_use_mm() vs TLB
invalidate"). At exec-time when the new mm is activated, the old one
should usually be single-threaded and no longer used, unless something
else is holding an mm_users reference (which may be possible).
Absent other mm_users, there is also a race with preemption and lazy tlb
switching. Consider the kernel_execve case where the current thread is
using a lazy tlb active mm:
If we switch back to the kernel thread from a different mm, there is a
double free of the old active_mm, and a missing free of the new one.
Closing this race only requires interrupts to be disabled while ->mm
and ->active_mm are being switched, but the TLB problem requires also
holding interrupts off over activate_mm. Unfortunately not all archs
can do that yet, e.g., arm defers the switch if irqs are disabled and
expects finish_arch_post_lock_switch() to be called to complete the
flush; um takes a blocking lock in activate_mm().
So as a first step, disable interrupts across the mm/active_mm updates
to close the lazy tlb preempt race, and provide an arch option to
extend that to activate_mm which allows architectures doing IPI based
TLB shootdowns to close the second race.
This is a bit ugly, but in the interest of fixing the bug and backporting
before all architectures are converted this is a compromise.
powerpc/pci: unmap legacy INTx interrupts when a PHB is removed
When a passthrough IO adapter is removed from a pseries machine using
hash MMU and the XIVE interrupt mode, the POWER hypervisor expects the
guest OS to clear all page table entries related to the adapter. If
some are still present, the RTAS call which isolates the PCI slot
returns error 9001 "valid outstanding translations" and the removal of
the IO adapter fails. This is because when the PHBs are scanned, Linux
maps automatically the INTx interrupts in the Linux interrupt number
space but these are never removed.
To solve this problem, we introduce a PPC platform specific
pcibios_remove_bus() routine which clears all interrupt mappings when
the bus is removed. This also clears the associated page table entries
of the ESB pages when using XIVE.
For this purpose, we record the logical interrupt numbers of the
mapped interrupt under the PHB structure and let pcibios_remove_bus()
do the clean up.
Since some PCI adapters, like GPUs, use the "interrupt-map" property
to describe interrupt mappings other than the legacy INTx interrupts,
we can not restrict the size of the mapping array to PCI_NUM_INTX. The
number of interrupt mappings is computed from the "interrupt-map"
property and the mapping array is allocated accordingly.
Nicholas Piggin [Wed, 19 Aug 2020 09:47:00 +0000 (19:47 +1000)]
powerpc/powernv/idle: add a basic stop 0-3 driver for POWER10
This driver does not restore stop > 3 state, so it limits itself
to states which do not lose full state or TB.
The POWER10 SPRs are sufficiently different from P9 that it seems
easier to split out the P10 code. The POWER10 deep sleep code
(e.g., the BHRB restore) has been taken out, but it can be re-added
when stop > 3 support is added.
Signed-off-by: Nicholas Piggin <npiggin@gmail.com> Tested-by: Pratik Rajesh Sampat<psampat@linux.ibm.com> Tested-by: Vaidyanathan Srinivasan <svaidy@linux.ibm.com> Reviewed-by: Pratik Rajesh Sampat<psampat@linux.ibm.com> Reviewed-by: Gautham R. Shenoy <ego@linux.vnet.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20200819094700.493399-1-npiggin@gmail.com
powerepc/book3s64/hash: Align start/end address correctly with bolt mapping
This ensures we don't do a partial mapping of memory. With nvdimm, when
creating namespaces with size not aligned to 16MB, the kernel ends up partially
mapping the pages. This can result in kernel adding multiple hash page table
entries for the same range. A new namespace will result in
create_section_mapping() with start and end overlapping an already existing
bolted hash page table entry.
commit: 6acd7d5ef264 ("libnvdimm/namespace: Enforce memremap_compat_align()")
made sure that we always create namespaces aligned to 16MB. But we can do
better by avoiding mapping pages that are not aligned. This helps to catch
access to these partially mapped pages early.
Jason Yan [Fri, 11 Sep 2020 02:01:21 +0000 (10:01 +0800)]
powerpc/ps3: make two symbols static
This addresses the following sparse warning:
arch/powerpc/platforms/ps3/spu.c:451:33: warning: symbol
'spu_management_ps3_ops' was not declared. Should it be static?
arch/powerpc/platforms/ps3/spu.c:592:28: warning: symbol
'spu_priv1_ps3_ops' was not declared. Should it be static?
Before the commit identified below, pages tables allocation was
performed after the allocation of final shadow area for linear memory.
But that commit switched the order, leading to page tables being
already allocated at the time 8xx kasan_init_shadow_8M() is called.
Due to this, kasan_init_shadow_8M() doesn't map the needed
shadow entries because there are already page tables.
kasan_init_shadow_8M() installs huge PMD entries instead of page
tables. We could at that time free the page tables, but there is no
point in creating page tables that get freed before being used.
Only book3s/32 hash needs early allocation of page tables. For other
variants, we can keep the initial order and create remaining page
tables after the allocation of final shadow memory for linear mem.
Move back the allocation of shadow page tables for
CONFIG_KASAN_VMALLOC into kasan_init() after the loop which creates
final shadow memory for linear mem.
powerpc/32: Fix vmap stack - Properly set r1 before activating MMU
We need r1 to be properly set before activating MMU, otherwise any new
exception taken while saving registers into the stack in exception
prologs will use the user stack, which is wrong and will even lockup
or crash when KUAP is selected.
Do that by switching the meaning of r11 and r1 until we have saved r1
to the stack: copy r1 into r11 and setup the new stack pointer in r1.
To avoid complicating and impacting all generic and specific prolog
code (and more), copy back r1 into r11 once r11 is save onto
the stack.
We could get rid of copying r1 back and forth at the cost of
rewriting everything to use r1 instead of r11 all the way when
CONFIG_VMAP_STACK is set, but the effort is probably not worth it.
powerpc/32: Fix vmap stack - Do not activate MMU before reading task struct
We need r1 to be properly set before activating MMU, so
reading task_struct->stack must be done with MMU off.
This means we need an additional register to play with MSR
bits while r11 now points to the stack. For that, move r10
back to CR (As is already done for hash MMU) and use r10.
We still don't have r1 correct yet when we activate MMU.
It is done in following patch.
powerpc/uaccess: Switch __put_user_size_allowed() to __put_user_asm_goto()
__put_user_asm_goto() provides more flexibility to GCC and avoids using
a local variable to tell if the write succeeded or not.
GCC can then avoid implementing a cmp in the fast path.
See the difference for a small function like the PPC64 version of
save_general_regs() in arch/powerpc/kernel/signal_32.c:
Christophe Leroy [Mon, 31 Aug 2020 08:30:44 +0000 (08:30 +0000)]
powerpc/8xx: Support 16k hugepages with 4k pages
The 8xx has 4 page sizes: 4k, 16k, 512k and 8M
4k and 16k can be selected at build time as standard page sizes,
and 512k and 8M are hugepages.
When 4k standard pages are selected, 16k pages are not available.
Allow 16k pages as hugepages when 4k pages are used.
To allow that, implement arch_make_huge_pte() which receives
the necessary arguments to allow setting the PTE in accordance
with the page size:
- 512 k pages must have _PAGE_HUGE and _PAGE_SPS. They are set
by pte_mkhuge(). arch_make_huge_pte() does nothing.
- 16 k pages must have only _PAGE_SPS. arch_make_huge_pte() clears
_PAGE_HUGE.
Christophe Leroy [Mon, 31 Aug 2020 08:30:43 +0000 (08:30 +0000)]
powerpc/8xx: Refactor calculation of number of entries per PTE in page tables
On 8xx, the number of entries occupied by a PTE in the page tables
depends on the size of the page. At the time being, this calculation
is done in two places: in pte_update() and in set_huge_pte_at()
Refactor this calculation into a helper called
number_of_cells_per_pte(). For the time being, the val param is
unused. It will be used by following patch.
Instead of opencoding is_hugepd(), use hugepd_ok() with a forward
declaration.
This fault is due to hugetlb_free_pgd_range() freeing page tables
that are also used by regular pages.
As explain in the comment at the beginning of
hugetlb_free_pgd_range(), the verification done in free_pgd_range()
on floor and ceiling is not done here, which means
hugetlb_free_pte_range() can free outside the expected range.
As the verification cannot be done in hugetlb_free_pgd_range(), it
must be done in hugetlb_free_pte_range().