git.proxmox.com Git - mirror_ubuntu-artful-kernel.git/log

UBUNTU: SAUCE: LSM stacking: add configs for LSM stacking

Add the config options for LSM stacking. Default to only apparmor
being in the stack. Users can experiment with stacking by specifying
the
     security=

kernel boot parameter.
  eg.
     security=apparmor,selinux

Signed-off-by: John Johansen <john.johansen@canonical.com>
Signed-off-by: Seth Forshee <seth.forshee@canonical.com>

UBUNTU: SAUCE: LSM stacking: add Kconfig to set default display LSM

Signed-off-by: John Johansen <john.johansen@canonical.com>
Signed-off-by: Seth Forshee <seth.forshee@canonical.com>

UBUNTU: SAUCE: LSM stacking: add /proc/<pid>/attr/display_lsm

Add /proc/<pid>/attr/display_lsm so that scripts can easily introspect
the display lsm of a give task.

Signed-off-by: John Johansen <john.johansen@canonical.com>
Signed-off-by: Seth Forshee <seth.forshee@canonical.com>

UBUNTU: SAUCE: LSM stacking: make sure LSM blob align on 64 bit boundaries

The security->task blob reserving the first 12 bytes means that LSM
blobs don't align on 64 byte boundaries. This is not a problem
for x86 but if an LSM stores a long or ptr in its blob, then some
architectures require it be aligned to the arch word size and
will through a fault.

Signed-off-by: John Johansen <john.johansen@canonical.com>
Signed-off-by: Seth Forshee <seth.forshee@canonical.com>

UBUNTU: SAUCE: LSM stacking: provide a way to specify the default display lsm

When the system boots up the desired default display LSM maybe different
than the first LSM initialized. Allow it to be set by specifying an
LSM with
security.display=apparmor

Signed-off-by: John Johansen <john.johansen@canonical.com>
Signed-off-by: Seth Forshee <seth.forshee@canonical.com>

UBUNTU: SAUCE: LSM stacking: verify display LSM

Make sure the display LSM is verified to be a registered LSM, to
avoid breakage when a bad name is passed.

Signed-off-by: John Johansen <john.johansen@canonical.com>
Signed-off-by: Seth Forshee <seth.forshee@canonical.com>

UBUNTU: SAUCE: LSM stacking: keep an index for each registered LSM

Keep an index of the registered LSMs so that it can be used in table
lookups and ordering comparisons.

pulled from the full LSM stacking patch

Signed-off-by: John Johansen <john.johansen@canonical.com>
Signed-off-by: Seth Forshee <seth.forshee@canonical.com>

UBUNTU: SAUCE: LSM stacking: inherit current display LSM

If a current display LSM is set it should be inherited. As per 2017
LSS discussion.

Signed-off-by: John Johansen <john.johansen@canonical.com>
Signed-off-by: Seth Forshee <seth.forshee@canonical.com>

UBUNTU: SAUCE: LSM stacking: provide prctl interface for setting context

Separate out the prctl interface added in the full LSM stacking patches
to allow tasks to set the display "LSM".

Signed-off-by: John Johansen <john.johansen@canonical.com>
Signed-off-by: Seth Forshee <seth.forshee@canonical.com>

UBUNTU: SAUCE: LSM stacking: allow selecting multiple LSMs using kernel boot params

The base stacking code does not provide a way for users to specify the
desired stack using the kernel boot parameters. Enable specifying an
LSM stack on the command line by providing a comma separated list

ie.
security=apparmor,selinux

Signed-off-by: John Johansen <john.johansen@canonical.com>
Signed-off-by: Seth Forshee <seth.forshee@canonical.com>

UBUNTU: SAUCE: LSM stacking: fixup stacking kconfig

The stack configs in the base stacking patches are confusing and
separate the selinux/smack stacking from the other LSMs with an
"extreme" stacking entry which is extremely confusing.

Switch the "extreme" stacking to a select for mutually exclusive
LSMs, which provides a better explanation of what is happening.

Fixes: 6c5100029055 ("LSM: general but not extreme module stacking")
Signed-off-by: John Johansen <john.johansen@canonical.com>
Signed-off-by: Seth Forshee <seth.forshee@canonical.com>

UBUNTU: SAUCE: LSM stacking: fixup apparmor stacking enablement

AppArmor doesn't need to register twice.

Signed-off-by: John Johansen <john.johansen@canonical.com>
Signed-off-by: Seth Forshee <seth.forshee@canonical.com>

UBUNTU: SAUCE: LSM stacking: add stacking support to apparmor network hooks

The stacking patches weren't developed against apparmor networking hooks.

Signed-off-by: John Johansen <john.johansen@canonical.com>
Signed-off-by: Seth Forshee <seth.forshee@canonical.com>

UBUNTU: SAUCE: LSM stacking: add support for stacking getpeersec_stream

getpeersec_stream needs to use the "current" display LSM set by the
prctl.

Split out the getpeersec_stream implementation from the full stacking
patch.

Signed-off-by: John Johansen <john.johansen@canonical.com>
Signed-off-by: Seth Forshee <seth.forshee@canonical.com>

UBUNTU: SAUCE: LSM stacking: fixup: alloc_task_ctx is dead code

Fixes: 7b27ec622c90 ("LSM: manage credential security blobs)
Signed-off-by: John Johansen <john.johansen@canonical.com>
Signed-off-by: Seth Forshee <seth.forshee@canonical.com>

UBUNTU: SAUCE: LSM stacking: fixup initialize task->security

Signed-off-by: John Johansen <john.johansen@canonical.com>
Signed-off-by: Seth Forshee <seth.forshee@canonical.com>

UBUNTU: SAUCE: LSM stacking: fixup procsfs: add smack subdir to attrs

The patch that adds the smack subdirs also changes how set_procattr
is handled. It makes the assumption that each LSM will attempt
to handle the context written in turn and return ENOENT, if
the LSM doesn't handle the context and another should try.

This is wrong in two ways, LSMs may return ENOENT as an error for
a context that is valid but is not present in currently loaded
policy, and the code is based on earlier versions where each
LSM is tried in turn.

Under the current patchset there is the concept of the current LSM
which will be used by applications using the old interfaces, to
ensure those interfaces which were not designed for multiple LSM
input/out only ever have to deal with a single LSM.

Fixes: 86400b03f812 ("procfs: add smack subdir to attrs")
Signed-off-by: John Johansen <john.johansen@canonical.com>
Signed-off-by: Seth Forshee <seth.forshee@canonical.com>

UBUNTU: SAUCE: LSM stacking: LSM: Complete task_alloc hook

The Task alloc hook needs to allocate the data.

Signed-off-by: Casey Schaufler <casey@schaufler-ca.com>
Signed-off-by: John Johansen <john.johansen@canonical.com>
Signed-off-by: Seth Forshee <seth.forshee@canonical.com>

UBUNTU: SAUCE: LSM stacking: LSM: general but not extreme module stacking

Leverage the infrastructure management of the credential and
file security blobs to allow stacking of security modules in
all but the most extreme case. Security modules are informed
of the location of their data within the blobs at module
initialization.

Stacking is optional. If stacking is not configured the old
limit of one "major" security module applies. If stacking is
configured any combination that does not include both SELinux
and Smack is allowed.

A subdirectory has been added to /proc/.../attr for each of
SELinux and AppArmor (Smack introduced such a subdirectory earlier)
to disambiguate what data is provided in the proc/.../attr
interfaces. An entry "context" is added to /proc/.../attr and
to each of the subdirectories. The "context" entry provides
process attribute information in the form:

lsm-name='lsm-data'[,lsm-name='lsm-data']...

Signed-off-by: Casey Schaufler <casey@schaufler-ca.com>
Signed-off-by: John Johansen <john.johansen@canonical.com>
Signed-off-by: Seth Forshee <seth.forshee@canonical.com>

UBUNTU: SAUCE: LSM stacking: LSM: Infrastructure management of the remaining blobs

Move management of the inode, ipc, key, msg_msg, sock and superblock
security blobs from the security modules to the infrastructure.
Use of the blob pointers is abstracted in the security modules.

Signed-off-by: Casey Schaufler <casey@schaufler-ca.com>
Signed-off-by: John Johansen <john.johansen@canonical.com>
Signed-off-by: Seth Forshee <seth.forshee@canonical.com>

UBUNTU: SAUCE: LSM stacking: LSM: manage task security blobs

Move management of task security blobs into the security
infrastructure. Modules are required to identify the space
they require. At this time there are no modules that use
task blobs.

Signed-off-by: Casey Schaufler <casey@schaufler-ca.com>
Signed-off-by: John Johansen <john.johansen@canonical.com>
Signed-off-by: Seth Forshee <seth.forshee@canonical.com>

UBUNTU: SAUCE: LSM stacking: LSM: Manage file security blobs

Move the management of file security blobs from the individual
security modules to the security infrastructure. The security modules
using file blobs have been updated accordingly. Modules are required
to identify the space they need at module initialization. In some
cases a module no longer needs to supply a blob management hook, in
which case the hook has been removed.

Signed-off-by: Casey Schaufler <casey@schaufler-ca.com>
Signed-off-by: John Johansen <john.johansen@canonical.com>
Signed-off-by: Seth Forshee <seth.forshee@canonical.com>

UBUNTU: SAUCE: LSM stacking: LSM: manage credential security blobs

Move the management of credential security blobs from the
individual security modules to the security infrastructure.
The security modules using credential blobs have been updated
accordingly. Modules are required to identify the space they
require at module initialization. In some cases a module no
longer needs to supply blob management hook, in which case
the hook has been removed.

Signed-off-by: Casey Schaufler <casey@schaufler-ca.com>
Signed-off-by: John Johansen <john.johansen@canonical.com>
Signed-off-by: Seth Forshee <seth.forshee@canonical.com>

UBUNTU: SAUCE: LSM stacking: procfs: add smack subdir to attrs

Back in 2007 I made what turned out to be a rather serious
mistake in the implementation of the Smack security module.
The SELinux module used an interface in /proc to manipulate
the security context on processes. Rather than use a similar
interface, I used the same interface. The AppArmor team did
likewise. Now /proc/.../attr/current will tell you the
security "context" of the process, but it will be different
depending on the security module you're using.

This patch provides a subdirectory in /proc/.../attr for
Smack. Smack user space can use the "current" file in
this subdirectory and never have to worry about getting
SELinux attributes by mistake. Programs that use the
old interface will continue to work (or fail, as the case
may be) as before.

This patch does not include subdirectories for SELinux
or AppArmor. I do have a patch that provides those, and
will happily make it available should anyone see value
in it.

The original implementation is by Kees Cook.

Signed-off-by: Casey Schaufler <casey@schaufler-ca.com>
Signed-off-by: John Johansen <john.johansen@canonical.com>
Signed-off-by: Seth Forshee <seth.forshee@canonical.com>

xhci: set missing SuperSpeedPlus Link Protocol bit in roothub descriptor

BugLink: http://bugs.launchpad.net/bugs/1720045
A SuperSpeedPlus roothub needs to have the Link Protocol (LP) bit set in
the bmSublinkSpeedAttr[] entry of a SuperSpeedPlus descriptor.

If the xhci controller has an optional Protocol Speed ID (PSI) table then
that will be used as a base to create the roothub SuperSpeedPlus
descriptor.
The PSI table does not however necessary contain the LP bit so we need
to set it manually.

Check the psi speed and set LP bit if speed is 10Gbps or higher.
We're not setting it for 5 to 10Gbps as USB 3.1 specification always
mention SuperSpeedPlus for 10Gbps or higher, and some SSIC USB 3.0 speeds
can be over 5Gbps, such as SSIC-G3B-L1 at 5830 Mbps

Cc: <stable@vger.kernel.org> # 4.6+
Signed-off-by: Mathias Nyman <mathias.nyman@linux.intel.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
(cherry picked from commit 7bea22b124d77845c85a62eaa29a85ba6cc2f899 linux-next)
Signed-off-by: Seth Forshee <seth.forshee@canonical.com>

UBUNTU: [Config] CONFIG_INTEL_RDT=y

BugLink: http://bugs.launchpad.net/bugs/1591609
CONFIG_INTEL_RDT_A changed name to CONFIG_INTEL_RDT, update our
configs accordingly.

Signed-off-by: Seth Forshee <seth.forshee@canonical.com>

x86/intel_rdt: Turn off most RDT features on Skylake

BugLink: http://bugs.launchpad.net/bugs/1713619
Errata list is included in this document:
https://www.intel.com/content/dam/www/public/us/en/documents/specification-updates/6th-gen-x-series-spec-update.pdf
with more details in:
https://www.intel.com/content/www/us/en/processors/xeon/scalable/xeon-scalable-spec-update.html

But the tl;dr summary (using tags from first of the documents) is:
SKZ4 MBM does not accurately track write bandwidth
SKZ17 CMT counters may not count accurately
SKZ18 CAT may not restrict cacheline allocation under certain conditions
SKZ19 MBM counters may undercount

Disable all these features on Skylake models. Users who understand the
errata may re-enable using boot command line options.

Signed-off-by: Tony Luck <tony.luck@intel.com>
Cc: Fenghua" <fenghua.yu@intel.com>
Cc: Ravi V" <ravi.v.shankar@intel.com>
Cc: "Peter Zijlstra" <peterz@infradead.org>
Cc: "Stephane Eranian" <eranian@google.com>
Cc: "Andi Kleen" <ak@linux.intel.com>
Cc: "David Carrillo-Cisneros" <davidcc@google.com>
Cc: Vikas Shivappa <vikas.shivappa@linux.intel.com>
Link: http://lkml.kernel.org/r/3aea0a3bae219062c812668bd9b7b8f1a25003ba.1503512900.git.tony.luck@intel.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
(cherry picked from commit d56593eb5eda8f593db92927059697bbf89bc4b3)
Signed-off-by: Seth Forshee <seth.forshee@canonical.com>

x86/intel_rdt: Add command line options for resource director technology

BugLink: http://bugs.launchpad.net/bugs/1713619
Command line options allow us to ignore features that we don't want.
Also we can re-enable options that have been disabled on a platform
(so long as the underlying h/w actually supports the option).

[ tglx: Marked the option array __initdata and the helper function __init ]

Signed-off-by: Tony Luck <tony.luck@intel.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Fenghua" <fenghua.yu@intel.com>
Cc: Ravi V" <ravi.v.shankar@intel.com>
Cc: "Peter Zijlstra" <peterz@infradead.org>
Cc: "Stephane Eranian" <eranian@google.com>
Cc: "Andi Kleen" <ak@linux.intel.com>
Cc: "David Carrillo-Cisneros" <davidcc@google.com>
Cc: Vikas Shivappa <vikas.shivappa@linux.intel.com>
Link: http://lkml.kernel.org/r/0c37b0d4dbc30977a3c1cee08b66420f83662694.1503512900.git.tony.luck@intel.com
(cherry picked from commit 1d9807fc64c131a83a96917f2b2da1c9b00cf127)
Signed-off-by: Seth Forshee <seth.forshee@canonical.com>

x86/intel_rdt: Move special case code for Haswell to a quirk function

BugLink: http://bugs.launchpad.net/bugs/1713619
No functional change, but lay the ground work for other per-model
quirks.

Signed-off-by: Tony Luck <tony.luck@intel.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Fenghua" <fenghua.yu@intel.com>
Cc: Ravi V" <ravi.v.shankar@intel.com>
Cc: "Peter Zijlstra" <peterz@infradead.org>
Cc: "Stephane Eranian" <eranian@google.com>
Cc: "Andi Kleen" <ak@linux.intel.com>
Cc: "David Carrillo-Cisneros" <davidcc@google.com>
Cc: Vikas Shivappa <vikas.shivappa@linux.intel.com>
Link: http://lkml.kernel.org/r/f195a83751b5f8b1d8a78bd3c1914300c8fa3142.1503512900.git.tony.luck@intel.com
(cherry picked from commit 0576113a387e0c8a5d9e24b4cd62605d1c9c0db8)
Signed-off-by: Seth Forshee <seth.forshee@canonical.com>

x86/intel_rdt: Remove redundant ternary operator on return

BugLink: http://bugs.launchpad.net/bugs/1591609
The use of the ternary operator is redundant as ret can never be
non-zero at that point. Instead, just return nbytes.

Detected by CoverityScan, CID#1452658 ("Logically dead code")

Signed-off-by: Colin Ian King <colin.king@canonical.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Vikas Shivappa <vikas.shivappa@linux.intel.com>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: kernel-janitors@vger.kernel.org
Link: http://lkml.kernel.org/r/20170808092859.13021-1-colin.king@canonical.com
(cherry picked from commit 5707b46a4206be2068444eb6b514a1ee070651c8)
Signed-off-by: Seth Forshee <seth.forshee@canonical.com>

x86/intel_rdt/cqm: Improve limbo list processing

BugLink: http://bugs.launchpad.net/bugs/1591609
During a mkdir, the entire limbo list is synchronously checked on each
package for free RMIDs by sending IPIs. With a large number of RMIDs (SKL
has 192) this creates a intolerable amount of work in IPIs.

Replace the IPI based checking of the limbo list with asynchronous worker
threads on each package which periodically scan the limbo list and move the
RMIDs that have:

llc_occupancy < threshold_occupancy

on all packages to the free list.

mkdir now returns -ENOSPC if the free list and the limbo list ere empty or
returns -EBUSY if there are RMIDs on the limbo list and the free list is
empty.

Getting rid of the IPIs also simplifies the data structures and the
serialization required for handling the lists.

[ tglx: Rewrote changelog ... ]

Signed-off-by: Vikas Shivappa <vikas.shivappa@linux.intel.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: ravi.v.shankar@intel.com
Cc: tony.luck@intel.com
Cc: fenghua.yu@intel.com
Cc: peterz@infradead.org
Cc: eranian@google.com
Cc: vikas.shivappa@intel.com
Cc: ak@linux.intel.com
Cc: davidcc@google.com
Link: http://lkml.kernel.org/r/1502845243-20454-3-git-send-email-vikas.shivappa@linux.intel.com
(cherry picked from commit 24247aeeabe99eab13b798ccccc2dec066dd6f07)
Signed-off-by: Seth Forshee <seth.forshee@canonical.com>

x86/intel_rdt/mbm: Fix MBM overflow handler during CPU hotplug

BugLink: http://bugs.launchpad.net/bugs/1591609
When a CPU is dying, the overflow worker is canceled and rescheduled on a
different CPU in the same domain. But if the timer is already about to
expire this essentially doubles the interval which might result in a non
detected overflow.

Cancel the overflow worker and reschedule it immediately on a different CPU
in same domain. The work could be flushed as well, but that would
reschedule it on the same CPU.

[ tglx: Rewrote changelog once again ]

Reported-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Vikas Shivappa <vikas.shivappa@linux.intel.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: ravi.v.shankar@intel.com
Cc: tony.luck@intel.com
Cc: fenghua.yu@intel.com
Cc: peterz@infradead.org
Cc: eranian@google.com
Cc: vikas.shivappa@intel.com
Cc: ak@linux.intel.com
Cc: davidcc@google.com
Link: http://lkml.kernel.org/r/1502845243-20454-2-git-send-email-vikas.shivappa@linux.intel.com
(cherry picked from commit bbc4615e0b7df5e21d0991adb4b2798508354924)
Signed-off-by: Seth Forshee <seth.forshee@canonical.com>

x86/intel_rdt: Modify the intel_pqr_state for better performance

BugLink: http://bugs.launchpad.net/bugs/1591609
Currently we have pqr_state and rdt_default_state which store the cached
CLOSID/RMIDs and the user configured cpu default values respectively. We
touch both of these during context switch. Put all of them in one
structure so that we can spare a cache line.

Reported-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Vikas Shivappa <vikas.shivappa@linux.intel.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: ravi.v.shankar@intel.com
Cc: tony.luck@intel.com
Cc: fenghua.yu@intel.com
Cc: peterz@infradead.org
Cc: eranian@google.com
Cc: sai.praneeth.prakhya@intel.com
Cc: ak@linux.intel.com
Cc: davidcc@google.com
Link: http://lkml.kernel.org/r/1502304395-7166-3-git-send-email-vikas.shivappa@linux.intel.com
(cherry picked from commit a9110b552d44fedbd1221eb0e5bde81da32d9350)
Signed-off-by: Seth Forshee <seth.forshee@canonical.com>

x86/intel_rdt/cqm: Clear the default RMID during hotcpu

BugLink: http://bugs.launchpad.net/bugs/1591609
The user configured per cpu default RMID is not cleared during cpu
hotplug. This may lead to incorrect RMID values after a cpu goes offline
and again comes back online. Clear the per cpu default RMID during cpu
offline and online handling.

Reported-by: Prakyha Sai Praneeth <sai.praneeth.prakhya@intel.com>
Signed-off-by: Vikas Shivappa <vikas.shivappa@linux.intel.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: ravi.v.shankar@intel.com
Cc: tony.luck@intel.com
Cc: fenghua.yu@intel.com
Cc: peterz@infradead.org
Cc: eranian@google.com
Cc: ak@linux.intel.com
Cc: davidcc@google.com
Link: http://lkml.kernel.org/r/1502304395-7166-2-git-send-email-vikas.shivappa@linux.intel.com
(cherry picked from commit eda61c265f3656be8345fdf0334b3a77829437fc)
Signed-off-by: Seth Forshee <seth.forshee@canonical.com>

x86/intel_rdt: Show bitmask of shareable resource with other executing units

BugLink: http://bugs.launchpad.net/bugs/1591609
CPUID.(EAX=0x10, ECX=res#):EBX[31:0] reports a bit mask for a resource.
Each set bit within the length of the CBM indicates the corresponding
unit of the resource allocation may be used by other entities in the
platform (e.g. an integrated graphics engine or hardware units outside
the processor core and have direct access to the resource). Each
cleared bit within the length of the CBM indicates the corresponding
allocation unit can be configured to implement a priority-based
allocation scheme without interference with other hardware agents in
the system. Bits outside the length of the CBM are reserved.

More details on the bit mask are described in x86 Software Developer's
Manual.

The bitmask is shown in "info" directory for each resource. It's
up to user to decide how to use the bitmask within a CBM in a partition
to share or isolate a resource with other executing units.

Suggested-by: Reinette Chatre <reinette.chatre@intel.com>
Signed-off-by: Fenghua Yu <fenghua.yu@intel.com>
Signed-off-by: Tony Luck <tony.luck@intel.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: ravi.v.shankar@intel.com
Cc: peterz@infradead.org
Cc: eranian@google.com
Cc: ak@linux.intel.com
Cc: davidcc@google.com
Cc: vikas.shivappa@linux.intel.com
Link: http://lkml.kernel.org/r/20170725223904.12996-1-tony.luck@intel.com
(cherry picked from commit 0dd2d7494cd818d06a2ae1cd840cd62124a2d25e)
Signed-off-by: Seth Forshee <seth.forshee@canonical.com>

x86/intel_rdt/mbm: Handle counter overflow

BugLink: http://bugs.launchpad.net/bugs/1591609
Set up a delayed work queue for each domain that will read all
the MBM counters of active RMIDs once per second to make sure
that they don't wrap around between reads from users.

[Tony: Added the initializations for the work structure and completed
the patch]

Signed-off-by: Tony Luck <tony.luck@intel.com>
Signed-off-by: Vikas Shivappa <vikas.shivappa@linux.intel.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: ravi.v.shankar@intel.com
Cc: fenghua.yu@intel.com
Cc: peterz@infradead.org
Cc: eranian@google.com
Cc: vikas.shivappa@intel.com
Cc: ak@linux.intel.com
Cc: davidcc@google.com
Cc: reinette.chatre@intel.com
Link: http://lkml.kernel.org/r/1501017287-28083-29-git-send-email-vikas.shivappa@linux.intel.com
(cherry picked from commit e33026831bdb5f051499fec6a606f79fe1f94cc8)
Signed-off-by: Seth Forshee <seth.forshee@canonical.com>

x86/intel_rdt/mbm: Add mbm counter initialization

BugLink: http://bugs.launchpad.net/bugs/1591609
MBM counters are monotonically increasing counts representing the total
memory bytes at a particular time. In order to calculate total_bytes for
an rdtgroup, we store the value of the counter when we create an
rdtgroup or when a new domain comes online.

When the total_bytes(all memory controller bytes) or local_bytes(local
memory controller bytes) file in "mon_data" is read it shows the
total bytes for that rdtgroup since its creation. User can snapshot this
at different time intervals to obtain bytes/second.

Signed-off-by: Vikas Shivappa <vikas.shivappa@linux.intel.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: ravi.v.shankar@intel.com
Cc: tony.luck@intel.com
Cc: fenghua.yu@intel.com
Cc: peterz@infradead.org
Cc: eranian@google.com
Cc: vikas.shivappa@intel.com
Cc: ak@linux.intel.com
Cc: davidcc@google.com
Cc: reinette.chatre@intel.com
Link: http://lkml.kernel.org/r/1501017287-28083-28-git-send-email-vikas.shivappa@linux.intel.com
(cherry picked from commit a4de1dfdd726537e2a78b55759fc646d9b0a0be8)
Signed-off-by: Seth Forshee <seth.forshee@canonical.com>

x86/intel_rdt/mbm: Basic counting of MBM events (total and local)

BugLink: http://bugs.launchpad.net/bugs/1591609
Check CPUID bits for whether each of the MBM events is supported.
Allocate space for each RMID for each counter in each domain to save
previous MSR counter value and running total of data.
Create files in each of the monitor directories.

Signed-off-by: Tony Luck <tony.luck@intel.com>
Signed-off-by: Vikas Shivappa <vikas.shivappa@linux.intel.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: ravi.v.shankar@intel.com
Cc: fenghua.yu@intel.com
Cc: peterz@infradead.org
Cc: eranian@google.com
Cc: vikas.shivappa@intel.com
Cc: ak@linux.intel.com
Cc: davidcc@google.com
Cc: reinette.chatre@intel.com
Link: http://lkml.kernel.org/r/1501017287-28083-27-git-send-email-vikas.shivappa@linux.intel.com
(cherry picked from commit 9f52425ba303d91c8370719e91d7e578bfdf309f)
Signed-off-by: Seth Forshee <seth.forshee@canonical.com>

x86/intel_rdt/cqm: Add CPU hotplug support

BugLink: http://bugs.launchpad.net/bugs/1591609
Resource groups have a per domain directory under "mon_data". Add or
remove these directories as and when domains come online and go offline.
Also update the per cpu rmids and cache upon onlining and offlining
cpus.

Signed-off-by: Vikas Shivappa <vikas.shivappa@linux.intel.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: ravi.v.shankar@intel.com
Cc: tony.luck@intel.com
Cc: fenghua.yu@intel.com
Cc: peterz@infradead.org
Cc: eranian@google.com
Cc: vikas.shivappa@intel.com
Cc: ak@linux.intel.com
Cc: davidcc@google.com
Cc: reinette.chatre@intel.com
Link: http://lkml.kernel.org/r/1501017287-28083-26-git-send-email-vikas.shivappa@linux.intel.com
(cherry picked from commit 895c663ecef16c8138e20a7d5c052e0fcc400241)
Signed-off-by: Seth Forshee <seth.forshee@canonical.com>

x86/intel_rdt/cqm: Add sched_in support

BugLink: http://bugs.launchpad.net/bugs/1591609
OS associates an RMID/CLOSid to a task by writing the per CPU
IA32_PQR_ASSOC MSR when a task is scheduled in.

The sched_in code will stay as no-op unless we are running on Intel SKU
which supports either resource control or monitoring and we also enable
them by mounting the resctrl fs. The per cpu CLOSid/RMID values are
cached and the write is performed only when a task with a different
CLOSid/RMID is scheduled in.

Signed-off-by: Vikas Shivappa <vikas.shivappa@linux.intel.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: ravi.v.shankar@intel.com
Cc: tony.luck@intel.com
Cc: fenghua.yu@intel.com
Cc: peterz@infradead.org
Cc: eranian@google.com
Cc: vikas.shivappa@intel.com
Cc: ak@linux.intel.com
Cc: davidcc@google.com
Cc: reinette.chatre@intel.com
Link: http://lkml.kernel.org/r/1501017287-28083-25-git-send-email-vikas.shivappa@linux.intel.com
(cherry picked from commit 748b6b881ccdda8f0663c68605f431279e06f49a)
Signed-off-by: Seth Forshee <seth.forshee@canonical.com>

x86/intel_rdt: Introduce rdt_enable_key for scheduling

BugLink: http://bugs.launchpad.net/bugs/1591609
Introduce the usage of rdt_enable_key in sched_in code as a preparation
to add RDT monitoring support for sched_in.

Signed-off-by: Vikas Shivappa <vikas.shivappa@linux.intel.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: ravi.v.shankar@intel.com
Cc: tony.luck@intel.com
Cc: fenghua.yu@intel.com
Cc: peterz@infradead.org
Cc: eranian@google.com
Cc: vikas.shivappa@intel.com
Cc: ak@linux.intel.com
Cc: davidcc@google.com
Cc: reinette.chatre@intel.com
Link: http://lkml.kernel.org/r/1501017287-28083-24-git-send-email-vikas.shivappa@linux.intel.com
(cherry picked from commit 4be6c078428b08d1a948cc09faca8f1326231866)
Signed-off-by: Seth Forshee <seth.forshee@canonical.com>

x86/intel_rdt/cqm: Add mount,umount support

BugLink: http://bugs.launchpad.net/bugs/1591609
Add monitoring support during mount and unmount. Since root directory is
a "ctrl_mon" directory which can control and monitor resources create
the "mon_groups" directory which can hold monitor groups and a
"mon_data" directory which would hold all monitoring data like the rest
of resource groups.

The mount succeeds if either of monitoring or control/allocation is
enabled. If only monitoring is enabled user can still create monitor
groups under the "/sys/fs/resctrl/mon_groups/" and any mkdir under root
would fail. If only control/allocation is enabled all of the monitoring
related directories/files would not exist and resctrl would work in
legacy mode.

Signed-off-by: Vikas Shivappa <vikas.shivappa@linux.intel.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: ravi.v.shankar@intel.com
Cc: tony.luck@intel.com
Cc: fenghua.yu@intel.com
Cc: peterz@infradead.org
Cc: eranian@google.com
Cc: vikas.shivappa@intel.com
Cc: ak@linux.intel.com
Cc: davidcc@google.com
Cc: reinette.chatre@intel.com
Link: http://lkml.kernel.org/r/1501017287-28083-23-git-send-email-vikas.shivappa@linux.intel.com
(cherry picked from commit 4af4a88e0c9246990f95c88eeba781265f27c58e)
Signed-off-by: Seth Forshee <seth.forshee@canonical.com>

x86/intel_rdt/cqm: Add rmdir support

BugLink: http://bugs.launchpad.net/bugs/1591609
Resource groups (ctrl_mon and monitor groups) are represented by
directories in resctrl fs. Add support to remove the directories.

When a ctrl_mon directory is removed all the cpus and tasks are assigned
back to the root rdtgroup. When a monitor group is removed the cpus and
tasks are returned to the parent control group.

Signed-off-by: Vikas Shivappa <vikas.shivappa@linux.intel.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: ravi.v.shankar@intel.com
Cc: tony.luck@intel.com
Cc: fenghua.yu@intel.com
Cc: peterz@infradead.org
Cc: eranian@google.com
Cc: vikas.shivappa@intel.com
Cc: ak@linux.intel.com
Cc: davidcc@google.com
Cc: reinette.chatre@intel.com
Link: http://lkml.kernel.org/r/1501017287-28083-22-git-send-email-vikas.shivappa@linux.intel.com
(cherry picked from commit f3cbeacaa06e2b8c2f3ce8531e9aa3fe1f2762cd)
Signed-off-by: Seth Forshee <seth.forshee@canonical.com>

x86/intel_rdt: Separate the ctrl bits from rmdir

BugLink: http://bugs.launchpad.net/bugs/1591609
Re-factor the code to separate the ctrl group removal from the rmdir to
prepare to add RDT monitoring group removal.

Signed-off-by: Vikas Shivappa <vikas.shivappa@linux.intel.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: ravi.v.shankar@intel.com
Cc: tony.luck@intel.com
Cc: fenghua.yu@intel.com
Cc: peterz@infradead.org
Cc: eranian@google.com
Cc: vikas.shivappa@intel.com
Cc: ak@linux.intel.com
Cc: davidcc@google.com
Cc: reinette.chatre@intel.com
Link: http://lkml.kernel.org/r/1501017287-28083-21-git-send-email-vikas.shivappa@linux.intel.com
(cherry picked from commit f9049547f7e72377049d717354b2f56f36a5854a)
Signed-off-by: Seth Forshee <seth.forshee@canonical.com>

x86/intel_rdt/cqm: Add mon_data

BugLink: http://bugs.launchpad.net/bugs/1591609
Add a mon_data directory for the root rdtgroup and all other rdtgroups.
The directory holds all of the monitored data for all domains and events
of all resources being monitored.

The mon_data itself has a list of directories in the format
mon_<domain_name>_<domain_id>. Each of these subdirectories contain one
file per event in the mode "0444". Reading the file displays a snapshot
of the monitored data for the event the file represents.

For ex, on a 2 socket Broadwell with llc_occupancy being
monitored the mon_data contents look as below:

$ ls /sys/fs/resctrl/p1/mon_data/
mon_L3_00
mon_L3_01

Each domain directory has one file per event:
$ ls /sys/fs/resctrl/p1/mon_data/mon_L3_00/
llc_occupancy

To read current llc_occupancy of ctrl_mon group p1
$ cat /sys/fs/resctrl/p1/mon_data/mon_L3_00/llc_occupancy
33789096

[This patch idea is based on Tony's sample patches to organise data in a
per domain directory and have one file per event (and use the fp->priv to
store mon data bits)]

Signed-off-by: Vikas Shivappa <vikas.shivappa@linux.intel.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: ravi.v.shankar@intel.com
Cc: tony.luck@intel.com
Cc: fenghua.yu@intel.com
Cc: peterz@infradead.org
Cc: eranian@google.com
Cc: vikas.shivappa@intel.com
Cc: ak@linux.intel.com
Cc: davidcc@google.com
Cc: reinette.chatre@intel.com
Link: http://lkml.kernel.org/r/1501017287-28083-20-git-send-email-vikas.shivappa@linux.intel.com
(cherry picked from commit d89b7379015fc561060a4094676d143e6ed264e7)
Signed-off-by: Seth Forshee <seth.forshee@canonical.com>

x86/intel_rdt: Prepare for RDT monitor data support

BugLink: http://bugs.launchpad.net/bugs/1591609
Rename the intel_rdt_schemata file to intel_rdt_ctrlmondata as we now
want to add support for RDT monitoring data for the events that are
supported in later patches.

Signed-off-by: Vikas Shivappa <vikas.shivappa@linux.intel.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: ravi.v.shankar@intel.com
Cc: tony.luck@intel.com
Cc: fenghua.yu@intel.com
Cc: peterz@infradead.org
Cc: eranian@google.com
Cc: vikas.shivappa@intel.com
Cc: ak@linux.intel.com
Cc: davidcc@google.com
Cc: reinette.chatre@intel.com
Link: http://lkml.kernel.org/r/1501017287-28083-19-git-send-email-vikas.shivappa@linux.intel.com
(cherry picked from commit 90c403e83101c87ee9e6df8c8d30ea8628ff8bfc)
Signed-off-by: Seth Forshee <seth.forshee@canonical.com>

x86/intel_rdt/cqm: Add cpus file support

BugLink: http://bugs.launchpad.net/bugs/1591609
The cpus file is extended to support resource monitoring. This is used
to over-ride the RMID of the default group when running on specific
CPUs. It works similar to the resource control. The "cpus" and
"cpus_list" file is present in default group, ctrl_mon groups and
monitor groups.

Each "cpus" file or cpu_list file reads a cpumask or list showing which
CPUs belong to the resource group. By default all online cpus belong to
the default root group. A CPU can be present in one "ctrl_mon" and one
"monitor" group simultaneously. They can be added to a resource group by
writing the CPU to the file. When a CPU is added to a ctrl_mon group it
is automatically removed from the previous ctrl_mon group. A CPU can be
added to a monitor group only if it is present in the parent ctrl_mon
group and when a CPU is added to a monitor group, it is automatically
removed from the previous monitor group. When CPUs go offline, they are
automatically removed from the ctrl_mon and monitor groups.

Signed-off-by: Vikas Shivappa <vikas.shivappa@linux.intel.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: ravi.v.shankar@intel.com
Cc: tony.luck@intel.com
Cc: fenghua.yu@intel.com
Cc: peterz@infradead.org
Cc: eranian@google.com
Cc: vikas.shivappa@intel.com
Cc: ak@linux.intel.com
Cc: davidcc@google.com
Cc: reinette.chatre@intel.com
Link: http://lkml.kernel.org/r/1501017287-28083-18-git-send-email-vikas.shivappa@linux.intel.com
(cherry picked from commit a9fcf8627dc01049c390023bbb0323db3c785b91)
Signed-off-by: Seth Forshee <seth.forshee@canonical.com>

x86/intel_rdt: Prepare to add RDT monitor cpus file support

BugLink: http://bugs.launchpad.net/bugs/1591609
Separate the ctrl cpus file handling from the generic cpus file handling
and convert the per cpu closid from u32 to a struct which will be used
later to add rmid to the same struct. Also cleanup some name space.

Signed-off-by: Vikas Shivappa <vikas.shivappa@linux.intel.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: ravi.v.shankar@intel.com
Cc: tony.luck@intel.com
Cc: fenghua.yu@intel.com
Cc: peterz@infradead.org
Cc: eranian@google.com
Cc: vikas.shivappa@intel.com
Cc: ak@linux.intel.com
Cc: davidcc@google.com
Cc: reinette.chatre@intel.com
Link: http://lkml.kernel.org/r/1501017287-28083-17-git-send-email-vikas.shivappa@linux.intel.com
(cherry picked from commit b09d981b3f346690dafa3e4ebedfcf3e44b68e83)
Signed-off-by: Seth Forshee <seth.forshee@canonical.com>

x86/intel_rdt/cqm: Add tasks file support

BugLink: http://bugs.launchpad.net/bugs/1591609
The root directory, ctrl_mon and monitor groups are populated
with a read/write file named "tasks". When read, it shows all the task
IDs assigned to the resource group.

Tasks can be added to groups by writing the PID to the file. A task can
be present in one "ctrl_mon" group "and" one "monitor" group. IOW a
PID_x can be seen in a ctrl_mon group and a monitor group at the same
time. When a task is added to a ctrl_mon group, it is automatically
removed from the previous ctrl_mon group where it belonged. Similarly if
a task is moved to a monitor group it is removed from the previous
monitor group . Also since the monitor groups can only have subset of
tasks of parent ctrl_mon group, a task can be moved to a monitor group
only if its already present in the parent ctrl_mon group.

Task membership is indicated by a new field in the task_struct "u32
rmid" which holds the RMID for the task. RMID=0 is reserved for the
default root group where the tasks belong to at mount.

[tony: zero the rmid if rdtgroup was deleted when task was being moved]

Signed-off-by: Tony Luck <tony.luck@linux.intel.com>
Signed-off-by: Vikas Shivappa <vikas.shivappa@linux.intel.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: ravi.v.shankar@intel.com
Cc: tony.luck@intel.com
Cc: fenghua.yu@intel.com
Cc: peterz@infradead.org
Cc: eranian@google.com
Cc: vikas.shivappa@intel.com
Cc: ak@linux.intel.com
Cc: davidcc@google.com
Cc: reinette.chatre@intel.com
Link: http://lkml.kernel.org/r/1501017287-28083-16-git-send-email-vikas.shivappa@linux.intel.com
(cherry picked from commit d6aaba615a482ce7d3ec218cf7b8d02d0d5753b8)
Signed-off-by: Seth Forshee <seth.forshee@canonical.com>

x86/intel_rdt: Change closid type from int to u32

BugLink: http://bugs.launchpad.net/bugs/1591609
OS associates a CLOSid(Class of service id) to a task by writing the
high 32 bits of per CPU IA32_PQR_ASSOC MSR when a task is scheduled in.
CPUID.(EAX=10H, ECX=1):EDX[15:0] enumerates the max CLOSID supported and
it is zero indexed. Hence change the type to u32 from int.

Signed-off-by: Vikas Shivappa <vikas.shivappa@linux.intel.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: ravi.v.shankar@intel.com
Cc: tony.luck@intel.com
Cc: fenghua.yu@intel.com
Cc: peterz@infradead.org
Cc: eranian@google.com
Cc: vikas.shivappa@intel.com
Cc: ak@linux.intel.com
Cc: davidcc@google.com
Cc: reinette.chatre@intel.com
Link: http://lkml.kernel.org/r/1501017287-28083-15-git-send-email-vikas.shivappa@linux.intel.com
(cherry picked from commit 0734ded1abee9439b0c5d7b62af1ead78aab895b)
Signed-off-by: Seth Forshee <seth.forshee@canonical.com>

x86/intel_rdt/cqm: Add mkdir support for RDT monitoring

BugLink: http://bugs.launchpad.net/bugs/1591609
Resource control groups can be created using mkdir in resctrl
fs(rdtgroup). In order to extend the resctrl interface to support
monitoring the control groups, extend the current mkdir to support
resource monitoring also.

This allows the rdtgroup created under the root directory to be able to
both control and monitor resources (ctrl_mon group). The ctrl_mon groups
are associated with one CLOSID like the legacy rdtgroups and one
RMID(Resource monitoring ID) as well. Hardware uses RMID to track the
resource usage. Once either of the CLOSID or RMID are exhausted, the
mkdir fails with -ENOSPC. If there are RMIDs in limbo list but not free
an -EBUSY is returned. User can also monitor a subset of the ctrl_mon
rdtgroup's tasks/cpus using the monitor groups. The monitor groups are
created using mkdir under the "mon_groups" directory in every ctrl_mon
group.

[Merged Tony's code: Removed a lot of common mkdir code, a fix to handling
of the list of the child rdtgroups and some cleanups in list
traversal. Also the changes to have similar alloc and free for CLOS/RMID
and return -EBUSY when RMIDs are in limbo and not free]

Signed-off-by: Tony Luck <tony.luck@intel.com>
Signed-off-by: Vikas Shivappa <vikas.shivappa@linux.intel.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: ravi.v.shankar@intel.com
Cc: fenghua.yu@intel.com
Cc: peterz@infradead.org
Cc: eranian@google.com
Cc: vikas.shivappa@intel.com
Cc: ak@linux.intel.com
Cc: davidcc@google.com
Cc: reinette.chatre@intel.com
Link: http://lkml.kernel.org/r/1501017287-28083-14-git-send-email-vikas.shivappa@linux.intel.com
(cherry picked from commit c7d9aac6131148abe29ed1dc6bd73ad1213d1f56)
Signed-off-by: Seth Forshee <seth.forshee@canonical.com>

x86/intel_rdt: Prepare for RDT monitoring mkdir support

BugLink: http://bugs.launchpad.net/bugs/1591609
Separate the ctrl mkdir code from the rest in order to prepare for
adding support for RDT monitoring mkdir support as well.

Signed-off-by: Vikas Shivappa <vikas.shivappa@linux.intel.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: ravi.v.shankar@intel.com
Cc: tony.luck@intel.com
Cc: fenghua.yu@intel.com
Cc: peterz@infradead.org
Cc: eranian@google.com
Cc: vikas.shivappa@intel.com
Cc: ak@linux.intel.com
Cc: davidcc@google.com
Cc: reinette.chatre@intel.com
Link: http://lkml.kernel.org/r/1501017287-28083-13-git-send-email-vikas.shivappa@linux.intel.com
(cherry picked from commit 65b4f403057e32da73c36e33d403890173c4c324)
Signed-off-by: Seth Forshee <seth.forshee@canonical.com>

x86/intel_rdt/cqm: Add info files for RDT monitoring

BugLink: http://bugs.launchpad.net/bugs/1591609
Add info directory files specific to RDT monitoring.

num_rmids:
    The number of RMIDs which are valid for the resource.

mon_features:
    Lists the monitoring events if monitoring is enabled for the
    resource.

max_threshold_occupancy:
    This is specific to llc_occupancy monitoring and is used to
    determine if an RMID can be reused. Provides an upper bound on the
    threshold and is shown to the user in bytes though the internal
    value will be rounded to the scaling factor supported by the h/w.

Signed-off-by: Vikas Shivappa <vikas.shivappa@linux.intel.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: ravi.v.shankar@intel.com
Cc: tony.luck@intel.com
Cc: fenghua.yu@intel.com
Cc: peterz@infradead.org
Cc: eranian@google.com
Cc: vikas.shivappa@intel.com
Cc: ak@linux.intel.com
Cc: davidcc@google.com
Cc: reinette.chatre@intel.com
Link: http://lkml.kernel.org/r/1501017287-28083-12-git-send-email-vikas.shivappa@linux.intel.com
(cherry picked from commit d4ab33201029913b594ae785a9665f45040396ab)
Signed-off-by: Seth Forshee <seth.forshee@canonical.com>

x86/intel_rdt: Simplify info and base file lists

BugLink: http://bugs.launchpad.net/bugs/1591609
The info directory files and base files need to be different for each
resource like cache and Memory bandwidth. With in each resource, the
files would be further different for monitoring and ctrl. This leads to
a lot of different static array declarations given that we are adding
resctrl monitoring.

Simplify this to one common list of files and then declare a set of
flags to choose the files based on the resource, whether it is info or
base and if it is control type file. This is as a preparation to include
monitoring based info and base files.

No functional change.

[Vikas: Extended the flags to have few bits per category like resource,
info/base etc]

Signed-off-by: Tony luck <tony.luck@intel.com>
Signed-off-by: Vikas Shivappa <vikas.shivappa@linux.intel.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: ravi.v.shankar@intel.com
Cc: fenghua.yu@intel.com
Cc: peterz@infradead.org
Cc: eranian@google.com
Cc: vikas.shivappa@intel.com
Cc: ak@linux.intel.com
Cc: davidcc@google.com
Cc: reinette.chatre@intel.com
Link: http://lkml.kernel.org/r/1501017287-28083-11-git-send-email-vikas.shivappa@linux.intel.com
(backported from commit 5dc1d5c6bac2cfe3420cf353dfb0ef2e543f7c10)
Signed-off-by: Seth Forshee <seth.forshee@canonical.com>
Conflicts:
arch/x86/kernel/cpu/intel_rdt_rdtgroup.c

x86/intel_rdt/cqm: Add RMID (Resource monitoring ID) management

BugLink: http://bugs.launchpad.net/bugs/1591609
Hardware uses RMID(Resource monitoring ID) to keep track of each of the
RDT events associated with tasks. The number of RMIDs is dependent on
the SKU and is enumerated via CPUID. We add support to manage the RMIDs
which include managing the RMID allocation and reading LLC occupancy
for an RMID.

RMID allocation is managed by keeping a free list which is initialized
to all available RMIDs except for RMID 0 which is always reserved for
root group. RMIDs goto a limbo list once they are
freed since the RMIDs are still tagged to cache lines of the tasks which
were using them - thereby still having some occupancy. They continue to
be in limbo list until the occupancy < threshold_occupancy. The
threshold_occupancy is a user configurable value.
OS uses IA32_QM_CTR MSR to read the occupancy associated with an RMID
after programming the IA32_EVENTSEL MSR with the RMID.

[Tony: Improved limbo search]

Signed-off-by: Vikas Shivappa <vikas.shivappa@linux.intel.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: ravi.v.shankar@intel.com
Cc: tony.luck@intel.com
Cc: fenghua.yu@intel.com
Cc: peterz@infradead.org
Cc: eranian@google.com
Cc: vikas.shivappa@intel.com
Cc: ak@linux.intel.com
Cc: davidcc@google.com
Cc: reinette.chatre@intel.com
Link: http://lkml.kernel.org/r/1501017287-28083-10-git-send-email-vikas.shivappa@linux.intel.com
(cherry picked from commit edf6fa1c4a951b3a03e94b63e6483c5d9da3ab11)
Signed-off-by: Seth Forshee <seth.forshee@canonical.com>

x86/intel_rdt/cqm: Add RDT monitoring initialization

BugLink: http://bugs.launchpad.net/bugs/1591609
Add common data structures for RDT resource monitoring and perform RDT
monitoring related data structure initializations which include setting
up the RMID(Resource monitoring ID) lists and event list which the
resource supports.

[ tony: some cleanup to make adding MBM easier later, remove "cqm" from
some names, make some data structure local to intel_rdt_monitor.c
static. Add copyright header]

[ tglx: Made it readable ]

Signed-off-by: Vikas Shivappa <vikas.shivappa@linux.intel.com>
Signed-off-by: Tony Luck <tony.luck@intel.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: ravi.v.shankar@intel.com
Cc: fenghua.yu@intel.com
Cc: peterz@infradead.org
Cc: eranian@google.com
Cc: vikas.shivappa@intel.com
Cc: ak@linux.intel.com
Cc: davidcc@google.com
Cc: reinette.chatre@intel.com
Link: http://lkml.kernel.org/r/1501017287-28083-9-git-send-email-vikas.shivappa@linux.intel.com
(cherry picked from commit 6a445edce657810992594c7b9e679219aaf78ad9)
Signed-off-by: Seth Forshee <seth.forshee@canonical.com>

x86/intel_rdt: Make rdt_resources_all more readable

BugLink: http://bugs.launchpad.net/bugs/1591609
Change the format of the global rdt_resources_all. This holds all the
RDT resource structure initialization values. Make this more readable by
using the format:

rdt_resources_all[] = {
[<resource_index>] =
{...
}
...
}

Signed-off-by: Vikas Shivappa <vikas.shivappa@linux.intel.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: ravi.v.shankar@intel.com
Cc: tony.luck@intel.com
Cc: fenghua.yu@intel.com
Cc: peterz@infradead.org
Cc: eranian@google.com
Cc: vikas.shivappa@intel.com
Cc: ak@linux.intel.com
Cc: davidcc@google.com
Cc: reinette.chatre@intel.com
Link: http://lkml.kernel.org/r/1501017287-28083-8-git-send-email-vikas.shivappa@linux.intel.com
(cherry picked from commit dd131853f3fbc1c3aa051c34a2967c2f76309024)
Signed-off-by: Seth Forshee <seth.forshee@canonical.com>

x86/intel_rdt: Cleanup namespace to support RDT monitoring

BugLink: http://bugs.launchpad.net/bugs/1591609
Few of the data-structures have generic names although they are RDT
allocation specific. Rename them to be allocation specific to
accommodate RDT monitoring. E.g. s/enabled/alloc_enabled/

No functional change.

Signed-off-by: Vikas Shivappa <vikas.shivappa@linux.intel.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: ravi.v.shankar@intel.com
Cc: tony.luck@intel.com
Cc: fenghua.yu@intel.com
Cc: peterz@infradead.org
Cc: eranian@google.com
Cc: vikas.shivappa@intel.com
Cc: ak@linux.intel.com
Cc: davidcc@google.com
Cc: reinette.chatre@intel.com
Link: http://lkml.kernel.org/r/1501017287-28083-7-git-send-email-vikas.shivappa@linux.intel.com
(backported from commit 1b5c0b7583173b787b5c93ff89838a950d0e23ff)
Signed-off-by: Seth Forshee <seth.forshee@canonical.com>
Conflicts:
arch/x86/kernel/cpu/intel_rdt.c

x86/intel_rdt: Mark rdt_root and closid_alloc as static

BugLink: http://bugs.launchpad.net/bugs/1591609
Sparse reports that both of these can be static.

Make it so.

Signed-off-by: Reinette Chatre <reinette.chatre@intel.com>
Signed-off-by: Vikas Shivappa <vikas.shivappa@linux.intel.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: ravi.v.shankar@intel.com
Cc: tony.luck@intel.com
Cc: fenghua.yu@intel.com
Cc: peterz@infradead.org
Cc: eranian@google.com
Cc: vikas.shivappa@intel.com
Cc: ak@linux.intel.com
Cc: davidcc@google.com
Link: http://lkml.kernel.org/r/1501017287-28083-6-git-send-email-vikas.shivappa@linux.intel.com
(cherry picked from commit cb2200e967c65519ca6c5426644a49dca65f6294)
Signed-off-by: Seth Forshee <seth.forshee@canonical.com>

x86/intel_rdt: Change file names to accommodate RDT monitor code

BugLink: http://bugs.launchpad.net/bugs/1591609
Because the "perf cqm" and resctrl code were separately added and
indivdually configurable, there seem to be separate context switch code
and also things on global .h which are not really needed.

Move only the scheduling specific code and definitions to
<asm/intel_rdt_sched.h> and the put all the other declarations to a
local intel_rdt.h.

h/t to Reinette Chatre for pointing out that we should separate the
public interfaces used by other parts of the kernel from private
objects shared between the various files comprising RDT.

No functional change.

Signed-off-by: Vikas Shivappa <vikas.shivappa@linux.intel.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: ravi.v.shankar@intel.com
Cc: tony.luck@intel.com
Cc: fenghua.yu@intel.com
Cc: peterz@infradead.org
Cc: eranian@google.com
Cc: vikas.shivappa@intel.com
Cc: ak@linux.intel.com
Cc: davidcc@google.com
Cc: reinette.chatre@intel.com
Link: http://lkml.kernel.org/r/1501017287-28083-5-git-send-email-vikas.shivappa@linux.intel.com
(cherry picked from commit 0583020456cea9fcf43b84bb13a41eab059ae0a8)
Signed-off-by: Seth Forshee <seth.forshee@canonical.com>

x86/intel_rdt: Introduce a common compile option for RDT

BugLink: http://bugs.launchpad.net/bugs/1591609
We currently have a CONFIG_RDT_A which is for RDT(Resource directory
technology) allocation based resctrl filesystem interface. As a
preparation to add support for RDT monitoring as well into the same
resctrl filesystem, change the config option to be CONFIG_RDT which
would include both RDT allocation and monitoring code.

No functional change.

Signed-off-by: Vikas Shivappa <vikas.shivappa@linux.intel.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: ravi.v.shankar@intel.com
Cc: tony.luck@intel.com
Cc: fenghua.yu@intel.com
Cc: peterz@infradead.org
Cc: eranian@google.com
Cc: vikas.shivappa@intel.com
Cc: ak@linux.intel.com
Cc: davidcc@google.com
Cc: reinette.chatre@intel.com
Link: http://lkml.kernel.org/r/1501017287-28083-4-git-send-email-vikas.shivappa@linux.intel.com
(cherry picked from commit f01d7d51f577b5dc0fa5919ab8a9228e2bf49f3e)
Signed-off-by: Seth Forshee <seth.forshee@canonical.com>

x86/intel_rdt/cqm: Documentation for resctrl based RDT Monitoring

BugLink: http://bugs.launchpad.net/bugs/1591609
Add a description of resctrl based RDT(resource director technology)
monitoring extension and its usage.

[Tony: Added descriptions for how monitoring and allocation are measured
and some cleanups]

Signed-off-by: Vikas Shivappa <vikas.shivappa@linux.intel.com>
Signed-off-by: Tony Luck <tony.luck@intel.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: ravi.v.shankar@intel.com
Cc: fenghua.yu@intel.com
Cc: peterz@infradead.org
Cc: eranian@google.com
Cc: vikas.shivappa@intel.com
Cc: ak@linux.intel.com
Cc: davidcc@google.com
Cc: reinette.chatre@intel.com
Link: http://lkml.kernel.org/r/1501017287-28083-3-git-send-email-vikas.shivappa@linux.intel.com
(cherry picked from commit 1640ae9471ae41eb18d2b214f1f40af3c4ed3828)
Signed-off-by: Seth Forshee <seth.forshee@canonical.com>

x86/perf/cqm: Wipe out perf based cqm

BugLink: http://bugs.launchpad.net/bugs/1591609
'perf cqm' never worked due to the incompatibility between perf
infrastructure and cqm hardware support.  The hardware uses RMIDs to
track the llc occupancy of tasks and these RMIDs are per package. This
makes monitoring a hierarchy like cgroup along with monitoring of tasks
separately difficult and several patches sent to lkml to fix them were
NACKed. Further more, the following issues in the current perf cqm make
it almost unusable:

    1. No support to monitor the same group of tasks for which we do
    allocation using resctrl.

    2. It gives random and inaccurate data (mostly 0s) once we run out
    of RMIDs due to issues in Recycling.

    3. Recycling results in inaccuracy of data because we cannot
    guarantee that the RMID was stolen from a task when it was not
    pulling data into cache or even when it pulled the least data. Also
    for monitoring llc_occupancy, if we stop using an RMID_x and then
    start using an RMID_y after we reclaim an RMID from an other event,
    we miss accounting all the occupancy that was tagged to RMID_x at a
    later perf_count.

    2. Recycling code makes the monitoring code complex including
    scheduling because the event can lose RMID any time. Since MBM
    counters count bandwidth for a period of time by taking snap shot of
    total bytes at two different times, recycling complicates the way we
    count MBM in a hierarchy. Also we need a spin lock while we do the
    processing to account for MBM counter overflow. We also currently
    use a spin lock in scheduling to prevent the RMID from being taken
    away.

    4. Lack of support when we run different kind of event like task,
    system-wide and cgroup events together. Data mostly prints 0s. This
    is also because we can have only one RMID tied to a cpu as defined
    by the cqm hardware but a perf can at the same time tie multiple
    events during one sched_in.

    5. No support of monitoring a group of tasks. There is partial support
    for cgroup but it does not work once there is a hierarchy of cgroups
    or if we want to monitor a task in a cgroup and the cgroup itself.

    6. No support for monitoring tasks for the lifetime without perf
    overhead.

    7. It reported the aggregate cache occupancy or memory bandwidth over
    all sockets. But most cloud and VMM based use cases want to know the
    individual per-socket usage.

Signed-off-by: Vikas Shivappa <vikas.shivappa@linux.intel.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: ravi.v.shankar@intel.com
Cc: tony.luck@intel.com
Cc: fenghua.yu@intel.com
Cc: peterz@infradead.org
Cc: eranian@google.com
Cc: vikas.shivappa@intel.com
Cc: ak@linux.intel.com
Cc: davidcc@google.com
Cc: reinette.chatre@intel.com
Link: http://lkml.kernel.org/r/1501017287-28083-2-git-send-email-vikas.shivappa@linux.intel.com
(cherry picked from commit c39a0e2c8850f08249383f2425dbd8dbe4baad69)
Signed-off-by: Seth Forshee <seth.forshee@canonical.com>

KVM: VMX: Do not BUG() on out-of-bounds guest IRQ

The value of the guest_irq argument to vmx_update_pi_irte() is
ultimately coming from a KVM_IRQFD API call. Do not BUG() in
vmx_update_pi_irte() if the value is out-of bounds. (Especially,
since KVM as a whole seems to hang after that.)

Instead, print a message only once if we find that we don't have a
route for a certain IRQ (which can be out-of-bounds or within the
array).

This fixes CVE-2017-1000252.

Fixes: efc644048ecde54 ("KVM: x86: Update IRTE for posted-interrupts")
Signed-off-by: Jan H. Schönherr <jschoenh@amazon.de>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
(cherry picked from commit 3a8b0677fc6180a467e26cc32ce6b0c09a32f9bb)
CVE-2017-1000252
Signed-off-by: Seth Forshee <seth.forshee@canonical.com>

crypto/nx: Add P9 NX support for 842 compression engine

BugLink: http://bugs.launchpad.net/bugs/1718292
This patch adds P9 NX support for 842 compression engine. Virtual
Accelerator Switchboard (VAS) is used to access 842 engine on P9.

For each NX engine per chip, setup receive window using
vas_rx_win_open() which configures RxFIFo with FIFO address, lpid,
pid and tid values. This unique (lpid, pid, tid) combination will
be used to identify the target engine.

For crypto open request, open send window on the NX engine for
the corresponding chip / cpu where the open request is executed.
This send window will be closed upon crypto close request.

NX provides high and normal priority FIFOs. For compression /
decompression requests, we use only hight priority FIFOs in kernel.

Each NX request will be communicated to VAS using copy/paste
instructions with vas_copy_crb() / vas_paste_crb() functions.

Signed-off-by: Haren Myneni <haren@us.ibm.com>
Reviewed-by: Ram Pai <linuxram@us.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
(cherry picked from commit b0d6c9bab5e41d07f2bece1ef8c81cd2175b5f88)
Signed-off-by: Gustavo Walbon <gwalbon@linux.vnet.ibm.com>
Signed-off-by: Seth Forshee <seth.forshee@canonical.com>

crypto/nx: Add P9 NX specific error codes for 842 engine

BugLink: http://bugs.launchpad.net/bugs/1718292
This patch adds changes for checking P9 specific 842 engine
error codes. These errros are reported in coprocessor status
block (CSB) for failures.

Signed-off-by: Haren Myneni <haren@us.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
(cherry picked from commit 146e9f1b65478643f2729a97ccb8be60bb4492e5)
Signed-off-by: Gustavo Walbon <gwalbon@linux.vnet.ibm.com>
Signed-off-by: Seth Forshee <seth.forshee@canonical.com>

crypto/nx: Use kzalloc for workmem allocation

BugLink: http://bugs.launchpad.net/bugs/1718292
Send window is opened / closed for each crypto session.
So initializes txwin in workmem.

Signed-off-by: Haren Myneni <haren@us.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
(cherry picked from commit f05368336b3ae399f66cf511c52c6d69c7bc6b39)
Signed-off-by: Gustavo Walbon <gwalbon@linux.vnet.ibm.com>
Signed-off-by: Seth Forshee <seth.forshee@canonical.com>

crypto/nx: Add nx842_add_coprocs_list function

BugLink: http://bugs.launchpad.net/bugs/1718292
Updating coprocessor list is moved to nx842_add_coprocs_list().
This function will be used for both icswx and VAS functions.

Signed-off-by: Haren Myneni <haren@us.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
(cherry picked from commit cd38a8a8a2ab95e43a031410db6a32f2d84e3fc0)
Signed-off-by: Gustavo Walbon <gwalbon@linux.vnet.ibm.com>
Signed-off-by: Seth Forshee <seth.forshee@canonical.com>

crypto/nx: Create nx842_delete_coprocs function

BugLink: http://bugs.launchpad.net/bugs/1718292
Move deleting coprocessors info upon exit or failure to
nx842_delete_coprocs().

Signed-off-by: Haren Myneni <haren@us.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
(cherry picked from commit 1ee51b28ee6ad7919da4fbe9672263dd274dcbfe)
Signed-off-by: Gustavo Walbon <gwalbon@linux.vnet.ibm.com>
Signed-off-by: Seth Forshee <seth.forshee@canonical.com>

crypto/nx: Create nx842_configure_crb function

BugLink: http://bugs.launchpad.net/bugs/1718292
Configure CRB is moved to nx842_configure_crb() so that it can
be used for icswx and VAS exec functions. VAS function will be
added later with P9 support.

Signed-off-by: Haren Myneni <haren@us.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
(cherry picked from commit 56c10d5ea68b9a1425df0dedbe5d2710cc7d8bfa)
Signed-off-by: Gustavo Walbon <gwalbon@linux.vnet.ibm.com>
Signed-off-by: Seth Forshee <seth.forshee@canonical.com>

crypto/nx: Rename nx842_powernv_function as icswx function

BugLink: http://bugs.launchpad.net/bugs/1718292
Rename nx842_powernv_function to nx842_powernv_exec.
nx842_powernv_exec points to nx842_exec_icswx and
will be point to VAS exec function which will be added later
for P9 NX support.

Signed-off-by: Haren Myneni <haren@us.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
(cherry picked from commit c97f8169fb227cae5adeac56cafa980f25978031)
Signed-off-by: Gustavo Walbon <gwalbon@linux.vnet.ibm.com>
Signed-off-by: Seth Forshee <seth.forshee@canonical.com>

UBUNTU: [Config] CONFIG_PPC_VAS=y

BugLink: http://bugs.launchpad.net/bugs/1718293
Signed-off-by: Seth Forshee <seth.forshee@canonical.com>

powerpc/powernv/vas: Define copy/paste interfaces

BugLink: http://bugs.launchpad.net/bugs/1718293
Define interfaces (wrappers) to the 'copy' and 'paste'
instructions (which are new in PowerISA 3.0). These are intended to be
used to by NX driver(s) to submit Coprocessor Request Blocks (CRBs) to
the NX hardware engines.

Signed-off-by: Sukadev Bhattiprolu <sukadev@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
(cherry picked from commit 2392c8c8c0450293625dbef19ff5e206fb7b6749)
Signed-off-by: Gustavo Walbon <gwalbon@linux.vnet.ibm.com>
Signed-off-by: Seth Forshee <seth.forshee@canonical.com>

powerpc/powernv/vas: Define vas_tx_win_open()

BugLink: http://bugs.launchpad.net/bugs/1718293
Define an interface to open a VAS send window. This interface is
intended to be used the Nest Accelerator (NX) driver(s) to open
a send window and use it to submit compression/encryption requests
to a VAS receive window.

The receive window, identified by the [vasid, cop] parameters, must
already be open in VAS (i.e connected to an NX engine).

Signed-off-by: Sukadev Bhattiprolu <sukadev@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
(cherry picked from commit 5239af679a07427647b009ebb9c70b1a03ebca9b)
Signed-off-by: Gustavo Walbon <gwalbon@linux.vnet.ibm.com>
Signed-off-by: Seth Forshee <seth.forshee@canonical.com>

powerpc/powernv/vas: Define vas_win_close() interface

BugLink: http://bugs.launchpad.net/bugs/1718293
Define the vas_win_close() interface which should be used to close a
send or receive windows.

While the hardware configurations required to open send and receive
windows differ, the configuration to close a window is the same for
both. So we use a single interface to close the window.

Signed-off-by: Sukadev Bhattiprolu <sukadev@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
(cherry picked from commit 98271d4198699947d66d6f8a02c09bd27cb90022)
Signed-off-by: Gustavo Walbon <gwalbon@linux.vnet.ibm.com>
Signed-off-by: Seth Forshee <seth.forshee@canonical.com>

powerpc/powernv/vas: Define vas_rx_win_open() interface

BugLink: http://bugs.launchpad.net/bugs/1718293
Define the vas_rx_win_open() interface. This interface is intended to
be used by the Nest Accelerator (NX) driver(s) to setup receive
windows for one or more NX engines (which implement compression &
encryption algorithms in the hardware).

Follow-on patches will provide an interface to close the window and to
open a send window that kernel subsystems can use to access the NX
engines.

The interface to open a receive window is expected to be invoked for
each instance of VAS in the system.

Signed-off-by: Sukadev Bhattiprolu <sukadev@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
(cherry picked from commit 62c4eda4fabe89709ec43dcf1efe9fbea007a734)
Signed-off-by: Gustavo Walbon <gwalbon@linux.vnet.ibm.com>
Signed-off-by: Seth Forshee <seth.forshee@canonical.com>

powerpc/powernv/vas: Define helpers to alloc/free windows

BugLink: http://bugs.launchpad.net/bugs/1718293
Define helpers to allocate/free VAS window objects. These will be used
in follow-on patches when opening/closing windows.

Signed-off-by: Sukadev Bhattiprolu <sukadev@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
(cherry picked from commit bbfe59f8a7057f80f67a74e77fb4e941240e90b9)
Signed-off-by: Gustavo Walbon <gwalbon@linux.vnet.ibm.com>
Signed-off-by: Seth Forshee <seth.forshee@canonical.com>

powerpc/powernv/vas: Define helpers to init window context

BugLink: http://bugs.launchpad.net/bugs/1718293
Define helpers to initialize window context registers of the VAS
hardware. These will be used in follow-on patches when opening/closing
VAS windows.

Signed-off-by: Sukadev Bhattiprolu <sukadev@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
(cherry picked from commit b25b33ac18b35775949ab227bb3075bb6cb11bc3)
Signed-off-by: Gustavo Walbon <gwalbon@linux.vnet.ibm.com>
Signed-off-by: Seth Forshee <seth.forshee@canonical.com>

powerpc/powernv/vas: Define helpers to access MMIO regions

BugLink: http://bugs.launchpad.net/bugs/1718293
Define some helper functions to access the MMIO regions. We use these
in follow-on patches to read/write VAS hardware registers. They are
also used to later issue 'paste' instructions to submit requests to
the NX hardware engines.

Signed-off-by: Sukadev Bhattiprolu <sukadev@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
(cherry picked from commit 180fe15a8299c14f77347c5835c98c2446226ee6)
Signed-off-by: Gustavo Walbon <gwalbon@linux.vnet.ibm.com>
Signed-off-by: Seth Forshee <seth.forshee@canonical.com>

powerpc/powernv/vas: Define vas_init() and vas_exit()

BugLink: http://bugs.launchpad.net/bugs/1718293
Implement vas_init() and vas_exit() functions for a new VAS module.
This VAS module is essentially a library for other device drivers
and kernel users of the NX coprocessors like NX-842 and NX-GZIP.
In the future this will be extended to add support for user space
to access the NX coprocessors.

VAS is currently only supported with 64K page size.

Signed-off-by: Sukadev Bhattiprolu <sukadev@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
(backported from commit 4dea2d1a927c61114a168d4509b56329ea6effb7)
Signed-off-by: Gustavo Walbon <gwalbon@linux.vnet.ibm.com>
Signed-off-by: Seth Forshee <seth.forshee@canonical.com>

powerpc/powernv: Move GET_FIELD/SET_FIELD to vas.h

BugLink: http://bugs.launchpad.net/bugs/1718293
Move the GET_FIELD and SET_FIELD macros to vas.h as VAS and other
users of VAS, including NX-842 can use those macros.

There is a lot of related code between the VAS/NX kernel drivers
and skiboot. For consistency, switch the order of parameters in
SET_FIELD to match the order in skiboot.

Signed-off-by: Sukadev Bhattiprolu <sukadev@linux.vnet.ibm.com>
Reviewed-by: Dan Streetman <ddstreet@ieee.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
(cherry picked from commit b6622a339e8670a1025d4dd84be473c76dabed33)
Signed-off-by: Gustavo Walbon <gwalbon@linux.vnet.ibm.com>
Signed-off-by: Seth Forshee <seth.forshee@canonical.com>

powerpc/powernv/vas: Define macros, register fields and structures

BugLink: http://bugs.launchpad.net/bugs/1718293
Define macros for the VAS hardware registers and bit-fields as well
as couple of data structures needed by the VAS driver.

Signed-off-by: Sukadev Bhattiprolu <sukadev@linux.vnet.ibm.com>
[mpe: Fixup include guard to use _ASM_POWERPC_VAS_H]
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
(cherry picked from commit 967689141eb37c4365eac0fac82d857773098475)
Signed-off-by: Gustavo Walbon <gwalbon@linux.vnet.ibm.com>
Signed-off-by: Seth Forshee <seth.forshee@canonical.com>

powerpc/powernv: Enable PCI peer-to-peer

BugLink: http://bugs.launchpad.net/bugs/1718293
P9 has support for PCI peer-to-peer, enabling a device to write in the
MMIO space of another device directly, without interrupting the CPU.

This patch adds support for it on powernv, by adding a new API to be
called by drivers. The pnv_pci_set_p2p(...) call configures an
'initiator', i.e the device which will issue the MMIO operation, and a
'target', i.e. the device on the receiving side.

P9 really only supports MMIO stores for the time being but that's
expected to change in the future, so the API allows to define both
load and store operations.

  /* PCI p2p descriptor */
  #define OPAL_PCI_P2P_ENABLE           0x1
  #define OPAL_PCI_P2P_LOAD             0x2
  #define OPAL_PCI_P2P_STORE            0x4

  int pnv_pci_set_p2p(struct pci_dev *initiator, struct pci_dev *target,
                      u64 desc)

It uses a new OPAL call, as the configuration magic is done on the
PHBs by skiboot.

Signed-off-by: Frederic Barrat <fbarrat@linux.vnet.ibm.com>
Reviewed-by: Russell Currey <ruscur@russell.cc>
[mpe: Drop unrelated OPAL calls, s/uint64_t/u64/, minor formatting]
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
(backported from commit 2552910084a5e12e280caf082ab01468e187a064)
Signed-off-by: Gustavo Walbon <gwalbon@linux.vnet.ibm.com>
Signed-off-by: Seth Forshee <seth.forshee@canonical.com>

powerpc/powernv: Add support to set power-shifting-ratio

BugLink: http://bugs.launchpad.net/bugs/1718293
This patch adds support to set power-shifting-ratio which hints the
firmware how to distribute/throttle power between different entities
in a system (e.g CPU v/s GPU). This ratio is used by OCC for power
capping algorithm.

Signed-off-by: Shilpasri G Bhat <shilpa.bhat@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
(cherry picked from commit 8e84b2d1f0f6a00b6476790f7bce6dcbffe91980)
Signed-off-by: Gustavo Walbon <gwalbon@linux.vnet.ibm.com>
Signed-off-by: Seth Forshee <seth.forshee@canonical.com>

powerpc/powernv: Add support for powercap framework

BugLink: http://bugs.launchpad.net/bugs/1718293
Adds a generic powercap framework to change the system powercap
inband through OPAL-OCC command/response interface.

Signed-off-by: Shilpasri G Bhat <shilpa.bhat@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
(backported from commit cb8b340de21e1c57e1c6d4f26ccc4af46a3ed559)
Signed-off-by: Gustavo Walbon <gwalbon@linux.vnet.ibm.com>
Signed-off-by: Seth Forshee <seth.forshee@canonical.com>

powerpc/perf: Add nest IMC PMU support

BugLink: http://bugs.launchpad.net/bugs/1718293
Add support to register Nest In-Memory Collection PMU counters.
Patch adds a new device file called "imc-pmu.c" under powerpc/perf
folder to contain all the device PMU functions.

Device tree parser code added to parse the PMU events information
and create sysfs event attributes for the PMU.

Cpumask attribute added along with Cpu hotplug online/offline functions
specific for nest PMU. A new state "CPUHP_AP_PERF_POWERPC_NEST_IMC_ONLINE"
added for the cpu hotplug callbacks. Error handle path frees the memory
and unregisters the CPU hotplug callbacks.

Signed-off-by: Anju T Sudhakar <anju@linux.vnet.ibm.com>
Signed-off-by: Hemant Kumar <hemant@linux.vnet.ibm.com>
Signed-off-by: Madhavan Srinivasan <maddy@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
(cherry picked from commit 885dcd709ba9120b9935415b8b0f9d1b94e5826b)
Signed-off-by: Gustavo Walbon <gwalbon@linux.vnet.ibm.com>
Signed-off-by: Seth Forshee <seth.forshee@canonical.com>

powerpc/powernv: Detect and create IMC device

BugLink: http://bugs.launchpad.net/bugs/1718293
Code to create platform device for the In-Memory Collection (IMC)
counters. Platform devices are created based on the IMC compatibility.
New header file created to contain the data structures and macros
needed for In-Memory Collection (IMC) counter pmu devices.

The device tree for IMC counters starts at the node "imc-counters".
This node contains all the IMC PMU nodes and event nodes for these IMC
PMUs. Device probe() parses the device to locate three possible IMC
device types (Nest/Core/Thread). Function then branch to parse each
unit nodes to populate vital information such as device memory sizes,
event nodes information, base address for reserve memory access (if
any) and so on. Simple bare-minimum shutdown function added which only
"stops" the engines.

Signed-off-by: Anju T Sudhakar <anju@linux.vnet.ibm.com>
Signed-off-by: Hemant Kumar <hemant@linux.vnet.ibm.com>
Signed-off-by: Madhavan Srinivasan <maddy@linux.vnet.ibm.com>
[mpe: Fix build with CONFIG_PERF_EVENTS=n]
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
(cherry picked from commit 8f95faaac56c18b32d0e23ace55417a440abdb7e)
Signed-off-by: Gustavo Walbon <gwalbon@linux.vnet.ibm.com>
Signed-off-by: Seth Forshee <seth.forshee@canonical.com>

powerpc/powernv: Add IMC OPAL APIs

BugLink: http://bugs.launchpad.net/bugs/1718293
In-Memory Collection (IMC) counters are performance monitoring
infrastructure. These counters need special sequence of SCOMs to
init/start/stop which is handled by OPAL. And OPAL provides three APIs
to init and control these IMC engines.

OPAL API documentation:
https://github.com/open-power/skiboot/blob/master/doc/opal-api/opal-imc-counters.rst

Patch updates the kernel side powernv platform code to support the new
OPAL APIs

Signed-off-by: Hemant Kumar <hemant@linux.vnet.ibm.com>
Signed-off-by: Anju T Sudhakar <anju@linux.vnet.ibm.com>
Signed-off-by: Madhavan Srinivasan <maddy@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
(cherry picked from commit 28a5db0061014c8afbbb98560cf420c29bc4d8e1)
Signed-off-by: Gustavo Walbon <gwalbon@linux.vnet.ibm.com>
Signed-off-by: Seth Forshee <seth.forshee@canonical.com>

UBUNTU: d/s/m/insert-ubuntu-changes: use full version numbers

Ignore: yes

Make insert-ubuntu-changes to consider full version numbers when looping
through debian.master/changelog entries and comparing the version number
of each entry with the arguments passed to the script to decide which
entries should be included in the output changelog file.

Previously, only the last number in the version was used in this
comparison. For example, when comparing 4.4.0-50.51 and 4.4.0-83.84 only
the numbers 51 and 84 were actually used in the comparison. That however
might not work properly when the major version is bumped.

For instance, using "end" as 4.4.0-50.51 and "start" as 4.4.0-83.84 used
to work fine because 84 is greater than 51. However when using "end"
as 4.11.0-10.11 and "start" as 4.13.0-2.3, no entry was being selected
since 3 is not greater than 11.

Signed-off-by: Marcelo Henrique Cerri <marcelo.cerri@canonical.com>
Acked-by: Stefan Bader <stefan.bader@canonical.com>
Acked-by: Kamal Mostafa <kamal@canonical.com>
Signed-off-by: Seth Forshee <seth.forshee@canonical.com>

Linux 4.13.4

BugLink: http://bugs.launchpad.net/bugs/1720154
Signed-off-by: Seth Forshee <seth.forshee@canonical.com>

iwlwifi: add workaround to disable wide channels in 5GHz

BugLink: http://bugs.launchpad.net/bugs/1720154
commit 01a9c948a09348950515bf2abb6113ed83e696d8 upstream.

The OTP in some SKUs have erroneously allowed 40MHz and 80MHz channels
in the 5.2GHz band. The firmware has been modified to not allow this
in those SKUs, so the driver needs to do the same otherwise the
firmware will assert when we try to use it.

Signed-off-by: Luca Coelho <luciano.coelho@intel.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Seth Forshee <seth.forshee@canonical.com>

sched/cpuset/pm: Fix cpuset vs. suspend-resume bugs

BugLink: http://bugs.launchpad.net/bugs/1720154
commit 50e76632339d4655859523a39249dd95ee5e93e7 upstream.

Cpusets vs. suspend-resume is _completely_ broken. And it got noticed
because it now resulted in non-cpuset usage breaking too.

On suspend cpuset_cpu_inactive() doesn't call into
cpuset_update_active_cpus() because it doesn't want to move tasks about,
there is no need, all tasks are frozen and won't run again until after
we've resumed everything.

But this means that when we finally do call into
cpuset_update_active_cpus() after resuming the last frozen cpu in
cpuset_cpu_active(), the top_cpuset will not have any difference with
the cpu_active_mask and this it will not in fact do _anything_.

So the cpuset configuration will not be restored. This was largely
hidden because we would unconditionally create identity domains and
mobile users would not in fact use cpusets much. And servers what do use
cpusets tend to not suspend-resume much.

An addition problem is that we'd not in fact wait for the cpuset work to
finish before resuming the tasks, allowing spurious migrations outside
of the specified domains.

Fix the rebuild by introducing cpuset_force_rebuild() and fix the
ordering with cpuset_wait_for_hotplug().

Reported-by: Andy Lutomirski <luto@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rafael J. Wysocki <rjw@rjwysocki.net>
Cc: Tejun Heo <tj@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Fixes: deb7aa308ea2 ("cpuset: reorganize CPU / memory hotplug handling")
Link: http://lkml.kernel.org/r/20170907091338.orwxrqkbfkki3c24@hirez.programming.kicks-ass.net
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Seth Forshee <seth.forshee@canonical.com>

bcache: fix bch_hprint crash and improve output

BugLink: http://bugs.launchpad.net/bugs/1720154
commit 9276717b9e297a62d1151a43d1cd286213f68eb7 upstream.

Most importantly, solve a crash where %llu was used to format signed
numbers.  This would cause a buffer overflow when reading sysfs
writeback_rate_debug, as only 20 bytes were allocated for this and
%llu writes 20 characters plus a null.

Always use the units mechanism rather than having different output
paths for simplicity.

Also, correct problems with display output where 1.10 was a larger
number than 1.09, by multiplying by 10 and then dividing by 1024 instead
of dividing by 100.  (Remainders of >= 1000 would print as .10).

Minor changes: Always display the decimal point instead of trying to
omit it based on number of digits shown.  Decide what units to use
based on 1000 as a threshold, not 1024 (in other words, always print
at most 3 digits before the decimal point).

Signed-off-by: Michael Lyle <mlyle@lyle.org>
Reported-by: Dmitry Yu Okunev <dyokunev@ut.mephi.ru>
Acked-by: Kent Overstreet <kent.overstreet@gmail.com>
Reviewed-by: Coly Li <colyli@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Seth Forshee <seth.forshee@canonical.com>

bcache: fix for gc and write-back race

BugLink: http://bugs.launchpad.net/bugs/1720154
commit 9baf30972b5568d8b5bc8b3c46a6ec5b58100463 upstream.

gc and write-back get raced (see the email "bcache get stucked" I sended
before):
gc thread                               write-back thread
|                                       |bch_writeback_thread()
|bch_gc_thread()                        |
|                                       |==>read_dirty()
|==>bch_btree_gc()                      |
|==>btree_root() //get btree root       |
|                //node write locker    |
|==>bch_btree_gc_root()                 |
|                                       |==>read_dirty_submit()
|                                       |==>write_dirty()
|                                       |==>continue_at(cl,
|                                       |               write_dirty_finish,
|                                       |               system_wq);
|                                       |==>write_dirty_finish()//excute
|                                       |               //in system_wq
|                                       |==>bch_btree_insert()
|                                       |==>bch_btree_map_leaf_nodes()
|                                       |==>__bch_btree_map_nodes()
|                                       |==>btree_root //try to get btree
|                                       |              //root node read
|                                       |              //lock
|                                       |-----stuck here
|==>bch_btree_set_root()
|==>bch_journal_meta()
|==>bch_journal()
|==>journal_try_write()
|==>journal_write_unlocked() //journal_full(&c->journal)
|                            //condition satisfied
|==>continue_at(cl, journal_write, system_wq); //try to excute
|                               //journal_write in system_wq
|                               //but work queue is excuting
|                               //write_dirty_finish()
|==>closure_sync(); //wait journal_write execute
|                   //over and wake up gc,
|-------------stuck here
|==>release root node write locker

This patch alloc a separate work-queue for write-back thread to avoid such
race.

(Commit log re-organized by Coly Li to pass checkpatch.pl checking)

Signed-off-by: Tang Junhui <tang.junhui@zte.com.cn>
Acked-by: Coly Li <colyli@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Seth Forshee <seth.forshee@canonical.com>

bcache: fix sequential large write IO bypass

BugLink: http://bugs.launchpad.net/bugs/1720154
commit c81ffa32a214c84b08900fbc9d432187bd948eba upstream.

Sequential write IOs were tested with bs=1M by FIO in writeback cache
mode, these IOs were expected to be bypassed, but actually they did not.
We debug the code, and find in check_should_bypass():
    if (!congested &&
        mode == CACHE_MODE_WRITEBACK &&
        op_is_write(bio_op(bio)) &&
        (bio-＞bi_opf & REQ_SYNC))
        goto rescale
that means, If in writeback mode, a write IO with REQ_SYNC flag will not
be bypassed though it is a sequential large IO, It's not a correct thing
to do actually, so this patch remove these codes.

Signed-off-by: tang.junhui <tang.junhui@zte.com.cn>
Reviewed-by: Kent Overstreet <kent.overstreet@gmail.com>
Reviewed-by: Eric Wheeler <bcache@linux.ewheeler.net>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Seth Forshee <seth.forshee@canonical.com>

bcache: Correct return value for sysfs attach errors

BugLink: http://bugs.launchpad.net/bugs/1720154
commit 77fa100f27475d08a569b9d51c17722130f089e7 upstream.

If you encounter any errors in bch_cached_dev_attach it will return
a negative error code. The variable 'v' which stores the result is
unsigned, thus user space sees a very large value returned for bytes
written which can cause incorrect user space behavior. Utilize 1
signed variable to use throughout the function to preserve error return
capability.

Signed-off-by: Tony Asleson <tasleson@redhat.com>
Acked-by: Coly Li <colyli@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Seth Forshee <seth.forshee@canonical.com>

bcache: correct cache_dirty_target in __update_writeback_rate()

BugLink: http://bugs.launchpad.net/bugs/1720154
commit a8394090a9129b40f9d90dcb7f4a49d60c727ca6 upstream.

__update_write_rate() uses a Proportion-Differentiation Controller
algorithm to control writeback rate. A dirty target number is used in
this PD controller to control writeback rate. A larger target number
will make the writeback rate smaller, on the versus, a smaller target
number will make the writeback rate larger.

bcache uses the following steps to calculate the target number,
1) cache_sectors = all-buckets-of-cache-set * buckets-size
2) cache_dirty_target = cache_sectors * cached-device-writeback_percent
3) target = cache_dirty_target *
(sectors-of-cached-device/sectors-of-all-cached-devices-of-this-cache-set)

The calculation at step 1) for cache_sectors is incorrect, which does
not consider dirty blocks occupied by flash only volume.

A flash only volume can be took as a bcache device without cached
device. All data sectors allocated for it are persistent on cache device
and marked dirty, they are not touched by bcache writeback and garbage
collection code. So data blocks of flash only volume should be ignore
when calculating cache_sectors of cache set.

Current code does not subtract dirty sectors of flash only volume, which
results a larger target number from the above 3 steps. And in sequence
the cache device's writeback rate is smaller then a correct value,
writeback speed is slower on all cached devices.

This patch fixes the incorrect slower writeback rate by subtracting
dirty sectors of flash only volumes in __update_writeback_rate().

(Commit log composed by Coly Li to pass checkpatch.pl checking)

Signed-off-by: Tang Junhui <tang.junhui@zte.com.cn>
Reviewed-by: Coly Li <colyli@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Seth Forshee <seth.forshee@canonical.com>

bcache: do not subtract sectors_to_gc for bypassed IO

BugLink: http://bugs.launchpad.net/bugs/1720154
commit 69daf03adef5f7bc13e0ac86b4b8007df1767aab upstream.

Since bypassed IOs use no bucket, so do not subtract sectors_to_gc to
trigger gc thread.

Signed-off-by: tang.junhui <tang.junhui@zte.com.cn>
Acked-by: Coly Li <colyli@suse.de>
Reviewed-by: Eric Wheeler <bcache@linux.ewheeler.net>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Seth Forshee <seth.forshee@canonical.com>

bcache: Fix leak of bdev reference

BugLink: http://bugs.launchpad.net/bugs/1720154
commit 4b758df21ee7081ab41448d21d60367efaa625b3 upstream.

If blkdev_get_by_path() in register_bcache() fails, we try to lookup the
block device using lookup_bdev() to detect which situation we are in to
properly report error. However we never drop the reference returned to
us from lookup_bdev(). Fix that.

Signed-off-by: Jan Kara <jack@suse.cz>
Acked-by: Coly Li <colyli@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Seth Forshee <seth.forshee@canonical.com>

bcache: initialize dirty stripes in flash_dev_run()

BugLink: http://bugs.launchpad.net/bugs/1720154
commit 175206cf9ab63161dec74d9cd7f9992e062491f5 upstream.

bcache uses a Proportion-Differentiation Controller algorithm to control
writeback rate to cached devices. In the PD controller algorithm, dirty
stripes of thin flash device should not be counted in, because flash only
volumes never write back dirty data.

Currently dirty stripe counter for thin flash device is not initialized
when the thin flash device starts. Which means the following calculation
in PD controller will reference an undefined dirty stripes number, and
all cached devices attached to the same cache set where the thin flash
device lies on may have an inaccurate writeback rate.

This patch calles bch_sectors_dirty_init() in flash_dev_run(), to
correctly initialize dirty stripe counter when the thin flash device
starts to run. This patch also does following parameter data type change,
-void bch_sectors_dirty_init(struct cached_dev *dc);
+void bch_sectors_dirty_init(struct bcache_device *);
to call this function conveniently in flash_dev_run().

(Commit log is composed by Coly Li)

Signed-off-by: Tang Junhui <tang.junhui@zte.com.cn>
Reviewed-by: Coly Li <colyli@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Seth Forshee <seth.forshee@canonical.com>