git.proxmox.com Git - mirror_ubuntu-artful-kernel.git/log

x86/intel_rdt: Prepare to add RDT monitor cpus file support

BugLink: http://bugs.launchpad.net/bugs/1591609
Separate the ctrl cpus file handling from the generic cpus file handling
and convert the per cpu closid from u32 to a struct which will be used
later to add rmid to the same struct. Also cleanup some name space.

Signed-off-by: Vikas Shivappa <vikas.shivappa@linux.intel.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: ravi.v.shankar@intel.com
Cc: tony.luck@intel.com
Cc: fenghua.yu@intel.com
Cc: peterz@infradead.org
Cc: eranian@google.com
Cc: vikas.shivappa@intel.com
Cc: ak@linux.intel.com
Cc: davidcc@google.com
Cc: reinette.chatre@intel.com
Link: http://lkml.kernel.org/r/1501017287-28083-17-git-send-email-vikas.shivappa@linux.intel.com
(cherry picked from commit b09d981b3f346690dafa3e4ebedfcf3e44b68e83)
Signed-off-by: Seth Forshee <seth.forshee@canonical.com>

x86/intel_rdt/cqm: Add tasks file support

BugLink: http://bugs.launchpad.net/bugs/1591609
The root directory, ctrl_mon and monitor groups are populated
with a read/write file named "tasks". When read, it shows all the task
IDs assigned to the resource group.

Tasks can be added to groups by writing the PID to the file. A task can
be present in one "ctrl_mon" group "and" one "monitor" group. IOW a
PID_x can be seen in a ctrl_mon group and a monitor group at the same
time. When a task is added to a ctrl_mon group, it is automatically
removed from the previous ctrl_mon group where it belonged. Similarly if
a task is moved to a monitor group it is removed from the previous
monitor group . Also since the monitor groups can only have subset of
tasks of parent ctrl_mon group, a task can be moved to a monitor group
only if its already present in the parent ctrl_mon group.

Task membership is indicated by a new field in the task_struct "u32
rmid" which holds the RMID for the task. RMID=0 is reserved for the
default root group where the tasks belong to at mount.

[tony: zero the rmid if rdtgroup was deleted when task was being moved]

Signed-off-by: Tony Luck <tony.luck@linux.intel.com>
Signed-off-by: Vikas Shivappa <vikas.shivappa@linux.intel.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: ravi.v.shankar@intel.com
Cc: tony.luck@intel.com
Cc: fenghua.yu@intel.com
Cc: peterz@infradead.org
Cc: eranian@google.com
Cc: vikas.shivappa@intel.com
Cc: ak@linux.intel.com
Cc: davidcc@google.com
Cc: reinette.chatre@intel.com
Link: http://lkml.kernel.org/r/1501017287-28083-16-git-send-email-vikas.shivappa@linux.intel.com
(cherry picked from commit d6aaba615a482ce7d3ec218cf7b8d02d0d5753b8)
Signed-off-by: Seth Forshee <seth.forshee@canonical.com>

x86/intel_rdt: Change closid type from int to u32

BugLink: http://bugs.launchpad.net/bugs/1591609
OS associates a CLOSid(Class of service id) to a task by writing the
high 32 bits of per CPU IA32_PQR_ASSOC MSR when a task is scheduled in.
CPUID.(EAX=10H, ECX=1):EDX[15:0] enumerates the max CLOSID supported and
it is zero indexed. Hence change the type to u32 from int.

Signed-off-by: Vikas Shivappa <vikas.shivappa@linux.intel.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: ravi.v.shankar@intel.com
Cc: tony.luck@intel.com
Cc: fenghua.yu@intel.com
Cc: peterz@infradead.org
Cc: eranian@google.com
Cc: vikas.shivappa@intel.com
Cc: ak@linux.intel.com
Cc: davidcc@google.com
Cc: reinette.chatre@intel.com
Link: http://lkml.kernel.org/r/1501017287-28083-15-git-send-email-vikas.shivappa@linux.intel.com
(cherry picked from commit 0734ded1abee9439b0c5d7b62af1ead78aab895b)
Signed-off-by: Seth Forshee <seth.forshee@canonical.com>

x86/intel_rdt/cqm: Add mkdir support for RDT monitoring

BugLink: http://bugs.launchpad.net/bugs/1591609
Resource control groups can be created using mkdir in resctrl
fs(rdtgroup). In order to extend the resctrl interface to support
monitoring the control groups, extend the current mkdir to support
resource monitoring also.

This allows the rdtgroup created under the root directory to be able to
both control and monitor resources (ctrl_mon group). The ctrl_mon groups
are associated with one CLOSID like the legacy rdtgroups and one
RMID(Resource monitoring ID) as well. Hardware uses RMID to track the
resource usage. Once either of the CLOSID or RMID are exhausted, the
mkdir fails with -ENOSPC. If there are RMIDs in limbo list but not free
an -EBUSY is returned. User can also monitor a subset of the ctrl_mon
rdtgroup's tasks/cpus using the monitor groups. The monitor groups are
created using mkdir under the "mon_groups" directory in every ctrl_mon
group.

[Merged Tony's code: Removed a lot of common mkdir code, a fix to handling
of the list of the child rdtgroups and some cleanups in list
traversal. Also the changes to have similar alloc and free for CLOS/RMID
and return -EBUSY when RMIDs are in limbo and not free]

Signed-off-by: Tony Luck <tony.luck@intel.com>
Signed-off-by: Vikas Shivappa <vikas.shivappa@linux.intel.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: ravi.v.shankar@intel.com
Cc: fenghua.yu@intel.com
Cc: peterz@infradead.org
Cc: eranian@google.com
Cc: vikas.shivappa@intel.com
Cc: ak@linux.intel.com
Cc: davidcc@google.com
Cc: reinette.chatre@intel.com
Link: http://lkml.kernel.org/r/1501017287-28083-14-git-send-email-vikas.shivappa@linux.intel.com
(cherry picked from commit c7d9aac6131148abe29ed1dc6bd73ad1213d1f56)
Signed-off-by: Seth Forshee <seth.forshee@canonical.com>

x86/intel_rdt: Prepare for RDT monitoring mkdir support

BugLink: http://bugs.launchpad.net/bugs/1591609
Separate the ctrl mkdir code from the rest in order to prepare for
adding support for RDT monitoring mkdir support as well.

Signed-off-by: Vikas Shivappa <vikas.shivappa@linux.intel.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: ravi.v.shankar@intel.com
Cc: tony.luck@intel.com
Cc: fenghua.yu@intel.com
Cc: peterz@infradead.org
Cc: eranian@google.com
Cc: vikas.shivappa@intel.com
Cc: ak@linux.intel.com
Cc: davidcc@google.com
Cc: reinette.chatre@intel.com
Link: http://lkml.kernel.org/r/1501017287-28083-13-git-send-email-vikas.shivappa@linux.intel.com
(cherry picked from commit 65b4f403057e32da73c36e33d403890173c4c324)
Signed-off-by: Seth Forshee <seth.forshee@canonical.com>

x86/intel_rdt/cqm: Add info files for RDT monitoring

BugLink: http://bugs.launchpad.net/bugs/1591609
Add info directory files specific to RDT monitoring.

num_rmids:
    The number of RMIDs which are valid for the resource.

mon_features:
    Lists the monitoring events if monitoring is enabled for the
    resource.

max_threshold_occupancy:
    This is specific to llc_occupancy monitoring and is used to
    determine if an RMID can be reused. Provides an upper bound on the
    threshold and is shown to the user in bytes though the internal
    value will be rounded to the scaling factor supported by the h/w.

Signed-off-by: Vikas Shivappa <vikas.shivappa@linux.intel.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: ravi.v.shankar@intel.com
Cc: tony.luck@intel.com
Cc: fenghua.yu@intel.com
Cc: peterz@infradead.org
Cc: eranian@google.com
Cc: vikas.shivappa@intel.com
Cc: ak@linux.intel.com
Cc: davidcc@google.com
Cc: reinette.chatre@intel.com
Link: http://lkml.kernel.org/r/1501017287-28083-12-git-send-email-vikas.shivappa@linux.intel.com
(cherry picked from commit d4ab33201029913b594ae785a9665f45040396ab)
Signed-off-by: Seth Forshee <seth.forshee@canonical.com>

x86/intel_rdt: Simplify info and base file lists

BugLink: http://bugs.launchpad.net/bugs/1591609
The info directory files and base files need to be different for each
resource like cache and Memory bandwidth. With in each resource, the
files would be further different for monitoring and ctrl. This leads to
a lot of different static array declarations given that we are adding
resctrl monitoring.

Simplify this to one common list of files and then declare a set of
flags to choose the files based on the resource, whether it is info or
base and if it is control type file. This is as a preparation to include
monitoring based info and base files.

No functional change.

[Vikas: Extended the flags to have few bits per category like resource,
info/base etc]

Signed-off-by: Tony luck <tony.luck@intel.com>
Signed-off-by: Vikas Shivappa <vikas.shivappa@linux.intel.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: ravi.v.shankar@intel.com
Cc: fenghua.yu@intel.com
Cc: peterz@infradead.org
Cc: eranian@google.com
Cc: vikas.shivappa@intel.com
Cc: ak@linux.intel.com
Cc: davidcc@google.com
Cc: reinette.chatre@intel.com
Link: http://lkml.kernel.org/r/1501017287-28083-11-git-send-email-vikas.shivappa@linux.intel.com
(backported from commit 5dc1d5c6bac2cfe3420cf353dfb0ef2e543f7c10)
Signed-off-by: Seth Forshee <seth.forshee@canonical.com>
Conflicts:
arch/x86/kernel/cpu/intel_rdt_rdtgroup.c

x86/intel_rdt/cqm: Add RMID (Resource monitoring ID) management

BugLink: http://bugs.launchpad.net/bugs/1591609
Hardware uses RMID(Resource monitoring ID) to keep track of each of the
RDT events associated with tasks. The number of RMIDs is dependent on
the SKU and is enumerated via CPUID. We add support to manage the RMIDs
which include managing the RMID allocation and reading LLC occupancy
for an RMID.

RMID allocation is managed by keeping a free list which is initialized
to all available RMIDs except for RMID 0 which is always reserved for
root group. RMIDs goto a limbo list once they are
freed since the RMIDs are still tagged to cache lines of the tasks which
were using them - thereby still having some occupancy. They continue to
be in limbo list until the occupancy < threshold_occupancy. The
threshold_occupancy is a user configurable value.
OS uses IA32_QM_CTR MSR to read the occupancy associated with an RMID
after programming the IA32_EVENTSEL MSR with the RMID.

[Tony: Improved limbo search]

Signed-off-by: Vikas Shivappa <vikas.shivappa@linux.intel.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: ravi.v.shankar@intel.com
Cc: tony.luck@intel.com
Cc: fenghua.yu@intel.com
Cc: peterz@infradead.org
Cc: eranian@google.com
Cc: vikas.shivappa@intel.com
Cc: ak@linux.intel.com
Cc: davidcc@google.com
Cc: reinette.chatre@intel.com
Link: http://lkml.kernel.org/r/1501017287-28083-10-git-send-email-vikas.shivappa@linux.intel.com
(cherry picked from commit edf6fa1c4a951b3a03e94b63e6483c5d9da3ab11)
Signed-off-by: Seth Forshee <seth.forshee@canonical.com>

x86/intel_rdt/cqm: Add RDT monitoring initialization

BugLink: http://bugs.launchpad.net/bugs/1591609
Add common data structures for RDT resource monitoring and perform RDT
monitoring related data structure initializations which include setting
up the RMID(Resource monitoring ID) lists and event list which the
resource supports.

[ tony: some cleanup to make adding MBM easier later, remove "cqm" from
some names, make some data structure local to intel_rdt_monitor.c
static. Add copyright header]

[ tglx: Made it readable ]

Signed-off-by: Vikas Shivappa <vikas.shivappa@linux.intel.com>
Signed-off-by: Tony Luck <tony.luck@intel.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: ravi.v.shankar@intel.com
Cc: fenghua.yu@intel.com
Cc: peterz@infradead.org
Cc: eranian@google.com
Cc: vikas.shivappa@intel.com
Cc: ak@linux.intel.com
Cc: davidcc@google.com
Cc: reinette.chatre@intel.com
Link: http://lkml.kernel.org/r/1501017287-28083-9-git-send-email-vikas.shivappa@linux.intel.com
(cherry picked from commit 6a445edce657810992594c7b9e679219aaf78ad9)
Signed-off-by: Seth Forshee <seth.forshee@canonical.com>

x86/intel_rdt: Make rdt_resources_all more readable

BugLink: http://bugs.launchpad.net/bugs/1591609
Change the format of the global rdt_resources_all. This holds all the
RDT resource structure initialization values. Make this more readable by
using the format:

rdt_resources_all[] = {
[<resource_index>] =
{...
}
...
}

Signed-off-by: Vikas Shivappa <vikas.shivappa@linux.intel.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: ravi.v.shankar@intel.com
Cc: tony.luck@intel.com
Cc: fenghua.yu@intel.com
Cc: peterz@infradead.org
Cc: eranian@google.com
Cc: vikas.shivappa@intel.com
Cc: ak@linux.intel.com
Cc: davidcc@google.com
Cc: reinette.chatre@intel.com
Link: http://lkml.kernel.org/r/1501017287-28083-8-git-send-email-vikas.shivappa@linux.intel.com
(cherry picked from commit dd131853f3fbc1c3aa051c34a2967c2f76309024)
Signed-off-by: Seth Forshee <seth.forshee@canonical.com>

x86/intel_rdt: Cleanup namespace to support RDT monitoring

BugLink: http://bugs.launchpad.net/bugs/1591609
Few of the data-structures have generic names although they are RDT
allocation specific. Rename them to be allocation specific to
accommodate RDT monitoring. E.g. s/enabled/alloc_enabled/

No functional change.

Signed-off-by: Vikas Shivappa <vikas.shivappa@linux.intel.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: ravi.v.shankar@intel.com
Cc: tony.luck@intel.com
Cc: fenghua.yu@intel.com
Cc: peterz@infradead.org
Cc: eranian@google.com
Cc: vikas.shivappa@intel.com
Cc: ak@linux.intel.com
Cc: davidcc@google.com
Cc: reinette.chatre@intel.com
Link: http://lkml.kernel.org/r/1501017287-28083-7-git-send-email-vikas.shivappa@linux.intel.com
(backported from commit 1b5c0b7583173b787b5c93ff89838a950d0e23ff)
Signed-off-by: Seth Forshee <seth.forshee@canonical.com>
Conflicts:
arch/x86/kernel/cpu/intel_rdt.c

x86/intel_rdt: Mark rdt_root and closid_alloc as static

BugLink: http://bugs.launchpad.net/bugs/1591609
Sparse reports that both of these can be static.

Make it so.

Signed-off-by: Reinette Chatre <reinette.chatre@intel.com>
Signed-off-by: Vikas Shivappa <vikas.shivappa@linux.intel.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: ravi.v.shankar@intel.com
Cc: tony.luck@intel.com
Cc: fenghua.yu@intel.com
Cc: peterz@infradead.org
Cc: eranian@google.com
Cc: vikas.shivappa@intel.com
Cc: ak@linux.intel.com
Cc: davidcc@google.com
Link: http://lkml.kernel.org/r/1501017287-28083-6-git-send-email-vikas.shivappa@linux.intel.com
(cherry picked from commit cb2200e967c65519ca6c5426644a49dca65f6294)
Signed-off-by: Seth Forshee <seth.forshee@canonical.com>

x86/intel_rdt: Change file names to accommodate RDT monitor code

BugLink: http://bugs.launchpad.net/bugs/1591609
Because the "perf cqm" and resctrl code were separately added and
indivdually configurable, there seem to be separate context switch code
and also things on global .h which are not really needed.

Move only the scheduling specific code and definitions to
<asm/intel_rdt_sched.h> and the put all the other declarations to a
local intel_rdt.h.

h/t to Reinette Chatre for pointing out that we should separate the
public interfaces used by other parts of the kernel from private
objects shared between the various files comprising RDT.

No functional change.

Signed-off-by: Vikas Shivappa <vikas.shivappa@linux.intel.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: ravi.v.shankar@intel.com
Cc: tony.luck@intel.com
Cc: fenghua.yu@intel.com
Cc: peterz@infradead.org
Cc: eranian@google.com
Cc: vikas.shivappa@intel.com
Cc: ak@linux.intel.com
Cc: davidcc@google.com
Cc: reinette.chatre@intel.com
Link: http://lkml.kernel.org/r/1501017287-28083-5-git-send-email-vikas.shivappa@linux.intel.com
(cherry picked from commit 0583020456cea9fcf43b84bb13a41eab059ae0a8)
Signed-off-by: Seth Forshee <seth.forshee@canonical.com>

x86/intel_rdt: Introduce a common compile option for RDT

BugLink: http://bugs.launchpad.net/bugs/1591609
We currently have a CONFIG_RDT_A which is for RDT(Resource directory
technology) allocation based resctrl filesystem interface. As a
preparation to add support for RDT monitoring as well into the same
resctrl filesystem, change the config option to be CONFIG_RDT which
would include both RDT allocation and monitoring code.

No functional change.

Signed-off-by: Vikas Shivappa <vikas.shivappa@linux.intel.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: ravi.v.shankar@intel.com
Cc: tony.luck@intel.com
Cc: fenghua.yu@intel.com
Cc: peterz@infradead.org
Cc: eranian@google.com
Cc: vikas.shivappa@intel.com
Cc: ak@linux.intel.com
Cc: davidcc@google.com
Cc: reinette.chatre@intel.com
Link: http://lkml.kernel.org/r/1501017287-28083-4-git-send-email-vikas.shivappa@linux.intel.com
(cherry picked from commit f01d7d51f577b5dc0fa5919ab8a9228e2bf49f3e)
Signed-off-by: Seth Forshee <seth.forshee@canonical.com>

x86/intel_rdt/cqm: Documentation for resctrl based RDT Monitoring

BugLink: http://bugs.launchpad.net/bugs/1591609
Add a description of resctrl based RDT(resource director technology)
monitoring extension and its usage.

[Tony: Added descriptions for how monitoring and allocation are measured
and some cleanups]

Signed-off-by: Vikas Shivappa <vikas.shivappa@linux.intel.com>
Signed-off-by: Tony Luck <tony.luck@intel.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: ravi.v.shankar@intel.com
Cc: fenghua.yu@intel.com
Cc: peterz@infradead.org
Cc: eranian@google.com
Cc: vikas.shivappa@intel.com
Cc: ak@linux.intel.com
Cc: davidcc@google.com
Cc: reinette.chatre@intel.com
Link: http://lkml.kernel.org/r/1501017287-28083-3-git-send-email-vikas.shivappa@linux.intel.com
(cherry picked from commit 1640ae9471ae41eb18d2b214f1f40af3c4ed3828)
Signed-off-by: Seth Forshee <seth.forshee@canonical.com>

x86/perf/cqm: Wipe out perf based cqm

BugLink: http://bugs.launchpad.net/bugs/1591609
'perf cqm' never worked due to the incompatibility between perf
infrastructure and cqm hardware support.  The hardware uses RMIDs to
track the llc occupancy of tasks and these RMIDs are per package. This
makes monitoring a hierarchy like cgroup along with monitoring of tasks
separately difficult and several patches sent to lkml to fix them were
NACKed. Further more, the following issues in the current perf cqm make
it almost unusable:

    1. No support to monitor the same group of tasks for which we do
    allocation using resctrl.

    2. It gives random and inaccurate data (mostly 0s) once we run out
    of RMIDs due to issues in Recycling.

    3. Recycling results in inaccuracy of data because we cannot
    guarantee that the RMID was stolen from a task when it was not
    pulling data into cache or even when it pulled the least data. Also
    for monitoring llc_occupancy, if we stop using an RMID_x and then
    start using an RMID_y after we reclaim an RMID from an other event,
    we miss accounting all the occupancy that was tagged to RMID_x at a
    later perf_count.

    2. Recycling code makes the monitoring code complex including
    scheduling because the event can lose RMID any time. Since MBM
    counters count bandwidth for a period of time by taking snap shot of
    total bytes at two different times, recycling complicates the way we
    count MBM in a hierarchy. Also we need a spin lock while we do the
    processing to account for MBM counter overflow. We also currently
    use a spin lock in scheduling to prevent the RMID from being taken
    away.

    4. Lack of support when we run different kind of event like task,
    system-wide and cgroup events together. Data mostly prints 0s. This
    is also because we can have only one RMID tied to a cpu as defined
    by the cqm hardware but a perf can at the same time tie multiple
    events during one sched_in.

    5. No support of monitoring a group of tasks. There is partial support
    for cgroup but it does not work once there is a hierarchy of cgroups
    or if we want to monitor a task in a cgroup and the cgroup itself.

    6. No support for monitoring tasks for the lifetime without perf
    overhead.

    7. It reported the aggregate cache occupancy or memory bandwidth over
    all sockets. But most cloud and VMM based use cases want to know the
    individual per-socket usage.

Signed-off-by: Vikas Shivappa <vikas.shivappa@linux.intel.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: ravi.v.shankar@intel.com
Cc: tony.luck@intel.com
Cc: fenghua.yu@intel.com
Cc: peterz@infradead.org
Cc: eranian@google.com
Cc: vikas.shivappa@intel.com
Cc: ak@linux.intel.com
Cc: davidcc@google.com
Cc: reinette.chatre@intel.com
Link: http://lkml.kernel.org/r/1501017287-28083-2-git-send-email-vikas.shivappa@linux.intel.com
(cherry picked from commit c39a0e2c8850f08249383f2425dbd8dbe4baad69)
Signed-off-by: Seth Forshee <seth.forshee@canonical.com>

KVM: VMX: Do not BUG() on out-of-bounds guest IRQ

The value of the guest_irq argument to vmx_update_pi_irte() is
ultimately coming from a KVM_IRQFD API call. Do not BUG() in
vmx_update_pi_irte() if the value is out-of bounds. (Especially,
since KVM as a whole seems to hang after that.)

Instead, print a message only once if we find that we don't have a
route for a certain IRQ (which can be out-of-bounds or within the
array).

This fixes CVE-2017-1000252.

Fixes: efc644048ecde54 ("KVM: x86: Update IRTE for posted-interrupts")
Signed-off-by: Jan H. Schönherr <jschoenh@amazon.de>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
(cherry picked from commit 3a8b0677fc6180a467e26cc32ce6b0c09a32f9bb)
CVE-2017-1000252
Signed-off-by: Seth Forshee <seth.forshee@canonical.com>

crypto/nx: Add P9 NX support for 842 compression engine

BugLink: http://bugs.launchpad.net/bugs/1718292
This patch adds P9 NX support for 842 compression engine. Virtual
Accelerator Switchboard (VAS) is used to access 842 engine on P9.

For each NX engine per chip, setup receive window using
vas_rx_win_open() which configures RxFIFo with FIFO address, lpid,
pid and tid values. This unique (lpid, pid, tid) combination will
be used to identify the target engine.

For crypto open request, open send window on the NX engine for
the corresponding chip / cpu where the open request is executed.
This send window will be closed upon crypto close request.

NX provides high and normal priority FIFOs. For compression /
decompression requests, we use only hight priority FIFOs in kernel.

Each NX request will be communicated to VAS using copy/paste
instructions with vas_copy_crb() / vas_paste_crb() functions.

Signed-off-by: Haren Myneni <haren@us.ibm.com>
Reviewed-by: Ram Pai <linuxram@us.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
(cherry picked from commit b0d6c9bab5e41d07f2bece1ef8c81cd2175b5f88)
Signed-off-by: Gustavo Walbon <gwalbon@linux.vnet.ibm.com>
Signed-off-by: Seth Forshee <seth.forshee@canonical.com>

crypto/nx: Add P9 NX specific error codes for 842 engine

BugLink: http://bugs.launchpad.net/bugs/1718292
This patch adds changes for checking P9 specific 842 engine
error codes. These errros are reported in coprocessor status
block (CSB) for failures.

Signed-off-by: Haren Myneni <haren@us.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
(cherry picked from commit 146e9f1b65478643f2729a97ccb8be60bb4492e5)
Signed-off-by: Gustavo Walbon <gwalbon@linux.vnet.ibm.com>
Signed-off-by: Seth Forshee <seth.forshee@canonical.com>

crypto/nx: Use kzalloc for workmem allocation

BugLink: http://bugs.launchpad.net/bugs/1718292
Send window is opened / closed for each crypto session.
So initializes txwin in workmem.

Signed-off-by: Haren Myneni <haren@us.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
(cherry picked from commit f05368336b3ae399f66cf511c52c6d69c7bc6b39)
Signed-off-by: Gustavo Walbon <gwalbon@linux.vnet.ibm.com>
Signed-off-by: Seth Forshee <seth.forshee@canonical.com>

crypto/nx: Add nx842_add_coprocs_list function

BugLink: http://bugs.launchpad.net/bugs/1718292
Updating coprocessor list is moved to nx842_add_coprocs_list().
This function will be used for both icswx and VAS functions.

Signed-off-by: Haren Myneni <haren@us.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
(cherry picked from commit cd38a8a8a2ab95e43a031410db6a32f2d84e3fc0)
Signed-off-by: Gustavo Walbon <gwalbon@linux.vnet.ibm.com>
Signed-off-by: Seth Forshee <seth.forshee@canonical.com>

crypto/nx: Create nx842_delete_coprocs function

BugLink: http://bugs.launchpad.net/bugs/1718292
Move deleting coprocessors info upon exit or failure to
nx842_delete_coprocs().

Signed-off-by: Haren Myneni <haren@us.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
(cherry picked from commit 1ee51b28ee6ad7919da4fbe9672263dd274dcbfe)
Signed-off-by: Gustavo Walbon <gwalbon@linux.vnet.ibm.com>
Signed-off-by: Seth Forshee <seth.forshee@canonical.com>

crypto/nx: Create nx842_configure_crb function

BugLink: http://bugs.launchpad.net/bugs/1718292
Configure CRB is moved to nx842_configure_crb() so that it can
be used for icswx and VAS exec functions. VAS function will be
added later with P9 support.

Signed-off-by: Haren Myneni <haren@us.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
(cherry picked from commit 56c10d5ea68b9a1425df0dedbe5d2710cc7d8bfa)
Signed-off-by: Gustavo Walbon <gwalbon@linux.vnet.ibm.com>
Signed-off-by: Seth Forshee <seth.forshee@canonical.com>

crypto/nx: Rename nx842_powernv_function as icswx function

BugLink: http://bugs.launchpad.net/bugs/1718292
Rename nx842_powernv_function to nx842_powernv_exec.
nx842_powernv_exec points to nx842_exec_icswx and
will be point to VAS exec function which will be added later
for P9 NX support.

Signed-off-by: Haren Myneni <haren@us.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
(cherry picked from commit c97f8169fb227cae5adeac56cafa980f25978031)
Signed-off-by: Gustavo Walbon <gwalbon@linux.vnet.ibm.com>
Signed-off-by: Seth Forshee <seth.forshee@canonical.com>

UBUNTU: [Config] CONFIG_PPC_VAS=y

BugLink: http://bugs.launchpad.net/bugs/1718293
Signed-off-by: Seth Forshee <seth.forshee@canonical.com>

powerpc/powernv/vas: Define copy/paste interfaces

BugLink: http://bugs.launchpad.net/bugs/1718293
Define interfaces (wrappers) to the 'copy' and 'paste'
instructions (which are new in PowerISA 3.0). These are intended to be
used to by NX driver(s) to submit Coprocessor Request Blocks (CRBs) to
the NX hardware engines.

Signed-off-by: Sukadev Bhattiprolu <sukadev@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
(cherry picked from commit 2392c8c8c0450293625dbef19ff5e206fb7b6749)
Signed-off-by: Gustavo Walbon <gwalbon@linux.vnet.ibm.com>
Signed-off-by: Seth Forshee <seth.forshee@canonical.com>

powerpc/powernv/vas: Define vas_tx_win_open()

BugLink: http://bugs.launchpad.net/bugs/1718293
Define an interface to open a VAS send window. This interface is
intended to be used the Nest Accelerator (NX) driver(s) to open
a send window and use it to submit compression/encryption requests
to a VAS receive window.

The receive window, identified by the [vasid, cop] parameters, must
already be open in VAS (i.e connected to an NX engine).

Signed-off-by: Sukadev Bhattiprolu <sukadev@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
(cherry picked from commit 5239af679a07427647b009ebb9c70b1a03ebca9b)
Signed-off-by: Gustavo Walbon <gwalbon@linux.vnet.ibm.com>
Signed-off-by: Seth Forshee <seth.forshee@canonical.com>

powerpc/powernv/vas: Define vas_win_close() interface

BugLink: http://bugs.launchpad.net/bugs/1718293
Define the vas_win_close() interface which should be used to close a
send or receive windows.

While the hardware configurations required to open send and receive
windows differ, the configuration to close a window is the same for
both. So we use a single interface to close the window.

Signed-off-by: Sukadev Bhattiprolu <sukadev@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
(cherry picked from commit 98271d4198699947d66d6f8a02c09bd27cb90022)
Signed-off-by: Gustavo Walbon <gwalbon@linux.vnet.ibm.com>
Signed-off-by: Seth Forshee <seth.forshee@canonical.com>

powerpc/powernv/vas: Define vas_rx_win_open() interface

BugLink: http://bugs.launchpad.net/bugs/1718293
Define the vas_rx_win_open() interface. This interface is intended to
be used by the Nest Accelerator (NX) driver(s) to setup receive
windows for one or more NX engines (which implement compression &
encryption algorithms in the hardware).

Follow-on patches will provide an interface to close the window and to
open a send window that kernel subsystems can use to access the NX
engines.

The interface to open a receive window is expected to be invoked for
each instance of VAS in the system.

Signed-off-by: Sukadev Bhattiprolu <sukadev@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
(cherry picked from commit 62c4eda4fabe89709ec43dcf1efe9fbea007a734)
Signed-off-by: Gustavo Walbon <gwalbon@linux.vnet.ibm.com>
Signed-off-by: Seth Forshee <seth.forshee@canonical.com>

powerpc/powernv/vas: Define helpers to alloc/free windows

BugLink: http://bugs.launchpad.net/bugs/1718293
Define helpers to allocate/free VAS window objects. These will be used
in follow-on patches when opening/closing windows.

Signed-off-by: Sukadev Bhattiprolu <sukadev@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
(cherry picked from commit bbfe59f8a7057f80f67a74e77fb4e941240e90b9)
Signed-off-by: Gustavo Walbon <gwalbon@linux.vnet.ibm.com>
Signed-off-by: Seth Forshee <seth.forshee@canonical.com>

powerpc/powernv/vas: Define helpers to init window context

BugLink: http://bugs.launchpad.net/bugs/1718293
Define helpers to initialize window context registers of the VAS
hardware. These will be used in follow-on patches when opening/closing
VAS windows.

Signed-off-by: Sukadev Bhattiprolu <sukadev@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
(cherry picked from commit b25b33ac18b35775949ab227bb3075bb6cb11bc3)
Signed-off-by: Gustavo Walbon <gwalbon@linux.vnet.ibm.com>
Signed-off-by: Seth Forshee <seth.forshee@canonical.com>

powerpc/powernv/vas: Define helpers to access MMIO regions

BugLink: http://bugs.launchpad.net/bugs/1718293
Define some helper functions to access the MMIO regions. We use these
in follow-on patches to read/write VAS hardware registers. They are
also used to later issue 'paste' instructions to submit requests to
the NX hardware engines.

Signed-off-by: Sukadev Bhattiprolu <sukadev@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
(cherry picked from commit 180fe15a8299c14f77347c5835c98c2446226ee6)
Signed-off-by: Gustavo Walbon <gwalbon@linux.vnet.ibm.com>
Signed-off-by: Seth Forshee <seth.forshee@canonical.com>

powerpc/powernv/vas: Define vas_init() and vas_exit()

BugLink: http://bugs.launchpad.net/bugs/1718293
Implement vas_init() and vas_exit() functions for a new VAS module.
This VAS module is essentially a library for other device drivers
and kernel users of the NX coprocessors like NX-842 and NX-GZIP.
In the future this will be extended to add support for user space
to access the NX coprocessors.

VAS is currently only supported with 64K page size.

Signed-off-by: Sukadev Bhattiprolu <sukadev@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
(backported from commit 4dea2d1a927c61114a168d4509b56329ea6effb7)
Signed-off-by: Gustavo Walbon <gwalbon@linux.vnet.ibm.com>
Signed-off-by: Seth Forshee <seth.forshee@canonical.com>

powerpc/powernv: Move GET_FIELD/SET_FIELD to vas.h

BugLink: http://bugs.launchpad.net/bugs/1718293
Move the GET_FIELD and SET_FIELD macros to vas.h as VAS and other
users of VAS, including NX-842 can use those macros.

There is a lot of related code between the VAS/NX kernel drivers
and skiboot. For consistency, switch the order of parameters in
SET_FIELD to match the order in skiboot.

Signed-off-by: Sukadev Bhattiprolu <sukadev@linux.vnet.ibm.com>
Reviewed-by: Dan Streetman <ddstreet@ieee.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
(cherry picked from commit b6622a339e8670a1025d4dd84be473c76dabed33)
Signed-off-by: Gustavo Walbon <gwalbon@linux.vnet.ibm.com>
Signed-off-by: Seth Forshee <seth.forshee@canonical.com>

powerpc/powernv/vas: Define macros, register fields and structures

BugLink: http://bugs.launchpad.net/bugs/1718293
Define macros for the VAS hardware registers and bit-fields as well
as couple of data structures needed by the VAS driver.

Signed-off-by: Sukadev Bhattiprolu <sukadev@linux.vnet.ibm.com>
[mpe: Fixup include guard to use _ASM_POWERPC_VAS_H]
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
(cherry picked from commit 967689141eb37c4365eac0fac82d857773098475)
Signed-off-by: Gustavo Walbon <gwalbon@linux.vnet.ibm.com>
Signed-off-by: Seth Forshee <seth.forshee@canonical.com>

powerpc/powernv: Enable PCI peer-to-peer

BugLink: http://bugs.launchpad.net/bugs/1718293
P9 has support for PCI peer-to-peer, enabling a device to write in the
MMIO space of another device directly, without interrupting the CPU.

This patch adds support for it on powernv, by adding a new API to be
called by drivers. The pnv_pci_set_p2p(...) call configures an
'initiator', i.e the device which will issue the MMIO operation, and a
'target', i.e. the device on the receiving side.

P9 really only supports MMIO stores for the time being but that's
expected to change in the future, so the API allows to define both
load and store operations.

  /* PCI p2p descriptor */
  #define OPAL_PCI_P2P_ENABLE           0x1
  #define OPAL_PCI_P2P_LOAD             0x2
  #define OPAL_PCI_P2P_STORE            0x4

  int pnv_pci_set_p2p(struct pci_dev *initiator, struct pci_dev *target,
                      u64 desc)

It uses a new OPAL call, as the configuration magic is done on the
PHBs by skiboot.

Signed-off-by: Frederic Barrat <fbarrat@linux.vnet.ibm.com>
Reviewed-by: Russell Currey <ruscur@russell.cc>
[mpe: Drop unrelated OPAL calls, s/uint64_t/u64/, minor formatting]
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
(backported from commit 2552910084a5e12e280caf082ab01468e187a064)
Signed-off-by: Gustavo Walbon <gwalbon@linux.vnet.ibm.com>
Signed-off-by: Seth Forshee <seth.forshee@canonical.com>

powerpc/powernv: Add support to set power-shifting-ratio

BugLink: http://bugs.launchpad.net/bugs/1718293
This patch adds support to set power-shifting-ratio which hints the
firmware how to distribute/throttle power between different entities
in a system (e.g CPU v/s GPU). This ratio is used by OCC for power
capping algorithm.

Signed-off-by: Shilpasri G Bhat <shilpa.bhat@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
(cherry picked from commit 8e84b2d1f0f6a00b6476790f7bce6dcbffe91980)
Signed-off-by: Gustavo Walbon <gwalbon@linux.vnet.ibm.com>
Signed-off-by: Seth Forshee <seth.forshee@canonical.com>

powerpc/powernv: Add support for powercap framework

BugLink: http://bugs.launchpad.net/bugs/1718293
Adds a generic powercap framework to change the system powercap
inband through OPAL-OCC command/response interface.

Signed-off-by: Shilpasri G Bhat <shilpa.bhat@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
(backported from commit cb8b340de21e1c57e1c6d4f26ccc4af46a3ed559)
Signed-off-by: Gustavo Walbon <gwalbon@linux.vnet.ibm.com>
Signed-off-by: Seth Forshee <seth.forshee@canonical.com>

powerpc/perf: Add nest IMC PMU support

BugLink: http://bugs.launchpad.net/bugs/1718293
Add support to register Nest In-Memory Collection PMU counters.
Patch adds a new device file called "imc-pmu.c" under powerpc/perf
folder to contain all the device PMU functions.

Device tree parser code added to parse the PMU events information
and create sysfs event attributes for the PMU.

Cpumask attribute added along with Cpu hotplug online/offline functions
specific for nest PMU. A new state "CPUHP_AP_PERF_POWERPC_NEST_IMC_ONLINE"
added for the cpu hotplug callbacks. Error handle path frees the memory
and unregisters the CPU hotplug callbacks.

Signed-off-by: Anju T Sudhakar <anju@linux.vnet.ibm.com>
Signed-off-by: Hemant Kumar <hemant@linux.vnet.ibm.com>
Signed-off-by: Madhavan Srinivasan <maddy@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
(cherry picked from commit 885dcd709ba9120b9935415b8b0f9d1b94e5826b)
Signed-off-by: Gustavo Walbon <gwalbon@linux.vnet.ibm.com>
Signed-off-by: Seth Forshee <seth.forshee@canonical.com>

powerpc/powernv: Detect and create IMC device

BugLink: http://bugs.launchpad.net/bugs/1718293
Code to create platform device for the In-Memory Collection (IMC)
counters. Platform devices are created based on the IMC compatibility.
New header file created to contain the data structures and macros
needed for In-Memory Collection (IMC) counter pmu devices.

The device tree for IMC counters starts at the node "imc-counters".
This node contains all the IMC PMU nodes and event nodes for these IMC
PMUs. Device probe() parses the device to locate three possible IMC
device types (Nest/Core/Thread). Function then branch to parse each
unit nodes to populate vital information such as device memory sizes,
event nodes information, base address for reserve memory access (if
any) and so on. Simple bare-minimum shutdown function added which only
"stops" the engines.

Signed-off-by: Anju T Sudhakar <anju@linux.vnet.ibm.com>
Signed-off-by: Hemant Kumar <hemant@linux.vnet.ibm.com>
Signed-off-by: Madhavan Srinivasan <maddy@linux.vnet.ibm.com>
[mpe: Fix build with CONFIG_PERF_EVENTS=n]
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
(cherry picked from commit 8f95faaac56c18b32d0e23ace55417a440abdb7e)
Signed-off-by: Gustavo Walbon <gwalbon@linux.vnet.ibm.com>
Signed-off-by: Seth Forshee <seth.forshee@canonical.com>

powerpc/powernv: Add IMC OPAL APIs

BugLink: http://bugs.launchpad.net/bugs/1718293
In-Memory Collection (IMC) counters are performance monitoring
infrastructure. These counters need special sequence of SCOMs to
init/start/stop which is handled by OPAL. And OPAL provides three APIs
to init and control these IMC engines.

OPAL API documentation:
https://github.com/open-power/skiboot/blob/master/doc/opal-api/opal-imc-counters.rst

Patch updates the kernel side powernv platform code to support the new
OPAL APIs

Signed-off-by: Hemant Kumar <hemant@linux.vnet.ibm.com>
Signed-off-by: Anju T Sudhakar <anju@linux.vnet.ibm.com>
Signed-off-by: Madhavan Srinivasan <maddy@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
(cherry picked from commit 28a5db0061014c8afbbb98560cf420c29bc4d8e1)
Signed-off-by: Gustavo Walbon <gwalbon@linux.vnet.ibm.com>
Signed-off-by: Seth Forshee <seth.forshee@canonical.com>

UBUNTU: d/s/m/insert-ubuntu-changes: use full version numbers

Ignore: yes

Make insert-ubuntu-changes to consider full version numbers when looping
through debian.master/changelog entries and comparing the version number
of each entry with the arguments passed to the script to decide which
entries should be included in the output changelog file.

Previously, only the last number in the version was used in this
comparison. For example, when comparing 4.4.0-50.51 and 4.4.0-83.84 only
the numbers 51 and 84 were actually used in the comparison. That however
might not work properly when the major version is bumped.

For instance, using "end" as 4.4.0-50.51 and "start" as 4.4.0-83.84 used
to work fine because 84 is greater than 51. However when using "end"
as 4.11.0-10.11 and "start" as 4.13.0-2.3, no entry was being selected
since 3 is not greater than 11.

Signed-off-by: Marcelo Henrique Cerri <marcelo.cerri@canonical.com>
Acked-by: Stefan Bader <stefan.bader@canonical.com>
Acked-by: Kamal Mostafa <kamal@canonical.com>
Signed-off-by: Seth Forshee <seth.forshee@canonical.com>

Linux 4.13.4

BugLink: http://bugs.launchpad.net/bugs/1720154
Signed-off-by: Seth Forshee <seth.forshee@canonical.com>

iwlwifi: add workaround to disable wide channels in 5GHz

BugLink: http://bugs.launchpad.net/bugs/1720154
commit 01a9c948a09348950515bf2abb6113ed83e696d8 upstream.

The OTP in some SKUs have erroneously allowed 40MHz and 80MHz channels
in the 5.2GHz band. The firmware has been modified to not allow this
in those SKUs, so the driver needs to do the same otherwise the
firmware will assert when we try to use it.

Signed-off-by: Luca Coelho <luciano.coelho@intel.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Seth Forshee <seth.forshee@canonical.com>

sched/cpuset/pm: Fix cpuset vs. suspend-resume bugs

BugLink: http://bugs.launchpad.net/bugs/1720154
commit 50e76632339d4655859523a39249dd95ee5e93e7 upstream.

Cpusets vs. suspend-resume is _completely_ broken. And it got noticed
because it now resulted in non-cpuset usage breaking too.

On suspend cpuset_cpu_inactive() doesn't call into
cpuset_update_active_cpus() because it doesn't want to move tasks about,
there is no need, all tasks are frozen and won't run again until after
we've resumed everything.

But this means that when we finally do call into
cpuset_update_active_cpus() after resuming the last frozen cpu in
cpuset_cpu_active(), the top_cpuset will not have any difference with
the cpu_active_mask and this it will not in fact do _anything_.

So the cpuset configuration will not be restored. This was largely
hidden because we would unconditionally create identity domains and
mobile users would not in fact use cpusets much. And servers what do use
cpusets tend to not suspend-resume much.

An addition problem is that we'd not in fact wait for the cpuset work to
finish before resuming the tasks, allowing spurious migrations outside
of the specified domains.

Fix the rebuild by introducing cpuset_force_rebuild() and fix the
ordering with cpuset_wait_for_hotplug().

Reported-by: Andy Lutomirski <luto@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rafael J. Wysocki <rjw@rjwysocki.net>
Cc: Tejun Heo <tj@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Fixes: deb7aa308ea2 ("cpuset: reorganize CPU / memory hotplug handling")
Link: http://lkml.kernel.org/r/20170907091338.orwxrqkbfkki3c24@hirez.programming.kicks-ass.net
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Seth Forshee <seth.forshee@canonical.com>

bcache: fix bch_hprint crash and improve output

BugLink: http://bugs.launchpad.net/bugs/1720154
commit 9276717b9e297a62d1151a43d1cd286213f68eb7 upstream.

Most importantly, solve a crash where %llu was used to format signed
numbers.  This would cause a buffer overflow when reading sysfs
writeback_rate_debug, as only 20 bytes were allocated for this and
%llu writes 20 characters plus a null.

Always use the units mechanism rather than having different output
paths for simplicity.

Also, correct problems with display output where 1.10 was a larger
number than 1.09, by multiplying by 10 and then dividing by 1024 instead
of dividing by 100.  (Remainders of >= 1000 would print as .10).

Minor changes: Always display the decimal point instead of trying to
omit it based on number of digits shown.  Decide what units to use
based on 1000 as a threshold, not 1024 (in other words, always print
at most 3 digits before the decimal point).

Signed-off-by: Michael Lyle <mlyle@lyle.org>
Reported-by: Dmitry Yu Okunev <dyokunev@ut.mephi.ru>
Acked-by: Kent Overstreet <kent.overstreet@gmail.com>
Reviewed-by: Coly Li <colyli@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Seth Forshee <seth.forshee@canonical.com>

bcache: fix for gc and write-back race

BugLink: http://bugs.launchpad.net/bugs/1720154
commit 9baf30972b5568d8b5bc8b3c46a6ec5b58100463 upstream.

gc and write-back get raced (see the email "bcache get stucked" I sended
before):
gc thread                               write-back thread
|                                       |bch_writeback_thread()
|bch_gc_thread()                        |
|                                       |==>read_dirty()
|==>bch_btree_gc()                      |
|==>btree_root() //get btree root       |
|                //node write locker    |
|==>bch_btree_gc_root()                 |
|                                       |==>read_dirty_submit()
|                                       |==>write_dirty()
|                                       |==>continue_at(cl,
|                                       |               write_dirty_finish,
|                                       |               system_wq);
|                                       |==>write_dirty_finish()//excute
|                                       |               //in system_wq
|                                       |==>bch_btree_insert()
|                                       |==>bch_btree_map_leaf_nodes()
|                                       |==>__bch_btree_map_nodes()
|                                       |==>btree_root //try to get btree
|                                       |              //root node read
|                                       |              //lock
|                                       |-----stuck here
|==>bch_btree_set_root()
|==>bch_journal_meta()
|==>bch_journal()
|==>journal_try_write()
|==>journal_write_unlocked() //journal_full(&c->journal)
|                            //condition satisfied
|==>continue_at(cl, journal_write, system_wq); //try to excute
|                               //journal_write in system_wq
|                               //but work queue is excuting
|                               //write_dirty_finish()
|==>closure_sync(); //wait journal_write execute
|                   //over and wake up gc,
|-------------stuck here
|==>release root node write locker

This patch alloc a separate work-queue for write-back thread to avoid such
race.

(Commit log re-organized by Coly Li to pass checkpatch.pl checking)

Signed-off-by: Tang Junhui <tang.junhui@zte.com.cn>
Acked-by: Coly Li <colyli@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Seth Forshee <seth.forshee@canonical.com>

bcache: fix sequential large write IO bypass

BugLink: http://bugs.launchpad.net/bugs/1720154
commit c81ffa32a214c84b08900fbc9d432187bd948eba upstream.

Sequential write IOs were tested with bs=1M by FIO in writeback cache
mode, these IOs were expected to be bypassed, but actually they did not.
We debug the code, and find in check_should_bypass():
    if (!congested &&
        mode == CACHE_MODE_WRITEBACK &&
        op_is_write(bio_op(bio)) &&
        (bio-＞bi_opf & REQ_SYNC))
        goto rescale
that means, If in writeback mode, a write IO with REQ_SYNC flag will not
be bypassed though it is a sequential large IO, It's not a correct thing
to do actually, so this patch remove these codes.

Signed-off-by: tang.junhui <tang.junhui@zte.com.cn>
Reviewed-by: Kent Overstreet <kent.overstreet@gmail.com>
Reviewed-by: Eric Wheeler <bcache@linux.ewheeler.net>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Seth Forshee <seth.forshee@canonical.com>

bcache: Correct return value for sysfs attach errors

BugLink: http://bugs.launchpad.net/bugs/1720154
commit 77fa100f27475d08a569b9d51c17722130f089e7 upstream.

If you encounter any errors in bch_cached_dev_attach it will return
a negative error code. The variable 'v' which stores the result is
unsigned, thus user space sees a very large value returned for bytes
written which can cause incorrect user space behavior. Utilize 1
signed variable to use throughout the function to preserve error return
capability.

Signed-off-by: Tony Asleson <tasleson@redhat.com>
Acked-by: Coly Li <colyli@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Seth Forshee <seth.forshee@canonical.com>

bcache: correct cache_dirty_target in __update_writeback_rate()

BugLink: http://bugs.launchpad.net/bugs/1720154
commit a8394090a9129b40f9d90dcb7f4a49d60c727ca6 upstream.

__update_write_rate() uses a Proportion-Differentiation Controller
algorithm to control writeback rate. A dirty target number is used in
this PD controller to control writeback rate. A larger target number
will make the writeback rate smaller, on the versus, a smaller target
number will make the writeback rate larger.

bcache uses the following steps to calculate the target number,
1) cache_sectors = all-buckets-of-cache-set * buckets-size
2) cache_dirty_target = cache_sectors * cached-device-writeback_percent
3) target = cache_dirty_target *
(sectors-of-cached-device/sectors-of-all-cached-devices-of-this-cache-set)

The calculation at step 1) for cache_sectors is incorrect, which does
not consider dirty blocks occupied by flash only volume.

A flash only volume can be took as a bcache device without cached
device. All data sectors allocated for it are persistent on cache device
and marked dirty, they are not touched by bcache writeback and garbage
collection code. So data blocks of flash only volume should be ignore
when calculating cache_sectors of cache set.

Current code does not subtract dirty sectors of flash only volume, which
results a larger target number from the above 3 steps. And in sequence
the cache device's writeback rate is smaller then a correct value,
writeback speed is slower on all cached devices.

This patch fixes the incorrect slower writeback rate by subtracting
dirty sectors of flash only volumes in __update_writeback_rate().

(Commit log composed by Coly Li to pass checkpatch.pl checking)

Signed-off-by: Tang Junhui <tang.junhui@zte.com.cn>
Reviewed-by: Coly Li <colyli@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Seth Forshee <seth.forshee@canonical.com>

bcache: do not subtract sectors_to_gc for bypassed IO

BugLink: http://bugs.launchpad.net/bugs/1720154
commit 69daf03adef5f7bc13e0ac86b4b8007df1767aab upstream.

Since bypassed IOs use no bucket, so do not subtract sectors_to_gc to
trigger gc thread.

Signed-off-by: tang.junhui <tang.junhui@zte.com.cn>
Acked-by: Coly Li <colyli@suse.de>
Reviewed-by: Eric Wheeler <bcache@linux.ewheeler.net>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Seth Forshee <seth.forshee@canonical.com>

bcache: Fix leak of bdev reference

BugLink: http://bugs.launchpad.net/bugs/1720154
commit 4b758df21ee7081ab41448d21d60367efaa625b3 upstream.

If blkdev_get_by_path() in register_bcache() fails, we try to lookup the
block device using lookup_bdev() to detect which situation we are in to
properly report error. However we never drop the reference returned to
us from lookup_bdev(). Fix that.

Signed-off-by: Jan Kara <jack@suse.cz>
Acked-by: Coly Li <colyli@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Seth Forshee <seth.forshee@canonical.com>

bcache: initialize dirty stripes in flash_dev_run()

BugLink: http://bugs.launchpad.net/bugs/1720154
commit 175206cf9ab63161dec74d9cd7f9992e062491f5 upstream.

bcache uses a Proportion-Differentiation Controller algorithm to control
writeback rate to cached devices. In the PD controller algorithm, dirty
stripes of thin flash device should not be counted in, because flash only
volumes never write back dirty data.

Currently dirty stripe counter for thin flash device is not initialized
when the thin flash device starts. Which means the following calculation
in PD controller will reference an undefined dirty stripes number, and
all cached devices attached to the same cache set where the thin flash
device lies on may have an inaccurate writeback rate.

This patch calles bch_sectors_dirty_init() in flash_dev_run(), to
correctly initialize dirty stripe counter when the thin flash device
starts to run. This patch also does following parameter data type change,
-void bch_sectors_dirty_init(struct cached_dev *dc);
+void bch_sectors_dirty_init(struct bcache_device *);
to call this function conveniently in flash_dev_run().

(Commit log is composed by Coly Li)

Signed-off-by: Tang Junhui <tang.junhui@zte.com.cn>
Reviewed-by: Coly Li <colyli@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Seth Forshee <seth.forshee@canonical.com>

ALSA: seq: Cancel pending autoload work at unbinding device

BugLink: http://bugs.launchpad.net/bugs/1720154
commit fc27fe7e8deef2f37cba3f2be2d52b6ca5eb9d57 upstream.

ALSA sequencer core has a mechanism to load the enumerated devices
automatically, and it's performed in an off-load work.  This seems
causing some race when a sequencer is removed while the pending
autoload work is running.  As syzkaller spotted, it may lead to some
use-after-free:
  BUG: KASAN: use-after-free in snd_rawmidi_dev_seq_free+0x69/0x70
  sound/core/rawmidi.c:1617
  Write of size 8 at addr ffff88006c611d90 by task kworker/2:1/567

  CPU: 2 PID: 567 Comm: kworker/2:1 Not tainted 4.13.0+ #29
  Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011
  Workqueue: events autoload_drivers
  Call Trace:
   __dump_stack lib/dump_stack.c:16 [inline]
   dump_stack+0x192/0x22c lib/dump_stack.c:52
   print_address_description+0x78/0x280 mm/kasan/report.c:252
   kasan_report_error mm/kasan/report.c:351 [inline]
   kasan_report+0x230/0x340 mm/kasan/report.c:409
   __asan_report_store8_noabort+0x1c/0x20 mm/kasan/report.c:435
   snd_rawmidi_dev_seq_free+0x69/0x70 sound/core/rawmidi.c:1617
   snd_seq_dev_release+0x4f/0x70 sound/core/seq_device.c:192
   device_release+0x13f/0x210 drivers/base/core.c:814
   kobject_cleanup lib/kobject.c:648 [inline]
   kobject_release lib/kobject.c:677 [inline]
   kref_put include/linux/kref.h:70 [inline]
   kobject_put+0x145/0x240 lib/kobject.c:694
   put_device+0x25/0x30 drivers/base/core.c:1799
   klist_devices_put+0x36/0x40 drivers/base/bus.c:827
   klist_next+0x264/0x4a0 lib/klist.c:403
   next_device drivers/base/bus.c:270 [inline]
   bus_for_each_dev+0x17e/0x210 drivers/base/bus.c:312
   autoload_drivers+0x3b/0x50 sound/core/seq_device.c:117
   process_one_work+0x9fb/0x1570 kernel/workqueue.c:2097
   worker_thread+0x1e4/0x1350 kernel/workqueue.c:2231
   kthread+0x324/0x3f0 kernel/kthread.c:231
   ret_from_fork+0x25/0x30 arch/x86/entry/entry_64.S:425

The fix is simply to assure canceling the autoload work at removing
the device.

Reported-by: Andrey Konovalov <andreyknvl@google.com>
Tested-by: Andrey Konovalov <andreyknvl@google.com>
Signed-off-by: Takashi Iwai <tiwai@suse.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Seth Forshee <seth.forshee@canonical.com>

PM / devfreq: Fix memory leak when fail to register device

BugLink: http://bugs.launchpad.net/bugs/1720154
commit 9e14de1077e9c34f141cf98bdba60cdd5193d962 upstream.

When the devfreq_add_device fails to register deivce, the memory
leak of devfreq instance happen. So, this patch fix the memory
leak issue. Before freeing the devfreq instance checks whether
devfreq instance is NULL or not because the device_unregister()
frees the devfreq instance when jumping to the 'err_init'.
It is to prevent the duplicate the kfee(devfreq).

Fixes: ac4b281176a5 ("PM / devfreq: fix duplicated kfree on devfreq pointer")
Signed-off-by: Chanwoo Choi <cw00.choi@samsung.com>
Signed-off-by: MyungJoo Ham <myungjoo.ham@samsung.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Seth Forshee <seth.forshee@canonical.com>

media: adv7180: add missing adv7180cp, adv7180st i2c device IDs

BugLink: http://bugs.launchpad.net/bugs/1720154
commit 281ddc3cdc10413b98531d701ab5323c4f3ff1f4 upstream.

Fixes a crash on Renesas R8A7793 Gose board that uses these "compatible"
entries.

Signed-off-by: Ulrich Hecht <ulrich.hecht+renesas@gmail.com>
Tested-by: Geert Uytterhoeven <geert+renesas@glider.be>
Signed-off-by: Hans Verkuil <hans.verkuil@cisco.com>
Signed-off-by: Mauro Carvalho Chehab <mchehab@s-opensource.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Seth Forshee <seth.forshee@canonical.com>

media: uvcvideo: Prevent heap overflow when accessing mapped controls

BugLink: http://bugs.launchpad.net/bugs/1720154
commit 7e09f7d5c790278ab98e5f2c22307ebe8ad6e8ba upstream.

The size of uvc_control_mapping is user controlled leading to a
potential heap overflow in the uvc driver. This adds a check to verify
the user provided size fits within the bounds of the defined buffer
size.

Originally-from: Richard Simmons <rssimmo@amazon.com>

Signed-off-by: Guenter Roeck <linux@roeck-us.net>
Reviewed-by: Laurent Pinchart <laurent.pinchart@ideasonboard.com>
Signed-off-by: Hans Verkuil <hans.verkuil@cisco.com>
Signed-off-by: Mauro Carvalho Chehab <mchehab@s-opensource.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Seth Forshee <seth.forshee@canonical.com>

media: venus: fix copy/paste error in return_buf_error

BugLink: http://bugs.launchpad.net/bugs/1720154
commit 0de0ef6c3f2dd7e9965270683445917e10384ab0 upstream.

Call function v4l2_m2m_dst_buf_remove_by_buf() instead of
v4l2_m2m_src_buf_remove_by_buf()

Addresses-Coverity-ID: 1415317

Signed-off-by: Gustavo A. R. Silva <gustavo@embeddedor.com>
Acked-by: Stanimir Varbanov <stanimir.varbanov@linaro.org>
Signed-off-by: Hans Verkuil <hansverk@cisco.com>
Signed-off-by: Mauro Carvalho Chehab <mchehab@s-opensource.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Seth Forshee <seth.forshee@canonical.com>

media: Revert "[media] lirc_dev: remove superfluous get/put_device() calls"

BugLink: http://bugs.launchpad.net/bugs/1720154
commit a607f51e5a4c421e2097077db88105402099c528 upstream.

This reverts commit 5be2b76a9ca4ea5fd3e221114d62eeb0d78267ca.

Only when the lirc device is freed, should we drop our reference to
rc_dev, else we the rc_dev is freed to early. If userspace has
a file descriptor open during unplug, it goes bang.

==================================================================
BUG: KASAN: use-after-free in __lock_acquire+0x7bb/0x1e10
Read of size 8 at addr ffff8801d7d61ed0 by task ir-rec/2609

-snip-
mutex_lock_nested+0x1b/0x20
? mutex_lock_nested+0x1b/0x20
rc_close.part.6+0x20/0x60 [rc_core]
rc_close+0x13/0x20 [rc_core]
lirc_dev_fop_close+0x62/0xd0 [lirc_dev]
__fput+0x236/0x410
? fput+0xb0/0xb0
? do_raw_spin_trylock+0x110/0x110
? set_rq_offline.part.70+0xa0/0xa0
____fput+0xe/0x10
task_work_run+0x116/0x180
? task_work_cancel+0x170/0x170
? _raw_spin_unlock+0x27/0x40
? switch_task_namespaces+0x5f/0x90
do_exit+0x68b/0xe80

Fixes: 5be2b76a9ca4 ("[media] lirc_dev: remove superfluous get/put_device() calls")
Signed-off-by: Sean Young <sean@mess.org>
Signed-off-by: Mauro Carvalho Chehab <mchehab@s-opensource.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Seth Forshee <seth.forshee@canonical.com>

media: v4l2-compat-ioctl32: Fix timespec conversion

BugLink: http://bugs.launchpad.net/bugs/1720154
commit 9c7ba1d7634cef490b85bc64c4091ff004821bfd upstream.

Certain syscalls like recvmmsg support 64 bit timespec values for the
X32 ABI. The helper function compat_put_timespec converts a timespec
value to a 32 bit or 64 bit value depending on what ABI is used. The
v4l2 compat layer, however, is not designed to support 64 bit timespec
values and always uses 32 bit values. Hence, compat_put_timespec must
not be used.

Without this patch, user space will be provided with bad timestamp
values from the VIDIOC_DQEVENT ioctl. Also, fields of the struct
v4l2_event32 that come immediately after timestamp get overwritten,
namely the field named id.

Fixes: 81993e81a994 ("compat: Get rid of (get|put)_compat_time(val|spec)")
Cc: H. Peter Anvin <hpa@linux.intel.com>
Cc: Laurent Pinchart <laurent.pinchart+renesas@ideasonboard.com>
Cc: Tiffany Lin <tiffany.lin@mediatek.com>
Cc: Ricardo Ribalda Delgado <ricardo.ribalda@gmail.com>
Cc: Sakari Ailus <sakari.ailus@linux.intel.com>
Signed-off-by: Daniel Mentz <danielmentz@google.com>
Signed-off-by: Hans Verkuil <hans.verkuil@cisco.com>
Signed-off-by: Mauro Carvalho Chehab <mchehab@s-opensource.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Seth Forshee <seth.forshee@canonical.com>

net/netfilter/nf_conntrack_core: Fix net_conntrack_lock()

BugLink: http://bugs.launchpad.net/bugs/1720154
commit 3ef0c7a730de0bae03d86c19570af764fa3c4445 upstream.

As we want to remove spin_unlock_wait() and replace it with explicit
spin_lock()/spin_unlock() calls, we can use this to simplify the
locking.

In addition:
- Reading nf_conntrack_locks_all needs ACQUIRE memory ordering.
- The new code avoids the backwards loop.

Only slightly tested, I did not manage to trigger calls to
nf_conntrack_all_lock().

V2: With improved comments, to clearly show how the barriers
pair.

Fixes: b16c29191dc8 ("netfilter: nf_conntrack: use safer way to lock all buckets")
Signed-off-by: Manfred Spraul <manfred@colorfullife.com>
Cc: Alan Stern <stern@rowland.harvard.edu>
Cc: Sasha Levin <sasha.levin@oracle.com>
Cc: Pablo Neira Ayuso <pablo@netfilter.org>
Cc: netfilter-devel@vger.kernel.org
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Seth Forshee <seth.forshee@canonical.com>

PCI: pciehp: Report power fault only once until we clear it

BugLink: http://bugs.launchpad.net/bugs/1720154
commit 7612b3b28c0b900dcbcdf5e9b9747cc20a1e2455 upstream.

When a power fault occurs, the power controller sets Power Fault Detected
in the Slot Status register, and pciehp_isr() queues an INT_POWER_FAULT
event to handle it.

It also clears Power Fault Detected, but since nothing has yet changed to
correct the power fault, the power controller will likely set it again
immediately, which may cause an infinite loop when pcie_isr() rechecks
Slot Status.

Fix that by masking off Power Fault Detected from new events if the driver
hasn't seen the power fault clear from the previous handling attempt.

Fixes: fad214b0aa72 ("PCI: pciehp: Process all hotplug events before looking for new ones")
Signed-off-by: Keith Busch <keith.busch@intel.com>
[bhelgaas: changelog, pull test out and add comment]
Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
Cc: Mayurkumar Patel <mayurkumar.patel@intel.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Seth Forshee <seth.forshee@canonical.com>

PCI: shpchp: Enable bridge bus mastering if MSI is enabled

BugLink: http://bugs.launchpad.net/bugs/1720154
commit 48b79a14505349a29b3e20f03619ada9b33c4b17 upstream.

An SHPC may generate MSIs to notify software about slot or controller
events (SHPC spec r1.0, sec 4.7). A PCI device can only generate an MSI if
it has bus mastering enabled.

Enable bus mastering if the bridge contains an SHPC that uses MSI for event
notifications.

Signed-off-by: Aleksandr Bezzubikov <zuban32s@gmail.com>
[bhelgaas: changelog]
Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
Reviewed-by: Marcel Apfelbaum <marcel@redhat.com>
Acked-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Seth Forshee <seth.forshee@canonical.com>

ARC: Re-enable MMU upon Machine Check exception

BugLink: http://bugs.launchpad.net/bugs/1720154
commit 1ee55a8f7f6b7ca4c0c59e0b4b4e3584a085c2d3 upstream.

I recently came upon a scenario where I would get a double fault
machine check exception tiriggered by a kernel module.
However the ensuing crash stacktrace (ksym lookup) was not working
correctly.

Turns out that machine check auto-disables MMU while modules are allocated
in kernel vaddr spapce.

This patch re-enables the MMU before start printing the stacktrace
making stacktracing of modules work upon a fatal exception.

Signed-off-by: Jose Abreu <joabreu@synopsys.com>
Reviewed-by: Alexey Brodkin <abrodkin@synopsys.com>
Signed-off-by: Vineet Gupta <vgupta@synopsys.com>
[vgupta: moved code into low level handler to avoid in 2 places]
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Seth Forshee <seth.forshee@canonical.com>

tracing: Apply trace_clock changes to instance max buffer

BugLink: http://bugs.launchpad.net/bugs/1720154
commit 170b3b1050e28d1ba0700e262f0899ffa4fccc52 upstream.

Currently trace_clock timestamps are applied to both regular and max
buffers only for global trace. For instance trace, trace_clock
timestamps are applied only to regular buffer. But, regular and max
buffers can be swapped, for example, following a snapshot. So, for
instance trace, bad timestamps can be seen following a snapshot.
Let's apply trace_clock timestamps to instance max buffer as well.

Link: http://lkml.kernel.org/r/ebdb168d0be042dcdf51f81e696b17fabe3609c1.1504642143.git.tom.zanussi@linux.intel.com
Fixes: 277ba0446 ("tracing: Add interface to allow multiple trace buffers")
Signed-off-by: Baohong Liu <baohong.liu@intel.com>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Seth Forshee <seth.forshee@canonical.com>

tracing: Fix clear of RECORDED_TGID flag when disabling trace event

BugLink: http://bugs.launchpad.net/bugs/1720154
commit 7685ab6c58557c6234f3540260195ecbee7fc4b3 upstream.

When disabling one trace event, the RECORDED_TGID flag in the event
file is not correctly cleared. It's clearing RECORDED_CMD flag when
it should clear RECORDED_TGID flag.

Link: http://lkml.kernel.org/r/1504589806-8425-1-git-send-email-chuhu@redhat.com
Cc: Joel Fernandes <joelaf@google.com>
Fixes: d914ba37d7 ("tracing: Add support for recording tgid of tasks")
Signed-off-by: Chunyu Hu <chuhu@redhat.com>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Seth Forshee <seth.forshee@canonical.com>

tracing: Add barrier to trace_printk() buffer nesting modification

BugLink: http://bugs.launchpad.net/bugs/1720154
commit 3d9622c12c8873911f4cc0ccdabd0362c2fca06b upstream.

trace_printk() uses 4 buffers, one for each context (normal, softirq, irq
and NMI), such that it does not need to worry about one context preempting
the other. There's a nesting counter that gets incremented to figure out
which buffer to use. If the context gets preempted by another context which
calls trace_printk() it will increment the counter and use the next buffer,
and restore the counter when it is finished.

The problem is that gcc may optimize the modification of the buffer nesting
counter and it may not be incremented in memory before the buffer is used.
If this happens, and the context gets interrupted by another context, it
could pick the same buffer and corrupt the one that is being used.

Compiler barriers need to be added after the nesting variable is incremented
and before it is decremented to prevent usage of the context buffers by more
than one context at the same time.

Cc: Andy Lutomirski <luto@kernel.org>
Fixes: e2ace00117 ("tracing: Choose static tp_printk buffer by explicit nesting count")
Hat-tip-to: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Seth Forshee <seth.forshee@canonical.com>

ftrace: Fix memleak when unregistering dynamic ops when tracing disabled

BugLink: http://bugs.launchpad.net/bugs/1720154
commit edb096e00724f02db5f6ec7900f3bbd465c6c76f upstream.

If function tracing is disabled by the user via the function-trace option or
the proc sysctl file, and a ftrace_ops that was allocated on the heap is
unregistered, then the shutdown code exits out without doing the proper
clean up. This was found via kmemleak and running the ftrace selftests, as
one of the tests unregisters with function tracing disabled.

# cat kmemleak
unreferenced object 0xffffffffa0020000 (size 4096):
  comm "swapper/0", pid 1, jiffies 4294668889 (age 569.209s)
  hex dump (first 32 bytes):
    55 ff 74 24 10 55 48 89 e5 ff 74 24 18 55 48 89  U.t$.UH...t$.UH.
    e5 48 81 ec a8 00 00 00 48 89 44 24 50 48 89 4c  .H......H.D$PH.L
  backtrace:
    [<ffffffff81d64665>] kmemleak_vmalloc+0x85/0xf0
    [<ffffffff81355631>] __vmalloc_node_range+0x281/0x3e0
    [<ffffffff8109697f>] module_alloc+0x4f/0x90
    [<ffffffff81091170>] arch_ftrace_update_trampoline+0x160/0x420
    [<ffffffff81249947>] ftrace_startup+0xe7/0x300
    [<ffffffff81249bd2>] register_ftrace_function+0x72/0x90
    [<ffffffff81263786>] trace_selftest_ops+0x204/0x397
    [<ffffffff82bb8971>] trace_selftest_startup_function+0x394/0x624
    [<ffffffff81263a75>] run_tracer_selftest+0x15c/0x1d7
    [<ffffffff82bb83f1>] init_trace_selftests+0x75/0x192
    [<ffffffff81002230>] do_one_initcall+0x90/0x1e2
    [<ffffffff82b7d620>] kernel_init_freeable+0x350/0x3fe
    [<ffffffff81d61ec3>] kernel_init+0x13/0x122
    [<ffffffff81d72c6a>] ret_from_fork+0x2a/0x40
    [<ffffffffffffffff>] 0xffffffffffffffff

Fixes: 12cce594fa ("ftrace/x86: Allow !CONFIG_PREEMPT dynamic ops to use allocated trampolines")
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Seth Forshee <seth.forshee@canonical.com>

ftrace: Fix selftest goto location on error

BugLink: http://bugs.launchpad.net/bugs/1720154
commit 46320a6acc4fb58f04bcf78c4c942cc43b20f986 upstream.

In the second iteration of trace_selftest_ops(), the error goto label is
wrong in the case where trace_selftest_test_global_cnt is off. In the
case of error, it leaks the dynamic ops that was allocated.

Fixes: 95950c2e ("ftrace: Add self-tests for multiple function trace users")
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Seth Forshee <seth.forshee@canonical.com>

ftrace: Fix debug preempt config name in stack_tracer_{en,dis}able

BugLink: http://bugs.launchpad.net/bugs/1720154
commit 60361e12d01676e23a8de89a5ef4a349ae97f616 upstream.

stack_tracer_disable()/stack_tracer_enable() had been using the wrong
name for the config symbol to enable their preempt-debugging checks --
fix with a word swap.

Link: http://lkml.kernel.org/r/20170831154036.4xldyakmmhuts5x7@hatter.bewilderbeest.net
Fixes: 8aaf1ee70e ("tracing: Rename trace_active to disable_stack_tracer and inline its modification")
Signed-off-by: Zev Weiss <zev@bewilderbeest.net>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Seth Forshee <seth.forshee@canonical.com>

mailbox: bcm-flexrm-mailbox: Fix mask used in CMPL_START_ADDR_VALUE()

BugLink: http://bugs.launchpad.net/bugs/1720154
commit 6d2061b981af165d3e45462e0804b5a1f2f4c7bc upstream.

The mask used in CMPL_START_ADDR_VALUE() should be 27bits instead of
26bits. This incorrect mask was causing completion writes to 40bits
physical address fail.

This patch fixes mask used in CMPL_START_ADDR_VALUE() macro.

Fixes: dbc049eee730 ("mailbox: Add driver for Broadcom FlexRM
ring manager")

Signed-off-by: Anup Patel <anup.patel@broadcom.com>
Reviewed-by: Ray Jui <ray.jui@broadcom.com>
Reviewed-by: Scott Branden <scott.branden@broadcom.com>
Signed-off-by: Jassi Brar <jaswinder.singh@linaro.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Seth Forshee <seth.forshee@canonical.com>

scsi: qla2xxx: Fix an integer overflow in sysfs code

BugLink: http://bugs.launchpad.net/bugs/1720154
commit e6f77540c067b48dee10f1e33678415bfcc89017 upstream.

The value of "size" comes from the user.  When we add "start + size" it
could lead to an integer overflow bug.

It means we vmalloc() a lot more memory than we had intended.  I believe
that on 64 bit systems vmalloc() can succeed even if we ask it to
allocate huge 4GB buffers.  So we would get memory corruption and likely
a crash when we call ha->isp_ops->write_optrom() and ->read_optrom().

Only root can trigger this bug.

Link: https://bugzilla.kernel.org/show_bug.cgi?id=194061
Fixes: b7cc176c9eb3 ("[SCSI] qla2xxx: Allow region-based flash-part accesses.")
Reported-by: shqking <shqking@gmail.com>
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Seth Forshee <seth.forshee@canonical.com>

scsi: qla2xxx: Use fabric name for Get Port Speed command

BugLink: http://bugs.launchpad.net/bugs/1720154
commit b2e8ae3f0e342a3308b4573790bd42528e51885a upstream.

The Get Port Speed switch command needs the fabric port name of the
remote device. Current code uses the registered WWPN.

Fixes: 726b85487067d ("qla2xxx: Add framework for async fabric discovery")
Cc: <stable@vger.kernel.org> # 4.10+
Signed-off-by: Quinn Tran <quinn.tran@cavium.com>
Signed-off-by: Himanshu Madhani <himanshu.madhani@cavium.com>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Seth Forshee <seth.forshee@canonical.com>

scsi: qla2xxx: Use BIT_6 to acquire FAWWPN from switch

BugLink: http://bugs.launchpad.net/bugs/1720154
commit fcc5b5cd726c0779cd689362aea82cc9d5a61346 upstream.

If FA-WWPN feature disabled on the switch side and enabled for the
adapter, then driver would update the port name with switch port name.

This patch fixes issue by checking correct BIT flag to validate.

Fixes: 41dc529a4602 ("qla2xxx: Improve RSCN handling in driver")
Signed-off-by: Sawan Chandak <sawan.chandak@cavium.com>
Signed-off-by: Himanshu Madhani <himanshu.madhani@cavium.com>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Seth Forshee <seth.forshee@canonical.com>

scsi: qla2xxx: Fix target multiqueue configuration

BugLink: http://bugs.launchpad.net/bugs/1720154
commit b7edfa235effb4b4a9816c2345620b11609c123e upstream.

Following error will be logged in to message file while trying to
configure target with multiqueue.

"Cmd 0x1f aborted with timeout since ISP Abort is pending"
"qla25xx_init_queues Rsp que: 1 init failed."

Fixes: 82de802ad46e ("scsi: qla2xxx: Preparation for Target MQ.")
Signed-off-by: Quinn Tran <quinn.tran@cavium.com>
Signed-off-by: Michael Hernandez <michael.hernandez@cavium.com>
Signed-off-by: Himanshu Madhani <himanshu.madhani@cavium.com>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Seth Forshee <seth.forshee@canonical.com>

scsi: qla2xxx: Correction to vha->vref_count timeout

BugLink: http://bugs.launchpad.net/bugs/1720154
commit 6e98095f8fb6d98da34c4e6c34e69e7c638d79c0 upstream.

Fix incorrect second argument for wait_event_timeout()

Fixes: c4a9b538ab2a ("qla2xxx: Allow vref count to timeout on vport delete.")
Signed-off-by: Joe Carnuccio <joe.carnuccio@cavium.com>
Signed-off-by: Himanshu Madhani <himanshu.madhani@cavium.com>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Seth Forshee <seth.forshee@canonical.com>

scsi: qla2xxx: Update fw_started flags at qpair creation.

BugLink: http://bugs.launchpad.net/bugs/1720154
commit e6373f33a6bba0de9f543f4a7faeaaa536c62997 upstream.

Fixes: 4b60c82736d0 ("scsi: qla2xxx: Add fw_started flags to qpair")
Signed-off-by: Himanshu Madhani <himanshu.madhani@cavium.com>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Seth Forshee <seth.forshee@canonical.com>

scsi: sg: fixup infoleak when using SG_GET_REQUEST_TABLE

BugLink: http://bugs.launchpad.net/bugs/1720154
commit 3e0097499839e0fe3af380410eababe5a47c4cf9 upstream.

When calling SG_GET_REQUEST_TABLE ioctl only a half-filled table is
returned; the remaining part will then contain stale kernel memory
information. This patch zeroes out the entire table to avoid this
issue.

Signed-off-by: Hannes Reinecke <hare@suse.com>
Reviewed-by: Bart Van Assche <bart.vanassche@wdc.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Seth Forshee <seth.forshee@canonical.com>

scsi: sg: factor out sg_fill_request_table()

BugLink: http://bugs.launchpad.net/bugs/1720154
commit 4759df905a474d245752c9dc94288e779b8734dd upstream.

Factor out sg_fill_request_table() for better readability.

[mkp: typos, applied by hand]

Signed-off-by: Hannes Reinecke <hare@suse.com>
Reviewed-by: Bart Van Assche <bart.vanassche@wdc.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Seth Forshee <seth.forshee@canonical.com>

scsi: storvsc: fix memory leak on ring buffer busy

BugLink: http://bugs.launchpad.net/bugs/1720154
commit 0208eeaa650c5c866a3242201678a19e6dc4a14e upstream.

When storvsc is sending I/O to Hyper-v, it may allocate a bigger buffer
descriptor for large data payload that can't fit into a pre-allocated
buffer descriptor. This bigger buffer is freed on return path.

If I/O request to Hyper-v fails due to ring buffer busy, the storvsc
allocated buffer descriptor should also be freed.

[mkp: applied by hand]

Fixes: be0cf6ca301c ("scsi: storvsc: Set the tablesize based on the information given by the host")
Signed-off-by: Long Li <longli@microsoft.com>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Seth Forshee <seth.forshee@canonical.com>

scsi: megaraid_sas: Return pended IOCTLs with cmd_status MFI_STAT_WRONG_STATE in case adapter is dead

BugLink: http://bugs.launchpad.net/bugs/1720154
commit eb3fe263a48b0d27b229c213929c4cb3b1b39a0f upstream.

After a kill adapter, since the cmd_status is not set, the IOCTLs will
be hung in driver resulting in application hang. Set cmd_status
MFI_STAT_WRONG_STATE when completing pended IOCTLs.

Signed-off-by: Kashyap Desai <kashyap.desai@broadcom.com>
Signed-off-by: Shivasharan S <shivasharan.srikanteshwara@broadcom.com>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Reviewed-by: Tomas Henzl <thenzl@redhat.com>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Seth Forshee <seth.forshee@canonical.com>

scsi: megaraid_sas: Check valid aen class range to avoid kernel panic

BugLink: http://bugs.launchpad.net/bugs/1720154
commit 91b3d9f0069c8307d0b3a4c6843b65a439183318 upstream.

Signed-off-by: Kashyap Desai <kashyap.desai@broadcom.com>
Signed-off-by: Shivasharan S <shivasharan.srikanteshwara@broadcom.com>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Reviewed-by: Tomas Henzl <thenzl@redhat.com>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Seth Forshee <seth.forshee@canonical.com>

scsi: megaraid_sas: set minimum value of resetwaittime to be 1 secs

BugLink: http://bugs.launchpad.net/bugs/1720154
commit e636a7a430f41efb0ff2727960ce61ef9f8f6769 upstream.

Setting resetwaittime to 0 during a FW fault will result in driver not
calling the OCR.

Signed-off-by: Kashyap Desai <kashyap.desai@broadcom.com>
Signed-off-by: Shivasharan S <shivasharan.srikanteshwara@broadcom.com>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Reviewed-by: Tomas Henzl <thenzl@redhat.com>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Seth Forshee <seth.forshee@canonical.com>

scsi: megaraid_sas: mismatch of allocated MFI frame size and length exposed in MFI MPT pass through command

BugLink: http://bugs.launchpad.net/bugs/1720154
commit ed2983f458bed9dc827ec60c8486253b1669bb52 upstream.

Driver allocated 256 byte MFI frames bytes but while sending MFI frame
(embedded inside chain frame of MPT frame) to firmware, driver sets the
length as 4k. This results in DMA read error messages during boot.

Signed-off-by: Kashyap Desai <kashyap.desai@broadcom.com>
Signed-off-by: Shivasharan S <shivasharan.srikanteshwara@broadcom.com>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Reviewed-by: Tomas Henzl <thenzl@redhat.com>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Seth Forshee <seth.forshee@canonical.com>

scsi: aacraid: Fix command send race condition

BugLink: http://bugs.launchpad.net/bugs/1720154
commit 1ae948fa4f00f3a2823e7cb19a3049ef27dd6947 upstream.

This fixes a potential race condition observed on Power systems.

Several places throughout the aacraid driver call aac_fib_send or
similar to send a command to the aacraid adapter, then check the return
code to determine if the command was actually sent to the adapter, then
update the phase field in the scsi command scratch pad area to track
that the firmware now owns this command.  However, there is nothing that
ensures that by the time the aac_fib_send function returns and we go to
write to the scsi command, that the command hasn't already completed and
the scsi command has been freed.  This was causing random crashes in the
TCP stack which was tracked down to be caused by memory that had been a
struct request + scsi_cmnd being now used for an skbuff. Memory
poisoning was enabled in the kernel to debug this which showed that the
last owner of the memory that had been freed was aacraid and that it was
a struct request.  The memory that was corrupted was the exact data
pattern of AAC_OWNER_FIRMWARE and it was at the same offset that aacraid
writes, which is scsicmd->SCp.phase. The patch below resolves this
issue.

Signed-off-by: Brian King <brking@linux.vnet.ibm.com>
Tested-by: Wen Xiong <wenxiong@linux.vnet.ibm.com>
Reviewed-by: Dave Carroll <david.carroll@microsemi.com>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Seth Forshee <seth.forshee@canonical.com>

scsi: qedi: off by one in qedi_get_cmd_from_tid()

BugLink: http://bugs.launchpad.net/bugs/1720154
commit fa2d9d6e894e096678a50ef0f65f7a8c3d8a40b8 upstream.

The > here should be >= or we end up reading one element beyond the end
of the qedi->itt_map[] array. The qedi->itt_map[] array is allocated in
qedi_alloc_itt().

Fixes: ace7f46ba5fd ("scsi: qedi: Add QLogic FastLinQ offload iSCSI driver framework.")
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Acked-by: Manish Rangankar <Manish.Rangankar@cavium.com>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Seth Forshee <seth.forshee@canonical.com>

scsi: zfcp: trace high part of "new" 64 bit SCSI LUN

BugLink: http://bugs.launchpad.net/bugs/1720154
commit 5d4a3d0a2ff23799b956e5962b886287614e7fad upstream.

Complements debugging aspects of the otherwise functionally complete
v3.17 commit 9cb78c16f5da ("scsi: use 64-bit LUNs").

While I don't have access to a target exporting 3 or 4 level LUNs,
I did test it by explicitly attaching a non-existent fake 4 level LUN
by means of zfcp sysfs attribute "unit_add".
In order to see corresponding trace records of otherwise successful
events, we had to increase the trace level of area SCSI and HBA to 6.

$ echo 6 > /sys/kernel/debug/s390dbf/zfcp_0.0.1880_scsi/level
$ echo 6 > /sys/kernel/debug/s390dbf/zfcp_0.0.1880_hba/level

$ echo 0x4011402240334044 > \
  /sys/bus/ccw/drivers/zfcp/0.0.1880/0x50050763031bd327/unit_add

Example output formatted by an updated zfcpdbf from the s390-tools
package interspersed with kernel messages at scsi_logging_level=4605:

Timestamp      : ...
Area           : REC
Subarea        : 00
Level          : 1
Exception      : -
CPU ID         : ..
Caller         : 0x...
Record ID      : 1
Tag            : scsla_1
LUN            : 0x4011402240334044
WWPN           : 0x50050763031bd327
D_ID           : 0x00......
Adapter status : 0x5400050b
Port status    : 0x54000001
LUN status     : 0x41000000
Ready count    : 0x00000001
Running count  : 0x00000000
ERP want       : 0x01
ERP need       : 0x01

scsi 2:0:0:4630896905707208721: scsi scan: INQUIRY pass 1 length 36
scsi 2:0:0:4630896905707208721: scsi scan: INQUIRY successful with code 0x0

Timestamp      : ...
Area           : HBA
Subarea        : 00
Level          : 6
Exception      : -
CPU ID         : ..
Caller         : 0x...
Record ID      : 1
Tag            : fs_norm
Request ID     : 0x<inquiry2-req-id>
Request status : 0x00000010
FSF cmnd       : 0x00000001
FSF sequence no: 0x...
FSF issued     : ...
FSF stat       : 0x00000000
FSF stat qual  : 00000000 00000000 00000000 00000000
Prot stat      : 0x00000001
Prot stat qual : ........ ........ 00000000 00000000
Port handle    : 0x...
LUN handle     : 0x...
|
Timestamp      : ...
Area           : SCSI
Subarea        : 00
Level          : 6
Exception      : -
CPU ID         : ..
Caller         : 0x...
Record ID      : 1
Tag            : rsl_nor
Request ID     : 0x<inquiry2-req-id>
SCSI ID        : 0x00000000
SCSI LUN       : 0x40224011
SCSI LUN high  : 0x40444033 <=======================
SCSI result    : 0x00000000
SCSI retries   : 0x00
SCSI allowed   : 0x03
SCSI scribble  : 0x<inquiry2-req-id>
SCSI opcode    : 12000000 a4000000 00000000 00000000
FCP rsp inf cod: 0x00
FCP rsp IU     : 00000000 00000000 00000000 00000000
                 00000000 00000000

scsi 2:0:0:4630896905707208721: scsi scan: INQUIRY pass 2 length 164
scsi 2:0:0:4630896905707208721: scsi scan: INQUIRY successful with code 0x0
scsi 2:0:0:4630896905707208721: scsi scan: peripheral device type of 31, \
no device added

Signed-off-by: Steffen Maier <maier@linux.vnet.ibm.com>
Fixes: 9cb78c16f5da ("scsi: use 64-bit LUNs")
Reviewed-by: Benjamin Block <bblock@linux.vnet.ibm.com>
Reviewed-by: Jens Remus <jremus@linux.vnet.ibm.com>
Signed-off-by: Benjamin Block <bblock@linux.vnet.ibm.com>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Seth Forshee <seth.forshee@canonical.com>

scsi: zfcp: trace HBA FSF response by default on dismiss or timedout late response

BugLink: http://bugs.launchpad.net/bugs/1720154
commit fdb7cee3b9e3c561502e58137a837341f10cbf8b upstream.

At the default trace level, we only trace unsuccessful events including
FSF responses.

zfcp_dbf_hba_fsf_response() only used protocol status and FSF status to
decide on an unsuccessful response. However, this is only one of multiple
possible sources determining a failed struct zfcp_fsf_req.

An FSF request can also "fail" if its response runs into an ERP timeout
or if it gets dismissed because a higher level recovery was triggered
[trace tags "erscf_1" or "erscf_2" in zfcp_erp_strategy_check_fsfreq()].
FSF requests with ERP timeout are:
FSF_QTCB_EXCHANGE_CONFIG_DATA, FSF_QTCB_EXCHANGE_PORT_DATA,
FSF_QTCB_OPEN_PORT_WITH_DID or FSF_QTCB_CLOSE_PORT or
FSF_QTCB_CLOSE_PHYSICAL_PORT for target ports,
FSF_QTCB_OPEN_LUN, FSF_QTCB_CLOSE_LUN.
One example is slow queue processing which can cause follow-on errors,
e.g. FSF_PORT_ALREADY_OPEN after FSF_QTCB_OPEN_PORT_WITH_DID timed out.
In order to see the root cause, we need to see late responses even if the
channel presented them successfully with FSF_PROT_GOOD and FSF_GOOD.
Example trace records formatted with zfcpdbf from the s390-tools package:

Timestamp      : ...
Area           : REC
Subarea        : 00
Level          : 1
Exception      : -
CPU ID         : ..
Caller         : ...
Record ID      : 1
Tag            : fcegpf1
LUN            : 0xffffffffffffffff
WWPN           : 0x<WWPN>
D_ID           : 0x00<D_ID>
Adapter status : 0x5400050b
Port status    : 0x41200000
LUN status     : 0x00000000
Ready count    : 0x00000001
Running count  : 0x...
ERP want       : 0x02 ZFCP_ERP_ACTION_REOPEN_PORT
ERP need       : 0x02 ZFCP_ERP_ACTION_REOPEN_PORT
|
Timestamp      : ... 30 seconds later
Area           : REC
Subarea        : 00
Level          : 1
Exception      : -
CPU ID         : ..
Caller         : ...
Record ID      : 2
Tag            : erscf_2
LUN            : 0xffffffffffffffff
WWPN           : 0x<WWPN>
D_ID           : 0x00<D_ID>
Adapter status : 0x5400050b
Port status    : 0x41200000
LUN status     : 0x00000000
Request ID     : 0x<request_ID>
ERP status     : 0x10000000 ZFCP_STATUS_ERP_TIMEDOUT
ERP step       : 0x0800 ZFCP_ERP_STEP_PORT_OPENING
ERP action     : 0x02 ZFCP_ERP_ACTION_REOPEN_PORT
ERP count      : 0x00
|
Timestamp      : ... later than previous record
Area           : HBA
Subarea        : 00
Level          : 5 > default level => 3 <= default level
Exception      : -
CPU ID         : 00
Caller         : ...
Record ID      : 1
Tag            : fs_qtcb => fs_rerr
Request ID     : 0x<request_ID>
Request status : 0x00001010 ZFCP_STATUS_FSFREQ_DISMISSED
| ZFCP_STATUS_FSFREQ_CLEANUP
FSF cmnd       : 0x00000005
FSF sequence no: 0x...
FSF issued     : ... > 30 seconds ago
FSF stat       : 0x00000000 FSF_GOOD
FSF stat qual  : 00000000 00000000 00000000 00000000
Prot stat      : 0x00000001 FSF_PROT_GOOD
Prot stat qual : 00000000 00000000 00000000 00000000
Port handle    : 0x...
LUN handle     : 0x00000000
QTCB log length: ...
QTCB log info  : ...

In case of problems detecting that new responses are waiting on the input
queue, we sooner or later trigger adapter recovery due to an FSF request
timeout (trace tag "fsrth_1").
FSF requests with FSF request timeout are:
typically FSF_QTCB_ABORT_FCP_CMND; but theoretically also
FSF_QTCB_EXCHANGE_CONFIG_DATA or FSF_QTCB_EXCHANGE_PORT_DATA via sysfs,
FSF_QTCB_OPEN_PORT_WITH_DID or FSF_QTCB_CLOSE_PORT for WKA ports,
FSF_QTCB_FCP_CMND for task management function (LUN / target reset).
One or more pending requests can meanwhile have FSF_PROT_GOOD and FSF_GOOD
because the channel filled in the response via DMA into the request's QTCB.

In a theroretical case, inject code can create an erroneous FSF request
on purpose. If data router is enabled, it uses deferred error reporting.
A READ SCSI command can succeed with FSF_PROT_GOOD, FSF_GOOD, and
SAM_STAT_GOOD. But on writing the read data to host memory via DMA,
it can still fail, e.g. if an intentionally wrong scatter list does not
provide enough space. Rather than getting an unsuccessful response,
we get a QDIO activate check which in turn triggers adapter recovery.
One or more pending requests can meanwhile have FSF_PROT_GOOD and FSF_GOOD
because the channel filled in the response via DMA into the request's QTCB.
Example trace records formatted with zfcpdbf from the s390-tools package:

Timestamp      : ...
Area           : HBA
Subarea        : 00
Level          : 6 > default level => 3 <= default level
Exception      : -
CPU ID         : ..
Caller         : ...
Record ID      : 1
Tag            : fs_norm => fs_rerr
Request ID     : 0x<request_ID2>
Request status : 0x00001010 ZFCP_STATUS_FSFREQ_DISMISSED
| ZFCP_STATUS_FSFREQ_CLEANUP
FSF cmnd       : 0x00000001
FSF sequence no: 0x...
FSF issued     : ...
FSF stat       : 0x00000000 FSF_GOOD
FSF stat qual  : 00000000 00000000 00000000 00000000
Prot stat      : 0x00000001 FSF_PROT_GOOD
Prot stat qual : ........ ........ 00000000 00000000
Port handle    : 0x...
LUN handle     : 0x...
|
Timestamp      : ...
Area           : SCSI
Subarea        : 00
Level          : 3
Exception      : -
CPU ID         : ..
Caller         : ...
Record ID      : 1
Tag            : rsl_err
Request ID     : 0x<request_ID2>
SCSI ID        : 0x...
SCSI LUN       : 0x...
SCSI result    : 0x000e0000 DID_TRANSPORT_DISRUPTED
SCSI retries   : 0x00
SCSI allowed   : 0x05
SCSI scribble  : 0x<request_ID2>
SCSI opcode    : 28... Read(10)
FCP rsp inf cod: 0x00
FCP rsp IU     : 00000000 00000000 00000000 00000000
                                         ^^ SAM_STAT_GOOD
                 00000000 00000000

Only with luck in both above cases, we could see a follow-on trace record
of an unsuccesful event following a successful but late FSF response with
FSF_PROT_GOOD and FSF_GOOD. Typically this was the case for I/O requests
resulting in a SCSI trace record "rsl_err" with DID_TRANSPORT_DISRUPTED
[On ZFCP_STATUS_FSFREQ_DISMISSED, zfcp_fsf_protstatus_eval() sets
ZFCP_STATUS_FSFREQ_ERROR seen by the request handler functions as failure].
However, the reason for this follow-on trace was invisible because the
corresponding HBA trace record was missing at the default trace level
(by default hidden records with tags "fs_norm", "fs_qtcb", or "fs_open").

On adapter recovery, after we had shut down the QDIO queues, we perform
unsuccessful pseudo completions with flag ZFCP_STATUS_FSFREQ_DISMISSED
for each pending FSF request in zfcp_fsf_req_dismiss_all().
In order to find the root cause, we need to see all pseudo responses even
if the channel presented them successfully with FSF_PROT_GOOD and FSF_GOOD.

Therefore, check zfcp_fsf_req.status for ZFCP_STATUS_FSFREQ_DISMISSED
or ZFCP_STATUS_FSFREQ_ERROR and trace with a new tag "fs_rerr".

It does not matter that there are numerous places which set
ZFCP_STATUS_FSFREQ_ERROR after the location where we trace an FSF response
early. These cases are based on protocol status != FSF_PROT_GOOD or
== FSF_PROT_FSF_STATUS_PRESENTED and are thus already traced by default
as trace tag "fs_perr" or "fs_ferr" respectively.

NB: The trace record with tag "fssrh_1" for status read buffers on dismiss
all remains. zfcp_fsf_req_complete() handles this and returns early.
All other FSF request types are handled separately and as described above.

Signed-off-by: Steffen Maier <maier@linux.vnet.ibm.com>
Fixes: 8a36e4532ea1 ("[SCSI] zfcp: enhancement of zfcp debug features")
Fixes: 2e261af84cdb ("[SCSI] zfcp: Only collect FSF/HBA debug data for matching trace levels")
Reviewed-by: Benjamin Block <bblock@linux.vnet.ibm.com>
Signed-off-by: Benjamin Block <bblock@linux.vnet.ibm.com>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Seth Forshee <seth.forshee@canonical.com>

scsi: zfcp: fix payload with full FCP_RSP IU in SCSI trace records

BugLink: http://bugs.launchpad.net/bugs/1720154
commit 12c3e5754c8022a4f2fd1e9f00d19e99ee0d3cc1 upstream.

If the FCP_RSP UI has optional parts (FCP_SNS_INFO or FCP_RSP_INFO) and
thus does not fit into the fsp_rsp field built into a SCSI trace record,
trace the full FCP_RSP UI with all optional parts as payload record
instead of just FCP_SNS_INFO as payload and
a 1 byte RSP_INFO_CODE part of FCP_RSP_INFO built into the SCSI record.

That way we would also get the full FCP_SNS_INFO in case a
target would ever send more than
min(SCSI_SENSE_BUFFERSIZE==96, ZFCP_DBF_PAY_MAX_REC==256)==96.

The mandatory part of FCP_RSP IU is only 24 bytes.
PAYload costs at least one full PAY record of 256 bytes anyway.
We cap to the hardware response size which is only FSF_FCP_RSP_SIZE==128.
So we can just put the whole FCP_RSP IU with any optional parts into
PAYload similarly as we do for SAN PAY since v4.9 commit aceeffbb59bb
("zfcp: trace full payload of all SAN records (req,resp,iels)").
This does not cause any additional trace records wasting memory.

Decoded trace records were confusing because they showed a hard-coded
sense data length of 96 even if the FCP_RSP_IU field FCP_SNS_LEN showed
actually less.

Since the same commit, we set pl_len for SAN traces to the full length of a
request/response even if we cap the corresponding trace.
In contrast, here for SCSI traces we set pl_len to the pre-computed
length of FCP_RSP IU considering SNS_LEN or RSP_LEN if valid.
Nonetheless we trace a hardcoded payload of length FSF_FCP_RSP_SIZE==128
if there were optional parts.
This makes it easier for the zfcpdbf tool to format only the relevant
part of the long FCP_RSP UI buffer. And any trailing information is still
available in the payload trace record just in case.

Rename the payload record tag from "fcp_sns" to "fcp_riu" to make the new
content explicit to zfcpdbf which can then pick a suitable field name such
as "FCP rsp IU all:" instead of "Sense info :"
Also, the same zfcpdbf can still be backwards compatible with "fcp_sns".

Old example trace record before this fix, formatted with the tool zfcpdbf
from s390-tools:

Timestamp      : ...
Area           : SCSI
Subarea        : 00
Level          : 3
Exception      : -
CPU id         : ..
Caller         : 0x...
Record id      : 1
Tag            : rsl_err
Request id     : 0x<request_id>
SCSI ID        : 0x...
SCSI LUN       : 0x...
SCSI result    : 0x00000002
SCSI retries   : 0x00
SCSI allowed   : 0x05
SCSI scribble  : 0x<request_id>
SCSI opcode    : 00000000 00000000 00000000 00000000
FCP rsp inf cod: 0x00
FCP rsp IU     : 00000000 00000000 00000202 00000000
                                       ^^==FCP_SNS_LEN_VALID
                 00000020 00000000
                 ^^^^^^^^==FCP_SNS_LEN==32
Sense len      : 96 <==min(SCSI_SENSE_BUFFERSIZE,ZFCP_DBF_PAY_MAX_REC)
Sense info     : 70000600 00000018 00000000 29000000
                 00000400 00000000 00000000 00000000
                 00000000 00000000 00000000 00000000<==superfluous
                 00000000 00000000 00000000 00000000<==superfluous
                 00000000 00000000 00000000 00000000<==superfluous
                 00000000 00000000 00000000 00000000<==superfluous

New example trace records with this fix:

Timestamp      : ...
Area           : SCSI
Subarea        : 00
Level          : 3
Exception      : -
CPU ID         : ..
Caller         : 0x...
Record ID      : 1
Tag            : rsl_err
Request ID     : 0x<request_id>
SCSI ID        : 0x...
SCSI LUN       : 0x...
SCSI result    : 0x00000002
SCSI retries   : 0x00
SCSI allowed   : 0x03
SCSI scribble  : 0x<request_id>
SCSI opcode    : a30c0112 00000000 02000000 00000000
FCP rsp inf cod: 0x00
FCP rsp IU     : 00000000 00000000 00000a02 00000200
                 00000020 00000000
FCP rsp IU len : 56
FCP rsp IU all : 00000000 00000000 00000a02 00000200
                                       ^^=FCP_RESID_UNDER|FCP_SNS_LEN_VALID
                 00000020 00000000 70000500 00000018
                 ^^^^^^^^==FCP_SNS_LEN
                                   ^^^^^^^^^^^^^^^^^
                 00000000 240000cb 00011100 00000000
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
                 00000000 00000000
                 ^^^^^^^^^^^^^^^^^==FCP_SNS_INFO

Timestamp      : ...
Area           : SCSI
Subarea        : 00
Level          : 1
Exception      : -
CPU ID         : ..
Caller         : 0x...
Record ID      : 1
Tag            : lr_okay
Request ID     : 0x<request_id>
SCSI ID        : 0x...
SCSI LUN       : 0x...
SCSI result    : 0x00000000
SCSI retries   : 0x00
SCSI allowed   : 0x05
SCSI scribble  : 0x<request_id>
SCSI opcode    : <CDB of unrelated SCSI command passed to eh handler>
FCP rsp inf cod: 0x00
FCP rsp IU     : 00000000 00000000 00000100 00000000
                 00000000 00000008
FCP rsp IU len : 32
FCP rsp IU all : 00000000 00000000 00000100 00000000
                                       ^^==FCP_RSP_LEN_VALID
                 00000000 00000008 00000000 00000000
                          ^^^^^^^^==FCP_RSP_LEN
                                   ^^^^^^^^^^^^^^^^^==FCP_RSP_INFO

Signed-off-by: Steffen Maier <maier@linux.vnet.ibm.com>
Fixes: 250a1352b95e ("[SCSI] zfcp: Redesign of the debug tracing for SCSI records.")
Reviewed-by: Benjamin Block <bblock@linux.vnet.ibm.com>
Signed-off-by: Benjamin Block <bblock@linux.vnet.ibm.com>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Seth Forshee <seth.forshee@canonical.com>

scsi: zfcp: fix missing trace records for early returns in TMF eh handlers

BugLink: http://bugs.launchpad.net/bugs/1720154
commit 1a5d999ebfc7bfe28deb48931bb57faa8e4102b6 upstream.

For problem determination we need to see that we were in scsi_eh
as well as whether and why we were successful or not.

The following commits introduced new early returns without adding
a trace record:

v2.6.35 commit a1dbfddd02d2
("[SCSI] zfcp: Pass return code from fc_block_scsi_eh to scsi eh")
on fc_block_scsi_eh() returning != 0 which is FAST_IO_FAIL,

v2.6.30 commit 63caf367e1c9
("[SCSI] zfcp: Improve reliability of SCSI eh handlers in zfcp")
on not having gotten an FSF request after the maximum number of retry
attempts and thus could not issue a TMF and has to return FAILED.

Signed-off-by: Steffen Maier <maier@linux.vnet.ibm.com>
Fixes: a1dbfddd02d2 ("[SCSI] zfcp: Pass return code from fc_block_scsi_eh to scsi eh")
Fixes: 63caf367e1c9 ("[SCSI] zfcp: Improve reliability of SCSI eh handlers in zfcp")
Reviewed-by: Benjamin Block <bblock@linux.vnet.ibm.com>
Signed-off-by: Benjamin Block <bblock@linux.vnet.ibm.com>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Seth Forshee <seth.forshee@canonical.com>

scsi: zfcp: fix passing fsf_req to SCSI trace on TMF to correlate with HBA

BugLink: http://bugs.launchpad.net/bugs/1720154
commit 9fe5d2b2fd30aa8c7827ec62cbbe6d30df4fe3e3 upstream.

Without this fix we get SCSI trace records on task management functions
which cannot be correlated to HBA trace records because all fields
related to the FSF request are empty (zero).
Also, the FCP_RSP_IU is missing as well as any sense data if available.

This was caused by v2.6.14 commit 8a36e4532ea1 ("[SCSI] zfcp: enhancement
of zfcp debug features") introducing trace records for TMFs but
hard coding NULL for a possibly existing TMF FSF request.
The scsi_cmnd scribble is also zero or unrelated for the TMF request
so it also could not lookup a suitable FSF request from there.

A broken example trace record formatted with zfcpdbf from the s390-tools
package:

Timestamp      : ...
Area           : SCSI
Subarea        : 00
Level          : 1
Exception      : -
CPU ID         : ..
Caller         : 0x...
Record ID      : 1
Tag            : lr_fail
Request ID     : 0x0000000000000000
                   ^^^^^^^^^^^^^^^^ no correlation to HBA record
SCSI ID        : 0x<scsitarget>
SCSI LUN       : 0x<scsilun>
SCSI result    : 0x000e0000
SCSI retries   : 0x00
SCSI allowed   : 0x05
SCSI scribble  : 0x0000000000000000
SCSI opcode    : 2a000017 3bb80000 08000000 00000000
FCP rsp inf cod: 0x00
                   ^^ no TMF response
FCP rsp IU     : 00000000 00000000 00000000 00000000
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
                 00000000 00000000
                 ^^^^^^^^^^^^^^^^^ no interesting FCP_RSP_IU
Sense len      : ...
^^^^^^^^^^^^^^^^^^^^ no sense data length
Sense info     : ...
^^^^^^^^^^^^^^^^^^^^ no sense data content, even if present

There are some true cases where we really do not have an FSF request:
"rsl_fai" from zfcp_dbf_scsi_fail_send() called for early
returns / completions in zfcp_scsi_queuecommand(),
"abrt_or", "abrt_bl", "abrt_ru", "abrt_ar" from
zfcp_scsi_eh_abort_handler() where we did not get as far,
"lr_nres", "tr_nres" from zfcp_task_mgmt_function() where we're
successful and do not need to do anything because adapter stopped.
For these cases it's correct to pass NULL for fsf_req to _zfcp_dbf_scsi().

Signed-off-by: Steffen Maier <maier@linux.vnet.ibm.com>
Fixes: 8a36e4532ea1 ("[SCSI] zfcp: enhancement of zfcp debug features")
Reviewed-by: Benjamin Block <bblock@linux.vnet.ibm.com>
Signed-off-by: Benjamin Block <bblock@linux.vnet.ibm.com>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Seth Forshee <seth.forshee@canonical.com>

scsi: zfcp: fix capping of unsuccessful GPN_FT SAN response trace records

BugLink: http://bugs.launchpad.net/bugs/1720154
commit 975171b4461be296a35e83ebd748946b81cf0635 upstream.

v4.9 commit aceeffbb59bb ("zfcp: trace full payload of all SAN records
(req,resp,iels)") fixed trace data loss of 2.6.38 commit 2c55b750a884
("[SCSI] zfcp: Redesign of the debug tracing for SAN records.")
necessary for problem determination, e.g. to see the
currently active zone set during automatic port scan.

While it already saves space by not dumping any empty residual entries
of the large successful GPN_FT response (4 pages), there are seldom cases
where the GPN_FT response is unsuccessful and likely does not have
FC_NS_FID_LAST set in fp_flags so we did not cap the trace record.
We typically see such case for an initiator WWPN, which is not in any zone.

Cap unsuccessful responses to at least the actual basic CT_IU response
plus whatever fits the SAN trace record built-in "payload" buffer
just in case there's trailing information
of which we would at least see the existence and its beginning.

In order not to erroneously cap successful responses, we need to swap
calling the trace function and setting the CT / ELS status to success (0).

Example trace record pair formatted with zfcpdbf:

Timestamp      : ...
Area           : SAN
Subarea        : 00
Level          : 1
Exception      : -
CPU ID         : ..
Caller         : 0x...
Record ID      : 1
Tag            : fssct_1
Request ID     : 0x<request_id>
Destination ID : 0x00fffffc
SAN req short  : 01000000 fc020000 01720ffc 00000000
                 00000008
SAN req length : 20
|
Timestamp      : ...
Area           : SAN
Subarea        : 00
Level          : 1
Exception      : -
CPU ID         : ..
Caller         : 0x...
Record ID      : 2
Tag            : fsscth2
Request ID     : 0x<request_id>
Destination ID : 0x00fffffc
SAN resp short : 01000000 fc020000 80010000 00090700
                 00000000 00000000 00000000 00000000 [trailing info]
                 00000000 00000000 00000000 00000000 [trailing info]
SAN resp length: 16384
San resp info  : 01000000 fc020000 80010000 00090700
                 00000000 00000000 00000000 00000000 [trailing info]
                 00000000 00000000 00000000 00000000 [trailing info]
                 00000000 00000000 00000000 00000000 [trailing info]
                 00000000 00000000 00000000 00000000 [trailing info]
                 00000000 00000000 00000000 00000000 [trailing info]
                 00000000 00000000 00000000 00000000 [trailing info]
                 00000000 00000000 00000000 00000000 [trailing info]
                 00000000 00000000 00000000 00000000 [trailing info]
                 00000000 00000000 00000000 00000000 [trailing info]
                 00000000 00000000 00000000 00000000 [trailing info]
                 00000000 00000000 00000000 00000000 [trailing info]
                 00000000 00000000 00000000 00000000 [trailing info]
                 00000000 00000000 00000000 00000000 [trailing info]
                 00000000 00000000 00000000 00000000 [trailing info]
                 00000000 00000000 00000000 00000000 [trailing info]

The fix saves all but one of the previously associated 64 PAYload trace
record chunks of size 256 bytes each.

Signed-off-by: Steffen Maier <maier@linux.vnet.ibm.com>
Fixes: aceeffbb59bb ("zfcp: trace full payload of all SAN records (req,resp,iels)")
Fixes: 2c55b750a884 ("[SCSI] zfcp: Redesign of the debug tracing for SAN records.")
Reviewed-by: Benjamin Block <bblock@linux.vnet.ibm.com>
Signed-off-by: Benjamin Block <bblock@linux.vnet.ibm.com>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Seth Forshee <seth.forshee@canonical.com>

scsi: zfcp: add handling for FCP_RESID_OVER to the fcp ingress path

BugLink: http://bugs.launchpad.net/bugs/1720154
commit a099b7b1fc1f0418ab8d79ecf98153e1e134656e upstream.

Up until now zfcp would just ignore the FCP_RESID_OVER flag in the FCP
response IU. When this flag is set, it is possible, in regards to the
FCP standard, that the storage-server processes the command normally, up
to the point where data is missing and simply ignores those.

In this case no CHECK CONDITION would be set, and because we ignored the
FCP_RESID_OVER flag we resulted in at least a data loss or even
-corruption as a follow-up error, depending on how the
applications/layers on top behave. To prevent this, we now set the
host-byte of the corresponding scsi_cmnd to DID_ERROR.

Other storage-behaviors, where the same condition results in a CHECK
CONDITION set in the answer, don't need to be changed as they are
handled in the mid-layer already.

Following is an example trace record decoded with zfcpdbf from the
s390-tools package. We forcefully injected a fc_dl which is one byte too
small:

Timestamp      : ...
Area           : SCSI
Subarea        : 00
Level          : 3
Exception      : -
CPU ID         : ..
Caller         : 0x...
Record ID      : 1
Tag            : rsl_err
Request ID     : 0x...
SCSI ID        : 0x...
SCSI LUN       : 0x...
SCSI result    : 0x00070000
                     ^^DID_ERROR
SCSI retries   : 0x..
SCSI allowed   : 0x..
SCSI scribble  : 0x...
SCSI opcode    : 2a000000 00000000 08000000 00000000
FCP rsp inf cod: 0x00
FCP rsp IU     : 00000000 00000000 00000400 00000001
                                       ^^fr_flags==FCP_RESID_OVER
                                         ^^fr_status==SAM_STAT_GOOD
                                            ^^^^^^^^fr_resid
                 00000000 00000000

As of now, we don't actively handle to possibility that a response IU
has both flags - FCP_RESID_OVER and FCP_RESID_UNDER - set at once.

Reported-by: Luke M. Hopkins <lmhopkin@us.ibm.com>
Reviewed-by: Steffen Maier <maier@linux.vnet.ibm.com>
Fixes: 553448f6c483 ("[SCSI] zfcp: Message cleanup")
Fixes: ea127f975424 ("[PATCH] s390 (7/7): zfcp host adapter.") (tglx/history.git)
Signed-off-by: Benjamin Block <bblock@linux.vnet.ibm.com>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Seth Forshee <seth.forshee@canonical.com>

scsi: zfcp: fix queuecommand for scsi_eh commands when DIX enabled

BugLink: http://bugs.launchpad.net/bugs/1720154
commit 71b8e45da51a7b64a23378221c0a5868bd79da4f upstream.

Since commit db007fc5e20c ("[SCSI] Command protection operation"),
scsi_eh_prep_cmnd() saves scmd->prot_op and temporarily resets it to
SCSI_PROT_NORMAL.
Other FCP LLDDs such as qla2xxx and lpfc shield their queuecommand()
to only access any of scsi_prot_sg...() if
(scsi_get_prot_op(cmd) != SCSI_PROT_NORMAL).

Do the same thing for zfcp, which introduced DIX support with
commit ef3eb71d8ba4 ("[SCSI] zfcp: Introduce experimental support for
DIF/DIX").

Otherwise, TUR SCSI commands as part of scsi_eh likely fail in zfcp,
because the regular SCSI command with DIX protection data, that scsi_eh
re-uses in scsi_send_eh_cmnd(), of course still has
(scsi_prot_sg_count() != 0) and so zfcp sends down bogus requests to the
FCP channel hardware.

This causes scsi_eh_test_devices() to have (finish_cmds == 0)
[not SCSI device is online or not scsi_eh_tur() failed]
so regular SCSI commands, that caused / were affected by scsi_eh,
are moved to work_q and scsi_eh_test_devices() itself returns false.
In turn, it unnecessarily escalates in our case in scsi_eh_ready_devs()
beyond host reset to finally scsi_eh_offline_sdevs()
which sets affected SCSI devices offline with the following kernel message:

"kernel: sd H:0:T:L: Device offlined - not ready after error recovery"

Signed-off-by: Steffen Maier <maier@linux.vnet.ibm.com>
Fixes: ef3eb71d8ba4 ("[SCSI] zfcp: Introduce experimental support for DIF/DIX")
Reviewed-by: Benjamin Block <bblock@linux.vnet.ibm.com>
Signed-off-by: Benjamin Block <bblock@linux.vnet.ibm.com>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Seth Forshee <seth.forshee@canonical.com>

skd: Submit requests to firmware before triggering the doorbell

BugLink: http://bugs.launchpad.net/bugs/1720154
commit 5fbd545cd3fd311ea1d6e8be4cedddd0ee5684c7 upstream.

Ensure that the members of struct skd_msg_buf have been transferred
to the PCIe adapter before the doorbell is triggered. This patch
avoids that I/O fails sporadically and that the following error
message is reported:

(skd0:STM000196603:[0000:00:09.0]): Completion mismatch comp_id=0x0000 skreq=0x0400 new=0x0000

Signed-off-by: Bart Van Assche <bart.vanassche@wdc.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Hannes Reinecke <hare@suse.de>
Cc: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Seth Forshee <seth.forshee@canonical.com>

skd: Avoid that module unloading triggers a use-after-free

BugLink: http://bugs.launchpad.net/bugs/1720154
commit 7277cc67b3916eed47558c64f9c9c0de00a35cda upstream.

Since put_disk() triggers a disk_release() call and since that
last function calls blk_put_queue() if disk->queue != NULL, clear
the disk->queue pointer before calling put_disk(). This avoids
that unloading the skd kernel module triggers the following
use-after-free:

WARNING: CPU: 8 PID: 297 at lib/refcount.c:128 refcount_sub_and_test+0x70/0x80
refcount_t: underflow; use-after-free.
CPU: 8 PID: 297 Comm: kworker/8:1 Not tainted 4.11.10-300.fc26.x86_64 #1
Workqueue: events work_for_cpu_fn
Call Trace:
dump_stack+0x63/0x84
__warn+0xcb/0xf0
warn_slowpath_fmt+0x5a/0x80
refcount_sub_and_test+0x70/0x80
refcount_dec_and_test+0x11/0x20
kobject_put+0x1f/0x50
blk_put_queue+0x15/0x20
disk_release+0xae/0xf0
device_release+0x32/0x90
kobject_release+0x67/0x170
kobject_put+0x2b/0x50
put_disk+0x17/0x20
skd_destruct+0x5c/0x890 [skd]
skd_pci_probe+0x124d/0x13a0 [skd]
local_pci_probe+0x42/0xa0
work_for_cpu_fn+0x14/0x20
process_one_work+0x19e/0x470
worker_thread+0x1dc/0x4a0
kthread+0x125/0x140
ret_from_fork+0x25/0x30

Signed-off-by: Bart Van Assche <bart.vanassche@wdc.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Hannes Reinecke <hare@suse.de>
Cc: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Seth Forshee <seth.forshee@canonical.com>

md/bitmap: disable bitmap_resize for file-backed bitmaps.

BugLink: http://bugs.launchpad.net/bugs/1720154
commit e8a27f836f165c26f867ece7f31eb5c811692319 upstream.

bitmap_resize() does not work for file-backed bitmaps.
The buffer_heads are allocated and initialized when
the bitmap is read from the file, but resize doesn't
read from the file, it loads from the internal bitmap.
When it comes time to write the new bitmap, the bh is
non-existent and we crash.

The common case when growing an array involves making the array larger,
and that normally means making the bitmap larger. Doing
that inside the kernel is possible, but would need more code.
It is probably easier to require people who use file-backed
bitmaps to remove them and re-add after a reshape.

So this patch disables the resizing of arrays which have
file-backed bitmaps. This is better than crashing.

Reported-by: Zhilong Liu <zlliu@suse.com>
Fixes: d60b479d177a ("md/bitmap: add bitmap_resize function to allow bitmap resizing.")
Signed-off-by: NeilBrown <neilb@suse.com>
Signed-off-by: Shaohua Li <shli@fb.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Seth Forshee <seth.forshee@canonical.com>

md/bitmap: copy correct data for bitmap super

BugLink: http://bugs.launchpad.net/bugs/1720154
commit 8031c3ddc70ab93099e7d1814382dba39f57b43e upstream.

raid5 cache could write bitmap superblock before bitmap superblock is
initialized. The bitmap superblock is less than 512B. The current code will
only copy the superblock to a new page and write the whole 512B, which will
zero the the data after the superblock. Unfortunately the data could include
bitmap, which we should preserve. The patch will make superblock read do 4k
chunk and we always copy the 4k data to new page, so the superblock write will
old data to disk and we don't change the bitmap.

Reported-by: Song Liu <songliubraving@fb.com>
Reviewed-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Shaohua Li <shli@fb.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Seth Forshee <seth.forshee@canonical.com>

block: directly insert blk-mq request from blk_insert_cloned_request()

BugLink: http://bugs.launchpad.net/bugs/1720154
commit 157f377beb710e84bd8bc7a3c4475c0674ebebd7 upstream.

A NULL pointer crash was reported for the case of having the BFQ IO
scheduler attached to the underlying blk-mq paths of a DM multipath
device.  The crash occured in blk_mq_sched_insert_request()'s call to
e->type->ops.mq.insert_requests().

Paolo Valente correctly summarized why the crash occured with:
"the call chain (dm_mq_queue_rq -> map_request -> setup_clone ->
blk_rq_prep_clone) creates a cloned request without invoking
e->type->ops.mq.prepare_request for the target elevator e.  The cloned
request is therefore not initialized for the scheduler, but it is
however inserted into the scheduler by blk_mq_sched_insert_request."

All said, a request-based DM multipath device's IO scheduler should be
the only one used -- when the original requests are issued to the
underlying paths as cloned requests they are inserted directly in the
underlying dispatch queue(s) rather than through an additional elevator.

But commit bd166ef18 ("blk-mq-sched: add framework for MQ capable IO
schedulers") switched blk_insert_cloned_request() from using
blk_mq_insert_request() to blk_mq_sched_insert_request().  Which
incorrectly added elevator machinery into a call chain that isn't
supposed to have any.

To fix this introduce a blk-mq private blk_mq_request_bypass_insert()
that blk_insert_cloned_request() calls to insert the request without
involving any elevator that may be attached to the cloned request's
request_queue.

Fixes: bd166ef183c2 ("blk-mq-sched: add framework for MQ capable IO schedulers")
Reported-by: Bart Van Assche <Bart.VanAssche@wdc.com>
Tested-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Seth Forshee <seth.forshee@canonical.com>

block: Relax a check in blk_start_queue()

BugLink: http://bugs.launchpad.net/bugs/1720154
commit 4ddd56b003f251091a67c15ae3fe4a5c5c5e390a upstream.

Calling blk_start_queue() from interrupt context with the queue
lock held and without disabling IRQs, as the skd driver does, is
safe. This patch avoids that loading the skd driver triggers the
following warning:

WARNING: CPU: 11 PID: 1348 at block/blk-core.c:283 blk_start_queue+0x84/0xa0
RIP: 0010:blk_start_queue+0x84/0xa0
Call Trace:
skd_unquiesce_dev+0x12a/0x1d0 [skd]
skd_complete_internal+0x1e7/0x5a0 [skd]
skd_complete_other+0xc2/0xd0 [skd]
skd_isr_completion_posted.isra.30+0x2a5/0x470 [skd]
skd_isr+0x14f/0x180 [skd]
irq_forced_thread_fn+0x2a/0x70
irq_thread+0x144/0x1a0
kthread+0x125/0x140
ret_from_fork+0x2a/0x40

Fixes: commit a038e2536472 ("[PATCH] blk_start_queue() must be called with irq disabled - add warning")
Signed-off-by: Bart Van Assche <bart.vanassche@wdc.com>
Cc: Paolo 'Blaisorblade' Giarrusso <blaisorblade@yahoo.it>
Cc: Andrew Morton <akpm@osdl.org>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Hannes Reinecke <hare@suse.de>
Cc: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Seth Forshee <seth.forshee@canonical.com>