Cc: Adrian Hunter <adrian.hunter@intel.com> Cc: David Ahern <dsahern@gmail.com> Cc: Hendrick Brueckner <brueckner@linux.vnet.ibm.com> Cc: Jiri Olsa <jolsa@kernel.org> Cc: Namhyung Kim <namhyung@kernel.org> Cc: Thomas Richter <tmricht@linux.vnet.ibm.com> Cc: Wang Nan <wangnan0@huawei.com> Link: https://lkml.kernel.org/n/tip-ic3g837xg8ob3kcpkspxwz0g@git.kernel.org Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Reported-by: Thomas Richter <tmricht@linux.vnet.ibm.com> Cc: Adrian Hunter <adrian.hunter@intel.com> Cc: David Ahern <dsahern@gmail.com> Cc: Hendrick Brueckner <brueckner@linux.vnet.ibm.com> Cc: Jiri Olsa <jolsa@kernel.org> Cc: Namhyung Kim <namhyung@kernel.org> Cc: Wang Nan <wangnan0@huawei.com> Link: https://lkml.kernel.org/n/tip-afsu9eegd43ppihiuafhh9qv@git.kernel.org Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
perf unwind: Do not look just at the global callchain_param.record_mode
When setting up DWARF callchains on specific events, without using
'record' or 'trace' --call-graph, but instead doing it like:
perf trace -e cycles/call-graph=dwarf/
The unwind__prepare_access() call in thread__insert_map() when we
process PERF_RECORD_MMAP(2) metadata events were not being performed,
precluding us from using per-event DWARF callchains, handling them just
when we asked for all events to be DWARF, using "--call-graph dwarf".
We do it in the PERF_RECORD_MMAP because we have to look at one of the
executable maps to figure out the executable type (64-bit, 32-bit) of
the DSO laid out in that mmap. Also to look at the architecture where
the perf.data file was recorded.
All this probably should be deferred to when we process a sample for
some thread that has callchains, so that we do this processing only for
the threads with samples, not for all of them.
For now, fix using DWARF on specific events.
Before:
# perf trace --no-syscalls -e probe_libc:inet_pton/call-graph=dwarf/ ping -6 -c 1 ::1
PING ::1(::1) 56 data bytes
64 bytes from ::1: icmp_seq=1 ttl=64 time=0.048 ms
--- ::1 ping statistics ---
1 packets transmitted, 1 received, 0% packet loss, time 0ms
rtt min/avg/max/mdev = 0.048/0.048/0.048/0.000 ms
0.000 probe_libc:inet_pton:(7fe9597bb350))
Problem processing probe_libc:inet_pton callchain, skipping...
#
After:
# perf trace --no-syscalls -e probe_libc:inet_pton/call-graph=dwarf/ ping -6 -c 1 ::1
PING ::1(::1) 56 data bytes
64 bytes from ::1: icmp_seq=1 ttl=64 time=0.060 ms
Acked-by: Namhyung Kim <namhyung@kernel.org> Cc: Adrian Hunter <adrian.hunter@intel.com> Cc: David Ahern <dsahern@gmail.com> Cc: Hendrick Brueckner <brueckner@linux.vnet.ibm.com> Cc: Jiri Olsa <jolsa@kernel.org> Cc: Thomas Richter <tmricht@linux.vnet.ibm.com> Cc: Wang Nan <wangnan0@huawei.com> Link: https://lkml.kernel.org/r/20180116182650.GE16107@kernel.org Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
When setting the "dwarf" unwinder for a specific event and not
specifying the max-stack, the attr.sample_max_stack ended up using an
uninitialized callchain_param.max_stack, fix it by using designated
initializers for that callchain_param variable, zeroing all non
explicitely initialized struct members.
Here is what happened:
# perf trace -vv --no-syscalls --max-stack 4 -e probe_libc:inet_pton/call-graph=dwarf/ ping -6 -c 1 ::1
callchain: type DWARF
callchain: stack dump size 8192
perf_event_attr:
type 2
size 112
config 0x730
{ sample_period, sample_freq } 1
sample_type IP|TID|TIME|ADDR|CALLCHAIN|CPU|PERIOD|RAW|REGS_USER|STACK_USER|DATA_SRC
exclude_callchain_user 1
{ wakeup_events, wakeup_watermark } 1
sample_regs_user 0xff0fff
sample_stack_user 8192
sample_max_stack 50656
sys_perf_event_open failed, error -75
Value too large for defined data type
# perf trace -vv --no-syscalls --max-stack 4 -e probe_libc:inet_pton/call-graph=dwarf/ ping -6 -c 1 ::1
callchain: type DWARF
callchain: stack dump size 8192
perf_event_attr:
type 2
size 112
config 0x730
sample_type IP|TID|TIME|ADDR|CALLCHAIN|CPU|PERIOD|RAW|REGS_USER|STACK_USER|DATA_SRC
exclude_callchain_user 1
sample_regs_user 0xff0fff
sample_stack_user 8192
sample_max_stack 30448
sys_perf_event_open failed, error -75
Value too large for defined data type
#
Now the attr.sample_max_stack is set to zero and the above works as
expected:
# perf trace --no-syscalls --max-stack 4 -e probe_libc:inet_pton/call-graph=dwarf/ ping -6 -c 1 ::1
PING ::1(::1) 56 data bytes
64 bytes from ::1: icmp_seq=1 ttl=64 time=0.072 ms
Cc: Adrian Hunter <adrian.hunter@intel.com> Cc: David Ahern <dsahern@gmail.com> Cc: Hendrick Brueckner <brueckner@linux.vnet.ibm.com> Cc: Jiri Olsa <jolsa@kernel.org> Cc: Namhyung Kim <namhyung@kernel.org> Cc: Thomas Richter <tmricht@linux.vnet.ibm.com> Cc: Wang Nan <wangnan0@huawei.com> Link: https://lkml.kernel.org/n/tip-is9tramondqa9jlxxsgcm9iz@git.kernel.org Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
- applies to acme's perf/{core,urgent} branches, likely elsewhere
- Report is self-contained within the tool.
Record requires enabling the kernel SPE driver by
setting CONFIG_ARM_SPE_PMU.
- The intel-bts implementation was used as a starting point; its
min/default/max buffer sizes and power of 2 pages granularity need to be
revisited for ARM SPE
- Recording across multiple SPE clusters/domains not supported
- Snapshot support (record -S), and conversion to native perf events
(e.g., via 'perf inject --itrace'), are also not supported
- Technically both cs-etm and spe can be used simultaneously, however
disabled for simplicity in this release
Signed-off-by: Kim Phillips <kim.phillips@arm.com> Reviewed-by: Dongjiu Geng <gengdongjiu@huawei.com> Acked-by: Adrian Hunter <adrian.hunter@intel.com> Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com> Cc: Andi Kleen <ak@linux.intel.com> Cc: Jiri Olsa <jolsa@kernel.org> Cc: linux-arm-kernel@lists.infradead.org Cc: Marc Zyngier <marc.zyngier@arm.com> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Mathieu Poirier <mathieu.poirier@linaro.org> Cc: Pawel Moll <pawel.moll@arm.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Rob Herring <robh@kernel.org> Cc: Suzuki Poulouse <suzuki.poulose@arm.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Wang Nan <wangnan0@huawei.com> Cc: Will Deacon <will.deacon@arm.com> Link: http://lkml.kernel.org/r/20180114132850.0b127434b704a26bad13268f@arm.com Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
tools lib traceevent: Fix get_field_str() for dynamic strings
If a field is a dynamic string, get_field_str() returned just the
offset/size value and not the string. Have it parse the offset/size
correctly to return the actual string. Otherwise filtering fails when
trying to filter fields that are dynamic strings.
Reported-by: Gopanapalli Pradeep <prap_hai@yahoo.com> Signed-off-by: Steven Rostedt <rostedt@goodmis.org> Acked-by: Namhyung Kim <namhyung@kernel.org> Cc: Andrew Morton <akpm@linux-foundation.org> Link: http://lkml.kernel.org/r/20180112004823.146333275@goodmis.org Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Taeung Song [Fri, 12 Jan 2018 00:47:50 +0000 (19:47 -0500)]
tools lib traceevent: Fix missing break in FALSE case of pevent_filter_clear_trivial()
Currently the FILTER_TRIVIAL_FALSE case has a missing break statement,
if the trivial type is FALSE, it will also run into the TRUE case, and
always be skipped as the TRUE statement will continue the loop on the
inverse condition of the FALSE statement.
Federico Vaga [Fri, 12 Jan 2018 00:47:48 +0000 (19:47 -0500)]
tools lib traceevent: Use asprintf when possible
It makes the code clearer and less error prone.
clearer:
- less code
- the code is now using the same format to create strings dynamically
less error prone:
- no magic number +2 +9 +5 to compute the size
- no copy&paste of the strings to compute the size and to concatenate
The function `asprintf` is not POSIX standard but the program
was already using it. Later it can be decided to use only POSIX
functions, then we can easly replace all the `asprintf(3)` with a local
implementation of that function.
tools lib traceevent: Show contents (in hex) of data of unrecognized type records
When a record has an unrecognized type, an error message is reported,
but it would also be helpful to see the contents of that record. At
least show what it is in hex, instead of just showing a blank line.
Signed-off-by: Steven Rostedt <rostedt@goodmis.org> Acked-by: Namhyung Kim <namhyung@kernel.org> Cc: Andrew Morton <akpm@linux-foundation.org> Link: http://lkml.kernel.org/r/20180112004822.542204577@goodmis.org Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
tools lib traceevent: Handle new pointer processing of bprint strings
The Linux kernel printf() has some extended use cases that dereference
the pointer. This is dangerouse for tracing because the pointer that is
dereferenced can change or even be unmapped. It also causes issues when
the trace data is extracted, because user space does not have access to
the contents of the pointer even if it still exists.
To handle this, the kernel was updated to process these dereferenced
pointers at the time they are recorded, and not post processed. Now they
exist in the tracing buffer, and no dereference is needed at the time of
reading the trace.
The event parsing library needs to handle this new case.
Signed-off-by: Steven Rostedt <rostedt@goodmis.org> Acked-by: Namhyung Kim <namhyung@kernel.org> Cc: Andrew Morton <akpm@linux-foundation.org> Link: http://lkml.kernel.org/r/20180112004822.403349289@goodmis.org Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
tools lib traceevent: Simplify pointer print logic and fix %pF
When processing %pX in pretty_print(), simplify the logic slightly by
incrementing the ptr to the format string if isalnum(ptr[1]) is true.
This follows the logic a bit more closely to what is in the kernel.
Also, this fixes a small bug where %pF was not giving the offset of the
function.
Signed-off-by: Steven Rostedt <rostedt@goodmis.org> Acked-by: Namhyung Kim <namhyung@kernel.org> Cc: Andrew Morton <akpm@linux-foundation.org> Link: http://lkml.kernel.org/r/20180112004822.260262257@goodmis.org Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
tools lib traceevent: Show value of flags that have not been parsed
If the value contains bits that are not defined by print_flags() helper,
then show the remaining bits. This aligns with the functionality of the
kernel.
Michael Sartain [Fri, 12 Jan 2018 00:47:42 +0000 (19:47 -0500)]
tools lib traceevent: Fix bad force_token escape sequence
Older kernels have a bug that creates invalid symbols. event-parse.c
handles them by replacing them with a "%s" token. But the fix included
an extra backslash, and "\%s" was added incorrectly.
perf trace: Fix setting of --call-graph/--max-stack for non-syscall events
The raw_syscalls:sys_{enter,exit} were first supported in 'perf trace',
together with minor and major page faults, then we supported
--call-graph, then --max-stack, but when the other tracepoints got
supported, and bpf, etc, I forgot to make those global call-graph
settings apply to them.
Fix it by realizing that the global --max-stack and --call-graph
settings are done via:
And then, when we go to parse the events in -e via:
OPT_CALLBACK('e', "event", &trace, "event",
"event/syscall selector. use 'perf list' to list available events",
trace__parse_events_option),
And trace__parse_sevents_option() calls:
struct option o = OPT_CALLBACK('e', "event", &trace->evlist, "event",
"event selector. use 'perf list' to list available events",
parse_events_option);
err = parse_events_option(&o, lists[0], 0);
parse_events_option() will override the global --call-graph and
--max-stack if the "call-graph" and/or "max-stack" terms are in the
event definition, such as in the probe_libc:inet_pton event in one of the
examples below (-e probe_libc:inet_pton/max-stack=2).
And then setting up a global setting (dwarf, max-stack=4), that will
affect the raw_syscall:sys_enter for the 'sendto' syscall and that will
be overriden in the probe_libc:inet_pton call to just one entry.
# perf trace --max-stack=4 --call-graph dwarf -e sendto -e probe_libc:inet_pton/max-stack=1/ ping -6 -c 1 ::1
PING ::1(::1) 56 data bytes
64 bytes from ::1: icmp_seq=1 ttl=64 time=0.090 ms
Install iputils-debuginfo to get those /usr/bin/ping addresses resolved,
those routines are not on its .dymsym nor .symtab :-)
Cc: Adrian Hunter <adrian.hunter@intel.com> Cc: David Ahern <dsahern@gmail.com> Cc: Hendrick Brueckner <brueckner@linux.vnet.ibm.com> Cc: Jiri Olsa <jolsa@kernel.org> Cc: Namhyung Kim <namhyung@kernel.org> Cc: Thomas Richter <tmricht@linux.vnet.ibm.com> Cc: Wang Nan <wangnan0@huawei.com> Link: https://lkml.kernel.org/n/tip-qgl2gse8elhh9zztw4ajopg3@git.kernel.org Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
perf evsel: Check if callchain is enabled before setting it up
The construct:
if (callchain_param)
perf_evsel__config_callchain(evsel, opts, &callchain_param);
happens in several places, so make perf_evsel__config_callchain() work
just like free(NULL), do nothing if param->enabled is not set.
Cc: Adrian Hunter <adrian.hunter@intel.com> Cc: David Ahern <dsahern@gmail.com> Cc: Hendrick Brueckner <brueckner@linux.vnet.ibm.com> Cc: Jiri Olsa <jolsa@kernel.org> Cc: Namhyung Kim <namhyung@kernel.org> Cc: Thomas Richter <tmricht@linux.vnet.ibm.com> Cc: Wang Nan <wangnan0@huawei.com> Link: https://lkml.kernel.org/n/tip-ykk0qzxnxwx3o611ctjnmxav@git.kernel.org Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Jiri Olsa [Tue, 9 Jan 2018 13:39:23 +0000 (14:39 +0100)]
perf tools: Fix copyfile_offset update of output offset
We need to increase output offset in each iteration, not decrease it as
we currently do.
I guess we were lucky to finish in most cases in first iteration, so the
bug never showed. However it shows a lot when working with big (~4GB)
size data.
Signed-off-by: Jiri Olsa <jolsa@kernel.org> Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com> Cc: Andi Kleen <ak@linux.intel.com> Cc: David Ahern <dsahern@gmail.com> Cc: Namhyung Kim <namhyung@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Fixes: 9c9f5a2f1944 ("perf tools: Introduce copyfile_offset() function") Link: http://lkml.kernel.org/r/20180109133923.25406-1-jolsa@kernel.org Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
perf trace: No need to set PERF_SAMPLE_IDENTIFIER explicitely
Since 75562573bab3 ("perf tools: Add support for
PERF_SAMPLE_IDENTIFIER") we don't need explicitely set
PERF_SAMPLE_IDENTIFIER, as perf_evlist__config() will do this for us,
i.e. when there are more than one evsel in an evlist, it will check if
some evsel has a sample_type different than the one on the first evsel
in the list, setting PERF_SAMPLE_IDENTIFIER in that case.
So, to simplify 'perf trace' codebase, ditch that check.
Cc: Adrian Hunter <adrian.hunter@intel.com> Cc: David Ahern <dsahern@gmail.com> Cc: Hendrick Brueckner <brueckner@linux.vnet.ibm.com> Cc: Jiri Olsa <jolsa@kernel.org> Cc: Namhyung Kim <namhyung@kernel.org> Cc: Thomas Richter <tmricht@linux.vnet.ibm.com> Cc: Wang Nan <wangnan0@huawei.com> Link: https://lkml.kernel.org/n/tip-12xq6orhwttee2tdtu96ucrp@git.kernel.org Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Kan Liang [Thu, 4 Jan 2018 20:59:55 +0000 (12:59 -0800)]
perf script python: Add script to profile and resolve physical mem type
There could be different types of memory in the system. E.g normal
System Memory, Persistent Memory. To understand how the workload maps to
those memories, it's important to know the I/O statistics of them. Perf
can collect physical addresses, but those are raw data. It still needs
extra work to resolve the physical addresses. Provide a script to
facilitate the physical addresses resolving and I/O statistics.
Profile with MEM_INST_RETIRED.ALL_LOADS or MEM_UOPS_RETIRED.ALL_LOADS
event if any of them is available.
Look up the /proc/iomem and resolve the physical address. Provide
memory type summary.
Changes since V2:
- Apply the new license rules.
- Add comments for globals
Changes since V1:
- Do not mix DLA and Load Latency. Do not compare the loads and stores.
Only profile the loads.
- Use event name to replace the RAW event
Signed-off-by: Kan Liang <Kan.liang@intel.com> Reviewed-by: Andi Kleen <ak@linux.intel.com> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Jiri Olsa <jolsa@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Philippe Ombredanne <pombredanne@nexb.com> Cc: Stephane Eranian <eranian@google.com> Link: https://lkml.kernel.org/r/1515099595-34770-1-git-send-email-kan.liang@intel.com Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
The trailing semicolon is an empty statement that does no operation.
Removing it since it doesn't do anything.
Signed-off-by: Luis de Bethencourt <luisbg@kernel.org> Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com> Cc: Jiri Olsa <jolsa@redhat.com> Cc: Joe Perches <joe@perches.com> Cc: Namhyung Kim <namhyung@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/r/20180111155020.9782-1-luisbg@kernel.org Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Mathieu Poirier [Wed, 10 Jan 2018 20:46:51 +0000 (13:46 -0700)]
perf evsel: Fix incorrect handling of type _TERM_DRV_CFG
Commit ("d0565132605f perf evsel: Enable type checking for
perf_evsel_config_term types") assumes PERF_EVSEL__CONFIG_TERM_DRV_CFG
isn't used and as such adds a BUG_ON().
Since the enumeration type is used in macro ADD_CONFIG_TERM() the change
break CoreSight trace acquisition.
This patch restores the original code.
Signed-off-by: Mathieu Poirier <mathieu.poirier@linaro.org> Acked-by: Jiri Olsa <jolsa@kernel.org> Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com> Cc: Andi Kleen <ak@linux.intel.com> Cc: Namhyung Kim <namhyung@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Fixes: d0565132605f ("perf evsel: Enable type checking for perf_evsel_config_term types") Link: http://lkml.kernel.org/r/1515617211-32024-1-git-send-email-mathieu.poirier@linaro.org Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Ingo Molnar [Thu, 11 Jan 2018 05:53:06 +0000 (06:53 +0100)]
Merge tag 'perf-core-for-mingo-4.16-20180110' of git://git.kernel.org/pub/scm/linux/kernel/git/acme/linux into perf/core
Pull perf/core improvements and fixes from Arnaldo Carvalho de Melo:
- The 'perf test bpf' entry hooked a eBPF proggie to the
SyS_epoll_wait() kernel function and expected it to be hit when calling
the epoll_wait() libc wrapper, which changed recently, in systems such
as Fedora 27, with the glibc wrapper calling instead the epoll_pwait()
syscall, so switch to epoll_pwait() for both the kernel and libc
function, getting it to work both in old and new systems (Arnaldo Carvalho de Melo)
- Beautify 'gettid' syscall result in 'perf trace', and in doing so
noticed that we need to handle namespaces in 'perf trace', will be
dealt with in follow up patches where we'll try to figure out if
the recent support for namespace in tools/perf/ can be used for this
purpose as well. (Arnaldo Carvalho de Melo)
- Introduce 'perf report --mmaps' and 'perf report --tasks' to show
info present in 'perf.data' (Jiri Olsa, Arnaldo Carvalho de Melo)
- Fix a wrong offset issue when using /proc/kcore (Jin Yao)
- Fix bug that prevented annotating symbols in perf.data files
generated with 'perf record --branch-any' (Jin Yao)
- Add infrastructure to record first and last sample time to the
perf.data file header, so that when processing all samples in
a 'perf record' session, such as when doing build-id processing,
or when specifically requesting that that info be recorded, use
that in 'perf report --time', that also got support for percent
slices in addition to absolute ones.
I.e. now it is possible to ask for the samples in the 10%-20%
time slice of a perf.data file (Jin Yao)
- Enable building with libbabeltrace by default (Jiri Olsa)
- Display perf_event_attr::namespaces when duping the attributes
in verbose mode (Jiri Olsa)
- Allocate context task_ctx_data for child event (Jiri Olsa)
- Update comments for PERF_RECORD_ITRACE_START and PERF_RECORD_MISC_* (Jiri Olsa)
- Add support for showing PERF_RECORD_LOST events in 'perf script' (Jiri Olsa)
- Add 'perf report --stats' option to display quick statistics about
metadata events (PERF_RECORD_*) i.e. what we get at the end of 'perf
report -D' (Jiri Olsa)
- Fix compile error with libunwind x86 (Wang Nan)
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com> Signed-off-by: Ingo Molnar <mingo@kernel.org>
None of those changes have an effect on tooling, so do a plain copy.
Cc: Adrian Hunter <adrian.hunter@intel.com> Cc: David Ahern <dsahern@gmail.com> Cc: Jiri Olsa <jolsa@kernel.org> Cc: Namhyung Kim <namhyung@kernel.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Wang Nan <wangnan0@huawei.com> Link: https://lkml.kernel.org/n/tip-qqzcs8ri3vks8cypg0puk0ae@git.kernel.org Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Similar to --tasks, producing the same output plus /proc/<PID>/maps
similar lines for each mmap record present in a perf.data file.
Please note that not all mmaps are stored, for instance, some of the
non-executable mmaps are only stored when 'perf record --data' is used,
when the user wants to resolve data accesses in addition to asking for
executable mmaps to get the DSO with symtabs.
Cc: Adrian Hunter <adrian.hunter@intel.com> Cc: David Ahern <dsahern@gmail.com> Cc: Jiri Olsa <jolsa@kernel.org> Cc: Namhyung Kim <namhyung@kernel.org> Cc: Wang Nan <wangnan0@huawei.com> Link: https://lkml.kernel.org/n/tip-zulwdlg5rfowogr1qznorvvc@git.kernel.org Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Jiri Olsa [Sun, 7 Jan 2018 16:03:56 +0000 (17:03 +0100)]
perf report: Add --tasks option to display monitored tasks
Add --tasks option to display monitored tasks stored in perf.data.
Displaying pid/tid/ppid plus the command string aligned to distinguish
parent and child tasks.
Signed-off-by: Jiri Olsa <jolsa@kernel.org> Tested-by: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com> Cc: Andi Kleen <ak@linux.intel.com> Cc: David Ahern <dsahern@gmail.com> Cc: Namhyung Kim <namhyung@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/r/20180107160356.28203-13-jolsa@kernel.org
[ Make it --tasks, plural, --task works as well, as its unambiguous ]
[ Use machine__find_thread(), not findnew(), as pointed out by Namhyung ] Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Adrian Hunter <adrian.hunter@intel.com> Cc: David Ahern <dsahern@gmail.com> Cc: Jiri Olsa <jolsa@kernel.org> Cc: Namhyung Kim <namhyung@kernel.org> Cc: Wang Nan <wangnan0@huawei.com> Link: https://lkml.kernel.org/n/tip-kyg4gz2yy0vkrrh2vtq29u71@git.kernel.org Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Jiri Olsa [Sun, 7 Jan 2018 16:03:55 +0000 (17:03 +0100)]
perf report: Add --stats option to display quick data statistics
Add --stats option to display quick data statistics of event numbers,
without any further processing, like the one at the end of the perf
report -D command.
I found this useful when hunting lost events for another change.
Signed-off-by: Jiri Olsa <jolsa@kernel.org> Tested-by: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com> Cc: Andi Kleen <ak@linux.intel.com> Cc: David Ahern <dsahern@gmail.com> Cc: Namhyung Kim <namhyung@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/r/20180107160356.28203-12-jolsa@kernel.org
[ Rename it to --stats, plural ] Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Jiri Olsa [Sun, 7 Jan 2018 16:03:54 +0000 (17:03 +0100)]
perf tools: Make the tool's warning messages optional
I want to display the pure events status coming in the next patch and
the tool's warnings are superfluous in the output. Making it optional,
enabled by default.
Signed-off-by: Jiri Olsa <jolsa@kernel.org> Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com> Cc: Andi Kleen <ak@linux.intel.com> Cc: David Ahern <dsahern@gmail.com> Cc: Namhyung Kim <namhyung@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/r/20180107160356.28203-11-jolsa@kernel.org Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Signed-off-by: Jiri Olsa <jolsa@kernel.org> Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com> Cc: Andi Kleen <ak@linux.intel.com> Cc: David Ahern <dsahern@gmail.com> Cc: Namhyung Kim <namhyung@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/r/20180107160356.28203-10-jolsa@kernel.org
[ Use PRIu64 when printing u64 values, fixing the build in some arches ] Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Jiri Olsa [Sun, 7 Jan 2018 16:03:52 +0000 (17:03 +0100)]
perf script: Add support to display sample misc field
Adding support to display sample misc field in form
of letter for each bit:
# perf script -F +misc ...
sched-messaging 1414 K 28690.636582: 4590 cycles ...
sched-messaging 1407 U 28690.636600: 325620 cycles ...
sched-messaging 1414 K 28690.636608: 19473 cycles ...
misc field __________/
The misc bits are assigned to following letters:
PERF_RECORD_MISC_KERNEL K
PERF_RECORD_MISC_USER U
PERF_RECORD_MISC_HYPERVISOR H
PERF_RECORD_MISC_GUEST_KERNEL G
PERF_RECORD_MISC_GUEST_USER g
PERF_RECORD_MISC_MMAP_DATA* M
PERF_RECORD_MISC_COMM_EXEC E
PERF_RECORD_MISC_SWITCH_OUT S
Signed-off-by: Jiri Olsa <jolsa@kernel.org> Tested-by: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com> Cc: Andi Kleen <ak@linux.intel.com> Cc: David Ahern <dsahern@gmail.com> Cc: Namhyung Kim <namhyung@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/r/20180107160356.28203-9-jolsa@kernel.org Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Jiri Olsa [Sun, 7 Jan 2018 16:03:51 +0000 (17:03 +0100)]
perf: Update PERF_RECORD_MISC_* comment for perf_event_header::misc bit 13
The perf_event_header::misc bit 13 is shared on different events and
next patch is adding yet another bit 13 user. Updating the comment to
make it more structured and clear which events use bit 13.
Suggested-by: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Jiri Olsa <jolsa@kernel.org> Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com> Cc: Andi Kleen <ak@linux.intel.com> Cc: David Ahern <dsahern@gmail.com> Cc: Namhyung Kim <namhyung@kernel.org> Link: http://lkml.kernel.org/r/20180107160356.28203-8-jolsa@kernel.org
[ Update the tools/include/uapi/linux copy ] Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Jiri Olsa [Sun, 7 Jan 2018 16:03:47 +0000 (17:03 +0100)]
perf: Allocate context task_ctx_data for child event
Currently we use perf_event_context::task_ctx_data to save and restore
the LBR status when the task is scheduled out and in.
We don't allocate it for child contexts, which results in shorter task's
LBR stack, because we don't save the history from previous run and start
over every time we schedule the task in.
I made a test to generate samples with LBR call stack and got higher
numbers on bigger chain depths:
Signed-off-by: Jiri Olsa <jolsa@kernel.org> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com> Cc: Andi Kleen <ak@linux.intel.com> Cc: David Ahern <dsahern@gmail.com> Cc: Namhyung Kim <namhyung@kernel.org> Fixes: 4af57ef28c2c ("perf: Add pmu specific data for perf task context") Link: http://lkml.kernel.org/r/20180107160356.28203-4-jolsa@kernel.org Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Jiri Olsa [Sun, 7 Jan 2018 16:03:45 +0000 (17:03 +0100)]
perf tools: Enable LIBBABELTRACE by default
There's no reason anymore to treat babel trace in a special way, because
a) we no longer display its state b) the needed babeltrace library is
now out and well adopted among distros.
Signed-off-by: Jiri Olsa <jolsa@kernel.org> Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com> Cc: Andi Kleen <ak@linux.intel.com> Cc: David Ahern <dsahern@gmail.com> Cc: Namhyung Kim <namhyung@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/r/20180107160356.28203-2-jolsa@kernel.org Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Jin Yao [Fri, 8 Dec 2017 13:13:46 +0000 (21:13 +0800)]
perf script: Support time percent and multiple time ranges
perf script has a --time option to limit the time range of output. It
only supports absolute time.
Now this option is extended to support multiple time ranges and support
the percent of time.
For example:
1. Select the first and second 10% time slices:
perf script --time 10%/1,10%/2
2. Select from 0% to 10% and 30% to 40% slices:
perf script --time 0%-10%,30%-40%
Changelog:
v6: Fix the merge issue with latest perf/core branch.
No functional changes.
v5: Add checking of first/last sample time to detect if it's recorded
in perf.data. If it's not recorded, returns error message to user.
v4: Remove perf_time__skip_sample, only uses perf_time__ranges_skip_sample
v3: Since the definitions of first_sample_time/last_sample_time
are moved from perf_session to perf_evlist so change the
related code.
Signed-off-by: Jin Yao <yao.jin@linux.intel.com> Acked-by: Jiri Olsa <jolsa@kernel.org> Tested-by: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com> Cc: Andi Kleen <ak@linux.intel.com> Cc: Kan Liang <kan.liang@intel.com> Cc: Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/r/1512738826-2628-7-git-send-email-yao.jin@linux.intel.com Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Jin Yao [Fri, 8 Dec 2017 13:13:45 +0000 (21:13 +0800)]
perf report: Support time percent and multiple time ranges
perf report has a --time option to limit the time range of output. It
only supports absolute time.
Now this option is extended to support multiple time ranges and support
the percent of time.
For example:
1. Select the first and second 10% time slices:
perf report --time 10%/1,10%/2
2. Select from 0% to 10% and 30% to 40% slices:
perf report --time 0%-10%,30%-40%
Changelog:
v6: Fix the merge issue with latest perf/core branch.
No functional changes.
v5: Add checking of first/last sample time to detect if it's recorded
in perf.data. If it's not recorded, returns error message to user.
v4: Remove perf_time__skip_sample, only uses perf_time__ranges_skip_sample
v3: Since the definitions of first_sample_time/last_sample_time
are moved from perf_session to perf_evlist so change the
related code.
Signed-off-by: Jin Yao <yao.jin@linux.intel.com> Acked-by: Jiri Olsa <jolsa@kernel.org> Tested-by: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com> Cc: Andi Kleen <ak@linux.intel.com> Cc: Kan Liang <kan.liang@intel.com> Cc: Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/r/1512738826-2628-6-git-send-email-yao.jin@linux.intel.com
[ Add missing colons at end of examples in the man page ] Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Jin Yao [Fri, 8 Dec 2017 13:13:44 +0000 (21:13 +0800)]
perf tools: Create function to perform multiple time range checking
Previous patch supports the multiple time range.
For example, select the first and second 10% time slices.
perf report --time 10%/1,10%/2
We need a function to check if a timestamp is in the ranges of
[0, 10%) and [10%, 20%].
Note that it includes the last element in [10%, 20%] but it doesn't
include the last element in [0, 10%). It's to avoid the overlap.
This patch implments a new function perf_time__ranges_skip_sample
for this checking.
Change log:
v4: Let perf_time__ranges_skip_sample be compatible with
perf_time__skip_sample when only one time range.
Signed-off-by: Jin Yao <yao.jin@linux.intel.com> Acked-by: Jiri Olsa <jolsa@kernel.org> Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com> Cc: Andi Kleen <ak@linux.intel.com> Cc: Kan Liang <kan.liang@intel.com> Cc: Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/r/1512738826-2628-5-git-send-email-yao.jin@linux.intel.com Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Jin Yao [Fri, 8 Dec 2017 13:13:43 +0000 (21:13 +0800)]
perf tools: Create function to parse time percent
Current perf report/script/... have a --time option to limit the time
range of output. But right now it only supports absolute time, add
support for time percentage.
For example:
1. Select the second 10% time slice
perf report --time 10%/2
2. Select from 0% to 10% time slice
perf report --time 0%-10%
It also support the multiple time ranges.
3. Select the first and second 10% time slices
perf report --time 10%/1,10%/2
4. Select from 0% to 10% and 30% to 40% slices
perf report --time 0%-10%,30%-40%
Changelog:
v4: An issue is found. Following passes.
perf script --time 10%/10x12321xsdfdasfdsafdsafdsa
Now it uses strtol to replace atoi.
Committer notes:
This just puts in place the infrastructure, so the examples in this cset
comment will only work later, after more patches in this series are
applied.
Signed-off-by: Jin Yao <yao.jin@linux.intel.com> Acked-by: Jiri Olsa <jolsa@kernel.org> Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com> Cc: Andi Kleen <ak@linux.intel.com> Cc: Kan Liang <kan.liang@intel.com> Cc: Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/r/1512738826-2628-4-git-send-email-yao.jin@linux.intel.com Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Jin Yao [Fri, 8 Dec 2017 13:13:42 +0000 (21:13 +0800)]
perf record: Record the first and last sample time in the header
In the default 'perf record' configuration, all samples are processed,
to create the HEADER_BUILD_ID table. So it's very easy to get the
first/last samples and save the time to perf file header via the
function write_sample_time().
Later, at post processing time, perf report/script will fetch the time
from perf file header.
Committer testing:
# perf record -a sleep 1
[ perf record: Woken up 1 times to write data ]
[ perf record: Captured and wrote 2.099 MB perf.data (1101 samples) ]
[root@jouet home]# perf report --header | grep "time of "
# time of first sample : 22947.909226
# time of last sample : 22948.910704
#
# perf report -D | grep PERF_RECORD_SAMPLE\(
0 22947909226101 0x20bb68 [0x30]: PERF_RECORD_SAMPLE(IP, 0x4001): 0/0: 0xffffffffa21b1af3 period: 1 addr: 0
0 22947909229928 0x20bb98 [0x30]: PERF_RECORD_SAMPLE(IP, 0x4001): 0/0: 0xffffffffa200d204 period: 1 addr: 0
<SNIP>
3 22948910397351 0x219360 [0x30]: PERF_RECORD_SAMPLE(IP, 0x4001): 28251/28251: 0xffffffffa22071d8 period: 169518 addr: 0
0 22948910652380 0x20f120 [0x30]: PERF_RECORD_SAMPLE(IP, 0x4001): 0/0: 0xffffffffa2856816 period: 198807 addr: 0
2 22948910704034 0x2172d0 [0x30]: PERF_RECORD_SAMPLE(IP, 0x4001): 0/0: 0xffffffffa2856816 period: 88111 addr: 0
#
Changelog:
v7: Just update the patch description according to Arnaldo's suggestion.
v6: Currently '--buildid-all' is not enabled at default. So the walking
on all samples is the default operation. There is no big overhead
to calculate the timestamp boundary in process_sample_event handler
once we already go through all samples. So the timestamp boundary
calculation is enabled by default when '--buildid-all' is not enabled.
While if '--buildid-all' is enabled, we creates a new option
"--timestamp-boundary" for user to decide if it enables the
timestamp boundary calculation.
v5: There is an issue that the sample walking can only work when
'--buildid-all' is not enabled. So we need to let the walking
be able to work even if '--buildid-all' is enabled and let the
processing skips the dso hit marking for this case.
At first, I want to provide a new option "--record-time-boundaries".
While after consideration, I think a new option is not very
necessary.
v3: Remove the definitions of first_sample_time and last_sample_time
from struct record and directly save them in perf_evlist.
Signed-off-by: Jin Yao <yao.jin@linux.intel.com> Acked-by: Jiri Olsa <jolsa@kernel.org> Tested-by: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com> Cc: Andi Kleen <ak@linux.intel.com> Cc: Kan Liang <kan.liang@intel.com> Cc: Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/r/1512738826-2628-3-git-send-email-yao.jin@linux.intel.com Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Jin Yao [Fri, 8 Dec 2017 13:13:41 +0000 (21:13 +0800)]
perf header: Add infrastructure to record first and last sample time
perf report/script/... have a --time option to limit the time range of
output. That's very useful to slice large traces, e.g. when processing
the output of perf script for some analysis.
But right now --time only supports absolute time. Also there is no fast
way to get the start/end times of a given trace except for looking at
it. This makes it hard to e.g. only decode the first half of the trace,
which is useful for parallelization of scripts
Another problem is that perf records are variable size and there is no
synchronization mechanism. So the only way to find the last sample
reliably would be to walk all samples. But we want to avoid that in perf
report/... because it is already quite expensive. That is why storing
the first sample time and last sample time in perf record is better.
This patch creates a new header feature type HEADER_SAMPLE_TIME and
related ops. Save the first sample time and the last sample time to the
feature section in perf file header. That will be done when, for
instance, processing build-ids, where we already have to process all
samples to create the build-id table, take advantage of that to further
amortize that processing by storing HEADER_SAMPLE_TIME to make 'perf
report/script' faster when using --time.
Committer testing:
After this patch is applied the header is written with zeroes, we need
the next patch, for "perf record" to actually write the timestamps:
# perf report -D | grep PERF_RECORD_SAMPLE\( 22501155244406 0x44f0 [0x28]: PERF_RECORD_SAMPLE(IP, 0x4001): 25016/25016: 0xffffffffa21be8c5 period: 1 addr: 0
<SNIP> 22501155793625 0x4a30 [0x28]: PERF_RECORD_SAMPLE(IP, 0x4001): 25016/25016: 0xffffffffa21ffd50 period: 2828043 addr: 0
# perf report --header | grep "time of "
# time of first sample : 0.000000
# time of last sample : 0.000000
#
Changelog:
v7: 1. Rebase to latest perf/core branch.
2. Add following clarification in patch description according to
Arnaldo's suggestion.
"That will be done when, for instance, processing build-ids,
where we already have to process all samples to create the
build-id table, take advantage of that to further amortize
that processing by storing HEADER_SAMPLE_TIME to make
'perf report/script' faster when using --time."
v4: Use perf script time style for timestamp printing. Also add with
the printing of sample duration.
v3: Remove the definitions of first_sample_time/last_sample_time from
perf_session. Just define them in perf_evlist
Signed-off-by: Jin Yao <yao.jin@linux.intel.com> Acked-by: Jiri Olsa <jolsa@kernel.org> Tested-by: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com> Cc: Andi Kleen <ak@linux.intel.com> Cc: Kan Liang <kan.liang@intel.com> Cc: Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/r/1512738826-2628-2-git-send-email-yao.jin@linux.intel.com Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Jin Yao [Tue, 26 Dec 2017 10:42:43 +0000 (18:42 +0800)]
perf report: Fix a no annotate browser displayed issue
When enabling '-b' option in perf record, for example,
perf record -b ...
perf report
and then browsing the annotate browser from perf report (press 'A'), it
would fail (annotate browser can't be displayed).
It's because the '.add_entry_cb' op of struct report is overwritten by
hist_iter__branch_callback() in builtin-report.c. But this function doesn't do
something like mapping symbols and sources. So next, do_annotate() will return
directly.
notes = symbol__annotation(act->ms.sym);
if (!notes->src)
return 0;
This patch adds the lost code to hist_iter__branch_callback (refer to
hist_iter__report_callback).
v2:
Fix a crash bug when perform 'perf report --stdio'.
The reason is that we init the symbol annotation only in browser mode, it
doesn't allocate/init resources for stdio mode.
So now in hist_iter__branch_callback(), it will return directly if it's not in
browser mode.
Signed-off-by: Jin Yao <yao.jin@linux.intel.com> Tested-by: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com> Cc: Andi Kleen <ak@linux.intel.com> Cc: Jiri Olsa <jolsa@kernel.org> Cc: Kan Liang <kan.liang@intel.com> Cc: Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/r/1514284963-18587-1-git-send-email-yao.jin@linux.intel.com Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
The issue is the value which is passed to parameter 'addr' in
__get_srcline() is the objdump address. It's not correct if we calculate
the offset by using 'addr - sym->start'.
This patch creates a new parameter 'ip' in __get_srcline(). It is not
converted to objdump address.
Make sure you get any valid vmlinux out of the way, using '-v' on the
'perf report' case and deleting it from places where perf searches them,
like your kernel build dir and the build-id cache, in ~/.debug/.
Reported-by: Arnaldo Carvalho de Melo <acme@redhat.com> Signed-off-by: Jin Yao <yao.jin@linux.intel.com> Tested-by: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com> Cc: Andi Kleen <ak@linux.intel.com> Cc: Jiri Olsa <jolsa@kernel.org> Cc: Kan Liang <kan.liang@intel.com> Cc: Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/r/1514564812-17344-1-git-send-email-yao.jin@linux.intel.com Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Wang Nan [Wed, 6 Dec 2017 01:50:40 +0000 (01:50 +0000)]
perf tools: Fix compile error with libunwind x86
Fix a compile error:
...
CC util/libunwind/x86_32.o
In file included from util/libunwind/x86_32.c:33:0:
util/libunwind/../../arch/x86/util/unwind-libunwind.c: In function 'libunwind__x86_reg_id':
util/libunwind/../../arch/x86/util/unwind-libunwind.c:110:11: error: 'EINVAL' undeclared (first use in this function)
return -EINVAL;
^
util/libunwind/../../arch/x86/util/unwind-libunwind.c:110:11: note: each undeclared identifier is reported only once for each function it appears in
mv: cannot stat 'util/libunwind/.x86_32.o.tmp': No such file or directory
make[4]: *** [util/libunwind/x86_32.o] Error 1
make[3]: *** [util] Error 2
make[2]: *** [libperf-in.o] Error 2
make[1]: *** [sub-make] Error 2
make: *** [all] Error 2
It happens when libunwind-x86 feature is detected.
The 'perf test bpf' was hooking a eBPF program on the SyS_epoll_wait()
kernel function, that was what the epoll_wait() glibc function ended up
calling, but since at least glibc 2.26, the one that comes with, for
instance, Fedora 27, glibc ends up calling SyS_epoll_pwait() when
epoll_wait() is used, causing this 'perf test' entry to fail.
So switch to using epoll_pwait() and hook the eBPF program to the
SyS_epoll_pwait() kernel function to make it work on a wider range of
glibc and kernel versions.
Tested-by: Wang Nan <wangnan0@huawei.com> Cc: Adrian Hunter <adrian.hunter@intel.com> Cc: David Ahern <dsahern@gmail.com> Cc: Jiri Olsa <jolsa@kernel.org> Cc: Namhyung Kim <namhyung@kernel.org> Link: https://lkml.kernel.org/n/tip-zynvquy63er8s5mrgsz65pto@git.kernel.org Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
perf test bpf: Use designated struct field initializers
To follow standard practice in the kernel sources, documenting the
initialization better and helping quickly finding the value for some
field in a struct with many entries.
Cc: Adrian Hunter <adrian.hunter@intel.com> Cc: David Ahern <dsahern@gmail.com> Cc: Jiri Olsa <jolsa@kernel.org> Cc: Namhyung Kim <namhyung@kernel.org> Cc: Wang Nan <wangnan0@huawei.com> Link: https://lkml.kernel.org/n/tip-syn3hz9hz7ukxlxbx5x6hv20@git.kernel.org Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Adrian Hunter <adrian.hunter@intel.com> Cc: David Ahern <dsahern@gmail.com> Cc: Jiri Olsa <jolsa@kernel.org> Cc: Namhyung Kim <namhyung@kernel.org> Cc: Wang Nan <wangnan0@huawei.com> Link: https://lkml.kernel.org/n/tip-403npu7daupv6b2bmxliv5pk@git.kernel.org Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
perf/x86/msr: Add support for MSR_IA32_THERM_STATUS
This patch adds support for the Digital Readout provided by the
IA32_THERM_STATUS MSR (0x19C) on Intel X86 processors. The readout
shows the number of degrees Celcius to the TCC (critical temperature)
supported by the processor. Thus, the larger, the better.
The perf_event support is provided via the msr PMU. The new
logical event is called cpu_thermal_margin. It comes with a unit and
snapshot files. The event shows the current temprature distance (margin).
It is not an accumulating event. The unit is degrees C. The event
is provided per logical CPU to make things simpler but it is the
same for both hyper-threads sharing a physical core.
$ perf stat -I 1000 -a -A -e msr/cpu_thermal_margin/
This will print the temperature for all logical CPUs.
time CPU counts unit events
1.000123741 CPU0 38 C msr/cpu_thermal_margin/
1.000161837 CPU1 37 C msr/cpu_thermal_margin/
1.000187906 CPU2 36 C msr/cpu_thermal_margin/
1.000189046 CPU3 39 C msr/cpu_thermal_margin/
1.000283044 CPU4 40 C msr/cpu_thermal_margin/
1.000344297 CPU5 40 C msr/cpu_thermal_margin/
1.000365832 CPU6 39 C msr/cpu_thermal_margin/
...
In case the temperature margin cannot be read, the reported value would be -1.
Works on all processors supporting the Digital Readout (dtherm in cpuinfo)
Signed-off-by: Stephane Eranian <eranian@google.com> Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: Jiri Olsa <jolsa@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Vince Weaver <vincent.weaver@maine.edu> Cc: kan.liang@intel.com Link: http://lkml.kernel.org/r/1515169132-3980-1-git-send-email-eranian@google.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
Linus Torvalds [Fri, 5 Jan 2018 21:02:46 +0000 (13:02 -0800)]
Merge tag 'for-4.15-rc7-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux
Pull btrfs fixes from David Sterba:
"We have two more fixes for 4.15, both aimed for stable.
The leak fix is obvious, the second patch fixes a bug revealed by the
refcount API, when it behaves differently than previous atomic_t and
reports refs going from 0 to 1 in one case"
* tag 'for-4.15-rc7-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
btrfs: fix refcount_t usage when deleting btrfs_delayed_nodes
btrfs: Fix flush bio leak
Linus Torvalds [Fri, 5 Jan 2018 20:59:32 +0000 (12:59 -0800)]
Merge tag 'xfs-4.15-fixes-10' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux
Pull XFS fixes from Darrick Wong:
"I have just a few fixes for bugs and resource cleanup problems this
week:
- Fix resource cleanup of failed quota initialization
- Fix integer overflow problems wrt s_maxbytes"
* tag 'xfs-4.15-fixes-10' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux:
xfs: fix s_maxbytes overflow problems
xfs: quota: check result of register_shrinker()
xfs: quota: fix missed destroy of qi_tree_lock
Linus Torvalds [Fri, 5 Jan 2018 20:23:57 +0000 (12:23 -0800)]
Merge branch 'x86-pti-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull more x86 pti fixes from Thomas Gleixner:
"Another small stash of fixes for fallout from the PTI work:
- Fix the modules vs. KASAN breakage which was caused by making
MODULES_END depend of the fixmap size. That was done when the cpu
entry area moved into the fixmap, but now that we have a separate
map space for that this is causing more issues than it solves.
- Use the proper cache flush methods for the debugstore buffers as
they are mapped/unmapped during runtime and not statically mapped
at boot time like the rest of the cpu entry area.
- Make the map layout of the cpu_entry_area consistent for 4 and 5
level paging and fix the KASLR vaddr_end wreckage.
- Use PER_CPU_EXPORT for per cpu variable and while at it unbreak
nvidia gfx drivers by dropping the GPL export. The subject line of
the commit tells it the other way around, but I noticed that too
late.
- Fix the ASM alternative macros so they can be used in the middle of
an inline asm block.
- Rename the BUG_CPU_INSECURE flag to BUG_CPU_MELTDOWN so the attack
vector is properly identified. The Spectre mitigations will come
with their own bug bits later"
* 'x86-pti-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/pti: Rename BUG_CPU_INSECURE to BUG_CPU_MELTDOWN
x86/alternatives: Add missing '\n' at end of ALTERNATIVE inline asm
x86/tlb: Drop the _GPL from the cpu_tlbstate export
x86/events/intel/ds: Use the proper cache flush method for mapping ds buffers
x86/kaslr: Fix the vaddr_end mess
x86/mm: Map cpu_entry_area at the same place on 4/5 level
x86/mm: Set MODULES_END to 0xffffffffff000000
Linus Torvalds [Fri, 5 Jan 2018 20:10:06 +0000 (12:10 -0800)]
Merge branch 'linus' of git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6
Pull crypto fixes from Herbert Xu:
"This fixes the following issues:
- racy use of ctx->rcvused in af_alg
- algif_aead crash in chacha20poly1305
- freeing bogus pointer in pcrypt
- build error on MIPS in mpi
- memory leak in inside-secure
- memory overwrite in inside-secure
- NULL pointer dereference in inside-secure
- state corruption in inside-secure
- build error without CRYPTO_GF128MUL in chelsio
- use after free in n2"
* 'linus' of git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6:
crypto: inside-secure - do not use areq->result for partial results
crypto: inside-secure - fix request allocations in invalidation path
crypto: inside-secure - free requests even if their handling failed
crypto: inside-secure - per request invalidation
lib/mpi: Fix umul_ppmm() for MIPS64r6
crypto: pcrypt - fix freeing pcrypt instances
crypto: n2 - cure use after free
crypto: af_alg - Fix race around ctx->rcvused by making it atomic_t
crypto: chacha20poly1305 - validate the digest size
crypto: chelsio - select CRYPTO_GF128MUL
Linus Torvalds [Fri, 5 Jan 2018 19:26:09 +0000 (11:26 -0800)]
Merge branch 'akpm' (patches from Andrew)
Merge misc fixes from Andrew Morton:
"9 fixes"
* emailed patches from Andrew Morton <akpm@linux-foundation.org>:
mailmap: update Mark Yao's email address
userfaultfd: clear the vma->vm_userfaultfd_ctx if UFFD_EVENT_FORK fails
mm/sparse.c: wrong allocation for mem_section
mm/zsmalloc.c: include fs.h
mm/debug.c: provide useful debugging information for VM_BUG
kernel/exit.c: export abort() to modules
mm/mprotect: add a cond_resched() inside change_pmd_range()
kernel/acct.c: fix the acct->needcheck check in check_free_space()
mm: check pfn_valid first in zero_resv_unavail
David Woodhouse [Thu, 4 Jan 2018 14:37:05 +0000 (14:37 +0000)]
x86/alternatives: Add missing '\n' at end of ALTERNATIVE inline asm
Where an ALTERNATIVE is used in the middle of an inline asm block, this
would otherwise lead to the following instruction being appended directly
to the trailing ".popsection", and a failed compile.
Fixes: 9cebed423c84 ("x86, alternative: Use .pushsection/.popsection") Signed-off-by: David Woodhouse <dwmw@amazon.co.uk> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: gnomes@lxorguk.ukuu.org.uk Cc: Rik van Riel <riel@redhat.com> Cc: ak@linux.intel.com Cc: Tim Chen <tim.c.chen@linux.intel.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Paul Turner <pjt@google.com> Cc: Jiri Kosina <jikos@kernel.org> Cc: Andy Lutomirski <luto@amacapital.net> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Kees Cook <keescook@google.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Greg Kroah-Hartman <gregkh@linux-foundation.org> Cc: stable@vger.kernel.org Link: https://lkml.kernel.org/r/20180104143710.8961-8-dwmw@amazon.co.uk
Sinan Kaya [Wed, 3 Jan 2018 12:32:45 +0000 (07:32 -0500)]
mfd: rtsx: Release IRQ during shutdown
'Commit cc27b735ad3a ("PCI/portdrv: Turn off PCIe services during
shutdown")' revealed a resource leak in rtsx_pci driver during shutdown.
Issue shows up as a warning during shutdown as follows:
remove_proc_entry: removing non-empty directory 'irq/17', leaking at least
'rtsx_pci'
WARNING: CPU: 0 PID: 1578 at fs/proc/generic.c:572
remove_proc_entry+0x11d/0x130
Modules linked in <long list but none that are out-of-tree>
...
Call Trace:
unregister_irq_proc
free_desc
irq_free_descs
mp_unmap_irq
acpi_unregister_gsi_apic
acpi_pci_irq_disable
do_pci_disable_device
pci_disable_device
device_shutdown
kernel_restart
Sys_reboot
Even though rtsx_pci driver implements a shutdown callback, it is not
releasing the interrupt that it registered during probe. This is causing
the ACPI layer to complain that the shared IRQ is in use while freeing
IRQ.
This code releases the IRQ to prevent resource leak and eliminate the
warning.
Fixes: cc27b735ad3a ("PCI/portdrv: Turn off PCIe services during shutdown") Link: https://bugzilla.kernel.org/show_bug.cgi?id=198141 Reported-by: Chris Clayton <chris2553@googlemail.com> Signed-off-by: Sinan Kaya <okaya@codeaurora.org> Reviewed-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Signed-off-by: Lee Jones <lee.jones@linaro.org>
Linus Torvalds [Fri, 5 Jan 2018 02:02:55 +0000 (18:02 -0800)]
Merge tag 'drm-fixes-for-v4.15-rc7' of git://people.freedesktop.org/~airlied/linux
Pull drm fixes from Dave Airlie:
"Just collecting some fixes to finish my hoildays :-).
A few fixes for i915 (one documentation build fix), one ttm fix, one
AMD display fix, one omapdrm fix, and a set of armada fixes from
Russell.
All seem pretty small, you can now return to your latest security news
site"
* tag 'drm-fixes-for-v4.15-rc7' of git://people.freedesktop.org/~airlied/linux:
drm/i915: Apply Display WA #1183 on skl, kbl, and cfl
drm/ttm: check the return value of kzalloc
drm/amd/display: call set csc_default if enable adjustment is false
docs: fix, intel_guc_loader.c has been moved to intel_guc_fw.c
omapdrm/dss/hdmi4_cec: fix interrupt handling
documentation/gpu/i915: fix docs build error after file rename
drm/i915: Put all non-blocking modesets onto an ordered wq
drm/i915: Disable DC states around GMBUS on GLK
drm/i915/psr: Fix register name mess up.
drm/armada: fix YUV planar format framebuffer offsets
drm/armada: improve efficiency of armada_drm_plane_calc_addrs()
drm/armada: fix UV swap code
drm/armada: fix SRAM powerdown
drm/armada: fix leak of crtc structure
userfaultfd: clear the vma->vm_userfaultfd_ctx if UFFD_EVENT_FORK fails
The previous fix in commit 384632e67e08 ("userfaultfd: non-cooperative:
fix fork use after free") corrected the refcounting in case of
UFFD_EVENT_FORK failure for the fork userfault paths.
That still didn't clear the vma->vm_userfaultfd_ctx of the vmas that
were set to point to the aborted new uffd ctx earlier in
dup_userfaultfd.
Link: http://lkml.kernel.org/r/20171223002505.593-2-aarcange@redhat.com Signed-off-by: Andrea Arcangeli <aarcange@redhat.com> Reported-by: syzbot <syzkaller@googlegroups.com> Reviewed-by: Mike Rapoport <rppt@linux.vnet.ibm.com> Cc: Eric Biggers <ebiggers3@gmail.com> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Baoquan He [Fri, 5 Jan 2018 00:18:06 +0000 (16:18 -0800)]
mm/sparse.c: wrong allocation for mem_section
In commit 83e3c48729d9 ("mm/sparsemem: Allocate mem_section at runtime
for CONFIG_SPARSEMEM_EXTREME=y") mem_section is allocated at runtime to
save memory.
It allocates the first dimension of array with sizeof(struct mem_section).
It costs extra memory, should be sizeof(struct mem_section *).
Fix it.
Link: http://lkml.kernel.org/r/1513932498-20350-1-git-send-email-bhe@redhat.com Fixes: 83e3c48729 ("mm/sparsemem: Allocate mem_section at runtime for CONFIG_SPARSEMEM_EXTREME=y") Signed-off-by: Baoquan He <bhe@redhat.com> Tested-by: Dave Young <dyoung@redhat.com> Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Ingo Molnar <mingo@kernel.org> Cc: Andy Lutomirski <luto@amacapital.net> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Atsushi Kumagai <ats-kumagai@wm.jp.nec.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Matthew Wilcox [Fri, 5 Jan 2018 00:17:59 +0000 (16:17 -0800)]
mm/debug.c: provide useful debugging information for VM_BUG
With the recent addition of hashed kernel pointers, places which need to
produce useful debug output have to specify %px, not %p. This patch
fixes all the VM debug to use %px. This is appropriate because it's
debug output that the user should never be able to trigger, and kernel
developers need to see the actual pointers.
Link: http://lkml.kernel.org/r/20171219133236.GE13680@bombadil.infradead.org Signed-off-by: Matthew Wilcox <mawilcox@microsoft.com> Acked-by: Michal Hocko <mhocko@suse.com> Cc: "Tobin C. Harding" <me@tobin.cc> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
mm/mprotect: add a cond_resched() inside change_pmd_range()
While testing on a large CPU system, detected the following RCU stall
many times over the span of the workload. This problem is solved by
adding a cond_resched() in the change_pmd_range() function.
Oleg Nesterov [Fri, 5 Jan 2018 00:17:49 +0000 (16:17 -0800)]
kernel/acct.c: fix the acct->needcheck check in check_free_space()
As Tsukada explains, the time_is_before_jiffies(acct->needcheck) check
is very wrong, we need time_is_after_jiffies() to make sys_acct() work.
Ignoring the overflows, the code should "goto out" if needcheck >
jiffies, while currently it checks "needcheck < jiffies" and thus in the
likely case check_free_space() does nothing until jiffies overflow.
In particular this means that sys_acct() is simply broken, acct_on()
sets acct->needcheck = jiffies and expects that check_free_space()
should set acct->active = 1 after the free-space check, but this won't
happen if jiffies increments in between.
This was broken by commit 32dc73086015 ("get rid of timer in
kern/acct.c") in 2011, then another (correct) commit 795a2f22a8ea
("acct() should honour the limits from the very beginning") made the
problem more visible.
Link: http://lkml.kernel.org/r/20171213133940.GA6554@redhat.com Fixes: 32dc73086015 ("get rid of timer in kern/acct.c") Reported-by: TSUKADA Koutaro <tsukada@ascade.co.jp> Suggested-by: TSUKADA Koutaro <tsukada@ascade.co.jp> Signed-off-by: Oleg Nesterov <oleg@redhat.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
In zero_resv_unavail it would be better to check pfn_valid first before
zero the page struct. This fixes the problem and potential other
similar problems. Also as Pavel Tatashin suggested checks pfn_valid at
the beginning of the section.
The range is backed by real memory. The memory range is efi "Boot
Service Data", that means after ExitBootServices() these ranges can be
used as system ram. But some of them need to be reserved, for example
the bgrt image address in an acpi table, if the image memory is freed
then kexec reboot will fail because kexec inherit same acpi table to
initialize the driver.
Link: http://lkml.kernel.org/r/20171201095048.GA3084@dhcp-128-65.nay.redhat.com Fixes: a4a3ede2132a ("mm: zero reserved and unavailable struct pages") Signed-off-by: Dave Young <dyoung@redhat.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Pavel Tatashin <pasha.tatashin@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Thomas Gleixner [Thu, 4 Jan 2018 21:19:04 +0000 (22:19 +0100)]
x86/tlb: Drop the _GPL from the cpu_tlbstate export
The recent changes for PTI touch cpu_tlbstate from various tlb_flush
inlines. cpu_tlbstate is exported as GPL symbol, so this causes a
regression when building out of tree drivers for certain graphics cards.
Aside of that the export was wrong since it was introduced as it should
have been EXPORT_PER_CPU_SYMBOL_GPL().
Use the correct PER_CPU export and drop the _GPL to restore the previous
state which allows users to utilize the cards they payed for.
As always I'm really thrilled to make this kind of change to support the
#friends (or however the hot hashtag of today is spelled) from that closet
sauce graphics corp.
Fixes: 1e02ce4cccdc ("x86: Store a per-cpu shadow copy of CR4") Fixes: 6fd166aae78c ("x86/mm: Use/Fix PCID to optimize user/kernel switches") Reported-by: Kees Cook <keescook@google.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Andy Lutomirski <luto@kernel.org> Cc: stable@vger.kernel.org
Thomas Gleixner [Thu, 4 Jan 2018 11:32:03 +0000 (12:32 +0100)]
x86/kaslr: Fix the vaddr_end mess
vaddr_end for KASLR is only documented in the KASLR code itself and is
adjusted depending on config options. So it's not surprising that a change
of the memory layout causes KASLR to have the wrong vaddr_end. This can map
arbitrary stuff into other areas causing hard to understand problems.
Remove the whole ifdef magic and define the start of the cpu_entry_area to
be the end of the KASLR vaddr range.
Add documentation to that effect.
Fixes: 92a0f81d8957 ("x86/cpu_entry_area: Move it out of the fixmap") Reported-by: Benjamin Gilbert <benjamin.gilbert@coreos.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Tested-by: Benjamin Gilbert <benjamin.gilbert@coreos.com> Cc: Andy Lutomirski <luto@kernel.org> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: stable <stable@vger.kernel.org> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Garnier <thgarnie@google.com>, Cc: Alexander Kuleshov <kuleshovmail@gmail.com> Link: https://lkml.kernel.org/r/alpine.DEB.2.20.1801041320360.1771@nanos
Dave Airlie [Thu, 4 Jan 2018 23:25:01 +0000 (09:25 +1000)]
Merge tag 'drm-intel-fixes-2018-01-04' of git://anongit.freedesktop.org/drm/drm-intel into drm-fixes
drm/i915 fixes for v4.15-rc7
- couple of documentation build fixes
- serialize non-blocking modesets
- prevent DMC from messing up GMBUS transfers
- PSR regression fix
* tag 'drm-intel-fixes-2018-01-04' of git://anongit.freedesktop.org/drm/drm-intel:
drm/i915: Apply Display WA #1183 on skl, kbl, and cfl
docs: fix, intel_guc_loader.c has been moved to intel_guc_fw.c
documentation/gpu/i915: fix docs build error after file rename
drm/i915: Put all non-blocking modesets onto an ordered wq
drm/i915: Disable DC states around GMBUS on GLK
drm/i915/psr: Fix register name mess up.
Dave Airlie [Thu, 4 Jan 2018 23:24:26 +0000 (09:24 +1000)]
Merge branch 'drm-fixes-4.15' of git://people.freedesktop.org/~agd5f/linux into drm-fixes
- backport of a DC change which fixes a greenish tint on some RV hw
- properly handle kzalloc fail in ttm
* 'drm-fixes-4.15' of git://people.freedesktop.org/~agd5f/linux:
drm/ttm: check the return value of kzalloc
drm/amd/display: call set csc_default if enable adjustment is false
Thomas Gleixner [Thu, 4 Jan 2018 12:01:40 +0000 (13:01 +0100)]
x86/mm: Map cpu_entry_area at the same place on 4/5 level
There is no reason for 4 and 5 level pagetables to have a different
layout. It just makes determining vaddr_end for KASLR harder than
necessary.
Fixes: 92a0f81d8957 ("x86/cpu_entry_area: Move it out of the fixmap") Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: Andy Lutomirski <luto@kernel.org> Cc: Benjamin Gilbert <benjamin.gilbert@coreos.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: stable <stable@vger.kernel.org> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Garnier <thgarnie@google.com>, Cc: Alexander Kuleshov <kuleshovmail@gmail.com> Link: https://lkml.kernel.org/r/alpine.DEB.2.20.1801041320360.1771@nanos
Andrey Ryabinin [Thu, 28 Dec 2017 16:06:20 +0000 (19:06 +0300)]
x86/mm: Set MODULES_END to 0xffffffffff000000
Since f06bdd4001c2 ("x86/mm: Adapt MODULES_END based on fixmap section size")
kasan_mem_to_shadow(MODULES_END) could be not aligned to a page boundary.
So passing page unaligned address to kasan_populate_zero_shadow() have two
possible effects:
1) It may leave one page hole in supposed to be populated area. After commit 21506525fb8d ("x86/kasan/64: Teach KASAN about the cpu_entry_area") that
hole happens to be in the shadow covering fixmap area and leads to crash:
BUG: unable to handle kernel paging request at fffffbffffe8ee04
RIP: 0010:check_memory_region+0x5c/0x190
Note, the crash likely disappeared after commit 92a0f81d8957, which
changed kasan_populate_zero_shadow() call the way it was before
commit 21506525fb8d.
2) Attempt to load module near MODULES_END will fail, because
__vmalloc_node_range() called from kasan_module_alloc() will hit the
WARN_ON(!pte_none(*pte)) in the vmap_pte_range() and bail out with error.
To fix this we need to make kasan_mem_to_shadow(MODULES_END) page aligned
which means that MODULES_END should be 8*PAGE_SIZE aligned.
The whole point of commit f06bdd4001c2 was to move MODULES_END down if
NR_CPUS is big, so the cpu_entry_area takes a lot of space.
But since 92a0f81d8957 ("x86/cpu_entry_area: Move it out of the fixmap")
the cpu_entry_area is no longer in fixmap, so we could just set
MODULES_END to a fixed 8*PAGE_SIZE aligned address.
Fixes: f06bdd4001c2 ("x86/mm: Adapt MODULES_END based on fixmap section size") Reported-by: Jakub Kicinski <kubakici@wp.pl> Signed-off-by: Andrey Ryabinin <aryabinin@virtuozzo.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: stable@vger.kernel.org Cc: Andy Lutomirski <luto@kernel.org> Cc: Thomas Garnier <thgarnie@google.com> Link: https://lkml.kernel.org/r/20171228160620.23818-1-aryabinin@virtuozzo.com
Linus Torvalds [Thu, 4 Jan 2018 19:14:36 +0000 (11:14 -0800)]
Merge tag 'armsoc-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/arm/arm-soc
Pull ARM SoC fixes from Arnd Bergmann:
"Fixes this time include mostly device tree changes, as usual, the
notable ones include:
- A number of patches to fix most of the remaining DTC warnings that
got introduced when DTC started warning about some obvious
mistakes. We still have some remaining warnings that probably may
have to wait until 4.16 to get fixed while we try to figure out
what the correct contents should be.
- On Allwinner A64, Ethernet PHYs need a fix after a mistake in
coordination between patches merged through multiple branches.
- Various fixes for PMICs on allwinner based boards
- Two fixes for ethernet link detection on some Renesas machines
- Two stability fixes for rockchip based boards
Aside from device-tree, two other areas got fixes for older problems:
- For TI Davinci DM365, a couple of fixes were needed to repair the
MMC DMA engine support, apparently this has been broken for a
while.
- One important fix for all Allwinner chips with the PMIC driver as a
loadable module"
* tag 'armsoc-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/arm/arm-soc: (23 commits)
arm64: dts: uniphier: fix gpio-ranges property of PXs3 SoC
arm64: dts: renesas: ulcb: Remove renesas, no-ether-link property
arm64: dts: renesas: salvator-x: Remove renesas, no-ether-link property
ARM: dts: tango4: remove bogus interrupt-controller property
ARM: dts: ls1021a: fix incorrect clock references
ARM: dts: aspeed-g4: Correct VUART IRQ number
ARM: dts: exynos: Enable Mixer node for Exynos5800 Peach Pi machine
ARM: dts: sun8i: a711: Reinstate the PMIC compatible
ARM: davinci: fix mmc entries in dm365's dma_slave_map
ARM: dts: da850-lego-ev3: Fix battery voltage gpio
ARM: davinci: Add dma_mask to dm365's eDMA device
ARM: davinci: Use platform_device_register_full() to create pdev for dm365's eDMA
arm64: dts: rockchip: limit rk3328-rock64 gmac speed to 100MBit for now
arm64: dts: rockchip: remove vdd_log from rk3399-puma
arm64: dts: orange-pi-zero-plus2: fix sdcard detect
arm64: allwinner: a64-sopine: Fix to use dcdc1 regulator instead of vcc3v3
ARM: dts: sunxi: Convert to CCU index macros for HDMI controller
sunxi-rsb: Include OF based modalias in device uevent
ARM: dts: at91: disable the nxp,se97b SMBUS timeout on the TSE-850
arm64: dts: rockchip: fix trailing 0 in rk3328 tsadc interrupts
...
Arnd Bergmann [Thu, 4 Jan 2018 16:06:25 +0000 (17:06 +0100)]
Merge tag 'sunxi-fixes-for-4.15' of ssh://gitolite.kernel.org/pub/scm/linux/kernel/git/sunxi/linux into fixes
Pull "Allwinner fixes for 4.15" from Chen-Yu Tsai:
First, one fix that adds proper regulator references for the EMAC
external PHYs on A64 boards. The EMAC bindings were developed for 4.13,
but reverted at the last minute. They were finalized and brought back
for 4.15. However in the time between, regulator support for the A64
boards was merged. When EMAC device tree changes were reintroduced,
this was not taken into account.
Second, a patch that adds OF based modalias uevent for RSB slave devices.
This has been missing since the introduction of RSB, and recently with
PMIC regulator support introduced for the A64, has been seen affecting
distributions, which have the all-important PMIC mfd drivers built as
modules, which then don't get loaded.
Other minor cleanups include final conversion of raw indices to CCU
binding macros for sun[4567]i HDMI, cleanup of dummy regulators on the
A64 SOPINE, a SD card detection polarity fix for the Orange Pi Zero
Plus2, and adding a missing compatible for the PMIC on the TBS A711
tablet.
* tag 'sunxi-fixes-for-4.15' of ssh://gitolite.kernel.org/pub/scm/linux/kernel/git/sunxi/linux:
ARM: dts: sun8i: a711: Reinstate the PMIC compatible
arm64: dts: orange-pi-zero-plus2: fix sdcard detect
arm64: allwinner: a64-sopine: Fix to use dcdc1 regulator instead of vcc3v3
ARM: dts: sunxi: Convert to CCU index macros for HDMI controller
sunxi-rsb: Include OF based modalias in device uevent
arm64: allwinner: a64: add Ethernet PHY regulator for several boards
Arnd Bergmann [Thu, 4 Jan 2018 16:05:06 +0000 (17:05 +0100)]
Merge tag 'renesas-fixes-for-v4.15' of ssh://gitolite.kernel.org/pub/scm/linux/kernel/git/horms/renesas into fixes
Pull "Renesas ARM Based SoC Fixes for v4.15" from Simon Horman:
Vladimir Zapolskiy says:
The present change is a bug fix for AVB link iteratively up/down.
Steps to reproduce:
- start AVB TX stream (Using aplay via MSE),
- disconnect+reconnect the eth cable,
- after a reconnection the eth connection goes iteratively up/down
without user interaction,
- this may heal after some seconds or even stay for minutes.
As the documentation specifies, the "renesas,no-ether-link" option
should be used when a board does not provide a proper AVB_LINK signal.
There is no need for this option enabled on RCAR H3/M3 Salvator-X/XS
and ULCB starter kits since the AVB_LINK is correctly handled by HW.
Choosing to keep or remove the "renesas,no-ether-link" option will
have impact on the code flow in the following ways:
- keeping this option enabled may lead to unexpected behavior since
the RX & TX are enabled/disabled directly from adjust_link function
without any HW interrogation,
- removing this option, the RX & TX will only be enabled/disabled after
HW interrogation. The HW check is made through the LMON pin in PSR
register which specifies AVB_LINK signal value (0 - at low level;
1 - at high level).
In conclusion, the change is also a safety improvement because it
removes the "renesas,no-ether-link" option leading to a proper way
of detecting the link state based on HW interrogation and not on
software heuristic.
Note that DTS files for V3M Starter Kit, Draak and Eagle boards
contain the same property, the files are untouched due to unavailable
schematics to verify if the fix applies to these boards as well.
Lucas De Marchi [Tue, 2 Jan 2018 20:18:37 +0000 (12:18 -0800)]
drm/i915: Apply Display WA #1183 on skl, kbl, and cfl
Display WA #1183 was recently added to workaround
"Failures when enabling DPLL0 with eDP link rate 2.16
or 4.32 GHz and CD clock frequency 308.57 or 617.14 MHz
(CDCLK_CTL CD Frequency Select 10b or 11b) used in this
enabling or in previous enabling."
This workaround was designed to minimize the impact only
to save the bad case with that link rates. But HW engineers
indicated that it should be safe to apply broadly, although
they were expecting the DPLL0 link rate to be unchanged on
runtime.
We need to cover 2 cases: when we are in fact enabling DPLL0
and when we are just changing the frequency with small
differences.
This is based on previous patch by Rodrigo Vivi with suggestions
from Ville Syrjälä.
Linus Torvalds [Thu, 4 Jan 2018 00:41:07 +0000 (16:41 -0800)]
Merge branch 'x86-pti-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 page table isolation fixes from Thomas Gleixner:
"A couple of urgent fixes for PTI:
- Fix a PTE mismatch between user and kernel visible mapping of the
cpu entry area (differs vs. the GLB bit) and causes a TLB mismatch
MCE on older AMD K8 machines
- Fix the misplaced CR3 switch in the SYSCALL compat entry code which
causes access to unmapped kernel memory resulting in double faults.
- Fix the section mismatch of the cpu_tss_rw percpu storage caused by
using a different mechanism for declaration and definition.
- Two fixes for dumpstack which help to decode entry stack issues
better
- Enable PTI by default in Kconfig. We should have done that earlier,
but it slipped through the cracks.
- Exclude AMD from the PTI enforcement. Not necessarily a fix, but if
AMD is so confident that they are not affected, then we should not
burden users with the overhead"
* 'x86-pti-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/process: Define cpu_tss_rw in same section as declaration
x86/pti: Switch to kernel CR3 at early in entry_SYSCALL_compat()
x86/dumpstack: Print registers for first stack frame
x86/dumpstack: Fix partial register dumps
x86/pti: Make sure the user/kernel PTEs match
x86/cpu, x86/pti: Do not enable PTI on AMD processors
x86/pti: Enable PTI by default
Thomas Gleixner [Wed, 3 Jan 2018 18:52:04 +0000 (19:52 +0100)]
x86/pti: Switch to kernel CR3 at early in entry_SYSCALL_compat()
The preparation for PTI which added CR3 switching to the entry code
misplaced the CR3 switch in entry_SYSCALL_compat().
With PTI enabled the entry code tries to access a per cpu variable after
switching to kernel GS. This fails because that variable is not mapped to
user space. This results in a double fault and in the worst case a kernel
crash.
Move the switch ahead of the access and clobber RSP which has been saved
already.
Fixes: 8a09317b895f ("x86/mm/pti: Prepare the x86/entry assembly code for entry/exit CR3 switching") Reported-by: Lars Wendler <wendler.lars@web.de> Reported-by: Laura Abbott <labbott@redhat.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: Borislav Betkov <bp@alien8.de> Cc: Andy Lutomirski <luto@kernel.org>, Cc: Dave Hansen <dave.hansen@linux.intel.com>, Cc: Peter Zijlstra <peterz@infradead.org>, Cc: Greg KH <gregkh@linuxfoundation.org>, , Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>, Cc: Juergen Gross <jgross@suse.com> Cc: stable@vger.kernel.org Link: https://lkml.kernel.org/r/alpine.DEB.2.20.1801031949200.1957@nanos
Linus Torvalds [Wed, 3 Jan 2018 19:03:07 +0000 (11:03 -0800)]
Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace
Pull pid allocation bug fix from Eric Biederman:
"The replacement of the pid hash table and the pid bitmap with an idr
resulted in an implementation that now fails more often in low memory
situations. Allowing fuzzers to observe bad behavior from a memory
allocation failure during pid allocation.
This is a small change to fix this by making the kernel more robust in
the case of error. The non-error paths are left alone so the only
danger is to the already broken error path. I have manually injected
errors and verified that this new error handling works"
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace:
pid: Handle failure to allocate the first pid in a pid namespace
Linus Torvalds [Wed, 3 Jan 2018 18:58:56 +0000 (10:58 -0800)]
Merge branch 'afs-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs
Pull afs/fscache fixes from David Howells:
- Fix the default return of fscache_maybe_release_page() when a cache
isn't in use - it prevents a filesystem from releasing pages. This
can cause a system to OOM.
- Fix a potential uninitialised variable in AFS.
- Fix AFS unlink's handling of the nlink count. It needs to use the
nlink manipulation functions so that inode structs of deleted inodes
actually get scheduled for destruction.
- Fix error handling in afs_write_end() so that the page gets unlocked
and put if we can't fill the unwritten portion.
* 'afs-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs:
afs: Fix missing error handling in afs_write_end()
afs: Fix unlink
afs: Potential uninitialized variable in afs_extract_data()
fscache: Fix the default for fscache_maybe_release_page()
Kees Cook [Tue, 2 Jan 2018 23:21:33 +0000 (15:21 -0800)]
exec: Weaken dumpability for secureexec
This is a logical revert of commit e37fdb785a5f ("exec: Use secureexec
for setting dumpability")
This weakens dumpability back to checking only for uid/gid changes in
current (which is useless), but userspace depends on dumpability not
being tied to secureexec.
Josh Poimboeuf [Sun, 31 Dec 2017 16:18:06 +0000 (10:18 -0600)]
x86/dumpstack: Fix partial register dumps
The show_regs_safe() logic is wrong. When there's an iret stack frame,
it prints the entire pt_regs -- most of which is random stack data --
instead of just the five registers at the end.
show_regs_safe() is also poorly named: the on_stack() checks aren't for
safety. Rename the function to show_regs_if_on_stack() and add a
comment to explain why the checks are needed.
These issues were introduced with the "partial register dump" feature of
the following commit:
b02fcf9ba121 ("x86/unwinder: Handle stack overflows more gracefully")
That patch had gone through a few iterations of development, and the
above issues were artifacts from a previous iteration of the patch where
'regs' pointed directly to the iret frame rather than to the (partially
empty) pt_regs.
Tested-by: Alexander Tsoy <alexander@tsoy.me> Signed-off-by: Josh Poimboeuf <jpoimboe@redhat.com> Cc: Andy Lutomirski <luto@kernel.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Toralf Förster <toralf.foerster@gmx.de> Cc: stable@vger.kernel.org Fixes: b02fcf9ba121 ("x86/unwinder: Handle stack overflows more gracefully") Link: http://lkml.kernel.org/r/5b05b8b344f59db2d3d50dbdeba92d60f2304c54.1514736742.git.jpoimboe@redhat.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
Tom Lendacky [Wed, 27 Dec 2017 05:43:54 +0000 (23:43 -0600)]
x86/cpu, x86/pti: Do not enable PTI on AMD processors
AMD processors are not subject to the types of attacks that the kernel
page table isolation feature protects against. The AMD microarchitecture
does not allow memory references, including speculative references, that
access higher privileged data when running in a lesser privileged mode
when that access would result in a page fault.
Disable page table isolation by default on AMD processors by not setting
the X86_BUG_CPU_INSECURE feature, which controls whether X86_FEATURE_PTI
is set.
Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Borislav Petkov <bp@suse.de> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Andy Lutomirski <luto@kernel.org> Cc: stable@vger.kernel.org Link: https://lkml.kernel.org/r/20171227054354.20369.94587.stgit@tlendack-t1.amdoffice.net