Masami Hiramatsu [Fri, 28 Aug 2009 22:13:19 +0000 (18:13 -0400)]
x86: Allow x86-32 instruction decoder selftest on x86-64
Pass $(CONFIG_64BIT) to the x86 insn decoder selftest in case we are
decoding 32bit code on x86-64, which will happen when building kernel
with ARCH=i386 on x86-64.
Masami Hiramatsu [Thu, 27 Aug 2009 17:23:25 +0000 (13:23 -0400)]
kprobes/x86-64: Fix to move common_interrupt to .kprobes.text
Since nmi, debug and int3 returns to irq_return inside common_interrupt,
probing this function will cause int3-loop, so it should be marked
as __kprobes.
Masami Hiramatsu [Thu, 27 Aug 2009 17:23:11 +0000 (13:23 -0400)]
kprobes/x86: Fix to add __kprobes to in-kernel fault handing functions
Add __kprobes to the functions which handle in-kernel fixable page
faults. Since kprobes can cause those in-kernel page faults by accessing
kprobe data structures, probing those fault functions will cause
fault-int3-loop (do_page_fault has already been marked as __kprobes).
Masami Hiramatsu [Thu, 27 Aug 2009 17:23:04 +0000 (13:23 -0400)]
kprobes/x86-64: Allow to reenter probe on post_handler
Allow to reenter probe on the post_handler of another probe on x86-64,
because x86-64 already allows reentering int3.
In that case, reentered probe just increases kp.nmissed and returns.
Masami Hiramatsu [Thu, 27 Aug 2009 17:22:58 +0000 (13:22 -0400)]
kprobes/x86: Call BUG() when reentering probe into KPROBES_HIT_SS
Call BUG() when a probe have been hit on the way of kprobe processing
path, because that kind of probes are currently unrecoverable
(recovering it will cause an infinite loop and stack overflow).
The original code seems to assume that it's caused by an int3
which another subsystem inserted on out-of-line singlestep buffer if
the hitting probe is same as current probe. However, in that case,
int3-hitting-address is on the out-of-line buffer and should be
different from first (current) int3 address.
Thus, I decided to remove the code.
I also removes arch_disarm_kprobe() because it will involve other stuffs
in text_poke().
tracing: Restore the const qualifier for field names and types definition
Restore the const qualifier in field's name and type parameters of
trace_define_field that was lost while solving a conflict.
Fields names and types are defined as builtin constant strings in
static TRACE_EVENTs. But kprobes allocates these dynamically.
That said, we still want to always pass these strings as const char *
in trace_define_fields() to avoid any further accidental writes on
the pointed strings.
Reported-by: Li Zefan <lizf@cn.fujitsu.com> Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com> Cc: Steven Rostedt <rostedt@goodmis.org>
tracing/kprobes: Dump the culprit kprobe in case of kprobe recursion
Kprobes can enter into a probing recursion, ie: a kprobe that does an
endless loop because one of its core mechanism function used during
probing is also probed itself.
This patch helps pinpointing the kprobe that raised such recursion
by dumping it and raising a BUG instead of a warning (we also disarm
the kprobe to try avoiding recursion in BUG itself). Having a BUG
instead of a warning stops the stacktrace in the right place and
doesn't pollute the logs with hundreds of traces that eventually end
up in a stack overflow.
Masami Hiramatsu [Thu, 13 Aug 2009 20:35:34 +0000 (16:35 -0400)]
tracing: Kprobe tracer assigns new event ids for each event
Assign new event ids for each kprobes event. This doesn't clear
ring_buffer when unregistering each kprobe event. Thus, if you mind
'Unknown event' messages, clear the buffer manually after changing
kprobe events.
Signed-off-by: Masami Hiramatsu <mhiramat@redhat.com> Cc: Ananth N Mavinakayanahalli <ananth@in.ibm.com> Cc: Avi Kivity <avi@redhat.com> Cc: Andi Kleen <ak@linux.intel.com> Cc: Christoph Hellwig <hch@infradead.org> Cc: Frank Ch. Eigler <fche@redhat.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: Jason Baron <jbaron@redhat.com> Cc: Jim Keniston <jkenisto@us.ibm.com> Cc: K.Prasad <prasad@linux.vnet.ibm.com> Cc: Lai Jiangshan <laijs@cn.fujitsu.com> Cc: Li Zefan <lizf@cn.fujitsu.com> Cc: Przemysław Pawełczyk <przemyslaw@pawelczyk.it> Cc: Roland McGrath <roland@redhat.com> Cc: Sam Ravnborg <sam@ravnborg.org> Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Tom Zanussi <tzanussi@gmail.com> Cc: Vegard Nossum <vegard.nossum@gmail.com>
LKML-Reference: <20090813203534.31965.49105.stgit@localhost.localdomain> Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Masami Hiramatsu [Thu, 13 Aug 2009 20:35:11 +0000 (16:35 -0400)]
tracing: Add kprobe-based event tracer
Add kprobes-based event tracer on ftrace.
This tracer is similar to the events tracer which is based on Tracepoint
infrastructure. Instead of Tracepoint, this tracer is based on kprobes
(kprobe and kretprobe). It probes anywhere where kprobes can probe(this
means, all functions body except for __kprobes functions).
Similar to the events tracer, this tracer doesn't need to be activated
via current_tracer, instead of that, just set probe points via
/sys/kernel/debug/tracing/kprobe_events. And you can set filters on each
probe events via /sys/kernel/debug/tracing/events/kprobes/<EVENT>/filter.
This tracer supports following probe arguments for each probe.
%REG : Fetch register REG
sN : Fetch Nth entry of stack (N >= 0)
sa : Fetch stack address.
@ADDR : Fetch memory at ADDR (ADDR should be in kernel)
@SYM[+|-offs] : Fetch memory at SYM +|- offs (SYM should be a data symbol)
aN : Fetch function argument. (N >= 0)
rv : Fetch return value.
ra : Fetch return address.
+|-offs(FETCHARG) : fetch memory at FETCHARG +|- offs address.
See Documentation/trace/kprobetrace.txt in the next patch for details.
Changes from v13:
- Support 'sa' for stack address.
- Use call->data instead of container_of() macro.
[fweisbec@gmail.com: Fixed conflict against latest tracing/core]
Signed-off-by: Masami Hiramatsu <mhiramat@redhat.com> Acked-by: Ananth N Mavinakayanahalli <ananth@in.ibm.com> Cc: Avi Kivity <avi@redhat.com> Cc: Andi Kleen <ak@linux.intel.com> Cc: Christoph Hellwig <hch@infradead.org> Cc: Frank Ch. Eigler <fche@redhat.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: Jason Baron <jbaron@redhat.com> Cc: Jim Keniston <jkenisto@us.ibm.com> Cc: K.Prasad <prasad@linux.vnet.ibm.com> Cc: Lai Jiangshan <laijs@cn.fujitsu.com> Cc: Li Zefan <lizf@cn.fujitsu.com> Cc: Przemysław Pawełczyk <przemyslaw@pawelczyk.it> Cc: Roland McGrath <roland@redhat.com> Cc: Sam Ravnborg <sam@ravnborg.org> Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Tom Zanussi <tzanussi@gmail.com> Cc: Vegard Nossum <vegard.nossum@gmail.com>
LKML-Reference: <20090813203510.31965.29123.stgit@localhost.localdomain> Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Masami Hiramatsu [Thu, 13 Aug 2009 20:34:53 +0000 (16:34 -0400)]
tracing: Ftrace dynamic ftrace_event_call support
Add dynamic ftrace_event_call support to ftrace. Trace engines can add
new ftrace_event_call to ftrace on the fly. Each operator function of
the call takes an ftrace_event_call data structure as an argument,
because these functions may be shared among several ftrace_event_calls.
Changes from v13:
- Define remove_subsystem_dir() always (revirt a2ca5e03), because
trace_remove_event_call() uses it.
- Modify syscall tracer because of ftrace_event_call change.
[fweisbec@gmail.com: Fixed conflict against latest tracing/core]
Signed-off-by: Masami Hiramatsu <mhiramat@redhat.com> Cc: Ananth N Mavinakayanahalli <ananth@in.ibm.com> Cc: Avi Kivity <avi@redhat.com> Cc: Andi Kleen <ak@linux.intel.com> Cc: Christoph Hellwig <hch@infradead.org> Cc: Frank Ch. Eigler <fche@redhat.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: Jason Baron <jbaron@redhat.com> Cc: Jim Keniston <jkenisto@us.ibm.com> Cc: K.Prasad <prasad@linux.vnet.ibm.com> Cc: Lai Jiangshan <laijs@cn.fujitsu.com> Cc: Li Zefan <lizf@cn.fujitsu.com> Cc: Przemysław Pawełczyk <przemyslaw@pawelczyk.it> Cc: Roland McGrath <roland@redhat.com> Cc: Sam Ravnborg <sam@ravnborg.org> Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Tom Zanussi <tzanussi@gmail.com> Cc: Vegard Nossum <vegard.nossum@gmail.com>
LKML-Reference: <20090813203453.31965.71901.stgit@localhost.localdomain> Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Masami Hiramatsu [Thu, 13 Aug 2009 20:34:44 +0000 (16:34 -0400)]
x86: Add pt_regs register and stack access APIs
Add following APIs for accessing registers and stack entries from
pt_regs.
These APIs are required by kprobes-based event tracer on ftrace.
Some other debugging tools might be able to use it too.
- regs_query_register_offset(const char *name)
Query the offset of "name" register.
- regs_query_register_name(unsigned int offset)
Query the name of register by its offset.
- regs_get_register(struct pt_regs *regs, unsigned int offset)
Get the value of a register by its offset.
- regs_within_kernel_stack(struct pt_regs *regs, unsigned long addr)
Check the address is in the kernel stack.
- regs_get_kernel_stack_nth(struct pt_regs *reg, unsigned int nth)
Get Nth entry of the kernel stack. (N >= 0)
- regs_get_argument_nth(struct pt_regs *reg, unsigned int nth)
Get Nth argument at function call. (N >= 0)
Signed-off-by: Masami Hiramatsu <mhiramat@redhat.com> Cc: linux-arch@vger.kernel.org Cc: Ananth N Mavinakayanahalli <ananth@in.ibm.com> Cc: Avi Kivity <avi@redhat.com> Cc: Andi Kleen <ak@linux.intel.com> Cc: Christoph Hellwig <hch@infradead.org> Cc: Frank Ch. Eigler <fche@redhat.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: Jason Baron <jbaron@redhat.com> Cc: Jim Keniston <jkenisto@us.ibm.com> Cc: K.Prasad <prasad@linux.vnet.ibm.com> Cc: Lai Jiangshan <laijs@cn.fujitsu.com> Cc: Li Zefan <lizf@cn.fujitsu.com> Cc: Przemysław Pawełczyk <przemyslaw@pawelczyk.it> Cc: Roland McGrath <roland@redhat.com> Cc: Sam Ravnborg <sam@ravnborg.org> Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Tom Zanussi <tzanussi@gmail.com> Cc: Vegard Nossum <vegard.nossum@gmail.com>
LKML-Reference: <20090813203444.31965.26374.stgit@localhost.localdomain> Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Masami Hiramatsu [Thu, 13 Aug 2009 20:34:36 +0000 (16:34 -0400)]
kprobes: Cleanup fix_riprel() using insn decoder on x86
Cleanup fix_riprel() in arch/x86/kernel/kprobes.c by using the new x86
instruction decoder instead of using comparisons with raw ad hoc numeric
opcodes.
Signed-off-by: Masami Hiramatsu <mhiramat@redhat.com> Cc: Ananth N Mavinakayanahalli <ananth@in.ibm.com> Cc: Avi Kivity <avi@redhat.com> Cc: Andi Kleen <ak@linux.intel.com> Cc: Christoph Hellwig <hch@infradead.org> Cc: Frank Ch. Eigler <fche@redhat.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: Jason Baron <jbaron@redhat.com> Cc: Jim Keniston <jkenisto@us.ibm.com> Cc: K.Prasad <prasad@linux.vnet.ibm.com> Cc: Lai Jiangshan <laijs@cn.fujitsu.com> Cc: Li Zefan <lizf@cn.fujitsu.com> Cc: Przemysław Pawełczyk <przemyslaw@pawelczyk.it> Cc: Roland McGrath <roland@redhat.com> Cc: Sam Ravnborg <sam@ravnborg.org> Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Tom Zanussi <tzanussi@gmail.com> Cc: Vegard Nossum <vegard.nossum@gmail.com>
LKML-Reference: <20090813203436.31965.34374.stgit@localhost.localdomain> Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Masami Hiramatsu [Thu, 13 Aug 2009 20:34:28 +0000 (16:34 -0400)]
kprobes: Checks probe address is instruction boudary on x86
Ensure safeness of inserting kprobes by checking whether the specified
address is at the first byte of an instruction on x86.
This is done by decoding probed function from its head to the probe
point.
Signed-off-by: Masami Hiramatsu <mhiramat@redhat.com> Acked-by: Ananth N Mavinakayanahalli <ananth@in.ibm.com> Cc: Avi Kivity <avi@redhat.com> Cc: Andi Kleen <ak@linux.intel.com> Cc: Christoph Hellwig <hch@infradead.org> Cc: Frank Ch. Eigler <fche@redhat.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: Jason Baron <jbaron@redhat.com> Cc: Jim Keniston <jkenisto@us.ibm.com> Cc: K.Prasad <prasad@linux.vnet.ibm.com> Cc: Lai Jiangshan <laijs@cn.fujitsu.com> Cc: Li Zefan <lizf@cn.fujitsu.com> Cc: Przemysław Pawełczyk <przemyslaw@pawelczyk.it> Cc: Roland McGrath <roland@redhat.com> Cc: Sam Ravnborg <sam@ravnborg.org> Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Tom Zanussi <tzanussi@gmail.com> Cc: Vegard Nossum <vegard.nossum@gmail.com>
LKML-Reference: <20090813203428.31965.21939.stgit@localhost.localdomain> Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Masami Hiramatsu [Thu, 13 Aug 2009 20:34:21 +0000 (16:34 -0400)]
x86: X86 instruction decoder build-time selftest
Add a user-space selftest of x86 instruction decoder at kernel build
time.
When CONFIG_X86_DECODER_SELFTEST=y, Kbuild builds a test harness of x86
instruction decoder and performs it after building vmlinux.
The test compares the results of objdump and x86 instruction decoder
code and check there are no differences.
Signed-off-by: Masami Hiramatsu <mhiramat@redhat.com> Signed-off-by: Jim Keniston <jkenisto@us.ibm.com> Cc: Ananth N Mavinakayanahalli <ananth@in.ibm.com> Cc: Avi Kivity <avi@redhat.com> Cc: Andi Kleen <ak@linux.intel.com> Cc: Christoph Hellwig <hch@infradead.org> Cc: Frank Ch. Eigler <fche@redhat.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: Jason Baron <jbaron@redhat.com> Cc: K.Prasad <prasad@linux.vnet.ibm.com> Cc: Lai Jiangshan <laijs@cn.fujitsu.com> Cc: Li Zefan <lizf@cn.fujitsu.com> Cc: Przemysław Pawełczyk <przemyslaw@pawelczyk.it> Cc: Roland McGrath <roland@redhat.com> Cc: Sam Ravnborg <sam@ravnborg.org> Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Tom Zanussi <tzanussi@gmail.com> Cc: Vegard Nossum <vegard.nossum@gmail.com>
LKML-Reference: <20090813203421.31965.29006.stgit@localhost.localdomain> Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Masami Hiramatsu [Thu, 13 Aug 2009 20:34:13 +0000 (16:34 -0400)]
x86: Instruction decoder API
Add x86 instruction decoder to arch-specific libraries. This decoder
can decode x86 instructions used in kernel into prefix, opcode, modrm,
sib, displacement and immediates. This can also show the length of
instructions.
This version introduces instruction attributes for decoding
instructions.
The instruction attribute tables are generated from the opcode map file
(x86-opcode-map.txt) by the generator script(gen-insn-attr-x86.awk).
Currently, the opcode maps are based on opcode maps in Intel(R) 64 and
IA-32 Architectures Software Developers Manual Vol.2: Appendix.A,
and consist of below two types of opcode tables.
1-byte/2-bytes/3-bytes opcodes, which has 256 elements, are
written as below;
Steven Rostedt [Wed, 26 Aug 2009 04:32:37 +0000 (00:32 -0400)]
tracing: add comments to explain TRACE_EVENT out of protection
The commit:
commit 5ac35daa9343936038a3c9c4f4d6d3fe6a2a7bd8
Author: Xiao Guangrong <xiaoguangrong@cn.fujitsu.com>
tracing/events: fix the include file dependencies
Moved the TRACE_EVENT out of the ifdef protection of tracepoints.h
but uses the define of TRACE_EVENT itself as protection. This patch
adds comments to explain why.
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Xiao Guangrong [Tue, 25 Aug 2009 06:06:22 +0000 (14:06 +0800)]
tracing/events: fix the include file dependencies
The TRACE_EVENT depends on the include/linux/tracepoint.h first
and include/trace/ftrace.h later, if we include the ftrace.h early,
a building error will occur.
Both define TRACE_EVENT in trace_a.h and trace_b.h, if we include
those in .c file, like this:
#define CREATE_TRACE_POINTS
include <trace/events/trace_a.h>
include <trace/events/trace_b.h>
The above will not work, because the TRACE_EVENT was re-defined by
the previous .h file.
Zhaolei [Tue, 25 Aug 2009 08:12:56 +0000 (16:12 +0800)]
ftrace: Move setting of clock-source out of options
There are many clock sources for the tracing system but we can only
enable/disable one at a time with the trace/options file.
We can move the setting of clock-source out of options and add a separate
file for it:
# cat trace_clock
[local] global
# echo global > trace_clock
# cat trace_clock
local [global]
Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com>
LKML-Reference: <4A939D08.6050604@cn.fujitsu.com> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Li Zefan [Fri, 7 Aug 2009 02:33:43 +0000 (10:33 +0800)]
tracing/filters: Support filtering for char * strings
Usually, char * entries are dangerous in traces because the string
can be released whereas a pointer to it can still wait to be read from
the ring buffer.
But sometimes we can assume it's safe, like in case of RO data
(eg: __file__ or __line__, used in bkl trace event). If these RO data
are in a module and so is the call to the trace event, then it's safe,
because the ring buffer will be flushed once this module get unloaded.
Josh Triplett [Thu, 6 Aug 2009 14:57:01 +0000 (07:57 -0700)]
tracing: Add vim script to enable folding for function_graph traces
function_graph traces look like nested function calls, complete with
braces denoting the start and end of functions. function-graph-fold.vim
teaches vim how to fold these functions, to make it more convenient to
browse them.
To use, :source function-graph-fold.vim while viewing a function_graph
trace, or use "view -S function-graph-fold.vim some-trace" to load it
from the command-line together with a trace. You can then use the usual
vim fold commands, such as "za", to open and close nested functions.
While closed, a fold will show the total time taken for a call, as would
normally appear on the line with the closing brace. Folded functions
will not include finish_task_switch(), so folding should remain
relatively sane even through a context switch.
Note that this will almost certainly only work well with a single-CPU
trace (e.g. trace-cmd report --cpu 1). It also takes some time to run
(a few seconds for a large trace on my laptop). Nevertheless, I found
it very handy to get an overview of a trace and then drill down on
problematic calls.
Signed-off-by: Josh Triplett <josh@joshtriplett.org>
LKML-Reference: <20090806145701.GB7661@feather> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Steven Rostedt [Thu, 6 Aug 2009 18:59:32 +0000 (14:59 -0400)]
tracing/sched: show CPU task wakes up on in trace event
While debugging the scheduler push / pull algorithm, I found
it very annoying that the sched wake up events did not show
the CPU that the task was waking on. In order to analyze the
scheduler, I needed that information.
This patch adds recording of the CPU that a task is waking up
on.
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Josh Stone [Mon, 24 Aug 2009 21:43:14 +0000 (14:43 -0700)]
tracing: Create generic syscall TRACE_EVENTs
This converts the syscall_enter/exit tracepoints into TRACE_EVENTs, so
you can have generic ftrace events that capture all system calls with
arguments and return values. These generic events are also renamed to
sys_enter/exit, so they're more closely aligned to the specific
sys_enter_foo events.
Signed-off-by: Josh Stone <jistone@redhat.com> Cc: Jason Baron <jbaron@redhat.com> Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: Li Zefan <lizf@cn.fujitsu.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca> Cc: Jiaying Zhang <jiayingz@google.com> Cc: Martin Bligh <mbligh@google.com> Cc: Lai Jiangshan <laijs@cn.fujitsu.com> Cc: Paul Mundt <lethal@linux-sh.org> Cc: Martin Schwidefsky <schwidefsky@de.ibm.com> Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
LKML-Reference: <1251150194-1713-5-git-send-email-jistone@redhat.com> Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Josh Stone [Mon, 24 Aug 2009 21:43:13 +0000 (14:43 -0700)]
tracing: Move tracepoint callbacks from declaration to definition
It's not strictly correct for the tracepoint reg/unreg callbacks to
occur when a client is hooking up, because the actual tracepoint may not
be present yet. This happens to be fine for syscall, since that's in
the core kernel, but it would cause problems for tracepoints defined in
a module that hasn't been loaded yet. It also means the reg/unreg has
to be EXPORTed for any modules to use the tracepoint (as in SystemTap).
This patch removes DECLARE_TRACE_WITH_CALLBACK, and instead introduces
DEFINE_TRACE_FN which stores the callbacks in struct tracepoint. The
callbacks are used now when the active state of the tracepoint changes
in set_tracepoint & disable_tracepoint.
This also introduces TRACE_EVENT_FN, so ftrace events can also provide
registration callbacks if needed.
Signed-off-by: Josh Stone <jistone@redhat.com> Cc: Jason Baron <jbaron@redhat.com> Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: Li Zefan <lizf@cn.fujitsu.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca> Cc: Jiaying Zhang <jiayingz@google.com> Cc: Martin Bligh <mbligh@google.com> Cc: Lai Jiangshan <laijs@cn.fujitsu.com> Cc: Paul Mundt <lethal@linux-sh.org> Cc: Martin Schwidefsky <schwidefsky@de.ibm.com> Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
LKML-Reference: <1251150194-1713-4-git-send-email-jistone@redhat.com> Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Josh Stone [Mon, 24 Aug 2009 21:43:12 +0000 (14:43 -0700)]
tracing: Make syscall tracepoints conditional
The syscall enter/exit tracepoints are only supported on archs that
HAVE_SYSCALL_TRACEPOINTS, so the declarations should be #ifdef'ed.
Also, the definition of syscall_regfunc and syscall_unregfunc should
depend on this same config, rather than the ftrace-specific one.
Signed-off-by: Josh Stone <jistone@redhat.com> Cc: Jason Baron <jbaron@redhat.com> Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: Li Zefan <lizf@cn.fujitsu.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca> Cc: Jiaying Zhang <jiayingz@google.com> Cc: Martin Bligh <mbligh@google.com> Cc: Lai Jiangshan <laijs@cn.fujitsu.com>
LKML-Reference: <1251150194-1713-3-git-send-email-jistone@redhat.com> Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Li Zefan [Wed, 19 Aug 2009 07:52:25 +0000 (15:52 +0800)]
tracing/syscalls: Fix fields format for enter events
The "format" file of a trace event is originally for parsers to
parse ftrace binary output.
But the "format" file of a syscall event can only be used by
perfcounter, because it describes the format of struct
syscall_enter_record not struct syscall_trace_enter.
To fix this, we remove struct syscall_enter_record, and then
struct syscall_trace_enter will be used by both perf profile
and ftrace.
Signed-off-by: Li Zefan <lizf@cn.fujitsu.com> Cc: Jason Baron <jbaron@redhat.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Frederic Weisbecker <fweisbec@gmail.com>
LKML-Reference: <4A8BAF39.1030404@cn.fujitsu.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>
Ingo Molnar [Tue, 18 Aug 2009 08:41:57 +0000 (10:41 +0200)]
[S390] ftrace: update system call tracer support
Commit fb34a08c3 ("tracing: Add trace events for each syscall
entry/exit") changed the lowlevel API to ftrace syscall tracing
but did not update s390 which started making use of it recently.
This broke the s390 build, as reported by Paul Mundt.
Update the callbacks with the syscall number and the syscall
return code values. This allows per syscall tracepoints,
syscall argument enumeration /debug/tracing/events/syscalls/
and perfcounters support and integration on s390 too.
Reported-by: Paul Mundt <lethal@linux-sh.org> Acked-by: Martin Schwidefsky <schwidefsky@de.ibm.com> Cc: Heiko Carstens <heiko.carstens@de.ibm.com> Cc: Jason Baron <jbaron@redhat.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Frederic Weisbecker <fweisbec@gmail.com>
LKML-Reference: <tip-fb34a08c3469b2be9eae626ccb96476b4687b810@git.kernel.org> Signed-off-by: Ingo Molnar <mingo@elte.hu>
Li Zefan [Mon, 17 Aug 2009 08:52:53 +0000 (16:52 +0800)]
trace_stat: Fix missing entry in stat file
One entry is missing in the output of a stat file.
The cause is, when stat_seq_start() is called the 2nd time, we
should start from the (pos-1)th elem in the rbtree but not pos,
because pos == 0 is the header.
Signed-off-by: Li Zefan <lizf@cn.fujitsu.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Frederic Weisbecker <fweisbec@gmail.com>
LKML-Reference: <4A891A65.70009@cn.fujitsu.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>
Ingo Molnar [Thu, 13 Aug 2009 21:37:26 +0000 (23:37 +0200)]
tracing: Fix syscall tracing on !HAVE_FTRACE_SYSCALLS architectures
The new syscall_regfunc()/unregfunc() functions rely on
the existence of TIF_SYSCALL_FTRACE - but that TIF flag
is only offered by HAVE_FTRACE_SYSCALLS.
Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: Jason Baron <jbaron@redhat.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Peter Zijlstra <peterz@infradead.org>
LKML-Reference: <new-submission> Signed-off-by: Ingo Molnar <mingo@elte.hu>
tracing: Support for syscall events raw records in perfcounters
This bring the support for raw syscall events in perfcounters.
The arguments or exit value are saved as a raw sample using
the PERF_SAMPLE_RAW attribute in a perf counter.
Example (for now you must explicitly set the PERF_SAMPLE_RAW flag
in perf record):
perf record -e syscalls:sys_enter_open -f -F 1 -a
perf report -D
tracing: Add ftrace event call parameter to its field descriptor handler
Add the struct ftrace_event_call as a parameter of its show_format()
callback. This way we can use it from the syscall trace events to
retrieve the syscall name from the ftrace event call parameter and
describe its fields using the syscalls metadata.
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com> Cc: Lai Jiangshan <laijs@cn.fujitsu.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca> Cc: Jiaying Zhang <jiayingz@google.com> Cc: Martin Bligh <mbligh@google.com> Cc: Li Zefan <lizf@cn.fujitsu.com> Cc: Masami Hiramatsu <mhiramat@redhat.com> Cc: Jason Baron <jbaron@redhat.com>
Jason Baron [Mon, 10 Aug 2009 20:52:47 +0000 (16:52 -0400)]
tracing: Add trace events for each syscall entry/exit
Layer Frederic's syscall tracer on tracepoints. We create trace events
via hooking into the SYSCALL_DEFINE macros. This allows us to
individually toggle syscall entry and exit points on/off.
Signed-off-by: Jason Baron <jbaron@redhat.com> Cc: Lai Jiangshan <laijs@cn.fujitsu.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca> Cc: Jiaying Zhang <jiayingz@google.com> Cc: Martin Bligh <mbligh@google.com> Cc: Li Zefan <lizf@cn.fujitsu.com> Cc: Masami Hiramatsu <mhiramat@redhat.com> Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Jason Baron [Mon, 10 Aug 2009 20:52:44 +0000 (16:52 -0400)]
tracing: Add ftrace_event_call void * 'data' field
add an optional void * pointer to 'ftrace_event_call' that is
passed in for regfunc and unregfunc.
This prepares for syscall tracepoints creation by passing the name of
the syscall we want to trace and then retrieve its number through our
arch syscall table.
Signed-off-by: Jason Baron <jbaron@redhat.com> Cc: Lai Jiangshan <laijs@cn.fujitsu.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca> Cc: Jiaying Zhang <jiayingz@google.com> Cc: Martin Bligh <mbligh@google.com> Cc: Li Zefan <lizf@cn.fujitsu.com> Cc: Masami Hiramatsu <mhiramat@redhat.com> Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Jason Baron [Mon, 10 Aug 2009 20:52:39 +0000 (16:52 -0400)]
tracing: Raw_init() bailout in trace event register fail case
Allow the return value of raw_init() trace event callback to bail us out
of creating a trace event file, in case we fail to register our
event.
Also, we plan to return -ENOSYS for syscall events that don't match any
syscalls listed in our arch tracing syscall table, we don't want to warn
in that case, we just want this event to be invisible in debugfs and
ignored.
Signed-off-by: Jason Baron <jbaron@redhat.com> Cc: Lai Jiangshan <laijs@cn.fujitsu.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca> Cc: Jiaying Zhang <jiayingz@google.com> Cc: Martin Bligh <mbligh@google.com> Cc: Li Zefan <lizf@cn.fujitsu.com> Cc: Masami Hiramatsu <mhiramat@redhat.com> Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Jason Baron [Mon, 10 Aug 2009 20:52:27 +0000 (16:52 -0400)]
tracing: Add DECLARE_TRACE_WITH_CALLBACK() macro
Introduce a new 'DECLARE_TRACE_WITH_CALLBACK()' macro, so that
tracepoints can associate an external register/unregister function.
This prepares for the syscalls tracer conversion to trace events. We
will need to perform arch level operations once a syscall event is
turned on/off, such as TIF flags setting, hence the need of such
specific callbacks.
Signed-off-by: Jason Baron <jbaron@redhat.com> Cc: Lai Jiangshan <laijs@cn.fujitsu.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca> Cc: Jiaying Zhang <jiayingz@google.com> Cc: Martin Bligh <mbligh@google.com> Cc: Li Zefan <lizf@cn.fujitsu.com> Cc: Masami Hiramatsu <mhiramat@redhat.com> Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Zhaolei [Fri, 7 Aug 2009 10:53:21 +0000 (18:53 +0800)]
tracing: Rename set_tracer_flags()'s local variable trace_flags
set_tracer_flags() have a local variable named trace_flags which has
the same name than a global one in the same scope.
This leads to confusion, using tracer_flags should be better by its
meaning.
Changelog:
v1->v2: Simplified another patch in this patchset, no change in this
patch.
Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Li Zefan <lizf@cn.fujitsu.com> Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Linus Torvalds [Mon, 10 Aug 2009 20:21:19 +0000 (13:21 -0700)]
pty: fix data loss when stopped (^S/^Q)
Commit d945cb9cc ("pty: Rework the pty layer to use the normal buffering
logic") dropped the test for 'tty->stopped' in pty_write_room(), which
then causes the n_tty line discipline thing to not throttle the data
properly when the tty is stopped.
So instead of pausing the write due to the tty being stopped, the ldisc
layer would go ahead and push it down to the pty. The pty write()
routine would then refuse to take the data (because it _did_ check
'stopped'), and the data wouldn't actually be written.
This whole stopped test should eventually be moved into the tty ldisc
layer rather than have low-level tty drivers care about these things,
but right now the fix is to just re-instate the missing pty 'stopped'
handling.
Reported-and-tested-by: Artur Skawina <art.08.09@gmail.com> Cc: Alan Cox <alan@lxorguk.ukuu.org.uk> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Linus Torvalds [Mon, 10 Aug 2009 18:48:51 +0000 (11:48 -0700)]
Merge branch 'perfcounters-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip
* 'perfcounters-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: (27 commits)
perf_counter: Zero dead bytes from ftrace raw samples size alignment
perf_counter: Subtract the buffer size field from the event record size
perf_counter: Require CAP_SYS_ADMIN for raw tracepoint data
perf_counter: Correct PERF_SAMPLE_RAW output
perf tools: callchain: Fix bad rounding of minimum rate
perf_counter tools: Fix libbfd detection for systems with libz dependency
perf: "Longum est iter per praecepta, breve et efficax per exempla"
perf_counter: Fix a race on perf_counter_ctx
perf_counter: Fix tracepoint sampling to be part of generic sampling
perf_counter: Work around gcc warning by initializing tracepoint record unconditionally
perf tools: callchain: Fix sum of percentages to be 100% by displaying amount of ignored chains in fractal mode
perf tools: callchain: Fix 'perf report' display to be callchain by default
perf tools: callchain: Fix spurious 'perf report' warnings: ignore empty callchains
perf record: Fix the -A UI for empty or non-existent perf.data
perf util: Fix do_read() to fail on EOF instead of busy-looping
perf list: Fix the output to not include tracepoints without an id
perf_counter/powerpc: Fix oops on cpus without perf_counter hardware support
perf stat: Fix tool option consistency: rename -S/--scale to -c/--scale
perf report: Add debug help for the finding of symbol bugs - show the symtab origin (DSO, build-id, kernel, etc)
perf report: Fix per task mult-counter stat reporting
...
Linus Torvalds [Mon, 10 Aug 2009 18:00:37 +0000 (11:00 -0700)]
Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jbarnes/pci-2.6
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jbarnes/pci-2.6:
PCI hotplug: SGI hotplug: do not use hotplug_slot_attr
PCI hotplug: SGI hotplug: fix build failure
Wei Chong Tan reported a fast-PIT-calibration corner-case:
| pit_expect_msb() is vulnerable to SMI disturbance corner case
| in some platforms which causes /proc/cpuinfo to show wrong
| CPU MHz value when quick_pit_calibrate() jumps to success
| section.
I think that the real issue isn't even an SMI - but the fact
that in the very last iteration of the loop, there's no
serializing instruction _after_ the last 'rdtsc'. So even in
the absense of SMI's, we do have a situation where the cycle
counter was read without proper serialization.
The last check should be done outside the outer loop, since
_inside_ the outer loop, we'll be testing that the PIT has
the right MSB value has the right value in the next iteration.
So only the _last_ iteration is special, because that's the one
that will not check the PIT MSB value any more, and because the
final 'get_cycles()' isn't serialized.
In other words:
- I'd like to move the PIT MSB check to after the last
iteration, rather than in every iteration
- I think we should comment on the fact that it's also a
serializing instruction and so 'fences in' the TSC read.
Linus Torvalds [Mon, 10 Aug 2009 16:00:47 +0000 (09:00 -0700)]
Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/security-testing-2.6
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/security-testing-2.6:
mm_for_maps: take ->cred_guard_mutex to fix the race with exec
mm_for_maps: shift down_read(mmap_sem) to the caller
mm_for_maps: simplify, use ptrace_may_access()
Figo.zhang [Sat, 8 Aug 2009 13:01:22 +0000 (21:01 +0800)]
mempool.c: clean up type-casting
clean up type-casting twice. "size_t" is typedef as "unsigned long" in
64-bit system, and "unsigned int" in 32-bit system, and the intermediate
cast to 'long' is pointless.
perf_counter: Zero dead bytes from ftrace raw samples size alignment
After aligning the ftrace raw samples, there are dead bytes storing
random data from the stack. We don't want to leak these to userspace,
then zero these out.
perf_counter: Subtract the buffer size field from the event record size
We compute the perf raw sample size by aligning the raw ftrace
event size plus the buffer size field itself. We do that
instead of aligning only the perf raw sample size, so that we
might economize some in some cases.
But this buffer size field is not stored in the perf raw
sample, we must then substract its size from the buffer once we
computed the alignment unless we may get a useless u32 field in
the buffer.
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com> Acked-by: Peter Zijlstra <peterz@infradead.org> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: Mike Galbraith <efault@gmx.de> Cc: Paul Mackerras <paulus@samba.org>
LKML-Reference: <20090810141129.GA5124@nowhere> Signed-off-by: Ingo Molnar <mingo@elte.hu>
mm_for_maps: take ->cred_guard_mutex to fix the race with exec
The problem is minor, but without ->cred_guard_mutex held we can race
with exec() and get the new ->mm but check old creds.
Now we do not need to re-check task->mm after ptrace_may_access(), it
can't be changed to the new mm under us.
Strictly speaking, this also fixes another very minor problem. Unless
security check fails or the task exits mm_for_maps() should never
return NULL, the caller should get either old or new ->mm.
mm_for_maps: shift down_read(mmap_sem) to the caller
mm_for_maps() takes ->mmap_sem after security checks, this looks
strange and obfuscates the locking rules. Move this lock to its
single caller, m_start().
Oleg Nesterov [Tue, 23 Jun 2009 19:25:32 +0000 (21:25 +0200)]
mm_for_maps: simplify, use ptrace_may_access()
It would be nice to kill __ptrace_may_access(). It requires task_lock(),
but this lock is only needed to read mm->flags in the middle.
Convert mm_for_maps() to use ptrace_may_access(), this also simplifies
the code a little bit.
Also, we do not need to take ->mmap_sem in advance. In fact I think
mm_for_maps() should not play with ->mmap_sem at all, the caller should
take this lock.
With or without this patch, without ->cred_guard_mutex held we can race
with exec() and get the new ->mm but check old creds.
Peter Zijlstra [Mon, 10 Aug 2009 09:16:52 +0000 (11:16 +0200)]
perf_counter: Correct PERF_SAMPLE_RAW output
PERF_SAMPLE_* output switches should unconditionally output the
correct format, as they are the only way to unambiguously parse
the PERF_EVENT_SAMPLE data.
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Acked-by: Frederic Weisbecker <fweisbec@gmail.com> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: Mike Galbraith <efault@gmx.de> Cc: Paul Mackerras <paulus@samba.org>
LKML-Reference: <1249896447.17467.74.camel@twins> Signed-off-by: Ingo Molnar <mingo@elte.hu>
powerpc/dma: pci_set_dma_mask() shouldn't fail if mask fits in RAM
On an iMac G5, the b43 driver is failing to initialise because trying to
set the dma mask to 30-bit fails. Even though there's only 512MiB of RAM
in the machine anyway:
https://bugzilla.redhat.com/show_bug.cgi?id=514787
We should probably let it succeed if the available RAM in the system
doesn't exceed the requested limit.
Signed-off-by: David Woodhouse <David.Woodhouse@intel.com> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Linus Torvalds [Sun, 9 Aug 2009 21:58:21 +0000 (14:58 -0700)]
Merge branch 'kvm-updates/2.6.31' of git://git.kernel.org/pub/scm/virt/kvm/kvm
* 'kvm-updates/2.6.31' of git://git.kernel.org/pub/scm/virt/kvm/kvm:
KVM: Avoid redelivery of edge interrupt before next edge
KVM: MMU: limit rmap chain length
KVM: ia64: fix build failures due to ia64/unsigned long mismatches
KVM: Make KVM_HPAGES_PER_HPAGE unsigned long to avoid build error on powerpc
KVM: fix ack not being delivered when msi present
KVM: s390: fix wait_queue handling
KVM: VMX: Fix locking imbalance on emulation failure
KVM: VMX: Fix locking order in handle_invalid_guest_state
KVM: MMU: handle n_free_mmu_pages > n_alloc_mmu_pages in kvm_mmu_change_mmu_pages
KVM: SVM: force new asid on vcpu migration
KVM: x86: verify MTRR/PAT validity
KVM: PIT: fix kpit_elapsed division by zero
KVM: Fix KVM_GET_MSR_INDEX_LIST
Linus Torvalds [Sun, 9 Aug 2009 21:57:41 +0000 (14:57 -0700)]
Merge branch 'timers-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip
* 'timers-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
posix_cpu_timers_exit_group(): Do not use thread_group_cputimer()
Linus Torvalds [Sun, 9 Aug 2009 21:57:26 +0000 (14:57 -0700)]
Merge branch 'tracing-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip
* 'tracing-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
perf_counter: Fix/complete ftrace event records sampling
perf_counter, ftrace: Fix perf_counter integration
tracing/filters: Always free pred on filter_add_subsystem_pred() failure
tracing/filters: Don't use pred on alloc failure
ring-buffer: Fix memleak in ring_buffer_free()
tracing: Fix recordmcount.pl to handle sections with only weak functions
ring-buffer: Fix advance of reader in rb_buffer_peek()
tracing: do not use functions starting with .L in recordmcount.pl
ring-buffer: do not disable ring buffer on oops_in_progress
ring-buffer: fix check of try_to_discard result
Linus Torvalds [Sun, 9 Aug 2009 21:57:09 +0000 (14:57 -0700)]
Merge branch 'x86-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip
* 'x86-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
x86: fix buffer overflow in efi_init()
x86: Add quirk to make Apple MacBookPro5,1 use reboot=pci
x86: Fix MSI-X initialization by using online_mask for x2apic target_cpus
x86: Fix VMI && stack protector
Linus Torvalds [Sun, 9 Aug 2009 21:56:51 +0000 (14:56 -0700)]
Merge branch 'core-fixes-for-linus-2' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip
* 'core-fixes-for-linus-2' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
lockdep: Fix typos in documentation
lockdep: Fix file mode of lock_stat
rtmutex: Avoid deadlock in rt_mutex_start_proxy_lock()
It is abnormal to get a 7.14% branch whereas we passed a 10%
filter.
The problem is that we round down the minimum threshold. This
happens mostly when we have very low number of events. If the
total amount of your branch is 4 and you have a subranch of 3
events, filtering to 90% will be computed like follows:
limit = 4 * 0.9;
The result is about 3.6, but the cast to integer will round
down to 3. It means that our filter is actually of 75%
We must then explicitly round up the minimum threshold.