]>
Commit | Line | Data |
---|---|---|
1da177e4 LT |
1 | -*-Mode: outline-*- |
2 | ||
3 | Light-weight System Calls for IA-64 | |
4 | ----------------------------------- | |
5 | ||
6 | Started: 13-Jan-2003 | |
7 | Last update: 27-Sep-2003 | |
8 | ||
9 | David Mosberger-Tang | |
10 | <davidm@hpl.hp.com> | |
11 | ||
12 | Using the "epc" instruction effectively introduces a new mode of | |
13 | execution to the ia64 linux kernel. We call this mode the | |
14 | "fsys-mode". To recap, the normal states of execution are: | |
15 | ||
16 | - kernel mode: | |
17 | Both the register stack and the memory stack have been | |
18 | switched over to kernel memory. The user-level state is saved | |
19 | in a pt-regs structure at the top of the kernel memory stack. | |
20 | ||
21 | - user mode: | |
22 | Both the register stack and the kernel stack are in | |
23 | user memory. The user-level state is contained in the | |
24 | CPU registers. | |
25 | ||
26 | - bank 0 interruption-handling mode: | |
27 | This is the non-interruptible state which all | |
28 | interruption-handlers start execution in. The user-level | |
29 | state remains in the CPU registers and some kernel state may | |
30 | be stored in bank 0 of registers r16-r31. | |
31 | ||
32 | In contrast, fsys-mode has the following special properties: | |
33 | ||
34 | - execution is at privilege level 0 (most-privileged) | |
35 | ||
36 | - CPU registers may contain a mixture of user-level and kernel-level | |
37 | state (it is the responsibility of the kernel to ensure that no | |
38 | security-sensitive kernel-level state is leaked back to | |
39 | user-level) | |
40 | ||
41 | - execution is interruptible and preemptible (an fsys-mode handler | |
42 | can disable interrupts and avoid all other interruption-sources | |
43 | to avoid preemption) | |
44 | ||
45 | - neither the memory-stack nor the register-stack can be trusted while | |
46 | in fsys-mode (they point to the user-level stacks, which may | |
47 | be invalid, or completely bogus addresses) | |
48 | ||
49 | In summary, fsys-mode is much more similar to running in user-mode | |
50 | than it is to running in kernel-mode. Of course, given that the | |
51 | privilege level is at level 0, this means that fsys-mode requires some | |
52 | care (see below). | |
53 | ||
54 | ||
55 | * How to tell fsys-mode | |
56 | ||
57 | Linux operates in fsys-mode when (a) the privilege level is 0 (most | |
58 | privileged) and (b) the stacks have NOT been switched to kernel memory | |
59 | yet. For convenience, the header file <asm-ia64/ptrace.h> provides | |
60 | three macros: | |
61 | ||
62 | user_mode(regs) | |
63 | user_stack(task,regs) | |
64 | fsys_mode(task,regs) | |
65 | ||
66 | The "regs" argument is a pointer to a pt_regs structure. The "task" | |
67 | argument is a pointer to the task structure to which the "regs" | |
68 | pointer belongs to. user_mode() returns TRUE if the CPU state pointed | |
69 | to by "regs" was executing in user mode (privilege level 3). | |
70 | user_stack() returns TRUE if the state pointed to by "regs" was | |
71 | executing on the user-level stack(s). Finally, fsys_mode() returns | |
72 | TRUE if the CPU state pointed to by "regs" was executing in fsys-mode. | |
73 | The fsys_mode() macro is equivalent to the expression: | |
74 | ||
75 | !user_mode(regs) && user_stack(task,regs) | |
76 | ||
77 | * How to write an fsyscall handler | |
78 | ||
79 | The file arch/ia64/kernel/fsys.S contains a table of fsyscall-handlers | |
80 | (fsyscall_table). This table contains one entry for each system call. | |
81 | By default, a system call is handled by fsys_fallback_syscall(). This | |
82 | routine takes care of entering (full) kernel mode and calling the | |
83 | normal Linux system call handler. For performance-critical system | |
84 | calls, it is possible to write a hand-tuned fsyscall_handler. For | |
85 | example, fsys.S contains fsys_getpid(), which is a hand-tuned version | |
86 | of the getpid() system call. | |
87 | ||
88 | The entry and exit-state of an fsyscall handler is as follows: | |
89 | ||
90 | ** Machine state on entry to fsyscall handler: | |
91 | ||
92 | - r10 = 0 | |
93 | - r11 = saved ar.pfs (a user-level value) | |
94 | - r15 = system call number | |
95 | - r16 = "current" task pointer (in normal kernel-mode, this is in r13) | |
96 | - r32-r39 = system call arguments | |
97 | - b6 = return address (a user-level value) | |
98 | - ar.pfs = previous frame-state (a user-level value) | |
99 | - PSR.be = cleared to zero (i.e., little-endian byte order is in effect) | |
100 | - all other registers may contain values passed in from user-mode | |
101 | ||
102 | ** Required machine state on exit to fsyscall handler: | |
103 | ||
104 | - r11 = saved ar.pfs (as passed into the fsyscall handler) | |
105 | - r15 = system call number (as passed into the fsyscall handler) | |
106 | - r32-r39 = system call arguments (as passed into the fsyscall handler) | |
107 | - b6 = return address (as passed into the fsyscall handler) | |
108 | - ar.pfs = previous frame-state (as passed into the fsyscall handler) | |
109 | ||
110 | Fsyscall handlers can execute with very little overhead, but with that | |
111 | speed comes a set of restrictions: | |
112 | ||
113 | o Fsyscall-handlers MUST check for any pending work in the flags | |
114 | member of the thread-info structure and if any of the | |
115 | TIF_ALLWORK_MASK flags are set, the handler needs to fall back on | |
116 | doing a full system call (by calling fsys_fallback_syscall). | |
117 | ||
118 | o Fsyscall-handlers MUST preserve incoming arguments (r32-r39, r11, | |
119 | r15, b6, and ar.pfs) because they will be needed in case of a | |
120 | system call restart. Of course, all "preserved" registers also | |
121 | must be preserved, in accordance to the normal calling conventions. | |
122 | ||
123 | o Fsyscall-handlers MUST check argument registers for containing a | |
124 | NaT value before using them in any way that could trigger a | |
125 | NaT-consumption fault. If a system call argument is found to | |
126 | contain a NaT value, an fsyscall-handler may return immediately | |
127 | with r8=EINVAL, r10=-1. | |
128 | ||
129 | o Fsyscall-handlers MUST NOT use the "alloc" instruction or perform | |
130 | any other operation that would trigger mandatory RSE | |
131 | (register-stack engine) traffic. | |
132 | ||
133 | o Fsyscall-handlers MUST NOT write to any stacked registers because | |
134 | it is not safe to assume that user-level called a handler with the | |
135 | proper number of arguments. | |
136 | ||
137 | o Fsyscall-handlers need to be careful when accessing per-CPU variables: | |
138 | unless proper safe-guards are taken (e.g., interruptions are avoided), | |
139 | execution may be pre-empted and resumed on another CPU at any given | |
140 | time. | |
141 | ||
142 | o Fsyscall-handlers must be careful not to leak sensitive kernel' | |
143 | information back to user-level. In particular, before returning to | |
144 | user-level, care needs to be taken to clear any scratch registers | |
145 | that could contain sensitive information (note that the current | |
146 | task pointer is not considered sensitive: it's already exposed | |
147 | through ar.k6). | |
148 | ||
149 | o Fsyscall-handlers MUST NOT access user-memory without first | |
150 | validating access-permission (this can be done typically via | |
151 | probe.r.fault and/or probe.w.fault) and without guarding against | |
152 | memory access exceptions (this can be done with the EX() macros | |
153 | defined by asmmacro.h). | |
154 | ||
155 | The above restrictions may seem draconian, but remember that it's | |
156 | possible to trade off some of the restrictions by paying a slightly | |
157 | higher overhead. For example, if an fsyscall-handler could benefit | |
158 | from the shadow register bank, it could temporarily disable PSR.i and | |
159 | PSR.ic, switch to bank 0 (bsw.0) and then use the shadow registers as | |
160 | needed. In other words, following the above rules yields extremely | |
161 | fast system call execution (while fully preserving system call | |
162 | semantics), but there is also a lot of flexibility in handling more | |
163 | complicated cases. | |
164 | ||
165 | * Signal handling | |
166 | ||
167 | The delivery of (asynchronous) signals must be delayed until fsys-mode | |
3f6dee9b | 168 | is exited. This is accomplished with the help of the lower-privilege |
1da177e4 LT |
169 | transfer trap: arch/ia64/kernel/process.c:do_notify_resume_user() |
170 | checks whether the interrupted task was in fsys-mode and, if so, sets | |
171 | PSR.lp and returns immediately. When fsys-mode is exited via the | |
172 | "br.ret" instruction that lowers the privilege level, a trap will | |
173 | occur. The trap handler clears PSR.lp again and returns immediately. | |
174 | The kernel exit path then checks for and delivers any pending signals. | |
175 | ||
176 | * PSR Handling | |
177 | ||
178 | The "epc" instruction doesn't change the contents of PSR at all. This | |
179 | is in contrast to a regular interruption, which clears almost all | |
180 | bits. Because of that, some care needs to be taken to ensure things | |
181 | work as expected. The following discussion describes how each PSR bit | |
182 | is handled. | |
183 | ||
184 | PSR.be Cleared when entering fsys-mode. A srlz.d instruction is used | |
185 | to ensure the CPU is in little-endian mode before the first | |
186 | load/store instruction is executed. PSR.be is normally NOT | |
187 | restored upon return from an fsys-mode handler. In other | |
188 | words, user-level code must not rely on PSR.be being preserved | |
189 | across a system call. | |
190 | PSR.up Unchanged. | |
191 | PSR.ac Unchanged. | |
192 | PSR.mfl Unchanged. Note: fsys-mode handlers must not write-registers! | |
193 | PSR.mfh Unchanged. Note: fsys-mode handlers must not write-registers! | |
194 | PSR.ic Unchanged. Note: fsys-mode handlers can clear the bit, if needed. | |
195 | PSR.i Unchanged. Note: fsys-mode handlers can clear the bit, if needed. | |
196 | PSR.pk Unchanged. | |
197 | PSR.dt Unchanged. | |
198 | PSR.dfl Unchanged. Note: fsys-mode handlers must not write-registers! | |
199 | PSR.dfh Unchanged. Note: fsys-mode handlers must not write-registers! | |
200 | PSR.sp Unchanged. | |
201 | PSR.pp Unchanged. | |
202 | PSR.di Unchanged. | |
203 | PSR.si Unchanged. | |
204 | PSR.db Unchanged. The kernel prevents user-level from setting a hardware | |
205 | breakpoint that triggers at any privilege level other than 3 (user-mode). | |
206 | PSR.lp Unchanged. | |
207 | PSR.tb Lazy redirect. If a taken-branch trap occurs while in | |
208 | fsys-mode, the trap-handler modifies the saved machine state | |
209 | such that execution resumes in the gate page at | |
210 | syscall_via_break(), with privilege level 3. Note: the | |
211 | taken branch would occur on the branch invoking the | |
212 | fsyscall-handler, at which point, by definition, a syscall | |
213 | restart is still safe. If the system call number is invalid, | |
214 | the fsys-mode handler will return directly to user-level. This | |
215 | return will trigger a taken-branch trap, but since the trap is | |
216 | taken _after_ restoring the privilege level, the CPU has already | |
217 | left fsys-mode, so no special treatment is needed. | |
218 | PSR.rt Unchanged. | |
219 | PSR.cpl Cleared to 0. | |
220 | PSR.is Unchanged (guaranteed to be 0 on entry to the gate page). | |
221 | PSR.mc Unchanged. | |
222 | PSR.it Unchanged (guaranteed to be 1). | |
223 | PSR.id Unchanged. Note: the ia64 linux kernel never sets this bit. | |
224 | PSR.da Unchanged. Note: the ia64 linux kernel never sets this bit. | |
225 | PSR.dd Unchanged. Note: the ia64 linux kernel never sets this bit. | |
226 | PSR.ss Lazy redirect. If set, "epc" will cause a Single Step Trap to | |
227 | be taken. The trap handler then modifies the saved machine | |
228 | state such that execution resumes in the gate page at | |
229 | syscall_via_break(), with privilege level 3. | |
230 | PSR.ri Unchanged. | |
231 | PSR.ed Unchanged. Note: This bit could only have an effect if an fsys-mode | |
232 | handler performed a speculative load that gets NaTted. If so, this | |
233 | would be the normal & expected behavior, so no special treatment is | |
234 | needed. | |
235 | PSR.bn Unchanged. Note: fsys-mode handlers may clear the bit, if needed. | |
236 | Doing so requires clearing PSR.i and PSR.ic as well. | |
237 | PSR.ia Unchanged. Note: the ia64 linux kernel never sets this bit. | |
238 | ||
239 | * Using fast system calls | |
240 | ||
241 | To use fast system calls, userspace applications need simply call | |
242 | __kernel_syscall_via_epc(). For example | |
243 | ||
244 | -- example fgettimeofday() call -- | |
245 | -- fgettimeofday.S -- | |
246 | ||
247 | #include <asm/asmmacro.h> | |
248 | ||
249 | GLOBAL_ENTRY(fgettimeofday) | |
250 | .prologue | |
251 | .save ar.pfs, r11 | |
252 | mov r11 = ar.pfs | |
253 | .body | |
254 | ||
255 | mov r2 = 0xa000000000020660;; // gate address | |
256 | // found by inspection of System.map for the | |
257 | // __kernel_syscall_via_epc() function. See | |
258 | // below for how to do this for real. | |
259 | ||
260 | mov b7 = r2 | |
261 | mov r15 = 1087 // gettimeofday syscall | |
262 | ;; | |
263 | br.call.sptk.many b6 = b7 | |
264 | ;; | |
265 | ||
266 | .restore sp | |
267 | ||
268 | mov ar.pfs = r11 | |
269 | br.ret.sptk.many rp;; // return to caller | |
270 | END(fgettimeofday) | |
271 | ||
272 | -- end fgettimeofday.S -- | |
273 | ||
274 | In reality, getting the gate address is accomplished by two extra | |
275 | values passed via the ELF auxiliary vector (include/asm-ia64/elf.h) | |
276 | ||
277 | o AT_SYSINFO : is the address of __kernel_syscall_via_epc() | |
278 | o AT_SYSINFO_EHDR : is the address of the kernel gate ELF DSO | |
279 | ||
280 | The ELF DSO is a pre-linked library that is mapped in by the kernel at | |
281 | the gate page. It is a proper ELF shared object so, with a dynamic | |
282 | loader that recognises the library, you should be able to make calls to | |
283 | the exported functions within it as with any other shared library. | |
284 | AT_SYSINFO points into the kernel DSO at the | |
285 | __kernel_syscall_via_epc() function for historical reasons (it was | |
286 | used before the kernel DSO) and as a convenience. |