]>
Commit | Line | Data |
---|---|---|
edba5eec FV |
1 | |
2 | .. _addsyscalls: | |
3 | ||
4983953d DD |
4 | Adding a New System Call |
5 | ======================== | |
6 | ||
7 | This document describes what's involved in adding a new system call to the | |
8 | Linux kernel, over and above the normal submission advice in | |
8c27ceff | 9 | :ref:`Documentation/process/submitting-patches.rst <submittingpatches>`. |
4983953d DD |
10 | |
11 | ||
12 | System Call Alternatives | |
13 | ------------------------ | |
14 | ||
15 | The first thing to consider when adding a new system call is whether one of | |
16 | the alternatives might be suitable instead. Although system calls are the | |
17 | most traditional and most obvious interaction points between userspace and the | |
18 | kernel, there are other possibilities -- choose what fits best for your | |
19 | interface. | |
20 | ||
21 | - If the operations involved can be made to look like a filesystem-like | |
22 | object, it may make more sense to create a new filesystem or device. This | |
23 | also makes it easier to encapsulate the new functionality in a kernel module | |
24 | rather than requiring it to be built into the main kernel. | |
12983bcd | 25 | |
4983953d DD |
26 | - If the new functionality involves operations where the kernel notifies |
27 | userspace that something has happened, then returning a new file | |
28 | descriptor for the relevant object allows userspace to use | |
12983bcd MCC |
29 | ``poll``/``select``/``epoll`` to receive that notification. |
30 | - However, operations that don't map to | |
31 | :manpage:`read(2)`/:manpage:`write(2)`-like operations | |
32 | have to be implemented as :manpage:`ioctl(2)` requests, which can lead | |
33 | to a somewhat opaque API. | |
34 | ||
4983953d | 35 | - If you're just exposing runtime system information, a new node in sysfs |
0c1bc6b8 | 36 | (see ``Documentation/filesystems/sysfs.rst``) or the ``/proc`` filesystem may |
12983bcd | 37 | be more appropriate. However, access to these mechanisms requires that the |
4983953d DD |
38 | relevant filesystem is mounted, which might not always be the case (e.g. |
39 | in a namespaced/sandboxed/chrooted environment). Avoid adding any API to | |
40 | debugfs, as this is not considered a 'production' interface to userspace. | |
41 | - If the operation is specific to a particular file or file descriptor, then | |
12983bcd MCC |
42 | an additional :manpage:`fcntl(2)` command option may be more appropriate. However, |
43 | :manpage:`fcntl(2)` is a multiplexing system call that hides a lot of complexity, so | |
4983953d | 44 | this option is best for when the new function is closely analogous to |
12983bcd | 45 | existing :manpage:`fcntl(2)` functionality, or the new functionality is very simple |
4983953d DD |
46 | (for example, getting/setting a simple flag related to a file descriptor). |
47 | - If the operation is specific to a particular task or process, then an | |
12983bcd MCC |
48 | additional :manpage:`prctl(2)` command option may be more appropriate. As |
49 | with :manpage:`fcntl(2)`, this system call is a complicated multiplexor so | |
50 | is best reserved for near-analogs of existing ``prctl()`` commands or | |
51 | getting/setting a simple flag related to a process. | |
4983953d DD |
52 | |
53 | ||
54 | Designing the API: Planning for Extension | |
55 | ----------------------------------------- | |
56 | ||
57 | A new system call forms part of the API of the kernel, and has to be supported | |
58 | indefinitely. As such, it's a very good idea to explicitly discuss the | |
59 | interface on the kernel mailing list, and it's important to plan for future | |
60 | extensions of the interface. | |
61 | ||
62 | (The syscall table is littered with historical examples where this wasn't done, | |
12983bcd MCC |
63 | together with the corresponding follow-up system calls -- |
64 | ``eventfd``/``eventfd2``, ``dup2``/``dup3``, ``inotify_init``/``inotify_init1``, | |
65 | ``pipe``/``pipe2``, ``renameat``/``renameat2`` -- so | |
4983953d DD |
66 | learn from the history of the kernel and plan for extensions from the start.) |
67 | ||
68 | For simpler system calls that only take a couple of arguments, the preferred | |
69 | way to allow for future extensibility is to include a flags argument to the | |
70 | system call. To make sure that userspace programs can safely use flags | |
71 | between kernel versions, check whether the flags value holds any unknown | |
12983bcd | 72 | flags, and reject the system call (with ``EINVAL``) if it does:: |
4983953d DD |
73 | |
74 | if (flags & ~(THING_FLAG1 | THING_FLAG2 | THING_FLAG3)) | |
75 | return -EINVAL; | |
76 | ||
77 | (If no flags values are used yet, check that the flags argument is zero.) | |
78 | ||
79 | For more sophisticated system calls that involve a larger number of arguments, | |
80 | it's preferred to encapsulate the majority of the arguments into a structure | |
81 | that is passed in by pointer. Such a structure can cope with future extension | |
12983bcd | 82 | by including a size argument in the structure:: |
4983953d DD |
83 | |
84 | struct xyzzy_params { | |
85 | u32 size; /* userspace sets p->size = sizeof(struct xyzzy_params) */ | |
86 | u32 param_1; | |
87 | u64 param_2; | |
88 | u64 param_3; | |
89 | }; | |
90 | ||
12983bcd | 91 | As long as any subsequently added field, say ``param_4``, is designed so that a |
4983953d DD |
92 | zero value gives the previous behaviour, then this allows both directions of |
93 | version mismatch: | |
94 | ||
95 | - To cope with a later userspace program calling an older kernel, the kernel | |
96 | code should check that any memory beyond the size of the structure that it | |
12983bcd | 97 | expects is zero (effectively checking that ``param_4 == 0``). |
4983953d DD |
98 | - To cope with an older userspace program calling a newer kernel, the kernel |
99 | code can zero-extend a smaller instance of the structure (effectively | |
12983bcd | 100 | setting ``param_4 = 0``). |
4983953d | 101 | |
12983bcd MCC |
102 | See :manpage:`perf_event_open(2)` and the ``perf_copy_attr()`` function (in |
103 | ``kernel/events/core.c``) for an example of this approach. | |
4983953d DD |
104 | |
105 | ||
106 | Designing the API: Other Considerations | |
107 | --------------------------------------- | |
108 | ||
109 | If your new system call allows userspace to refer to a kernel object, it | |
110 | should use a file descriptor as the handle for that object -- don't invent a | |
111 | new type of userspace object handle when the kernel already has mechanisms and | |
112 | well-defined semantics for using file descriptors. | |
113 | ||
12983bcd MCC |
114 | If your new :manpage:`xyzzy(2)` system call does return a new file descriptor, |
115 | then the flags argument should include a value that is equivalent to setting | |
116 | ``O_CLOEXEC`` on the new FD. This makes it possible for userspace to close | |
117 | the timing window between ``xyzzy()`` and calling | |
118 | ``fcntl(fd, F_SETFD, FD_CLOEXEC)``, where an unexpected ``fork()`` and | |
119 | ``execve()`` in another thread could leak a descriptor to | |
4983953d | 120 | the exec'ed program. (However, resist the temptation to re-use the actual value |
12983bcd MCC |
121 | of the ``O_CLOEXEC`` constant, as it is architecture-specific and is part of a |
122 | numbering space of ``O_*`` flags that is fairly full.) | |
4983953d DD |
123 | |
124 | If your system call returns a new file descriptor, you should also consider | |
12983bcd | 125 | what it means to use the :manpage:`poll(2)` family of system calls on that file |
4983953d DD |
126 | descriptor. Making a file descriptor ready for reading or writing is the |
127 | normal way for the kernel to indicate to userspace that an event has | |
128 | occurred on the corresponding kernel object. | |
129 | ||
12983bcd | 130 | If your new :manpage:`xyzzy(2)` system call involves a filename argument:: |
4983953d DD |
131 | |
132 | int sys_xyzzy(const char __user *path, ..., unsigned int flags); | |
133 | ||
12983bcd | 134 | you should also consider whether an :manpage:`xyzzyat(2)` version is more appropriate:: |
4983953d DD |
135 | |
136 | int sys_xyzzyat(int dfd, const char __user *path, ..., unsigned int flags); | |
137 | ||
138 | This allows more flexibility for how userspace specifies the file in question; | |
139 | in particular it allows userspace to request the functionality for an | |
12983bcd MCC |
140 | already-opened file descriptor using the ``AT_EMPTY_PATH`` flag, effectively |
141 | giving an :manpage:`fxyzzy(3)` operation for free:: | |
4983953d DD |
142 | |
143 | - xyzzyat(AT_FDCWD, path, ..., 0) is equivalent to xyzzy(path,...) | |
144 | - xyzzyat(fd, "", ..., AT_EMPTY_PATH) is equivalent to fxyzzy(fd, ...) | |
145 | ||
12983bcd MCC |
146 | (For more details on the rationale of the \*at() calls, see the |
147 | :manpage:`openat(2)` man page; for an example of AT_EMPTY_PATH, see the | |
148 | :manpage:`fstatat(2)` man page.) | |
149 | ||
150 | If your new :manpage:`xyzzy(2)` system call involves a parameter describing an | |
151 | offset within a file, make its type ``loff_t`` so that 64-bit offsets can be | |
152 | supported even on 32-bit architectures. | |
153 | ||
154 | If your new :manpage:`xyzzy(2)` system call involves privileged functionality, | |
155 | it needs to be governed by the appropriate Linux capability bit (checked with | |
156 | a call to ``capable()``), as described in the :manpage:`capabilities(7)` man | |
157 | page. Choose an existing capability bit that governs related functionality, | |
158 | but try to avoid combining lots of only vaguely related functions together | |
159 | under the same bit, as this goes against capabilities' purpose of splitting | |
160 | the power of root. In particular, avoid adding new uses of the already | |
161 | overly-general ``CAP_SYS_ADMIN`` capability. | |
162 | ||
163 | If your new :manpage:`xyzzy(2)` system call manipulates a process other than | |
164 | the calling process, it should be restricted (using a call to | |
165 | ``ptrace_may_access()``) so that only a calling process with the same | |
166 | permissions as the target process, or with the necessary capabilities, can | |
167 | manipulate the target process. | |
4983953d DD |
168 | |
169 | Finally, be aware that some non-x86 architectures have an easier time if | |
170 | system call parameters that are explicitly 64-bit fall on odd-numbered | |
171 | arguments (i.e. parameter 1, 3, 5), to allow use of contiguous pairs of 32-bit | |
172 | registers. (This concern does not apply if the arguments are part of a | |
173 | structure that's passed in by pointer.) | |
174 | ||
175 | ||
176 | Proposing the API | |
177 | ----------------- | |
178 | ||
179 | To make new system calls easy to review, it's best to divide up the patchset | |
180 | into separate chunks. These should include at least the following items as | |
181 | distinct commits (each of which is described further below): | |
182 | ||
183 | - The core implementation of the system call, together with prototypes, | |
184 | generic numbering, Kconfig changes and fallback stub implementation. | |
185 | - Wiring up of the new system call for one particular architecture, usually | |
186 | x86 (including all of x86_64, x86_32 and x32). | |
187 | - A demonstration of the use of the new system call in userspace via a | |
12983bcd | 188 | selftest in ``tools/testing/selftests/``. |
4983953d DD |
189 | - A draft man-page for the new system call, either as plain text in the |
190 | cover letter, or as a patch to the (separate) man-pages repository. | |
191 | ||
192 | New system call proposals, like any change to the kernel's API, should always | |
193 | be cc'ed to linux-api@vger.kernel.org. | |
194 | ||
195 | ||
196 | Generic System Call Implementation | |
197 | ---------------------------------- | |
198 | ||
12983bcd MCC |
199 | The main entry point for your new :manpage:`xyzzy(2)` system call will be called |
200 | ``sys_xyzzy()``, but you add this entry point with the appropriate | |
201 | ``SYSCALL_DEFINEn()`` macro rather than explicitly. The 'n' indicates the | |
202 | number of arguments to the system call, and the macro takes the system call name | |
4983953d DD |
203 | followed by the (type, name) pairs for the parameters as arguments. Using |
204 | this macro allows metadata about the new system call to be made available for | |
205 | other tools. | |
206 | ||
207 | The new entry point also needs a corresponding function prototype, in | |
12983bcd MCC |
208 | ``include/linux/syscalls.h``, marked as asmlinkage to match the way that system |
209 | calls are invoked:: | |
4983953d DD |
210 | |
211 | asmlinkage long sys_xyzzy(...); | |
212 | ||
213 | Some architectures (e.g. x86) have their own architecture-specific syscall | |
214 | tables, but several other architectures share a generic syscall table. Add your | |
215 | new system call to the generic list by adding an entry to the list in | |
12983bcd | 216 | ``include/uapi/asm-generic/unistd.h``:: |
4983953d DD |
217 | |
218 | #define __NR_xyzzy 292 | |
219 | __SYSCALL(__NR_xyzzy, sys_xyzzy) | |
220 | ||
221 | Also update the __NR_syscalls count to reflect the additional system call, and | |
222 | note that if multiple new system calls are added in the same merge window, | |
223 | your new syscall number may get adjusted to resolve conflicts. | |
224 | ||
12983bcd MCC |
225 | The file ``kernel/sys_ni.c`` provides a fallback stub implementation of each |
226 | system call, returning ``-ENOSYS``. Add your new system call here too:: | |
4983953d | 227 | |
67a7acd3 | 228 | COND_SYSCALL(xyzzy); |
4983953d DD |
229 | |
230 | Your new kernel functionality, and the system call that controls it, should | |
12983bcd MCC |
231 | normally be optional, so add a ``CONFIG`` option (typically to |
232 | ``init/Kconfig``) for it. As usual for new ``CONFIG`` options: | |
4983953d DD |
233 | |
234 | - Include a description of the new functionality and system call controlled | |
235 | by the option. | |
236 | - Make the option depend on EXPERT if it should be hidden from normal users. | |
237 | - Make any new source files implementing the function dependent on the CONFIG | |
418ca3de | 238 | option in the Makefile (e.g. ``obj-$(CONFIG_XYZZY_SYSCALL) += xyzzy.o``). |
4983953d DD |
239 | - Double check that the kernel still builds with the new CONFIG option turned |
240 | off. | |
241 | ||
242 | To summarize, you need a commit that includes: | |
243 | ||
12983bcd MCC |
244 | - ``CONFIG`` option for the new function, normally in ``init/Kconfig`` |
245 | - ``SYSCALL_DEFINEn(xyzzy, ...)`` for the entry point | |
246 | - corresponding prototype in ``include/linux/syscalls.h`` | |
247 | - generic table entry in ``include/uapi/asm-generic/unistd.h`` | |
248 | - fallback stub in ``kernel/sys_ni.c`` | |
4983953d DD |
249 | |
250 | ||
251 | x86 System Call Implementation | |
252 | ------------------------------ | |
253 | ||
254 | To wire up your new system call for x86 platforms, you need to update the | |
255 | master syscall tables. Assuming your new system call isn't special in some | |
256 | way (see below), this involves a "common" entry (for x86_64 and x32) in | |
12983bcd | 257 | arch/x86/entry/syscalls/syscall_64.tbl:: |
4983953d DD |
258 | |
259 | 333 common xyzzy sys_xyzzy | |
260 | ||
12983bcd | 261 | and an "i386" entry in ``arch/x86/entry/syscalls/syscall_32.tbl``:: |
4983953d DD |
262 | |
263 | 380 i386 xyzzy sys_xyzzy | |
264 | ||
265 | Again, these numbers are liable to be changed if there are conflicts in the | |
266 | relevant merge window. | |
267 | ||
268 | ||
269 | Compatibility System Calls (Generic) | |
270 | ------------------------------------ | |
271 | ||
272 | For most system calls the same 64-bit implementation can be invoked even when | |
273 | the userspace program is itself 32-bit; even if the system call's parameters | |
274 | include an explicit pointer, this is handled transparently. | |
275 | ||
276 | However, there are a couple of situations where a compatibility layer is | |
277 | needed to cope with size differences between 32-bit and 64-bit. | |
278 | ||
279 | The first is if the 64-bit kernel also supports 32-bit userspace programs, and | |
12983bcd | 280 | so needs to parse areas of (``__user``) memory that could hold either 32-bit or |
4983953d DD |
281 | 64-bit values. In particular, this is needed whenever a system call argument |
282 | is: | |
283 | ||
284 | - a pointer to a pointer | |
12983bcd MCC |
285 | - a pointer to a struct containing a pointer (e.g. ``struct iovec __user *``) |
286 | - a pointer to a varying sized integral type (``time_t``, ``off_t``, | |
287 | ``long``, ...) | |
4983953d DD |
288 | - a pointer to a struct containing a varying sized integral type. |
289 | ||
290 | The second situation that requires a compatibility layer is if one of the | |
291 | system call's arguments has a type that is explicitly 64-bit even on a 32-bit | |
12983bcd MCC |
292 | architecture, for example ``loff_t`` or ``__u64``. In this case, a value that |
293 | arrives at a 64-bit kernel from a 32-bit application will be split into two | |
294 | 32-bit values, which then need to be re-assembled in the compatibility layer. | |
4983953d DD |
295 | |
296 | (Note that a system call argument that's a pointer to an explicit 64-bit type | |
12983bcd MCC |
297 | does **not** need a compatibility layer; for example, :manpage:`splice(2)`'s arguments of |
298 | type ``loff_t __user *`` do not trigger the need for a ``compat_`` system call.) | |
4983953d | 299 | |
12983bcd MCC |
300 | The compatibility version of the system call is called ``compat_sys_xyzzy()``, |
301 | and is added with the ``COMPAT_SYSCALL_DEFINEn()`` macro, analogously to | |
4983953d DD |
302 | SYSCALL_DEFINEn. This version of the implementation runs as part of a 64-bit |
303 | kernel, but expects to receive 32-bit parameter values and does whatever is | |
12983bcd MCC |
304 | needed to deal with them. (Typically, the ``compat_sys_`` version converts the |
305 | values to 64-bit versions and either calls on to the ``sys_`` version, or both of | |
4983953d DD |
306 | them call a common inner implementation function.) |
307 | ||
308 | The compat entry point also needs a corresponding function prototype, in | |
12983bcd MCC |
309 | ``include/linux/compat.h``, marked as asmlinkage to match the way that system |
310 | calls are invoked:: | |
4983953d DD |
311 | |
312 | asmlinkage long compat_sys_xyzzy(...); | |
313 | ||
314 | If the system call involves a structure that is laid out differently on 32-bit | |
12983bcd MCC |
315 | and 64-bit systems, say ``struct xyzzy_args``, then the include/linux/compat.h |
316 | header file should also include a compat version of the structure (``struct | |
317 | compat_xyzzy_args``) where each variable-size field has the appropriate | |
318 | ``compat_`` type that corresponds to the type in ``struct xyzzy_args``. The | |
319 | ``compat_sys_xyzzy()`` routine can then use this ``compat_`` structure to | |
320 | parse the arguments from a 32-bit invocation. | |
4983953d | 321 | |
12983bcd | 322 | For example, if there are fields:: |
4983953d DD |
323 | |
324 | struct xyzzy_args { | |
325 | const char __user *ptr; | |
326 | __kernel_long_t varying_val; | |
327 | u64 fixed_val; | |
328 | /* ... */ | |
329 | }; | |
330 | ||
12983bcd | 331 | in struct xyzzy_args, then struct compat_xyzzy_args would have:: |
4983953d DD |
332 | |
333 | struct compat_xyzzy_args { | |
334 | compat_uptr_t ptr; | |
335 | compat_long_t varying_val; | |
336 | u64 fixed_val; | |
337 | /* ... */ | |
338 | }; | |
339 | ||
340 | The generic system call list also needs adjusting to allow for the compat | |
12983bcd MCC |
341 | version; the entry in ``include/uapi/asm-generic/unistd.h`` should use |
342 | ``__SC_COMP`` rather than ``__SYSCALL``:: | |
4983953d DD |
343 | |
344 | #define __NR_xyzzy 292 | |
345 | __SC_COMP(__NR_xyzzy, sys_xyzzy, compat_sys_xyzzy) | |
346 | ||
347 | To summarize, you need: | |
348 | ||
12983bcd MCC |
349 | - a ``COMPAT_SYSCALL_DEFINEn(xyzzy, ...)`` for the compat entry point |
350 | - corresponding prototype in ``include/linux/compat.h`` | |
351 | - (if needed) 32-bit mapping struct in ``include/linux/compat.h`` | |
352 | - instance of ``__SC_COMP`` not ``__SYSCALL`` in | |
353 | ``include/uapi/asm-generic/unistd.h`` | |
4983953d DD |
354 | |
355 | ||
356 | Compatibility System Calls (x86) | |
357 | -------------------------------- | |
358 | ||
359 | To wire up the x86 architecture of a system call with a compatibility version, | |
360 | the entries in the syscall tables need to be adjusted. | |
361 | ||
12983bcd | 362 | First, the entry in ``arch/x86/entry/syscalls/syscall_32.tbl`` gets an extra |
4983953d | 363 | column to indicate that a 32-bit userspace program running on a 64-bit kernel |
12983bcd | 364 | should hit the compat entry point:: |
4983953d | 365 | |
5ac9efa3 | 366 | 380 i386 xyzzy sys_xyzzy __ia32_compat_sys_xyzzy |
4983953d DD |
367 | |
368 | Second, you need to figure out what should happen for the x32 ABI version of | |
369 | the new system call. There's a choice here: the layout of the arguments | |
370 | should either match the 64-bit version or the 32-bit version. | |
371 | ||
372 | If there's a pointer-to-a-pointer involved, the decision is easy: x32 is | |
373 | ILP32, so the layout should match the 32-bit version, and the entry in | |
12983bcd MCC |
374 | ``arch/x86/entry/syscalls/syscall_64.tbl`` is split so that x32 programs hit |
375 | the compatibility wrapper:: | |
4983953d DD |
376 | |
377 | 333 64 xyzzy sys_xyzzy | |
378 | ... | |
5ac9efa3 | 379 | 555 x32 xyzzy __x32_compat_sys_xyzzy |
4983953d DD |
380 | |
381 | If no pointers are involved, then it is preferable to re-use the 64-bit system | |
382 | call for the x32 ABI (and consequently the entry in | |
383 | arch/x86/entry/syscalls/syscall_64.tbl is unchanged). | |
384 | ||
385 | In either case, you should check that the types involved in your argument | |
386 | layout do indeed map exactly from x32 (-mx32) to either the 32-bit (-m32) or | |
387 | 64-bit (-m64) equivalents. | |
388 | ||
389 | ||
390 | System Calls Returning Elsewhere | |
391 | -------------------------------- | |
392 | ||
393 | For most system calls, once the system call is complete the user program | |
394 | continues exactly where it left off -- at the next instruction, with the | |
395 | stack the same and most of the registers the same as before the system call, | |
396 | and with the same virtual memory space. | |
397 | ||
398 | However, a few system calls do things differently. They might return to a | |
12983bcd MCC |
399 | different location (``rt_sigreturn``) or change the memory space |
400 | (``fork``/``vfork``/``clone``) or even architecture (``execve``/``execveat``) | |
401 | of the program. | |
4983953d DD |
402 | |
403 | To allow for this, the kernel implementation of the system call may need to | |
404 | save and restore additional registers to the kernel stack, allowing complete | |
405 | control of where and how execution continues after the system call. | |
406 | ||
407 | This is arch-specific, but typically involves defining assembly entry points | |
408 | that save/restore additional registers and invoke the real system call entry | |
409 | point. | |
410 | ||
12983bcd MCC |
411 | For x86_64, this is implemented as a ``stub_xyzzy`` entry point in |
412 | ``arch/x86/entry/entry_64.S``, and the entry in the syscall table | |
413 | (``arch/x86/entry/syscalls/syscall_64.tbl``) is adjusted to match:: | |
4983953d DD |
414 | |
415 | 333 common xyzzy stub_xyzzy | |
416 | ||
417 | The equivalent for 32-bit programs running on a 64-bit kernel is normally | |
12983bcd | 418 | called ``stub32_xyzzy`` and implemented in ``arch/x86/entry/entry_64_compat.S``, |
4983953d | 419 | with the corresponding syscall table adjustment in |
12983bcd | 420 | ``arch/x86/entry/syscalls/syscall_32.tbl``:: |
4983953d DD |
421 | |
422 | 380 i386 xyzzy sys_xyzzy stub32_xyzzy | |
423 | ||
424 | If the system call needs a compatibility layer (as in the previous section) | |
12983bcd MCC |
425 | then the ``stub32_`` version needs to call on to the ``compat_sys_`` version |
426 | of the system call rather than the native 64-bit version. Also, if the x32 ABI | |
4983953d | 427 | implementation is not common with the x86_64 version, then its syscall |
12983bcd | 428 | table will also need to invoke a stub that calls on to the ``compat_sys_`` |
4983953d DD |
429 | version. |
430 | ||
431 | For completeness, it's also nice to set up a mapping so that user-mode Linux | |
432 | still works -- its syscall table will reference stub_xyzzy, but the UML build | |
12983bcd | 433 | doesn't include ``arch/x86/entry/entry_64.S`` implementation (because UML |
4983953d | 434 | simulates registers etc). Fixing this is as simple as adding a #define to |
12983bcd | 435 | ``arch/x86/um/sys_call_table_64.c``:: |
4983953d DD |
436 | |
437 | #define stub_xyzzy sys_xyzzy | |
438 | ||
439 | ||
440 | Other Details | |
441 | ------------- | |
442 | ||
443 | Most of the kernel treats system calls in a generic way, but there is the | |
444 | occasional exception that may need updating for your particular system call. | |
445 | ||
446 | The audit subsystem is one such special case; it includes (arch-specific) | |
447 | functions that classify some special types of system call -- specifically | |
12983bcd MCC |
448 | file open (``open``/``openat``), program execution (``execve``/``exeveat``) or |
449 | socket multiplexor (``socketcall``) operations. If your new system call is | |
450 | analogous to one of these, then the audit system should be updated. | |
4983953d DD |
451 | |
452 | More generally, if there is an existing system call that is analogous to your | |
453 | new system call, it's worth doing a kernel-wide grep for the existing system | |
454 | call to check there are no other special cases. | |
455 | ||
456 | ||
457 | Testing | |
458 | ------- | |
459 | ||
460 | A new system call should obviously be tested; it is also useful to provide | |
461 | reviewers with a demonstration of how user space programs will use the system | |
462 | call. A good way to combine these aims is to include a simple self-test | |
12983bcd | 463 | program in a new directory under ``tools/testing/selftests/``. |
4983953d DD |
464 | |
465 | For a new system call, there will obviously be no libc wrapper function and so | |
12983bcd | 466 | the test will need to invoke it using ``syscall()``; also, if the system call |
4983953d DD |
467 | involves a new userspace-visible structure, the corresponding header will need |
468 | to be installed to compile the test. | |
469 | ||
470 | Make sure the selftest runs successfully on all supported architectures. For | |
471 | example, check that it works when compiled as an x86_64 (-m64), x86_32 (-m32) | |
472 | and x32 (-mx32) ABI program. | |
473 | ||
474 | For more extensive and thorough testing of new functionality, you should also | |
475 | consider adding tests to the Linux Test Project, or to the xfstests project | |
476 | for filesystem-related changes. | |
12983bcd | 477 | |
4983953d DD |
478 | - https://linux-test-project.github.io/ |
479 | - git://git.kernel.org/pub/scm/fs/xfs/xfstests-dev.git | |
480 | ||
481 | ||
482 | Man Page | |
483 | -------- | |
484 | ||
485 | All new system calls should come with a complete man page, ideally using groff | |
486 | markup, but plain text will do. If groff is used, it's helpful to include a | |
487 | pre-rendered ASCII version of the man page in the cover email for the | |
488 | patchset, for the convenience of reviewers. | |
489 | ||
490 | The man page should be cc'ed to linux-man@vger.kernel.org | |
491 | For more details, see https://www.kernel.org/doc/man-pages/patches.html | |
492 | ||
819671ff DB |
493 | |
494 | Do not call System Calls in the Kernel | |
495 | -------------------------------------- | |
496 | ||
497 | System calls are, as stated above, interaction points between userspace and | |
498 | the kernel. Therefore, system call functions such as ``sys_xyzzy()`` or | |
499 | ``compat_sys_xyzzy()`` should only be called from userspace via the syscall | |
500 | table, but not from elsewhere in the kernel. If the syscall functionality is | |
501 | useful to be used within the kernel, needs to be shared between an old and a | |
502 | new syscall, or needs to be shared between a syscall and its compatibility | |
503 | variant, it should be implemented by means of a "helper" function (such as | |
504 | ``kern_xyzzy()``). This kernel function may then be called within the | |
505 | syscall stub (``sys_xyzzy()``), the compatibility syscall stub | |
506 | (``compat_sys_xyzzy()``), and/or other kernel code. | |
507 | ||
508 | At least on 64-bit x86, it will be a hard requirement from v4.17 onwards to not | |
509 | call system call functions in the kernel. It uses a different calling | |
510 | convention for system calls where ``struct pt_regs`` is decoded on-the-fly in a | |
511 | syscall wrapper which then hands processing over to the actual syscall function. | |
512 | This means that only those parameters which are actually needed for a specific | |
513 | syscall are passed on during syscall entry, instead of filling in six CPU | |
514 | registers with random user space content all the time (which may cause serious | |
515 | trouble down the call chain). | |
516 | ||
517 | Moreover, rules on how data may be accessed may differ between kernel data and | |
518 | user data. This is another reason why calling ``sys_xyzzy()`` is generally a | |
519 | bad idea. | |
520 | ||
521 | Exceptions to this rule are only allowed in architecture-specific overrides, | |
522 | architecture-specific compatibility wrappers, or other code in arch/. | |
523 | ||
524 | ||
4983953d DD |
525 | References and Sources |
526 | ---------------------- | |
527 | ||
528 | - LWN article from Michael Kerrisk on use of flags argument in system calls: | |
529 | https://lwn.net/Articles/585415/ | |
530 | - LWN article from Michael Kerrisk on how to handle unknown flags in a system | |
531 | call: https://lwn.net/Articles/588444/ | |
532 | - LWN article from Jake Edge describing constraints on 64-bit system call | |
533 | arguments: https://lwn.net/Articles/311630/ | |
534 | - Pair of LWN articles from David Drysdale that describe the system call | |
535 | implementation paths in detail for v3.14: | |
12983bcd | 536 | |
4983953d DD |
537 | - https://lwn.net/Articles/604287/ |
538 | - https://lwn.net/Articles/604515/ | |
12983bcd | 539 | |
4983953d | 540 | - Architecture-specific requirements for system calls are discussed in the |
12983bcd | 541 | :manpage:`syscall(2)` man-page: |
4983953d | 542 | http://man7.org/linux/man-pages/man2/syscall.2.html#NOTES |
12983bcd | 543 | - Collated emails from Linus Torvalds discussing the problems with ``ioctl()``: |
93431e06 | 544 | https://yarchive.net/comp/linux/ioctl.html |
4983953d | 545 | - "How to not invent kernel interfaces", Arnd Bergmann, |
93431e06 | 546 | https://www.ukuug.org/events/linux2007/2007/papers/Bergmann.pdf |
4983953d DD |
547 | - LWN article from Michael Kerrisk on avoiding new uses of CAP_SYS_ADMIN: |
548 | https://lwn.net/Articles/486306/ | |
549 | - Recommendation from Andrew Morton that all related information for a new | |
550 | system call should come in the same email thread: | |
551 | https://lkml.org/lkml/2014/7/24/641 | |
552 | - Recommendation from Michael Kerrisk that a new system call should come with | |
553 | a man page: https://lkml.org/lkml/2014/6/13/309 | |
554 | - Suggestion from Thomas Gleixner that x86 wire-up should be in a separate | |
555 | commit: https://lkml.org/lkml/2014/11/19/254 | |
556 | - Suggestion from Greg Kroah-Hartman that it's good for new system calls to | |
557 | come with a man-page & selftest: https://lkml.org/lkml/2014/3/19/710 | |
12983bcd | 558 | - Discussion from Michael Kerrisk of new system call vs. :manpage:`prctl(2)` extension: |
4983953d DD |
559 | https://lkml.org/lkml/2014/6/3/411 |
560 | - Suggestion from Ingo Molnar that system calls that involve multiple | |
561 | arguments should encapsulate those arguments in a struct, which includes a | |
562 | size field for future extensibility: https://lkml.org/lkml/2015/7/30/117 | |
563 | - Numbering oddities arising from (re-)use of O_* numbering space flags: | |
12983bcd | 564 | |
4983953d DD |
565 | - commit 75069f2b5bfb ("vfs: renumber FMODE_NONOTIFY and add to uniqueness |
566 | check") | |
567 | - commit 12ed2e36c98a ("fanotify: FMODE_NONOTIFY and __O_SYNC in sparc | |
568 | conflict") | |
569 | - commit bb458c644a59 ("Safer ABI for O_TMPFILE") | |
12983bcd | 570 | |
4983953d DD |
571 | - Discussion from Matthew Wilcox about restrictions on 64-bit arguments: |
572 | https://lkml.org/lkml/2008/12/12/187 | |
573 | - Recommendation from Greg Kroah-Hartman that unknown flags should be | |
574 | policed: https://lkml.org/lkml/2014/7/17/577 | |
575 | - Recommendation from Linus Torvalds that x32 system calls should prefer | |
576 | compatibility with 64-bit versions rather than 32-bit versions: | |
577 | https://lkml.org/lkml/2011/8/31/244 |