[mirror_ubuntu-jammy-kernel.git] / Documentation / process / botching-up-ioctls.rst

=================================
(How to avoid) Botching up ioctls
=================================

From: https://blog.ffwll.ch/2013/11/botching-up-ioctls.html

By: Daniel Vetter, Copyright © 2013 Intel Corporation

One clear insight kernel graphics hackers gained in the past few years is that
trying to come up with a unified interface to manage the execution units and
memory on completely different GPUs is a futile effort. So nowadays every
driver has its own set of ioctls to allocate memory and submit work to the GPU.
Which is nice, since there's no more insanity in the form of fake-generic, but
actually only used once interfaces. But the clear downside is that there's much
more potential to screw things up.

To avoid repeating all the same mistakes again I've written up some of the
lessons learned while botching the job for the drm/i915 driver. Most of these
only cover technicalities and not the big-picture issues like what the command
submission ioctl exactly should look like. Learning these lessons is probably
something every GPU driver has to do on its own.


Prerequisites
-------------

First the prerequisites. Without these you have already failed, because you
will need to add a 32-bit compat layer:

 * Only use fixed sized integers. To avoid conflicts with typedefs in userspace
   the kernel has special types like __u32, __s64. Use them.

 * Align everything to the natural size and use explicit padding. 32-bit
   platforms don't necessarily align 64-bit values to 64-bit boundaries, but
   64-bit platforms do. So we always need padding to the natural size to get
   this right.

 * Pad the entire struct to a multiple of 64-bits if the structure contains
   64-bit types - the structure size will otherwise differ on 32-bit versus
   64-bit. Having a different structure size hurts when passing arrays of
   structures to the kernel, or if the kernel checks the structure size, which
   e.g. the drm core does.

 * Pointers are __u64, cast from/to a uintprt_t on the userspace side and
   from/to a void __user * in the kernel. Try really hard not to delay this
   conversion or worse, fiddle the raw __u64 through your code since that
   diminishes the checking tools like sparse can provide. The macro
   u64_to_user_ptr can be used in the kernel to avoid warnings about integers
   and pointers of different sizes.


Basics
------

With the joys of writing a compat layer avoided we can take a look at the basic
fumbles. Neglecting these will make backward and forward compatibility a real
pain. And since getting things wrong on the first attempt is guaranteed you
will have a second iteration or at least an extension for any given interface.

 * Have a clear way for userspace to figure out whether your new ioctl or ioctl
   extension is supported on a given kernel. If you can't rely on old kernels
   rejecting the new flags/modes or ioctls (since doing that was botched in the
   past) then you need a driver feature flag or revision number somewhere.

 * Have a plan for extending ioctls with new flags or new fields at the end of
   the structure. The drm core checks the passed-in size for each ioctl call
   and zero-extends any mismatches between kernel and userspace. That helps,
   but isn't a complete solution since newer userspace on older kernels won't
   notice that the newly added fields at the end get ignored. So this still
   needs a new driver feature flags.

 * Check all unused fields and flags and all the padding for whether it's 0,
   and reject the ioctl if that's not the case. Otherwise your nice plan for
   future extensions is going right down the gutters since someone will submit
   an ioctl struct with random stack garbage in the yet unused parts. Which
   then bakes in the ABI that those fields can never be used for anything else
   but garbage. This is also the reason why you must explicitly pad all
   structures, even if you never use them in an array - the padding the compiler
   might insert could contain garbage.

 * Have simple testcases for all of the above.


Fun with Error Paths
--------------------

Nowadays we don't have any excuse left any more for drm drivers being neat
little root exploits. This means we both need full input validation and solid
error handling paths - GPUs will die eventually in the oddmost corner cases
anyway:

 * The ioctl must check for array overflows. Also it needs to check for
   over/underflows and clamping issues of integer values in general. The usual
   example is sprite positioning values fed directly into the hardware with the
   hardware just having 12 bits or so. Works nicely until some odd display
   server doesn't bother with clamping itself and the cursor wraps around the
   screen.

 * Have simple testcases for every input validation failure case in your ioctl.
   Check that the error code matches your expectations. And finally make sure
   that you only test for one single error path in each subtest by submitting
   otherwise perfectly valid data. Without this an earlier check might reject
   the ioctl already and shadow the codepath you actually want to test, hiding
   bugs and regressions.

 * Make all your ioctls restartable. First X really loves signals and second
   this will allow you to test 90% of all error handling paths by just
   interrupting your main test suite constantly with signals. Thanks to X's
   love for signal you'll get an excellent base coverage of all your error
   paths pretty much for free for graphics drivers. Also, be consistent with
   how you handle ioctl restarting - e.g. drm has a tiny drmIoctl helper in its
   userspace library. The i915 driver botched this with the set_tiling ioctl,
   now we're stuck forever with some arcane semantics in both the kernel and
   userspace.

 * If you can't make a given codepath restartable make a stuck task at least
   killable. GPUs just die and your users won't like you more if you hang their
   entire box (by means of an unkillable X process). If the state recovery is
   still too tricky have a timeout or hangcheck safety net as a last-ditch
   effort in case the hardware has gone bananas.

 * Have testcases for the really tricky corner cases in your error recovery code
   - it's way too easy to create a deadlock between your hangcheck code and
   waiters.


Time, Waiting and Missing it
----------------------------

GPUs do most everything asynchronously, so we have a need to time operations and
wait for outstanding ones. This is really tricky business; at the moment none of
the ioctls supported by the drm/i915 get this fully right, which means there's
still tons more lessons to learn here.

 * Use CLOCK_MONOTONIC as your reference time, always. It's what alsa, drm and
   v4l use by default nowadays. But let userspace know which timestamps are
   derived from different clock domains like your main system clock (provided
   by the kernel) or some independent hardware counter somewhere else. Clocks
   will mismatch if you look close enough, but if performance measuring tools
   have this information they can at least compensate. If your userspace can
   get at the raw values of some clocks (e.g. through in-command-stream
   performance counter sampling instructions) consider exposing those also.

 * Use __s64 seconds plus __u64 nanoseconds to specify time. It's not the most
   convenient time specification, but it's mostly the standard.

 * Check that input time values are normalized and reject them if not. Note
   that the kernel native struct ktime has a signed integer for both seconds
   and nanoseconds, so beware here.

 * For timeouts, use absolute times. If you're a good fellow and made your
   ioctl restartable relative timeouts tend to be too coarse and can
   indefinitely extend your wait time due to rounding on each restart.
   Especially if your reference clock is something really slow like the display
   frame counter. With a spec lawyer hat on this isn't a bug since timeouts can
   always be extended - but users will surely hate you if their neat animations
   starts to stutter due to this.

 * Consider ditching any synchronous wait ioctls with timeouts and just deliver
   an asynchronous event on a pollable file descriptor. It fits much better
   into event driven applications' main loop.

 * Have testcases for corner-cases, especially whether the return values for
   already-completed events, successful waits and timed-out waits are all sane
   and suiting to your needs.


Leaking Resources, Not
----------------------

A full-blown drm driver essentially implements a little OS, but specialized to
the given GPU platforms. This means a driver needs to expose tons of handles
for different objects and other resources to userspace. Doing that right
entails its own little set of pitfalls:

 * Always attach the lifetime of your dynamically created resources to the
   lifetime of a file descriptor. Consider using a 1:1 mapping if your resource
   needs to be shared across processes -  fd-passing over unix domain sockets
   also simplifies lifetime management for userspace.

 * Always have O_CLOEXEC support.

 * Ensure that you have sufficient insulation between different clients. By
   default pick a private per-fd namespace which forces any sharing to be done
   explicitly. Only go with a more global per-device namespace if the objects
   are truly device-unique. One counterexample in the drm modeset interfaces is
   that the per-device modeset objects like connectors share a namespace with
   framebuffer objects, which mostly are not shared at all. A separate
   namespace, private by default, for framebuffers would have been more
   suitable.

 * Think about uniqueness requirements for userspace handles. E.g. for most drm
   drivers it's a userspace bug to submit the same object twice in the same
   command submission ioctl. But then if objects are shareable userspace needs
   to know whether it has seen an imported object from a different process
   already or not. I haven't tried this myself yet due to lack of a new class
   of objects, but consider using inode numbers on your shared file descriptors
   as unique identifiers - it's how real files are told apart, too.
   Unfortunately this requires a full-blown virtual filesystem in the kernel.


Last, but not Least
-------------------

Not every problem needs a new ioctl:

 * Think hard whether you really want a driver-private interface. Of course
   it's much quicker to push a driver-private interface than engaging in
   lengthy discussions for a more generic solution. And occasionally doing a
   private interface to spearhead a new concept is what's required. But in the
   end, once the generic interface comes around you'll end up maintainer two
   interfaces. Indefinitely.

 * Consider other interfaces than ioctls. A sysfs attribute is much better for
   per-device settings, or for child objects with fairly static lifetimes (like
   output connectors in drm with all the detection override attributes). Or
   maybe only your testsuite needs this interface, and then debugfs with its
   disclaimer of not having a stable ABI would be better.

Finally, the name of the game is to get it right on the first attempt, since if
your driver proves popular and your hardware platforms long-lived then you'll
be stuck with a given ioctl essentially forever. You can try to deprecate
horrible ioctls on newer iterations of your hardware, but generally it takes
years to accomplish this. And then again years until the last user able to
complain about regressions disappears, too.
Commit	Line	Data
5c04dcea	1	=================================
efe4a772 ME	2	(How to avoid) Botching up ioctls
	3	=================================
	4
e7b4311e	5	From: https://blog.ffwll.ch/2013/11/botching-up-ioctls.html
efe4a772 ME	6
	7	By: Daniel Vetter, Copyright © 2013 Intel Corporation
	8
	9	One clear insight kernel graphics hackers gained in the past few years is that
	10	trying to come up with a unified interface to manage the execution units and
	11	memory on completely different GPUs is a futile effort. So nowadays every
	12	driver has its own set of ioctls to allocate memory and submit work to the GPU.
	13	Which is nice, since there's no more insanity in the form of fake-generic, but
	14	actually only used once interfaces. But the clear downside is that there's much
	15	more potential to screw things up.
	16
	17	To avoid repeating all the same mistakes again I've written up some of the
	18	lessons learned while botching the job for the drm/i915 driver. Most of these
	19	only cover technicalities and not the big-picture issues like what the command
	20	submission ioctl exactly should look like. Learning these lessons is probably
	21	something every GPU driver has to do on its own.
	22
	23
	24	Prerequisites
	25	-------------
	26
	27	First the prerequisites. Without these you have already failed, because you
8da9704c	28	will need to add a 32-bit compat layer:
efe4a772 ME	29
	30	* Only use fixed sized integers. To avoid conflicts with typedefs in userspace
	31	the kernel has special types like __u32, __s64. Use them.
	32
	33	* Align everything to the natural size and use explicit padding. 32-bit
	34	platforms don't necessarily align 64-bit values to 64-bit boundaries, but
	35	64-bit platforms do. So we always need padding to the natural size to get
	36	this right.
	37
c6517b78 LA	38	* Pad the entire struct to a multiple of 64-bits if the structure contains
	39	64-bit types - the structure size will otherwise differ on 32-bit versus
	40	64-bit. Having a different structure size hurts when passing arrays of
	41	structures to the kernel, or if the kernel checks the structure size, which
	42	e.g. the drm core does.
efe4a772 ME	43
	44	* Pointers are __u64, cast from/to a uintprt_t on the userspace side and
	45	from/to a void __user * in the kernel. Try really hard not to delay this
	46	conversion or worse, fiddle the raw __u64 through your code since that
c6517b78 LA	47	diminishes the checking tools like sparse can provide. The macro
c6517b78 LA	48	u64_to_user_ptr can be used in the kernel to avoid warnings about integers
b275fb60	49	and pointers of different sizes.
efe4a772 ME	50
	51
	52	Basics
	53	------
	54
	55	With the joys of writing a compat layer avoided we can take a look at the basic
	56	fumbles. Neglecting these will make backward and forward compatibility a real
	57	pain. And since getting things wrong on the first attempt is guaranteed you
	58	will have a second iteration or at least an extension for any given interface.
	59
	60	* Have a clear way for userspace to figure out whether your new ioctl or ioctl
	61	extension is supported on a given kernel. If you can't rely on old kernels
	62	rejecting the new flags/modes or ioctls (since doing that was botched in the
	63	past) then you need a driver feature flag or revision number somewhere.
	64
	65	* Have a plan for extending ioctls with new flags or new fields at the end of
	66	the structure. The drm core checks the passed-in size for each ioctl call
	67	and zero-extends any mismatches between kernel and userspace. That helps,
	68	but isn't a complete solution since newer userspace on older kernels won't
	69	notice that the newly added fields at the end get ignored. So this still
	70	needs a new driver feature flags.
	71
	72	* Check all unused fields and flags and all the padding for whether it's 0,
	73	and reject the ioctl if that's not the case. Otherwise your nice plan for
	74	future extensions is going right down the gutters since someone will submit
	75	an ioctl struct with random stack garbage in the yet unused parts. Which
	76	then bakes in the ABI that those fields can never be used for anything else
1897e8f3 DV	77	but garbage. This is also the reason why you must explicitly pad all
	78	structures, even if you never use them in an array - the padding the compiler
	79	might insert could contain garbage.
efe4a772 ME	80
	81	* Have simple testcases for all of the above.
	82
	83
	84	Fun with Error Paths
	85	--------------------
	86
	87	Nowadays we don't have any excuse left any more for drm drivers being neat
	88	little root exploits. This means we both need full input validation and solid
	89	error handling paths - GPUs will die eventually in the oddmost corner cases
	90	anyway:
	91
	92	* The ioctl must check for array overflows. Also it needs to check for
	93	over/underflows and clamping issues of integer values in general. The usual
	94	example is sprite positioning values fed directly into the hardware with the
	95	hardware just having 12 bits or so. Works nicely until some odd display
	96	server doesn't bother with clamping itself and the cursor wraps around the
	97	screen.
	98
	99	* Have simple testcases for every input validation failure case in your ioctl.
	100	Check that the error code matches your expectations. And finally make sure
	101	that you only test for one single error path in each subtest by submitting
	102	otherwise perfectly valid data. Without this an earlier check might reject
	103	the ioctl already and shadow the codepath you actually want to test, hiding
	104	bugs and regressions.
	105
	106	* Make all your ioctls restartable. First X really loves signals and second
	107	this will allow you to test 90% of all error handling paths by just
	108	interrupting your main test suite constantly with signals. Thanks to X's
	109	love for signal you'll get an excellent base coverage of all your error
	110	paths pretty much for free for graphics drivers. Also, be consistent with
	111	how you handle ioctl restarting - e.g. drm has a tiny drmIoctl helper in its
	112	userspace library. The i915 driver botched this with the set_tiling ioctl,
	113	now we're stuck forever with some arcane semantics in both the kernel and
	114	userspace.
	115
	116	* If you can't make a given codepath restartable make a stuck task at least
	117	killable. GPUs just die and your users won't like you more if you hang their
	118	entire box (by means of an unkillable X process). If the state recovery is
	119	still too tricky have a timeout or hangcheck safety net as a last-ditch
	120	effort in case the hardware has gone bananas.
	121
	122	* Have testcases for the really tricky corner cases in your error recovery code
	123	- it's way too easy to create a deadlock between your hangcheck code and
	124	waiters.
	125
	126
	127	Time, Waiting and Missing it
	128	----------------------------
	129
	130	GPUs do most everything asynchronously, so we have a need to time operations and
d53a7b8f	131	wait for outstanding ones. This is really tricky business; at the moment none of
efe4a772 ME	132	the ioctls supported by the drm/i915 get this fully right, which means there's
	133	still tons more lessons to learn here.
	134
	135	* Use CLOCK_MONOTONIC as your reference time, always. It's what alsa, drm and
	136	v4l use by default nowadays. But let userspace know which timestamps are
	137	derived from different clock domains like your main system clock (provided
	138	by the kernel) or some independent hardware counter somewhere else. Clocks
	139	will mismatch if you look close enough, but if performance measuring tools
	140	have this information they can at least compensate. If your userspace can
	141	get at the raw values of some clocks (e.g. through in-command-stream
	142	performance counter sampling instructions) consider exposing those also.
	143
	144	* Use __s64 seconds plus __u64 nanoseconds to specify time. It's not the most
	145	convenient time specification, but it's mostly the standard.
	146
	147	* Check that input time values are normalized and reject them if not. Note
	148	that the kernel native struct ktime has a signed integer for both seconds
	149	and nanoseconds, so beware here.
	150
	151	* For timeouts, use absolute times. If you're a good fellow and made your
	152	ioctl restartable relative timeouts tend to be too coarse and can
	153	indefinitely extend your wait time due to rounding on each restart.
	154	Especially if your reference clock is something really slow like the display
d53a7b8f	155	frame counter. With a spec lawyer hat on this isn't a bug since timeouts can
efe4a772 ME	156	always be extended - but users will surely hate you if their neat animations
	157	starts to stutter due to this.
	158
	159	* Consider ditching any synchronous wait ioctls with timeouts and just deliver
	160	an asynchronous event on a pollable file descriptor. It fits much better
	161	into event driven applications' main loop.
	162
	163	* Have testcases for corner-cases, especially whether the return values for
	164	already-completed events, successful waits and timed-out waits are all sane
	165	and suiting to your needs.
	166
	167
	168	Leaking Resources, Not
	169	----------------------
	170
	171	A full-blown drm driver essentially implements a little OS, but specialized to
	172	the given GPU platforms. This means a driver needs to expose tons of handles
	173	for different objects and other resources to userspace. Doing that right
	174	entails its own little set of pitfalls:
	175
	176	* Always attach the lifetime of your dynamically created resources to the
	177	lifetime of a file descriptor. Consider using a 1:1 mapping if your resource
	178	needs to be shared across processes - fd-passing over unix domain sockets
	179	also simplifies lifetime management for userspace.
	180
	181	* Always have O_CLOEXEC support.
	182
	183	* Ensure that you have sufficient insulation between different clients. By
	184	default pick a private per-fd namespace which forces any sharing to be done
d53a7b8f	185	explicitly. Only go with a more global per-device namespace if the objects
efe4a772 ME	186	are truly device-unique. One counterexample in the drm modeset interfaces is
	187	that the per-device modeset objects like connectors share a namespace with
	188	framebuffer objects, which mostly are not shared at all. A separate
	189	namespace, private by default, for framebuffers would have been more
	190	suitable.
	191
	192	* Think about uniqueness requirements for userspace handles. E.g. for most drm
	193	drivers it's a userspace bug to submit the same object twice in the same
	194	command submission ioctl. But then if objects are shareable userspace needs
	195	to know whether it has seen an imported object from a different process
	196	already or not. I haven't tried this myself yet due to lack of a new class
	197	of objects, but consider using inode numbers on your shared file descriptors
	198	as unique identifiers - it's how real files are told apart, too.
	199	Unfortunately this requires a full-blown virtual filesystem in the kernel.
	200
	201
	202	Last, but not Least
	203	-------------------
	204
	205	Not every problem needs a new ioctl:
	206
	207	* Think hard whether you really want a driver-private interface. Of course
	208	it's much quicker to push a driver-private interface than engaging in
	209	lengthy discussions for a more generic solution. And occasionally doing a
	210	private interface to spearhead a new concept is what's required. But in the
	211	end, once the generic interface comes around you'll end up maintainer two
	212	interfaces. Indefinitely.
	213
	214	* Consider other interfaces than ioctls. A sysfs attribute is much better for
	215	per-device settings, or for child objects with fairly static lifetimes (like
	216	output connectors in drm with all the detection override attributes). Or
	217	maybe only your testsuite needs this interface, and then debugfs with its
	218	disclaimer of not having a stable ABI would be better.
	219
	220	Finally, the name of the game is to get it right on the first attempt, since if
	221	your driver proves popular and your hardware platforms long-lived then you'll
	222	be stuck with a given ioctl essentially forever. You can try to deprecate
	223	horrible ioctls on newer iterations of your hardware, but generally it takes
	224	years to accomplish this. And then again years until the last user able to
	225	complain about regressions disappears, too.