--- /dev/null
+ ================
+ ARM64 ELF hwcaps
+ ================
+
+ This document describes the usage and semantics of the arm64 ELF hwcaps.
+
+
+ 1. Introduction
+ ---------------
+
+ Some hardware or software features are only available on some CPU
+ implementations, and/or with certain kernel configurations, but have no
+ architected discovery mechanism available to userspace code at EL0. The
+ kernel exposes the presence of these features to userspace through a set
+ of flags called hwcaps, exposed in the auxilliary vector.
+
+ Userspace software can test for features by acquiring the AT_HWCAP or
+ AT_HWCAP2 entry of the auxiliary vector, and testing whether the relevant
+ flags are set, e.g.::
+
+ bool floating_point_is_present(void)
+ {
+ unsigned long hwcaps = getauxval(AT_HWCAP);
+ if (hwcaps & HWCAP_FP)
+ return true;
+
+ return false;
+ }
+
+ Where software relies on a feature described by a hwcap, it should check
+ the relevant hwcap flag to verify that the feature is present before
+ attempting to make use of the feature.
+
+ Features cannot be probed reliably through other means. When a feature
+ is not available, attempting to use it may result in unpredictable
+ behaviour, and is not guaranteed to result in any reliable indication
+ that the feature is unavailable, such as a SIGILL.
+
+
+ 2. Interpretation of hwcaps
+ ---------------------------
+
+ The majority of hwcaps are intended to indicate the presence of features
+ which are described by architected ID registers inaccessible to
+ userspace code at EL0. These hwcaps are defined in terms of ID register
+ fields, and should be interpreted with reference to the definition of
+ these fields in the ARM Architecture Reference Manual (ARM ARM).
+
+ Such hwcaps are described below in the form::
+
+ Functionality implied by idreg.field == val.
+
+ Such hwcaps indicate the availability of functionality that the ARM ARM
+ defines as being present when idreg.field has value val, but do not
+ indicate that idreg.field is precisely equal to val, nor do they
+ indicate the absence of functionality implied by other values of
+ idreg.field.
+
+ Other hwcaps may indicate the presence of features which cannot be
+ described by ID registers alone. These may be described without
+ reference to ID registers, and may refer to other documentation.
+
+
+ 3. The hwcaps exposed in AT_HWCAP
+ ---------------------------------
+
+ HWCAP_FP
+ Functionality implied by ID_AA64PFR0_EL1.FP == 0b0000.
+
+ HWCAP_ASIMD
+ Functionality implied by ID_AA64PFR0_EL1.AdvSIMD == 0b0000.
+
+ HWCAP_EVTSTRM
+ The generic timer is configured to generate events at a frequency of
+ approximately 100KHz.
+
+ HWCAP_AES
+ Functionality implied by ID_AA64ISAR0_EL1.AES == 0b0001.
+
+ HWCAP_PMULL
+ Functionality implied by ID_AA64ISAR0_EL1.AES == 0b0010.
+
+ HWCAP_SHA1
+ Functionality implied by ID_AA64ISAR0_EL1.SHA1 == 0b0001.
+
+ HWCAP_SHA2
+ Functionality implied by ID_AA64ISAR0_EL1.SHA2 == 0b0001.
+
+ HWCAP_CRC32
+ Functionality implied by ID_AA64ISAR0_EL1.CRC32 == 0b0001.
+
+ HWCAP_ATOMICS
+ Functionality implied by ID_AA64ISAR0_EL1.Atomic == 0b0010.
+
+ HWCAP_FPHP
+ Functionality implied by ID_AA64PFR0_EL1.FP == 0b0001.
+
+ HWCAP_ASIMDHP
+ Functionality implied by ID_AA64PFR0_EL1.AdvSIMD == 0b0001.
+
+ HWCAP_CPUID
+ EL0 access to certain ID registers is available, to the extent
+ described by Documentation/arm64/cpu-feature-registers.rst.
+
+ These ID registers may imply the availability of features.
+
+ HWCAP_ASIMDRDM
+ Functionality implied by ID_AA64ISAR0_EL1.RDM == 0b0001.
+
+ HWCAP_JSCVT
+ Functionality implied by ID_AA64ISAR1_EL1.JSCVT == 0b0001.
+
+ HWCAP_FCMA
+ Functionality implied by ID_AA64ISAR1_EL1.FCMA == 0b0001.
+
+ HWCAP_LRCPC
+ Functionality implied by ID_AA64ISAR1_EL1.LRCPC == 0b0001.
+
+ HWCAP_DCPOP
+ Functionality implied by ID_AA64ISAR1_EL1.DPB == 0b0001.
+
+ HWCAP2_DCPODP
+
+ Functionality implied by ID_AA64ISAR1_EL1.DPB == 0b0010.
+
+ HWCAP_SHA3
+ Functionality implied by ID_AA64ISAR0_EL1.SHA3 == 0b0001.
+
+ HWCAP_SM3
+ Functionality implied by ID_AA64ISAR0_EL1.SM3 == 0b0001.
+
+ HWCAP_SM4
+ Functionality implied by ID_AA64ISAR0_EL1.SM4 == 0b0001.
+
+ HWCAP_ASIMDDP
+ Functionality implied by ID_AA64ISAR0_EL1.DP == 0b0001.
+
+ HWCAP_SHA512
+ Functionality implied by ID_AA64ISAR0_EL1.SHA2 == 0b0010.
+
+ HWCAP_SVE
+ Functionality implied by ID_AA64PFR0_EL1.SVE == 0b0001.
+
+ HWCAP2_SVE2
+
+ Functionality implied by ID_AA64ZFR0_EL1.SVEVer == 0b0001.
+
+ HWCAP2_SVEAES
+
+ Functionality implied by ID_AA64ZFR0_EL1.AES == 0b0001.
+
+ HWCAP2_SVEPMULL
+
+ Functionality implied by ID_AA64ZFR0_EL1.AES == 0b0010.
+
+ HWCAP2_SVEBITPERM
+
+ Functionality implied by ID_AA64ZFR0_EL1.BitPerm == 0b0001.
+
+ HWCAP2_SVESHA3
+
+ Functionality implied by ID_AA64ZFR0_EL1.SHA3 == 0b0001.
+
+ HWCAP2_SVESM4
+
+ Functionality implied by ID_AA64ZFR0_EL1.SM4 == 0b0001.
+
+ HWCAP_ASIMDFHM
+ Functionality implied by ID_AA64ISAR0_EL1.FHM == 0b0001.
+
+ HWCAP_DIT
+ Functionality implied by ID_AA64PFR0_EL1.DIT == 0b0001.
+
+ HWCAP_USCAT
+ Functionality implied by ID_AA64MMFR2_EL1.AT == 0b0001.
+
+ HWCAP_ILRCPC
+ Functionality implied by ID_AA64ISAR1_EL1.LRCPC == 0b0010.
+
+ HWCAP_FLAGM
+ Functionality implied by ID_AA64ISAR0_EL1.TS == 0b0001.
+
++HWCAP2_FLAGM2
++
++ Functionality implied by ID_AA64ISAR0_EL1.TS == 0b0010.
++
+ HWCAP_SSBS
+ Functionality implied by ID_AA64PFR1_EL1.SSBS == 0b0010.
+
+ HWCAP_PACA
+ Functionality implied by ID_AA64ISAR1_EL1.APA == 0b0001 or
+ ID_AA64ISAR1_EL1.API == 0b0001, as described by
+ Documentation/arm64/pointer-authentication.rst.
+
+ HWCAP_PACG
+ Functionality implied by ID_AA64ISAR1_EL1.GPA == 0b0001 or
+ ID_AA64ISAR1_EL1.GPI == 0b0001, as described by
+ Documentation/arm64/pointer-authentication.rst.
+
++HWCAP2_FRINT
++
++ Functionality implied by ID_AA64ISAR1_EL1.FRINTTS == 0b0001.
++
+
+ 4. Unused AT_HWCAP bits
+ -----------------------
+
+ For interoperation with userspace, the kernel guarantees that bits 62
+ and 63 of AT_HWCAP will always be returned as 0.
--- /dev/null
+ ===================================================
+ Scalable Vector Extension support for AArch64 Linux
+ ===================================================
+
+ Author: Dave Martin <Dave.Martin@arm.com>
+
+ Date: 4 August 2017
+
+ This document outlines briefly the interface provided to userspace by Linux in
+ order to support use of the ARM Scalable Vector Extension (SVE).
+
+ This is an outline of the most important features and issues only and not
+ intended to be exhaustive.
+
+ This document does not aim to describe the SVE architecture or programmer's
+ model. To aid understanding, a minimal description of relevant programmer's
+ model features for SVE is included in Appendix A.
+
+
+ 1. General
+ -----------
+
+ * SVE registers Z0..Z31, P0..P15 and FFR and the current vector length VL, are
+ tracked per-thread.
+
+ * The presence of SVE is reported to userspace via HWCAP_SVE in the aux vector
+ AT_HWCAP entry. Presence of this flag implies the presence of the SVE
+ instructions and registers, and the Linux-specific system interfaces
+ described in this document. SVE is reported in /proc/cpuinfo as "sve".
+
+ * Support for the execution of SVE instructions in userspace can also be
+ detected by reading the CPU ID register ID_AA64PFR0_EL1 using an MRS
+ instruction, and checking that the value of the SVE field is nonzero. [3]
+
+ It does not guarantee the presence of the system interfaces described in the
+ following sections: software that needs to verify that those interfaces are
+ present must check for HWCAP_SVE instead.
+
+ * On hardware that supports the SVE2 extensions, HWCAP2_SVE2 will also
+ be reported in the AT_HWCAP2 aux vector entry. In addition to this,
+ optional extensions to SVE2 may be reported by the presence of:
+
+ HWCAP2_SVE2
+ HWCAP2_SVEAES
+ HWCAP2_SVEPMULL
+ HWCAP2_SVEBITPERM
+ HWCAP2_SVESHA3
+ HWCAP2_SVESM4
+
+ This list may be extended over time as the SVE architecture evolves.
+
+ These extensions are also reported via the CPU ID register ID_AA64ZFR0_EL1,
+ which userspace can read using an MRS instruction. See elf_hwcaps.txt and
+ cpu-feature-registers.txt for details.
+
+ * Debuggers should restrict themselves to interacting with the target via the
+ NT_ARM_SVE regset. The recommended way of detecting support for this regset
+ is to connect to a target process first and then attempt a
+ ptrace(PTRACE_GETREGSET, pid, NT_ARM_SVE, &iov).
+
++* Whenever SVE scalable register values (Zn, Pn, FFR) are exchanged in memory
++ between userspace and the kernel, the register value is encoded in memory in
++ an endianness-invariant layout, with bits [(8 * i + 7) : (8 * i)] encoded at
++ byte offset i from the start of the memory representation. This affects for
++ example the signal frame (struct sve_context) and ptrace interface
++ (struct user_sve_header) and associated data.
++
++ Beware that on big-endian systems this results in a different byte order than
++ for the FPSIMD V-registers, which are stored as single host-endian 128-bit
++ values, with bits [(127 - 8 * i) : (120 - 8 * i)] of the register encoded at
++ byte offset i. (struct fpsimd_context, struct user_fpsimd_state).
++
+
+ 2. Vector length terminology
+ -----------------------------
+
+ The size of an SVE vector (Z) register is referred to as the "vector length".
+
+ To avoid confusion about the units used to express vector length, the kernel
+ adopts the following conventions:
+
+ * Vector length (VL) = size of a Z-register in bytes
+
+ * Vector quadwords (VQ) = size of a Z-register in units of 128 bits
+
+ (So, VL = 16 * VQ.)
+
+ The VQ convention is used where the underlying granularity is important, such
+ as in data structure definitions. In most other situations, the VL convention
+ is used. This is consistent with the meaning of the "VL" pseudo-register in
+ the SVE instruction set architecture.
+
+
+ 3. System call behaviour
+ -------------------------
+
+ * On syscall, V0..V31 are preserved (as without SVE). Thus, bits [127:0] of
+ Z0..Z31 are preserved. All other bits of Z0..Z31, and all of P0..P15 and FFR
+ become unspecified on return from a syscall.
+
+ * The SVE registers are not used to pass arguments to or receive results from
+ any syscall.
+
+ * In practice the affected registers/bits will be preserved or will be replaced
+ with zeros on return from a syscall, but userspace should not make
+ assumptions about this. The kernel behaviour may vary on a case-by-case
+ basis.
+
+ * All other SVE state of a thread, including the currently configured vector
+ length, the state of the PR_SVE_VL_INHERIT flag, and the deferred vector
+ length (if any), is preserved across all syscalls, subject to the specific
+ exceptions for execve() described in section 6.
+
+ In particular, on return from a fork() or clone(), the parent and new child
+ process or thread share identical SVE configuration, matching that of the
+ parent before the call.
+
+
+ 4. Signal handling
+ -------------------
+
+ * A new signal frame record sve_context encodes the SVE registers on signal
+ delivery. [1]
+
+ * This record is supplementary to fpsimd_context. The FPSR and FPCR registers
+ are only present in fpsimd_context. For convenience, the content of V0..V31
+ is duplicated between sve_context and fpsimd_context.
+
+ * The signal frame record for SVE always contains basic metadata, in particular
+ the thread's vector length (in sve_context.vl).
+
+ * The SVE registers may or may not be included in the record, depending on
+ whether the registers are live for the thread. The registers are present if
+ and only if:
+ sve_context.head.size >= SVE_SIG_CONTEXT_SIZE(sve_vq_from_vl(sve_context.vl)).
+
+ * If the registers are present, the remainder of the record has a vl-dependent
+ size and layout. Macros SVE_SIG_* are defined [1] to facilitate access to
+ the members.
+
++* Each scalable register (Zn, Pn, FFR) is stored in an endianness-invariant
++ layout, with bits [(8 * i + 7) : (8 * i)] stored at byte offset i from the
++ start of the register's representation in memory.
++
+ * If the SVE context is too big to fit in sigcontext.__reserved[], then extra
+ space is allocated on the stack, an extra_context record is written in
+ __reserved[] referencing this space. sve_context is then written in the
+ extra space. Refer to [1] for further details about this mechanism.
+
+
+ 5. Signal return
+ -----------------
+
+ When returning from a signal handler:
+
+ * If there is no sve_context record in the signal frame, or if the record is
+ present but contains no register data as desribed in the previous section,
+ then the SVE registers/bits become non-live and take unspecified values.
+
+ * If sve_context is present in the signal frame and contains full register
+ data, the SVE registers become live and are populated with the specified
+ data. However, for backward compatibility reasons, bits [127:0] of Z0..Z31
+ are always restored from the corresponding members of fpsimd_context.vregs[]
+ and not from sve_context. The remaining bits are restored from sve_context.
+
+ * Inclusion of fpsimd_context in the signal frame remains mandatory,
+ irrespective of whether sve_context is present or not.
+
+ * The vector length cannot be changed via signal return. If sve_context.vl in
+ the signal frame does not match the current vector length, the signal return
+ attempt is treated as illegal, resulting in a forced SIGSEGV.
+
+
+ 6. prctl extensions
+ --------------------
+
+ Some new prctl() calls are added to allow programs to manage the SVE vector
+ length:
+
+ prctl(PR_SVE_SET_VL, unsigned long arg)
+
+ Sets the vector length of the calling thread and related flags, where
+ arg == vl | flags. Other threads of the calling process are unaffected.
+
+ vl is the desired vector length, where sve_vl_valid(vl) must be true.
+
+ flags:
+
+ PR_SVE_SET_VL_INHERIT
+
+ Inherit the current vector length across execve(). Otherwise, the
+ vector length is reset to the system default at execve(). (See
+ Section 9.)
+
+ PR_SVE_SET_VL_ONEXEC
+
+ Defer the requested vector length change until the next execve()
+ performed by this thread.
+
+ The effect is equivalent to implicit exceution of the following
+ call immediately after the next execve() (if any) by the thread:
+
+ prctl(PR_SVE_SET_VL, arg & ~PR_SVE_SET_VL_ONEXEC)
+
+ This allows launching of a new program with a different vector
+ length, while avoiding runtime side effects in the caller.
+
+
+ Without PR_SVE_SET_VL_ONEXEC, the requested change takes effect
+ immediately.
+
+
+ Return value: a nonnegative on success, or a negative value on error:
+ EINVAL: SVE not supported, invalid vector length requested, or
+ invalid flags.
+
+
+ On success:
+
+ * Either the calling thread's vector length or the deferred vector length
+ to be applied at the next execve() by the thread (dependent on whether
+ PR_SVE_SET_VL_ONEXEC is present in arg), is set to the largest value
+ supported by the system that is less than or equal to vl. If vl ==
+ SVE_VL_MAX, the value set will be the largest value supported by the
+ system.
+
+ * Any previously outstanding deferred vector length change in the calling
+ thread is cancelled.
+
+ * The returned value describes the resulting configuration, encoded as for
+ PR_SVE_GET_VL. The vector length reported in this value is the new
+ current vector length for this thread if PR_SVE_SET_VL_ONEXEC was not
+ present in arg; otherwise, the reported vector length is the deferred
+ vector length that will be applied at the next execve() by the calling
+ thread.
+
+ * Changing the vector length causes all of P0..P15, FFR and all bits of
+ Z0..Z31 except for Z0 bits [127:0] .. Z31 bits [127:0] to become
+ unspecified. Calling PR_SVE_SET_VL with vl equal to the thread's current
+ vector length, or calling PR_SVE_SET_VL with the PR_SVE_SET_VL_ONEXEC
+ flag, does not constitute a change to the vector length for this purpose.
+
+
+ prctl(PR_SVE_GET_VL)
+
+ Gets the vector length of the calling thread.
+
+ The following flag may be OR-ed into the result:
+
+ PR_SVE_SET_VL_INHERIT
+
+ Vector length will be inherited across execve().
+
+ There is no way to determine whether there is an outstanding deferred
+ vector length change (which would only normally be the case between a
+ fork() or vfork() and the corresponding execve() in typical use).
+
+ To extract the vector length from the result, and it with
+ PR_SVE_VL_LEN_MASK.
+
+ Return value: a nonnegative value on success, or a negative value on error:
+ EINVAL: SVE not supported.
+
+
+ 7. ptrace extensions
+ ---------------------
+
+ * A new regset NT_ARM_SVE is defined for use with PTRACE_GETREGSET and
+ PTRACE_SETREGSET.
+
+ Refer to [2] for definitions.
+
+ The regset data starts with struct user_sve_header, containing:
+
+ size
+
+ Size of the complete regset, in bytes.
+ This depends on vl and possibly on other things in the future.
+
+ If a call to PTRACE_GETREGSET requests less data than the value of
+ size, the caller can allocate a larger buffer and retry in order to
+ read the complete regset.
+
+ max_size
+
+ Maximum size in bytes that the regset can grow to for the target
+ thread. The regset won't grow bigger than this even if the target
+ thread changes its vector length etc.
+
+ vl
+
+ Target thread's current vector length, in bytes.
+
+ max_vl
+
+ Maximum possible vector length for the target thread.
+
+ flags
+
+ either
+
+ SVE_PT_REGS_FPSIMD
+
+ SVE registers are not live (GETREGSET) or are to be made
+ non-live (SETREGSET).
+
+ The payload is of type struct user_fpsimd_state, with the same
+ meaning as for NT_PRFPREG, starting at offset
+ SVE_PT_FPSIMD_OFFSET from the start of user_sve_header.
+
+ Extra data might be appended in the future: the size of the
+ payload should be obtained using SVE_PT_FPSIMD_SIZE(vq, flags).
+
+ vq should be obtained using sve_vq_from_vl(vl).
+
+ or
+
+ SVE_PT_REGS_SVE
+
+ SVE registers are live (GETREGSET) or are to be made live
+ (SETREGSET).
+
+ The payload contains the SVE register data, starting at offset
+ SVE_PT_SVE_OFFSET from the start of user_sve_header, and with
+ size SVE_PT_SVE_SIZE(vq, flags);
+
+ ... OR-ed with zero or more of the following flags, which have the same
+ meaning and behaviour as the corresponding PR_SET_VL_* flags:
+
+ SVE_PT_VL_INHERIT
+
+ SVE_PT_VL_ONEXEC (SETREGSET only).
+
+ * The effects of changing the vector length and/or flags are equivalent to
+ those documented for PR_SVE_SET_VL.
+
+ The caller must make a further GETREGSET call if it needs to know what VL is
+ actually set by SETREGSET, unless is it known in advance that the requested
+ VL is supported.
+
+ * In the SVE_PT_REGS_SVE case, the size and layout of the payload depends on
+ the header fields. The SVE_PT_SVE_*() macros are provided to facilitate
+ access to the members.
+
+ * In either case, for SETREGSET it is permissible to omit the payload, in which
+ case only the vector length and flags are changed (along with any
+ consequences of those changes).
+
+ * For SETREGSET, if an SVE_PT_REGS_SVE payload is present and the
+ requested VL is not supported, the effect will be the same as if the
+ payload were omitted, except that an EIO error is reported. No
+ attempt is made to translate the payload data to the correct layout
+ for the vector length actually set. The thread's FPSIMD state is
+ preserved, but the remaining bits of the SVE registers become
+ unspecified. It is up to the caller to translate the payload layout
+ for the actual VL and retry.
+
+ * The effect of writing a partial, incomplete payload is unspecified.
+
+
+ 8. ELF coredump extensions
+ ---------------------------
+
+ * A NT_ARM_SVE note will be added to each coredump for each thread of the
+ dumped process. The contents will be equivalent to the data that would have
+ been read if a PTRACE_GETREGSET of NT_ARM_SVE were executed for each thread
+ when the coredump was generated.
+
+
+ 9. System runtime configuration
+ --------------------------------
+
+ * To mitigate the ABI impact of expansion of the signal frame, a policy
+ mechanism is provided for administrators, distro maintainers and developers
+ to set the default vector length for userspace processes:
+
+ /proc/sys/abi/sve_default_vector_length
+
+ Writing the text representation of an integer to this file sets the system
+ default vector length to the specified value, unless the value is greater
+ than the maximum vector length supported by the system in which case the
+ default vector length is set to that maximum.
+
+ The result can be determined by reopening the file and reading its
+ contents.
+
+ At boot, the default vector length is initially set to 64 or the maximum
+ supported vector length, whichever is smaller. This determines the initial
+ vector length of the init process (PID 1).
+
+ Reading this file returns the current system default vector length.
+
+ * At every execve() call, the new vector length of the new process is set to
+ the system default vector length, unless
+
+ * PR_SVE_SET_VL_INHERIT (or equivalently SVE_PT_VL_INHERIT) is set for the
+ calling thread, or
+
+ * a deferred vector length change is pending, established via the
+ PR_SVE_SET_VL_ONEXEC flag (or SVE_PT_VL_ONEXEC).
+
+ * Modifying the system default vector length does not affect the vector length
+ of any existing process or thread that does not make an execve() call.
+
+
+ Appendix A. SVE programmer's model (informative)
+ =================================================
+
+ This section provides a minimal description of the additions made by SVE to the
+ ARMv8-A programmer's model that are relevant to this document.
+
+ Note: This section is for information only and not intended to be complete or
+ to replace any architectural specification.
+
+ A.1. Registers
+ ---------------
+
+ In A64 state, SVE adds the following:
+
+ * 32 8VL-bit vector registers Z0..Z31
+ For each Zn, Zn bits [127:0] alias the ARMv8-A vector register Vn.
+
+ A register write using a Vn register name zeros all bits of the corresponding
+ Zn except for bits [127:0].
+
+ * 16 VL-bit predicate registers P0..P15
+
+ * 1 VL-bit special-purpose predicate register FFR (the "first-fault register")
+
+ * a VL "pseudo-register" that determines the size of each vector register
+
+ The SVE instruction set architecture provides no way to write VL directly.
+ Instead, it can be modified only by EL1 and above, by writing appropriate
+ system registers.
+
+ * The value of VL can be configured at runtime by EL1 and above:
+ 16 <= VL <= VLmax, where VL must be a multiple of 16.
+
+ * The maximum vector length is determined by the hardware:
+ 16 <= VLmax <= 256.
+
+ (The SVE architecture specifies 256, but permits future architecture
+ revisions to raise this limit.)
+
+ * FPSR and FPCR are retained from ARMv8-A, and interact with SVE floating-point
+ operations in a similar way to the way in which they interact with ARMv8
+ floating-point operations::
+
+ 8VL-1 128 0 bit index
+ +---- //// -----------------+
+ Z0 | : V0 |
+ : :
+ Z7 | : V7 |
+ Z8 | : * V8 |
+ : : :
+ Z15 | : *V15 |
+ Z16 | : V16 |
+ : :
+ Z31 | : V31 |
+ +---- //// -----------------+
+ 31 0
+ VL-1 0 +-------+
+ +---- //// --+ FPSR | |
+ P0 | | +-------+
+ : | | *FPCR | |
+ P15 | | +-------+
+ +---- //// --+
+ FFR | | +-----+
+ +---- //// --+ VL | |
+ +-----+
+
+ (*) callee-save:
+ This only applies to bits [63:0] of Z-/V-registers.
+ FPCR contains callee-save and caller-save bits. See [4] for details.
+
+
+ A.2. Procedure call standard
+ -----------------------------
+
+ The ARMv8-A base procedure call standard is extended as follows with respect to
+ the additional SVE register state:
+
+ * All SVE register bits that are not shared with FP/SIMD are caller-save.
+
+ * Z8 bits [63:0] .. Z15 bits [63:0] are callee-save.
+
+ This follows from the way these bits are mapped to V8..V15, which are caller-
+ save in the base procedure call standard.
+
+
+ Appendix B. ARMv8-A FP/SIMD programmer's model
+ ===============================================
+
+ Note: This section is for information only and not intended to be complete or
+ to replace any architectural specification.
+
+ Refer to [4] for for more information.
+
+ ARMv8-A defines the following floating-point / SIMD register state:
+
+ * 32 128-bit vector registers V0..V31
+ * 2 32-bit status/control registers FPSR, FPCR
+
+ ::
+
+ 127 0 bit index
+ +---------------+
+ V0 | |
+ : : :
+ V7 | |
+ * V8 | |
+ : : : :
+ *V15 | |
+ V16 | |
+ : : :
+ V31 | |
+ +---------------+
+
+ 31 0
+ +-------+
+ FPSR | |
+ +-------+
+ *FPCR | |
+ +-------+
+
+ (*) callee-save:
+ This only applies to bits [63:0] of V-registers.
+ FPCR contains a mixture of callee-save and caller-save bits.
+
+
+ References
+ ==========
+
+ [1] arch/arm64/include/uapi/asm/sigcontext.h
+ AArch64 Linux signal ABI definitions
+
+ [2] arch/arm64/include/uapi/asm/ptrace.h
+ AArch64 Linux ptrace ABI definitions
+
+ [3] Documentation/arm64/cpu-feature-registers.rst
+
+ [4] ARM IHI0055C
+ http://infocenter.arm.com/help/topic/com.arm.doc.ihi0055c/IHI0055C_beta_aapcs64.pdf
+ http://infocenter.arm.com/help/topic/com.arm.doc.subset.swdev.abi/index.html
+ Procedure Call Standard for the ARM 64-bit Architecture (AArch64)
void ktime_get_coarse_boottime_ts64( struct timespec64 * )
void ktime_get_coarse_real_ts64( struct timespec64 * )
void ktime_get_coarse_clocktai_ts64( struct timespec64 * )
- void ktime_get_coarse_raw_ts64( struct timespec64 * )
These are quicker than the non-coarse versions, but less accurate,
- corresponding to CLOCK_MONONOTNIC_COARSE and CLOCK_REALTIME_COARSE
+ corresponding to CLOCK_MONOTONIC_COARSE and CLOCK_REALTIME_COARSE
in user space, along with the equivalent boottime/tai/raw
timebase not available in user space.
--- /dev/null
+ NVMe Fault Injection
+ ====================
+ Linux's fault injection framework provides a systematic way to support
+ error injection via debugfs in the /sys/kernel/debug directory. When
+ enabled, the default NVME_SC_INVALID_OPCODE with no retry will be
+ injected into the nvme_end_request. Users can change the default status
+ code and no retry flag via the debugfs. The list of Generic Command
+ Status can be found in include/linux/nvme.h
+
+ Following examples show how to inject an error into the nvme.
+
+ First, enable CONFIG_FAULT_INJECTION_DEBUG_FS kernel config,
+ recompile the kernel. After booting up the kernel, do the
+ following.
+
+ Example 1: Inject default status code with no retry
+ ---------------------------------------------------
+
+ ::
+
+ mount /dev/nvme0n1 /mnt
+ echo 1 > /sys/kernel/debug/nvme0n1/fault_inject/times
+ echo 100 > /sys/kernel/debug/nvme0n1/fault_inject/probability
+ cp a.file /mnt
+
+ Expected Result::
+
+ cp: cannot stat ‘/mnt/a.file’: Input/output error
+
+ Message from dmesg::
+
+ FAULT_INJECTION: forcing a failure.
+ name fault_inject, interval 1, probability 100, space 0, times 1
+ CPU: 0 PID: 0 Comm: swapper/0 Not tainted 4.15.0-rc8+ #2
+ Hardware name: innotek GmbH VirtualBox/VirtualBox,
+ BIOS VirtualBox 12/01/2006
+ Call Trace:
+ <IRQ>
+ dump_stack+0x5c/0x7d
+ should_fail+0x148/0x170
+ nvme_should_fail+0x2f/0x50 [nvme_core]
+ nvme_process_cq+0xe7/0x1d0 [nvme]
+ nvme_irq+0x1e/0x40 [nvme]
+ __handle_irq_event_percpu+0x3a/0x190
+ handle_irq_event_percpu+0x30/0x70
+ handle_irq_event+0x36/0x60
+ handle_fasteoi_irq+0x78/0x120
+ handle_irq+0xa7/0x130
+ ? tick_irq_enter+0xa8/0xc0
+ do_IRQ+0x43/0xc0
+ common_interrupt+0xa2/0xa2
+ </IRQ>
+ RIP: 0010:native_safe_halt+0x2/0x10
+ RSP: 0018:ffffffff82003e90 EFLAGS: 00000246 ORIG_RAX: ffffffffffffffdd
+ RAX: ffffffff817a10c0 RBX: ffffffff82012480 RCX: 0000000000000000
+ RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000000
+ RBP: 0000000000000000 R08: 000000008e38ce64 R09: 0000000000000000
+ R10: 0000000000000000 R11: 0000000000000000 R12: ffffffff82012480
+ R13: ffffffff82012480 R14: 0000000000000000 R15: 0000000000000000
+ ? __sched_text_end+0x4/0x4
+ default_idle+0x18/0xf0
+ do_idle+0x150/0x1d0
+ cpu_startup_entry+0x6f/0x80
+ start_kernel+0x4c4/0x4e4
+ ? set_init_arg+0x55/0x55
+ secondary_startup_64+0xa5/0xb0
+ print_req_error: I/O error, dev nvme0n1, sector 9240
+ EXT4-fs error (device nvme0n1): ext4_find_entry:1436:
+ inode #2: comm cp: reading directory lblock 0
+
+ Example 2: Inject default status code with retry
+ ------------------------------------------------
+
+ ::
+
+ mount /dev/nvme0n1 /mnt
+ echo 1 > /sys/kernel/debug/nvme0n1/fault_inject/times
+ echo 100 > /sys/kernel/debug/nvme0n1/fault_inject/probability
+ echo 1 > /sys/kernel/debug/nvme0n1/fault_inject/status
+ echo 0 > /sys/kernel/debug/nvme0n1/fault_inject/dont_retry
+
+ cp a.file /mnt
+
+ Expected Result::
+
+ command success without error
+
+ Message from dmesg::
+
+ FAULT_INJECTION: forcing a failure.
+ name fault_inject, interval 1, probability 100, space 0, times 1
+ CPU: 1 PID: 0 Comm: swapper/1 Not tainted 4.15.0-rc8+ #4
+ Hardware name: innotek GmbH VirtualBox/VirtualBox, BIOS VirtualBox 12/01/2006
+ Call Trace:
+ <IRQ>
+ dump_stack+0x5c/0x7d
+ should_fail+0x148/0x170
+ nvme_should_fail+0x30/0x60 [nvme_core]
+ nvme_loop_queue_response+0x84/0x110 [nvme_loop]
+ nvmet_req_complete+0x11/0x40 [nvmet]
+ nvmet_bio_done+0x28/0x40 [nvmet]
+ blk_update_request+0xb0/0x310
+ blk_mq_end_request+0x18/0x60
+ flush_smp_call_function_queue+0x3d/0xf0
+ smp_call_function_single_interrupt+0x2c/0xc0
+ call_function_single_interrupt+0xa2/0xb0
+ </IRQ>
+ RIP: 0010:native_safe_halt+0x2/0x10
+ RSP: 0018:ffffc9000068bec0 EFLAGS: 00000246 ORIG_RAX: ffffffffffffff04
+ RAX: ffffffff817a10c0 RBX: ffff88011a3c9680 RCX: 0000000000000000
+ RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000000
+ RBP: 0000000000000001 R08: 000000008e38c131 R09: 0000000000000000
+ R10: 0000000000000000 R11: 0000000000000000 R12: ffff88011a3c9680
+ R13: ffff88011a3c9680 R14: 0000000000000000 R15: 0000000000000000
+ ? __sched_text_end+0x4/0x4
+ default_idle+0x18/0xf0
+ do_idle+0x150/0x1d0
+ cpu_startup_entry+0x6f/0x80
+ start_secondary+0x187/0x1e0
+ secondary_startup_64+0xa5/0xb0
++
++Example 3: Inject an error into the 10th admin command
++------------------------------------------------------
++
++::
++
++ echo 100 > /sys/kernel/debug/nvme0/fault_inject/probability
++ echo 10 > /sys/kernel/debug/nvme0/fault_inject/space
++ echo 1 > /sys/kernel/debug/nvme0/fault_inject/times
++ nvme reset /dev/nvme0
++
++Expected Result::
++
++ After NVMe controller reset, the reinitialization may or may not succeed.
++ It depends on which admin command is actually forced to fail.
++
++Message from dmesg::
++
++ nvme nvme0: resetting controller
++ FAULT_INJECTION: forcing a failure.
++ name fault_inject, interval 1, probability 100, space 1, times 1
++ CPU: 0 PID: 0 Comm: swapper/0 Not tainted 5.2.0-rc2+ #2
++ Hardware name: MSI MS-7A45/B150M MORTAR ARCTIC (MS-7A45), BIOS 1.50 04/25/2017
++ Call Trace:
++ <IRQ>
++ dump_stack+0x63/0x85
++ should_fail+0x14a/0x170
++ nvme_should_fail+0x38/0x80 [nvme_core]
++ nvme_irq+0x129/0x280 [nvme]
++ ? blk_mq_end_request+0xb3/0x120
++ __handle_irq_event_percpu+0x84/0x1a0
++ handle_irq_event_percpu+0x32/0x80
++ handle_irq_event+0x3b/0x60
++ handle_edge_irq+0x7f/0x1a0
++ handle_irq+0x20/0x30
++ do_IRQ+0x4e/0xe0
++ common_interrupt+0xf/0xf
++ </IRQ>
++ RIP: 0010:cpuidle_enter_state+0xc5/0x460
++ Code: ff e8 8f 5f 86 ff 80 7d c7 00 74 17 9c 58 0f 1f 44 00 00 f6 c4 02 0f 85 69 03 00 00 31 ff e8 62 aa 8c ff fb 66 0f 1f 44 00 00 <45> 85 ed 0f 88 37 03 00 00 4c 8b 45 d0 4c 2b 45 b8 48 ba cf f7 53
++ RSP: 0018:ffffffff88c03dd0 EFLAGS: 00000246 ORIG_RAX: ffffffffffffffdc
++ RAX: ffff9dac25a2ac80 RBX: ffffffff88d53760 RCX: 000000000000001f
++ RDX: 0000000000000000 RSI: 000000002d958403 RDI: 0000000000000000
++ RBP: ffffffff88c03e18 R08: fffffff75e35ffb7 R09: 00000a49a56c0b48
++ R10: ffffffff88c03da0 R11: 0000000000001b0c R12: ffff9dac25a34d00
++ R13: 0000000000000006 R14: 0000000000000006 R15: ffffffff88d53760
++ cpuidle_enter+0x2e/0x40
++ call_cpuidle+0x23/0x40
++ do_idle+0x201/0x280
++ cpu_startup_entry+0x1d/0x20
++ rest_init+0xaa/0xb0
++ arch_call_rest_init+0xe/0x1b
++ start_kernel+0x51c/0x53b
++ x86_64_start_reservations+0x24/0x26
++ x86_64_start_kernel+0x74/0x77
++ secondary_startup_64+0xa4/0xb0
++ nvme nvme0: Could not set queue count (16385)
++ nvme nvme0: IO queues not created
--- /dev/null
- through the cpuset facility (Documentation/cgroup-v1/cpusets.txt).
+ ========================
+ Deadline Task Scheduling
+ ========================
+
+ .. CONTENTS
+
+ 0. WARNING
+ 1. Overview
+ 2. Scheduling algorithm
+ 2.1 Main algorithm
+ 2.2 Bandwidth reclaiming
+ 3. Scheduling Real-Time Tasks
+ 3.1 Definitions
+ 3.2 Schedulability Analysis for Uniprocessor Systems
+ 3.3 Schedulability Analysis for Multiprocessor Systems
+ 3.4 Relationship with SCHED_DEADLINE Parameters
+ 4. Bandwidth management
+ 4.1 System-wide settings
+ 4.2 Task interface
+ 4.3 Default behavior
+ 4.4 Behavior of sched_yield()
+ 5. Tasks CPU affinity
+ 5.1 SCHED_DEADLINE and cpusets HOWTO
+ 6. Future plans
+ A. Test suite
+ B. Minimal main()
+
+
+ 0. WARNING
+ ==========
+
+ Fiddling with these settings can result in an unpredictable or even unstable
+ system behavior. As for -rt (group) scheduling, it is assumed that root users
+ know what they're doing.
+
+
+ 1. Overview
+ ===========
+
+ The SCHED_DEADLINE policy contained inside the sched_dl scheduling class is
+ basically an implementation of the Earliest Deadline First (EDF) scheduling
+ algorithm, augmented with a mechanism (called Constant Bandwidth Server, CBS)
+ that makes it possible to isolate the behavior of tasks between each other.
+
+
+ 2. Scheduling algorithm
+ =======================
+
+ 2.1 Main algorithm
+ ------------------
+
+ SCHED_DEADLINE [18] uses three parameters, named "runtime", "period", and
+ "deadline", to schedule tasks. A SCHED_DEADLINE task should receive
+ "runtime" microseconds of execution time every "period" microseconds, and
+ these "runtime" microseconds are available within "deadline" microseconds
+ from the beginning of the period. In order to implement this behavior,
+ every time the task wakes up, the scheduler computes a "scheduling deadline"
+ consistent with the guarantee (using the CBS[2,3] algorithm). Tasks are then
+ scheduled using EDF[1] on these scheduling deadlines (the task with the
+ earliest scheduling deadline is selected for execution). Notice that the
+ task actually receives "runtime" time units within "deadline" if a proper
+ "admission control" strategy (see Section "4. Bandwidth management") is used
+ (clearly, if the system is overloaded this guarantee cannot be respected).
+
+ Summing up, the CBS[2,3] algorithm assigns scheduling deadlines to tasks so
+ that each task runs for at most its runtime every period, avoiding any
+ interference between different tasks (bandwidth isolation), while the EDF[1]
+ algorithm selects the task with the earliest scheduling deadline as the one
+ to be executed next. Thanks to this feature, tasks that do not strictly comply
+ with the "traditional" real-time task model (see Section 3) can effectively
+ use the new policy.
+
+ In more details, the CBS algorithm assigns scheduling deadlines to
+ tasks in the following way:
+
+ - Each SCHED_DEADLINE task is characterized by the "runtime",
+ "deadline", and "period" parameters;
+
+ - The state of the task is described by a "scheduling deadline", and
+ a "remaining runtime". These two parameters are initially set to 0;
+
+ - When a SCHED_DEADLINE task wakes up (becomes ready for execution),
+ the scheduler checks if::
+
+ remaining runtime runtime
+ ---------------------------------- > ---------
+ scheduling deadline - current time period
+
+ then, if the scheduling deadline is smaller than the current time, or
+ this condition is verified, the scheduling deadline and the
+ remaining runtime are re-initialized as
+
+ scheduling deadline = current time + deadline
+ remaining runtime = runtime
+
+ otherwise, the scheduling deadline and the remaining runtime are
+ left unchanged;
+
+ - When a SCHED_DEADLINE task executes for an amount of time t, its
+ remaining runtime is decreased as::
+
+ remaining runtime = remaining runtime - t
+
+ (technically, the runtime is decreased at every tick, or when the
+ task is descheduled / preempted);
+
+ - When the remaining runtime becomes less or equal than 0, the task is
+ said to be "throttled" (also known as "depleted" in real-time literature)
+ and cannot be scheduled until its scheduling deadline. The "replenishment
+ time" for this task (see next item) is set to be equal to the current
+ value of the scheduling deadline;
+
+ - When the current time is equal to the replenishment time of a
+ throttled task, the scheduling deadline and the remaining runtime are
+ updated as::
+
+ scheduling deadline = scheduling deadline + period
+ remaining runtime = remaining runtime + runtime
+
+ The SCHED_FLAG_DL_OVERRUN flag in sched_attr's sched_flags field allows a task
+ to get informed about runtime overruns through the delivery of SIGXCPU
+ signals.
+
+
+ 2.2 Bandwidth reclaiming
+ ------------------------
+
+ Bandwidth reclaiming for deadline tasks is based on the GRUB (Greedy
+ Reclamation of Unused Bandwidth) algorithm [15, 16, 17] and it is enabled
+ when flag SCHED_FLAG_RECLAIM is set.
+
+ The following diagram illustrates the state names for tasks handled by GRUB::
+
+ ------------
+ (d) | Active |
+ ------------->| |
+ | | Contending |
+ | ------------
+ | A |
+ ---------- | |
+ | | | |
+ | Inactive | |(b) | (a)
+ | | | |
+ ---------- | |
+ A | V
+ | ------------
+ | | Active |
+ --------------| Non |
+ (c) | Contending |
+ ------------
+
+ A task can be in one of the following states:
+
+ - ActiveContending: if it is ready for execution (or executing);
+
+ - ActiveNonContending: if it just blocked and has not yet surpassed the 0-lag
+ time;
+
+ - Inactive: if it is blocked and has surpassed the 0-lag time.
+
+ State transitions:
+
+ (a) When a task blocks, it does not become immediately inactive since its
+ bandwidth cannot be immediately reclaimed without breaking the
+ real-time guarantees. It therefore enters a transitional state called
+ ActiveNonContending. The scheduler arms the "inactive timer" to fire at
+ the 0-lag time, when the task's bandwidth can be reclaimed without
+ breaking the real-time guarantees.
+
+ The 0-lag time for a task entering the ActiveNonContending state is
+ computed as::
+
+ (runtime * dl_period)
+ deadline - ---------------------
+ dl_runtime
+
+ where runtime is the remaining runtime, while dl_runtime and dl_period
+ are the reservation parameters.
+
+ (b) If the task wakes up before the inactive timer fires, the task re-enters
+ the ActiveContending state and the "inactive timer" is canceled.
+ In addition, if the task wakes up on a different runqueue, then
+ the task's utilization must be removed from the previous runqueue's active
+ utilization and must be added to the new runqueue's active utilization.
+ In order to avoid races between a task waking up on a runqueue while the
+ "inactive timer" is running on a different CPU, the "dl_non_contending"
+ flag is used to indicate that a task is not on a runqueue but is active
+ (so, the flag is set when the task blocks and is cleared when the
+ "inactive timer" fires or when the task wakes up).
+
+ (c) When the "inactive timer" fires, the task enters the Inactive state and
+ its utilization is removed from the runqueue's active utilization.
+
+ (d) When an inactive task wakes up, it enters the ActiveContending state and
+ its utilization is added to the active utilization of the runqueue where
+ it has been enqueued.
+
+ For each runqueue, the algorithm GRUB keeps track of two different bandwidths:
+
+ - Active bandwidth (running_bw): this is the sum of the bandwidths of all
+ tasks in active state (i.e., ActiveContending or ActiveNonContending);
+
+ - Total bandwidth (this_bw): this is the sum of all tasks "belonging" to the
+ runqueue, including the tasks in Inactive state.
+
+
+ The algorithm reclaims the bandwidth of the tasks in Inactive state.
+ It does so by decrementing the runtime of the executing task Ti at a pace equal
+ to
+
+ dq = -max{ Ui / Umax, (1 - Uinact - Uextra) } dt
+
+ where:
+
+ - Ui is the bandwidth of task Ti;
+ - Umax is the maximum reclaimable utilization (subjected to RT throttling
+ limits);
+ - Uinact is the (per runqueue) inactive utilization, computed as
+ (this_bq - running_bw);
+ - Uextra is the (per runqueue) extra reclaimable utilization
+ (subjected to RT throttling limits).
+
+
+ Let's now see a trivial example of two deadline tasks with runtime equal
+ to 4 and period equal to 8 (i.e., bandwidth equal to 0.5)::
+
+ A Task T1
+ |
+ | |
+ | |
+ |-------- |----
+ | | V
+ |---|---|---|---|---|---|---|---|--------->t
+ 0 1 2 3 4 5 6 7 8
+
+
+ A Task T2
+ |
+ | |
+ | |
+ | ------------------------|
+ | | V
+ |---|---|---|---|---|---|---|---|--------->t
+ 0 1 2 3 4 5 6 7 8
+
+
+ A running_bw
+ |
+ 1 ----------------- ------
+ | | |
+ 0.5- -----------------
+ | |
+ |---|---|---|---|---|---|---|---|--------->t
+ 0 1 2 3 4 5 6 7 8
+
+
+ - Time t = 0:
+
+ Both tasks are ready for execution and therefore in ActiveContending state.
+ Suppose Task T1 is the first task to start execution.
+ Since there are no inactive tasks, its runtime is decreased as dq = -1 dt.
+
+ - Time t = 2:
+
+ Suppose that task T1 blocks
+ Task T1 therefore enters the ActiveNonContending state. Since its remaining
+ runtime is equal to 2, its 0-lag time is equal to t = 4.
+ Task T2 start execution, with runtime still decreased as dq = -1 dt since
+ there are no inactive tasks.
+
+ - Time t = 4:
+
+ This is the 0-lag time for Task T1. Since it didn't woken up in the
+ meantime, it enters the Inactive state. Its bandwidth is removed from
+ running_bw.
+ Task T2 continues its execution. However, its runtime is now decreased as
+ dq = - 0.5 dt because Uinact = 0.5.
+ Task T2 therefore reclaims the bandwidth unused by Task T1.
+
+ - Time t = 8:
+
+ Task T1 wakes up. It enters the ActiveContending state again, and the
+ running_bw is incremented.
+
+
+ 2.3 Energy-aware scheduling
+ ---------------------------
+
+ When cpufreq's schedutil governor is selected, SCHED_DEADLINE implements the
+ GRUB-PA [19] algorithm, reducing the CPU operating frequency to the minimum
+ value that still allows to meet the deadlines. This behavior is currently
+ implemented only for ARM architectures.
+
+ A particular care must be taken in case the time needed for changing frequency
+ is of the same order of magnitude of the reservation period. In such cases,
+ setting a fixed CPU frequency results in a lower amount of deadline misses.
+
+
+ 3. Scheduling Real-Time Tasks
+ =============================
+
+
+
+ .. BIG FAT WARNING ******************************************************
+
+ .. warning::
+
+ This section contains a (not-thorough) summary on classical deadline
+ scheduling theory, and how it applies to SCHED_DEADLINE.
+ The reader can "safely" skip to Section 4 if only interested in seeing
+ how the scheduling policy can be used. Anyway, we strongly recommend
+ to come back here and continue reading (once the urge for testing is
+ satisfied :P) to be sure of fully understanding all technical details.
+
+ .. ************************************************************************
+
+ There are no limitations on what kind of task can exploit this new
+ scheduling discipline, even if it must be said that it is particularly
+ suited for periodic or sporadic real-time tasks that need guarantees on their
+ timing behavior, e.g., multimedia, streaming, control applications, etc.
+
+ 3.1 Definitions
+ ------------------------
+
+ A typical real-time task is composed of a repetition of computation phases
+ (task instances, or jobs) which are activated on a periodic or sporadic
+ fashion.
+ Each job J_j (where J_j is the j^th job of the task) is characterized by an
+ arrival time r_j (the time when the job starts), an amount of computation
+ time c_j needed to finish the job, and a job absolute deadline d_j, which
+ is the time within which the job should be finished. The maximum execution
+ time max{c_j} is called "Worst Case Execution Time" (WCET) for the task.
+ A real-time task can be periodic with period P if r_{j+1} = r_j + P, or
+ sporadic with minimum inter-arrival time P is r_{j+1} >= r_j + P. Finally,
+ d_j = r_j + D, where D is the task's relative deadline.
+ Summing up, a real-time task can be described as
+
+ Task = (WCET, D, P)
+
+ The utilization of a real-time task is defined as the ratio between its
+ WCET and its period (or minimum inter-arrival time), and represents
+ the fraction of CPU time needed to execute the task.
+
+ If the total utilization U=sum(WCET_i/P_i) is larger than M (with M equal
+ to the number of CPUs), then the scheduler is unable to respect all the
+ deadlines.
+ Note that total utilization is defined as the sum of the utilizations
+ WCET_i/P_i over all the real-time tasks in the system. When considering
+ multiple real-time tasks, the parameters of the i-th task are indicated
+ with the "_i" suffix.
+ Moreover, if the total utilization is larger than M, then we risk starving
+ non- real-time tasks by real-time tasks.
+ If, instead, the total utilization is smaller than M, then non real-time
+ tasks will not be starved and the system might be able to respect all the
+ deadlines.
+ As a matter of fact, in this case it is possible to provide an upper bound
+ for tardiness (defined as the maximum between 0 and the difference
+ between the finishing time of a job and its absolute deadline).
+ More precisely, it can be proven that using a global EDF scheduler the
+ maximum tardiness of each task is smaller or equal than
+
+ ((M − 1) · WCET_max − WCET_min)/(M − (M − 2) · U_max) + WCET_max
+
+ where WCET_max = max{WCET_i} is the maximum WCET, WCET_min=min{WCET_i}
+ is the minimum WCET, and U_max = max{WCET_i/P_i} is the maximum
+ utilization[12].
+
+ 3.2 Schedulability Analysis for Uniprocessor Systems
+ ----------------------------------------------------
+
+ If M=1 (uniprocessor system), or in case of partitioned scheduling (each
+ real-time task is statically assigned to one and only one CPU), it is
+ possible to formally check if all the deadlines are respected.
+ If D_i = P_i for all tasks, then EDF is able to respect all the deadlines
+ of all the tasks executing on a CPU if and only if the total utilization
+ of the tasks running on such a CPU is smaller or equal than 1.
+ If D_i != P_i for some task, then it is possible to define the density of
+ a task as WCET_i/min{D_i,P_i}, and EDF is able to respect all the deadlines
+ of all the tasks running on a CPU if the sum of the densities of the tasks
+ running on such a CPU is smaller or equal than 1:
+
+ sum(WCET_i / min{D_i, P_i}) <= 1
+
+ It is important to notice that this condition is only sufficient, and not
+ necessary: there are task sets that are schedulable, but do not respect the
+ condition. For example, consider the task set {Task_1,Task_2} composed by
+ Task_1=(50ms,50ms,100ms) and Task_2=(10ms,100ms,100ms).
+ EDF is clearly able to schedule the two tasks without missing any deadline
+ (Task_1 is scheduled as soon as it is released, and finishes just in time
+ to respect its deadline; Task_2 is scheduled immediately after Task_1, hence
+ its response time cannot be larger than 50ms + 10ms = 60ms) even if
+
+ 50 / min{50,100} + 10 / min{100, 100} = 50 / 50 + 10 / 100 = 1.1
+
+ Of course it is possible to test the exact schedulability of tasks with
+ D_i != P_i (checking a condition that is both sufficient and necessary),
+ but this cannot be done by comparing the total utilization or density with
+ a constant. Instead, the so called "processor demand" approach can be used,
+ computing the total amount of CPU time h(t) needed by all the tasks to
+ respect all of their deadlines in a time interval of size t, and comparing
+ such a time with the interval size t. If h(t) is smaller than t (that is,
+ the amount of time needed by the tasks in a time interval of size t is
+ smaller than the size of the interval) for all the possible values of t, then
+ EDF is able to schedule the tasks respecting all of their deadlines. Since
+ performing this check for all possible values of t is impossible, it has been
+ proven[4,5,6] that it is sufficient to perform the test for values of t
+ between 0 and a maximum value L. The cited papers contain all of the
+ mathematical details and explain how to compute h(t) and L.
+ In any case, this kind of analysis is too complex as well as too
+ time-consuming to be performed on-line. Hence, as explained in Section
+ 4 Linux uses an admission test based on the tasks' utilizations.
+
+ 3.3 Schedulability Analysis for Multiprocessor Systems
+ ------------------------------------------------------
+
+ On multiprocessor systems with global EDF scheduling (non partitioned
+ systems), a sufficient test for schedulability can not be based on the
+ utilizations or densities: it can be shown that even if D_i = P_i task
+ sets with utilizations slightly larger than 1 can miss deadlines regardless
+ of the number of CPUs.
+
+ Consider a set {Task_1,...Task_{M+1}} of M+1 tasks on a system with M
+ CPUs, with the first task Task_1=(P,P,P) having period, relative deadline
+ and WCET equal to P. The remaining M tasks Task_i=(e,P-1,P-1) have an
+ arbitrarily small worst case execution time (indicated as "e" here) and a
+ period smaller than the one of the first task. Hence, if all the tasks
+ activate at the same time t, global EDF schedules these M tasks first
+ (because their absolute deadlines are equal to t + P - 1, hence they are
+ smaller than the absolute deadline of Task_1, which is t + P). As a
+ result, Task_1 can be scheduled only at time t + e, and will finish at
+ time t + e + P, after its absolute deadline. The total utilization of the
+ task set is U = M · e / (P - 1) + P / P = M · e / (P - 1) + 1, and for small
+ values of e this can become very close to 1. This is known as "Dhall's
+ effect"[7]. Note: the example in the original paper by Dhall has been
+ slightly simplified here (for example, Dhall more correctly computed
+ lim_{e->0}U).
+
+ More complex schedulability tests for global EDF have been developed in
+ real-time literature[8,9], but they are not based on a simple comparison
+ between total utilization (or density) and a fixed constant. If all tasks
+ have D_i = P_i, a sufficient schedulability condition can be expressed in
+ a simple way:
+
+ sum(WCET_i / P_i) <= M - (M - 1) · U_max
+
+ where U_max = max{WCET_i / P_i}[10]. Notice that for U_max = 1,
+ M - (M - 1) · U_max becomes M - M + 1 = 1 and this schedulability condition
+ just confirms the Dhall's effect. A more complete survey of the literature
+ about schedulability tests for multi-processor real-time scheduling can be
+ found in [11].
+
+ As seen, enforcing that the total utilization is smaller than M does not
+ guarantee that global EDF schedules the tasks without missing any deadline
+ (in other words, global EDF is not an optimal scheduling algorithm). However,
+ a total utilization smaller than M is enough to guarantee that non real-time
+ tasks are not starved and that the tardiness of real-time tasks has an upper
+ bound[12] (as previously noted). Different bounds on the maximum tardiness
+ experienced by real-time tasks have been developed in various papers[13,14],
+ but the theoretical result that is important for SCHED_DEADLINE is that if
+ the total utilization is smaller or equal than M then the response times of
+ the tasks are limited.
+
+ 3.4 Relationship with SCHED_DEADLINE Parameters
+ -----------------------------------------------
+
+ Finally, it is important to understand the relationship between the
+ SCHED_DEADLINE scheduling parameters described in Section 2 (runtime,
+ deadline and period) and the real-time task parameters (WCET, D, P)
+ described in this section. Note that the tasks' temporal constraints are
+ represented by its absolute deadlines d_j = r_j + D described above, while
+ SCHED_DEADLINE schedules the tasks according to scheduling deadlines (see
+ Section 2).
+ If an admission test is used to guarantee that the scheduling deadlines
+ are respected, then SCHED_DEADLINE can be used to schedule real-time tasks
+ guaranteeing that all the jobs' deadlines of a task are respected.
+ In order to do this, a task must be scheduled by setting:
+
+ - runtime >= WCET
+ - deadline = D
+ - period <= P
+
+ IOW, if runtime >= WCET and if period is <= P, then the scheduling deadlines
+ and the absolute deadlines (d_j) coincide, so a proper admission control
+ allows to respect the jobs' absolute deadlines for this task (this is what is
+ called "hard schedulability property" and is an extension of Lemma 1 of [2]).
+ Notice that if runtime > deadline the admission control will surely reject
+ this task, as it is not possible to respect its temporal constraints.
+
+ References:
+
+ 1 - C. L. Liu and J. W. Layland. Scheduling algorithms for multiprogram-
+ ming in a hard-real-time environment. Journal of the Association for
+ Computing Machinery, 20(1), 1973.
+ 2 - L. Abeni , G. Buttazzo. Integrating Multimedia Applications in Hard
+ Real-Time Systems. Proceedings of the 19th IEEE Real-time Systems
+ Symposium, 1998. http://retis.sssup.it/~giorgio/paps/1998/rtss98-cbs.pdf
+ 3 - L. Abeni. Server Mechanisms for Multimedia Applications. ReTiS Lab
+ Technical Report. http://disi.unitn.it/~abeni/tr-98-01.pdf
+ 4 - J. Y. Leung and M.L. Merril. A Note on Preemptive Scheduling of
+ Periodic, Real-Time Tasks. Information Processing Letters, vol. 11,
+ no. 3, pp. 115-118, 1980.
+ 5 - S. K. Baruah, A. K. Mok and L. E. Rosier. Preemptively Scheduling
+ Hard-Real-Time Sporadic Tasks on One Processor. Proceedings of the
+ 11th IEEE Real-time Systems Symposium, 1990.
+ 6 - S. K. Baruah, L. E. Rosier and R. R. Howell. Algorithms and Complexity
+ Concerning the Preemptive Scheduling of Periodic Real-Time tasks on
+ One Processor. Real-Time Systems Journal, vol. 4, no. 2, pp 301-324,
+ 1990.
+ 7 - S. J. Dhall and C. L. Liu. On a real-time scheduling problem. Operations
+ research, vol. 26, no. 1, pp 127-140, 1978.
+ 8 - T. Baker. Multiprocessor EDF and Deadline Monotonic Schedulability
+ Analysis. Proceedings of the 24th IEEE Real-Time Systems Symposium, 2003.
+ 9 - T. Baker. An Analysis of EDF Schedulability on a Multiprocessor.
+ IEEE Transactions on Parallel and Distributed Systems, vol. 16, no. 8,
+ pp 760-768, 2005.
+ 10 - J. Goossens, S. Funk and S. Baruah, Priority-Driven Scheduling of
+ Periodic Task Systems on Multiprocessors. Real-Time Systems Journal,
+ vol. 25, no. 2–3, pp. 187–205, 2003.
+ 11 - R. Davis and A. Burns. A Survey of Hard Real-Time Scheduling for
+ Multiprocessor Systems. ACM Computing Surveys, vol. 43, no. 4, 2011.
+ http://www-users.cs.york.ac.uk/~robdavis/papers/MPSurveyv5.0.pdf
+ 12 - U. C. Devi and J. H. Anderson. Tardiness Bounds under Global EDF
+ Scheduling on a Multiprocessor. Real-Time Systems Journal, vol. 32,
+ no. 2, pp 133-189, 2008.
+ 13 - P. Valente and G. Lipari. An Upper Bound to the Lateness of Soft
+ Real-Time Tasks Scheduled by EDF on Multiprocessors. Proceedings of
+ the 26th IEEE Real-Time Systems Symposium, 2005.
+ 14 - J. Erickson, U. Devi and S. Baruah. Improved tardiness bounds for
+ Global EDF. Proceedings of the 22nd Euromicro Conference on
+ Real-Time Systems, 2010.
+ 15 - G. Lipari, S. Baruah, Greedy reclamation of unused bandwidth in
+ constant-bandwidth servers, 12th IEEE Euromicro Conference on Real-Time
+ Systems, 2000.
+ 16 - L. Abeni, J. Lelli, C. Scordino, L. Palopoli, Greedy CPU reclaiming for
+ SCHED DEADLINE. In Proceedings of the Real-Time Linux Workshop (RTLWS),
+ Dusseldorf, Germany, 2014.
+ 17 - L. Abeni, G. Lipari, A. Parri, Y. Sun, Multicore CPU reclaiming: parallel
+ or sequential?. In Proceedings of the 31st Annual ACM Symposium on Applied
+ Computing, 2016.
+ 18 - J. Lelli, C. Scordino, L. Abeni, D. Faggioli, Deadline scheduling in the
+ Linux kernel, Software: Practice and Experience, 46(6): 821-839, June
+ 2016.
+ 19 - C. Scordino, L. Abeni, J. Lelli, Energy-Aware Real-Time Scheduling in
+ the Linux Kernel, 33rd ACM/SIGAPP Symposium On Applied Computing (SAC
+ 2018), Pau, France, April 2018.
+
+
+ 4. Bandwidth management
+ =======================
+
+ As previously mentioned, in order for -deadline scheduling to be
+ effective and useful (that is, to be able to provide "runtime" time units
+ within "deadline"), it is important to have some method to keep the allocation
+ of the available fractions of CPU time to the various tasks under control.
+ This is usually called "admission control" and if it is not performed, then
+ no guarantee can be given on the actual scheduling of the -deadline tasks.
+
+ As already stated in Section 3, a necessary condition to be respected to
+ correctly schedule a set of real-time tasks is that the total utilization
+ is smaller than M. When talking about -deadline tasks, this requires that
+ the sum of the ratio between runtime and period for all tasks is smaller
+ than M. Notice that the ratio runtime/period is equivalent to the utilization
+ of a "traditional" real-time task, and is also often referred to as
+ "bandwidth".
+ The interface used to control the CPU bandwidth that can be allocated
+ to -deadline tasks is similar to the one already used for -rt
+ tasks with real-time group scheduling (a.k.a. RT-throttling - see
+ Documentation/scheduler/sched-rt-group.rst), and is based on readable/
+ writable control files located in procfs (for system wide settings).
+ Notice that per-group settings (controlled through cgroupfs) are still not
+ defined for -deadline tasks, because more discussion is needed in order to
+ figure out how we want to manage SCHED_DEADLINE bandwidth at the task group
+ level.
+
+ A main difference between deadline bandwidth management and RT-throttling
+ is that -deadline tasks have bandwidth on their own (while -rt ones don't!),
+ and thus we don't need a higher level throttling mechanism to enforce the
+ desired bandwidth. In other words, this means that interface parameters are
+ only used at admission control time (i.e., when the user calls
+ sched_setattr()). Scheduling is then performed considering actual tasks'
+ parameters, so that CPU bandwidth is allocated to SCHED_DEADLINE tasks
+ respecting their needs in terms of granularity. Therefore, using this simple
+ interface we can put a cap on total utilization of -deadline tasks (i.e.,
+ \Sum (runtime_i / period_i) < global_dl_utilization_cap).
+
+ 4.1 System wide settings
+ ------------------------
+
+ The system wide settings are configured under the /proc virtual file system.
+
+ For now the -rt knobs are used for -deadline admission control and the
+ -deadline runtime is accounted against the -rt runtime. We realize that this
+ isn't entirely desirable; however, it is better to have a small interface for
+ now, and be able to change it easily later. The ideal situation (see 5.) is to
+ run -rt tasks from a -deadline server; in which case the -rt bandwidth is a
+ direct subset of dl_bw.
+
+ This means that, for a root_domain comprising M CPUs, -deadline tasks
+ can be created while the sum of their bandwidths stays below:
+
+ M * (sched_rt_runtime_us / sched_rt_period_us)
+
+ It is also possible to disable this bandwidth management logic, and
+ be thus free of oversubscribing the system up to any arbitrary level.
+ This is done by writing -1 in /proc/sys/kernel/sched_rt_runtime_us.
+
+
+ 4.2 Task interface
+ ------------------
+
+ Specifying a periodic/sporadic task that executes for a given amount of
+ runtime at each instance, and that is scheduled according to the urgency of
+ its own timing constraints needs, in general, a way of declaring:
+
+ - a (maximum/typical) instance execution time,
+ - a minimum interval between consecutive instances,
+ - a time constraint by which each instance must be completed.
+
+ Therefore:
+
+ * a new struct sched_attr, containing all the necessary fields is
+ provided;
+ * the new scheduling related syscalls that manipulate it, i.e.,
+ sched_setattr() and sched_getattr() are implemented.
+
+ For debugging purposes, the leftover runtime and absolute deadline of a
+ SCHED_DEADLINE task can be retrieved through /proc/<pid>/sched (entries
+ dl.runtime and dl.deadline, both values in ns). A programmatic way to
+ retrieve these values from production code is under discussion.
+
+
+ 4.3 Default behavior
+ ---------------------
+
+ The default value for SCHED_DEADLINE bandwidth is to have rt_runtime equal to
+ 950000. With rt_period equal to 1000000, by default, it means that -deadline
+ tasks can use at most 95%, multiplied by the number of CPUs that compose the
+ root_domain, for each root_domain.
+ This means that non -deadline tasks will receive at least 5% of the CPU time,
+ and that -deadline tasks will receive their runtime with a guaranteed
+ worst-case delay respect to the "deadline" parameter. If "deadline" = "period"
+ and the cpuset mechanism is used to implement partitioned scheduling (see
+ Section 5), then this simple setting of the bandwidth management is able to
+ deterministically guarantee that -deadline tasks will receive their runtime
+ in a period.
+
+ Finally, notice that in order not to jeopardize the admission control a
+ -deadline task cannot fork.
+
+
+ 4.4 Behavior of sched_yield()
+ -----------------------------
+
+ When a SCHED_DEADLINE task calls sched_yield(), it gives up its
+ remaining runtime and is immediately throttled, until the next
+ period, when its runtime will be replenished (a special flag
+ dl_yielded is set and used to handle correctly throttling and runtime
+ replenishment after a call to sched_yield()).
+
+ This behavior of sched_yield() allows the task to wake-up exactly at
+ the beginning of the next period. Also, this may be useful in the
+ future with bandwidth reclaiming mechanisms, where sched_yield() will
+ make the leftoever runtime available for reclamation by other
+ SCHED_DEADLINE tasks.
+
+
+ 5. Tasks CPU affinity
+ =====================
+
+ -deadline tasks cannot have an affinity mask smaller that the entire
+ root_domain they are created on. However, affinities can be specified
++ through the cpuset facility (Documentation/cgroup-v1/cpusets.rst).
+
+ 5.1 SCHED_DEADLINE and cpusets HOWTO
+ ------------------------------------
+
+ An example of a simple configuration (pin a -deadline task to CPU0)
+ follows (rt-app is used to create a -deadline task)::
+
+ mkdir /dev/cpuset
+ mount -t cgroup -o cpuset cpuset /dev/cpuset
+ cd /dev/cpuset
+ mkdir cpu0
+ echo 0 > cpu0/cpuset.cpus
+ echo 0 > cpu0/cpuset.mems
+ echo 1 > cpuset.cpu_exclusive
+ echo 0 > cpuset.sched_load_balance
+ echo 1 > cpu0/cpuset.cpu_exclusive
+ echo 1 > cpu0/cpuset.mem_exclusive
+ echo $$ > cpu0/tasks
+ rt-app -t 100000:10000:d:0 -D5 # it is now actually superfluous to specify
+ # task affinity
+
+ 6. Future plans
+ ===============
+
+ Still missing:
+
+ - programmatic way to retrieve current runtime and absolute deadline
+ - refinements to deadline inheritance, especially regarding the possibility
+ of retaining bandwidth isolation among non-interacting tasks. This is
+ being studied from both theoretical and practical points of view, and
+ hopefully we should be able to produce some demonstrative code soon;
+ - (c)group based bandwidth management, and maybe scheduling;
+ - access control for non-root users (and related security concerns to
+ address), which is the best way to allow unprivileged use of the mechanisms
+ and how to prevent non-root users "cheat" the system?
+
+ As already discussed, we are planning also to merge this work with the EDF
+ throttling patches [https://lkml.org/lkml/2010/2/23/239] but we still are in
+ the preliminary phases of the merge and we really seek feedback that would
+ help us decide on the direction it should take.
+
+ Appendix A. Test suite
+ ======================
+
+ The SCHED_DEADLINE policy can be easily tested using two applications that
+ are part of a wider Linux Scheduler validation suite. The suite is
+ available as a GitHub repository: https://github.com/scheduler-tools.
+
+ The first testing application is called rt-app and can be used to
+ start multiple threads with specific parameters. rt-app supports
+ SCHED_{OTHER,FIFO,RR,DEADLINE} scheduling policies and their related
+ parameters (e.g., niceness, priority, runtime/deadline/period). rt-app
+ is a valuable tool, as it can be used to synthetically recreate certain
+ workloads (maybe mimicking real use-cases) and evaluate how the scheduler
+ behaves under such workloads. In this way, results are easily reproducible.
+ rt-app is available at: https://github.com/scheduler-tools/rt-app.
+
+ Thread parameters can be specified from the command line, with something like
+ this::
+
+ # rt-app -t 100000:10000:d -t 150000:20000:f:10 -D5
+
+ The above creates 2 threads. The first one, scheduled by SCHED_DEADLINE,
+ executes for 10ms every 100ms. The second one, scheduled at SCHED_FIFO
+ priority 10, executes for 20ms every 150ms. The test will run for a total
+ of 5 seconds.
+
+ More interestingly, configurations can be described with a json file that
+ can be passed as input to rt-app with something like this::
+
+ # rt-app my_config.json
+
+ The parameters that can be specified with the second method are a superset
+ of the command line options. Please refer to rt-app documentation for more
+ details (`<rt-app-sources>/doc/*.json`).
+
+ The second testing application is a modification of schedtool, called
+ schedtool-dl, which can be used to setup SCHED_DEADLINE parameters for a
+ certain pid/application. schedtool-dl is available at:
+ https://github.com/scheduler-tools/schedtool-dl.git.
+
+ The usage is straightforward::
+
+ # schedtool -E -t 10000000:100000000 -e ./my_cpuhog_app
+
+ With this, my_cpuhog_app is put to run inside a SCHED_DEADLINE reservation
+ of 10ms every 100ms (note that parameters are expressed in microseconds).
+ You can also use schedtool to create a reservation for an already running
+ application, given that you know its pid::
+
+ # schedtool -E -t 10000000:100000000 my_app_pid
+
+ Appendix B. Minimal main()
+ ==========================
+
+ We provide in what follows a simple (ugly) self-contained code snippet
+ showing how SCHED_DEADLINE reservations can be created by a real-time
+ application developer::
+
+ #define _GNU_SOURCE
+ #include <unistd.h>
+ #include <stdio.h>
+ #include <stdlib.h>
+ #include <string.h>
+ #include <time.h>
+ #include <linux/unistd.h>
+ #include <linux/kernel.h>
+ #include <linux/types.h>
+ #include <sys/syscall.h>
+ #include <pthread.h>
+
+ #define gettid() syscall(__NR_gettid)
+
+ #define SCHED_DEADLINE 6
+
+ /* XXX use the proper syscall numbers */
+ #ifdef __x86_64__
+ #define __NR_sched_setattr 314
+ #define __NR_sched_getattr 315
+ #endif
+
+ #ifdef __i386__
+ #define __NR_sched_setattr 351
+ #define __NR_sched_getattr 352
+ #endif
+
+ #ifdef __arm__
+ #define __NR_sched_setattr 380
+ #define __NR_sched_getattr 381
+ #endif
+
+ static volatile int done;
+
+ struct sched_attr {
+ __u32 size;
+
+ __u32 sched_policy;
+ __u64 sched_flags;
+
+ /* SCHED_NORMAL, SCHED_BATCH */
+ __s32 sched_nice;
+
+ /* SCHED_FIFO, SCHED_RR */
+ __u32 sched_priority;
+
+ /* SCHED_DEADLINE (nsec) */
+ __u64 sched_runtime;
+ __u64 sched_deadline;
+ __u64 sched_period;
+ };
+
+ int sched_setattr(pid_t pid,
+ const struct sched_attr *attr,
+ unsigned int flags)
+ {
+ return syscall(__NR_sched_setattr, pid, attr, flags);
+ }
+
+ int sched_getattr(pid_t pid,
+ struct sched_attr *attr,
+ unsigned int size,
+ unsigned int flags)
+ {
+ return syscall(__NR_sched_getattr, pid, attr, size, flags);
+ }
+
+ void *run_deadline(void *data)
+ {
+ struct sched_attr attr;
+ int x = 0;
+ int ret;
+ unsigned int flags = 0;
+
+ printf("deadline thread started [%ld]\n", gettid());
+
+ attr.size = sizeof(attr);
+ attr.sched_flags = 0;
+ attr.sched_nice = 0;
+ attr.sched_priority = 0;
+
+ /* This creates a 10ms/30ms reservation */
+ attr.sched_policy = SCHED_DEADLINE;
+ attr.sched_runtime = 10 * 1000 * 1000;
+ attr.sched_period = attr.sched_deadline = 30 * 1000 * 1000;
+
+ ret = sched_setattr(0, &attr, flags);
+ if (ret < 0) {
+ done = 0;
+ perror("sched_setattr");
+ exit(-1);
+ }
+
+ while (!done) {
+ x++;
+ }
+
+ printf("deadline thread dies [%ld]\n", gettid());
+ return NULL;
+ }
+
+ int main (int argc, char **argv)
+ {
+ pthread_t thread;
+
+ printf("main thread [%ld]\n", gettid());
+
+ pthread_create(&thread, NULL, run_deadline, NULL);
+
+ sleep(10);
+
+ done = 1;
+ pthread_join(thread, NULL);
+
+ printf("main dies [%ld]\n", gettid());
+ return 0;
+ }
--- /dev/null
- Documentation/cgroup-v1/cgroups.txt for more information about this filesystem.
+ =============
+ CFS Scheduler
+ =============
+
+
+ 1. OVERVIEW
+ ============
+
+ CFS stands for "Completely Fair Scheduler," and is the new "desktop" process
+ scheduler implemented by Ingo Molnar and merged in Linux 2.6.23. It is the
+ replacement for the previous vanilla scheduler's SCHED_OTHER interactivity
+ code.
+
+ 80% of CFS's design can be summed up in a single sentence: CFS basically models
+ an "ideal, precise multi-tasking CPU" on real hardware.
+
+ "Ideal multi-tasking CPU" is a (non-existent :-)) CPU that has 100% physical
+ power and which can run each task at precise equal speed, in parallel, each at
+ 1/nr_running speed. For example: if there are 2 tasks running, then it runs
+ each at 50% physical power --- i.e., actually in parallel.
+
+ On real hardware, we can run only a single task at once, so we have to
+ introduce the concept of "virtual runtime." The virtual runtime of a task
+ specifies when its next timeslice would start execution on the ideal
+ multi-tasking CPU described above. In practice, the virtual runtime of a task
+ is its actual runtime normalized to the total number of running tasks.
+
+
+
+ 2. FEW IMPLEMENTATION DETAILS
+ ==============================
+
+ In CFS the virtual runtime is expressed and tracked via the per-task
+ p->se.vruntime (nanosec-unit) value. This way, it's possible to accurately
+ timestamp and measure the "expected CPU time" a task should have gotten.
+
+ [ small detail: on "ideal" hardware, at any time all tasks would have the same
+ p->se.vruntime value --- i.e., tasks would execute simultaneously and no task
+ would ever get "out of balance" from the "ideal" share of CPU time. ]
+
+ CFS's task picking logic is based on this p->se.vruntime value and it is thus
+ very simple: it always tries to run the task with the smallest p->se.vruntime
+ value (i.e., the task which executed least so far). CFS always tries to split
+ up CPU time between runnable tasks as close to "ideal multitasking hardware" as
+ possible.
+
+ Most of the rest of CFS's design just falls out of this really simple concept,
+ with a few add-on embellishments like nice levels, multiprocessing and various
+ algorithm variants to recognize sleepers.
+
+
+
+ 3. THE RBTREE
+ ==============
+
+ CFS's design is quite radical: it does not use the old data structures for the
+ runqueues, but it uses a time-ordered rbtree to build a "timeline" of future
+ task execution, and thus has no "array switch" artifacts (by which both the
+ previous vanilla scheduler and RSDL/SD are affected).
+
+ CFS also maintains the rq->cfs.min_vruntime value, which is a monotonic
+ increasing value tracking the smallest vruntime among all tasks in the
+ runqueue. The total amount of work done by the system is tracked using
+ min_vruntime; that value is used to place newly activated entities on the left
+ side of the tree as much as possible.
+
+ The total number of running tasks in the runqueue is accounted through the
+ rq->cfs.load value, which is the sum of the weights of the tasks queued on the
+ runqueue.
+
+ CFS maintains a time-ordered rbtree, where all runnable tasks are sorted by the
+ p->se.vruntime key. CFS picks the "leftmost" task from this tree and sticks to it.
+ As the system progresses forwards, the executed tasks are put into the tree
+ more and more to the right --- slowly but surely giving a chance for every task
+ to become the "leftmost task" and thus get on the CPU within a deterministic
+ amount of time.
+
+ Summing up, CFS works like this: it runs a task a bit, and when the task
+ schedules (or a scheduler tick happens) the task's CPU usage is "accounted
+ for": the (small) time it just spent using the physical CPU is added to
+ p->se.vruntime. Once p->se.vruntime gets high enough so that another task
+ becomes the "leftmost task" of the time-ordered rbtree it maintains (plus a
+ small amount of "granularity" distance relative to the leftmost task so that we
+ do not over-schedule tasks and trash the cache), then the new leftmost task is
+ picked and the current task is preempted.
+
+
+
+ 4. SOME FEATURES OF CFS
+ ========================
+
+ CFS uses nanosecond granularity accounting and does not rely on any jiffies or
+ other HZ detail. Thus the CFS scheduler has no notion of "timeslices" in the
+ way the previous scheduler had, and has no heuristics whatsoever. There is
+ only one central tunable (you have to switch on CONFIG_SCHED_DEBUG):
+
+ /proc/sys/kernel/sched_min_granularity_ns
+
+ which can be used to tune the scheduler from "desktop" (i.e., low latencies) to
+ "server" (i.e., good batching) workloads. It defaults to a setting suitable
+ for desktop workloads. SCHED_BATCH is handled by the CFS scheduler module too.
+
+ Due to its design, the CFS scheduler is not prone to any of the "attacks" that
+ exist today against the heuristics of the stock scheduler: fiftyp.c, thud.c,
+ chew.c, ring-test.c, massive_intr.c all work fine and do not impact
+ interactivity and produce the expected behavior.
+
+ The CFS scheduler has a much stronger handling of nice levels and SCHED_BATCH
+ than the previous vanilla scheduler: both types of workloads are isolated much
+ more aggressively.
+
+ SMP load-balancing has been reworked/sanitized: the runqueue-walking
+ assumptions are gone from the load-balancing code now, and iterators of the
+ scheduling modules are used. The balancing code got quite a bit simpler as a
+ result.
+
+
+
+ 5. Scheduling policies
+ ======================
+
+ CFS implements three scheduling policies:
+
+ - SCHED_NORMAL (traditionally called SCHED_OTHER): The scheduling
+ policy that is used for regular tasks.
+
+ - SCHED_BATCH: Does not preempt nearly as often as regular tasks
+ would, thereby allowing tasks to run longer and make better use of
+ caches but at the cost of interactivity. This is well suited for
+ batch jobs.
+
+ - SCHED_IDLE: This is even weaker than nice 19, but its not a true
+ idle timer scheduler in order to avoid to get into priority
+ inversion problems which would deadlock the machine.
+
+ SCHED_FIFO/_RR are implemented in sched/rt.c and are as specified by
+ POSIX.
+
+ The command chrt from util-linux-ng 2.13.1.1 can set all of these except
+ SCHED_IDLE.
+
+
+
+ 6. SCHEDULING CLASSES
+ ======================
+
+ The new CFS scheduler has been designed in such a way to introduce "Scheduling
+ Classes," an extensible hierarchy of scheduler modules. These modules
+ encapsulate scheduling policy details and are handled by the scheduler core
+ without the core code assuming too much about them.
+
+ sched/fair.c implements the CFS scheduler described above.
+
+ sched/rt.c implements SCHED_FIFO and SCHED_RR semantics, in a simpler way than
+ the previous vanilla scheduler did. It uses 100 runqueues (for all 100 RT
+ priority levels, instead of 140 in the previous scheduler) and it needs no
+ expired array.
+
+ Scheduling classes are implemented through the sched_class structure, which
+ contains hooks to functions that must be called whenever an interesting event
+ occurs.
+
+ This is the (partial) list of the hooks:
+
+ - enqueue_task(...)
+
+ Called when a task enters a runnable state.
+ It puts the scheduling entity (task) into the red-black tree and
+ increments the nr_running variable.
+
+ - dequeue_task(...)
+
+ When a task is no longer runnable, this function is called to keep the
+ corresponding scheduling entity out of the red-black tree. It decrements
+ the nr_running variable.
+
+ - yield_task(...)
+
+ This function is basically just a dequeue followed by an enqueue, unless the
+ compat_yield sysctl is turned on; in that case, it places the scheduling
+ entity at the right-most end of the red-black tree.
+
+ - check_preempt_curr(...)
+
+ This function checks if a task that entered the runnable state should
+ preempt the currently running task.
+
+ - pick_next_task(...)
+
+ This function chooses the most appropriate task eligible to run next.
+
+ - set_curr_task(...)
+
+ This function is called when a task changes its scheduling class or changes
+ its task group.
+
+ - task_tick(...)
+
+ This function is mostly called from time tick functions; it might lead to
+ process switch. This drives the running preemption.
+
+
+
+
+ 7. GROUP SCHEDULER EXTENSIONS TO CFS
+ =====================================
+
+ Normally, the scheduler operates on individual tasks and strives to provide
+ fair CPU time to each task. Sometimes, it may be desirable to group tasks and
+ provide fair CPU time to each such task group. For example, it may be
+ desirable to first provide fair CPU time to each user on the system and then to
+ each task belonging to a user.
+
+ CONFIG_CGROUP_SCHED strives to achieve exactly that. It lets tasks to be
+ grouped and divides CPU time fairly among such groups.
+
+ CONFIG_RT_GROUP_SCHED permits to group real-time (i.e., SCHED_FIFO and
+ SCHED_RR) tasks.
+
+ CONFIG_FAIR_GROUP_SCHED permits to group CFS (i.e., SCHED_NORMAL and
+ SCHED_BATCH) tasks.
+
+ These options need CONFIG_CGROUPS to be defined, and let the administrator
+ create arbitrary groups of tasks, using the "cgroup" pseudo filesystem. See
++ Documentation/cgroup-v1/cgroups.rst for more information about this filesystem.
+
+ When CONFIG_FAIR_GROUP_SCHED is defined, a "cpu.shares" file is created for each
+ group created using the pseudo filesystem. See example steps below to create
+ task groups and modify their CPU share using the "cgroups" pseudo filesystem::
+
+ # mount -t tmpfs cgroup_root /sys/fs/cgroup
+ # mkdir /sys/fs/cgroup/cpu
+ # mount -t cgroup -ocpu none /sys/fs/cgroup/cpu
+ # cd /sys/fs/cgroup/cpu
+
+ # mkdir multimedia # create "multimedia" group of tasks
+ # mkdir browser # create "browser" group of tasks
+
+ # #Configure the multimedia group to receive twice the CPU bandwidth
+ # #that of browser group
+
+ # echo 2048 > multimedia/cpu.shares
+ # echo 1024 > browser/cpu.shares
+
+ # firefox & # Launch firefox and move it to "browser" group
+ # echo <firefox_pid> > browser/tasks
+
+ # #Launch gmplayer (or your favourite movie player)
+ # echo <movie_player_pid> > multimedia/tasks
--- /dev/null
-Documentation/cgroup-v1/cgroups.txt as well.
+ ==========================
+ Real-Time group scheduling
+ ==========================
+
+ .. CONTENTS
+
+ 0. WARNING
+ 1. Overview
+ 1.1 The problem
+ 1.2 The solution
+ 2. The interface
+ 2.1 System-wide settings
+ 2.2 Default behaviour
+ 2.3 Basis for grouping tasks
+ 3. Future plans
+
+
+ 0. WARNING
+ ==========
+
+ Fiddling with these settings can result in an unstable system, the knobs are
+ root only and assumes root knows what he is doing.
+
+ Most notable:
+
+ * very small values in sched_rt_period_us can result in an unstable
+ system when the period is smaller than either the available hrtimer
+ resolution, or the time it takes to handle the budget refresh itself.
+
+ * very small values in sched_rt_runtime_us can result in an unstable
+ system when the runtime is so small the system has difficulty making
+ forward progress (NOTE: the migration thread and kstopmachine both
+ are real-time processes).
+
+ 1. Overview
+ ===========
+
+
+ 1.1 The problem
+ ---------------
+
+ Realtime scheduling is all about determinism, a group has to be able to rely on
+ the amount of bandwidth (eg. CPU time) being constant. In order to schedule
+ multiple groups of realtime tasks, each group must be assigned a fixed portion
+ of the CPU time available. Without a minimum guarantee a realtime group can
+ obviously fall short. A fuzzy upper limit is of no use since it cannot be
+ relied upon. Which leaves us with just the single fixed portion.
+
+ 1.2 The solution
+ ----------------
+
+ CPU time is divided by means of specifying how much time can be spent running
+ in a given period. We allocate this "run time" for each realtime group which
+ the other realtime groups will not be permitted to use.
+
+ Any time not allocated to a realtime group will be used to run normal priority
+ tasks (SCHED_OTHER). Any allocated run time not used will also be picked up by
+ SCHED_OTHER.
+
+ Let's consider an example: a frame fixed realtime renderer must deliver 25
+ frames a second, which yields a period of 0.04s per frame. Now say it will also
+ have to play some music and respond to input, leaving it with around 80% CPU
+ time dedicated for the graphics. We can then give this group a run time of 0.8
+ * 0.04s = 0.032s.
+
+ This way the graphics group will have a 0.04s period with a 0.032s run time
+ limit. Now if the audio thread needs to refill the DMA buffer every 0.005s, but
+ needs only about 3% CPU time to do so, it can do with a 0.03 * 0.005s =
+ 0.00015s. So this group can be scheduled with a period of 0.005s and a run time
+ of 0.00015s.
+
+ The remaining CPU time will be used for user input and other tasks. Because
+ realtime tasks have explicitly allocated the CPU time they need to perform
+ their tasks, buffer underruns in the graphics or audio can be eliminated.
+
+ NOTE: the above example is not fully implemented yet. We still
+ lack an EDF scheduler to make non-uniform periods usable.
+
+
+ 2. The Interface
+ ================
+
+
+ 2.1 System wide settings
+ ------------------------
+
+ The system wide settings are configured under the /proc virtual file system:
+
+ /proc/sys/kernel/sched_rt_period_us:
+ The scheduling period that is equivalent to 100% CPU bandwidth
+
+ /proc/sys/kernel/sched_rt_runtime_us:
+ A global limit on how much time realtime scheduling may use. Even without
+ CONFIG_RT_GROUP_SCHED enabled, this will limit time reserved to realtime
+ processes. With CONFIG_RT_GROUP_SCHED it signifies the total bandwidth
+ available to all realtime groups.
+
+ * Time is specified in us because the interface is s32. This gives an
+ operating range from 1us to about 35 minutes.
+ * sched_rt_period_us takes values from 1 to INT_MAX.
+ * sched_rt_runtime_us takes values from -1 to (INT_MAX - 1).
+ * A run time of -1 specifies runtime == period, ie. no limit.
+
+
+ 2.2 Default behaviour
+ ---------------------
+
+ The default values for sched_rt_period_us (1000000 or 1s) and
+ sched_rt_runtime_us (950000 or 0.95s). This gives 0.05s to be used by
+ SCHED_OTHER (non-RT tasks). These defaults were chosen so that a run-away
+ realtime tasks will not lock up the machine but leave a little time to recover
+ it. By setting runtime to -1 you'd get the old behaviour back.
+
+ By default all bandwidth is assigned to the root group and new groups get the
+ period from /proc/sys/kernel/sched_rt_period_us and a run time of 0. If you
+ want to assign bandwidth to another group, reduce the root group's bandwidth
+ and assign some or all of the difference to another group.
+
+ Realtime group scheduling means you have to assign a portion of total CPU
+ bandwidth to the group before it will accept realtime tasks. Therefore you will
+ not be able to run realtime tasks as any user other than root until you have
+ done that, even if the user has the rights to run processes with realtime
+ priority!
+
+
+ 2.3 Basis for grouping tasks
+ ----------------------------
+
+ Enabling CONFIG_RT_GROUP_SCHED lets you explicitly allocate real
+ CPU bandwidth to task groups.
+
+ This uses the cgroup virtual file system and "<cgroup>/cpu.rt_runtime_us"
+ to control the CPU time reserved for each control group.
+
+ For more information on working with control groups, you should read
++Documentation/cgroup-v1/cgroups.rst as well.
+
+ Group settings are checked against the following limits in order to keep the
+ configuration schedulable:
+
+ \Sum_{i} runtime_{i} / global_period <= global_runtime / global_period
+
+ For now, this can be simplified to just the following (but see Future plans):
+
+ \Sum_{i} runtime_{i} <= global_runtime
+
+
+ 3. Future plans
+ ===============
+
+ There is work in progress to make the scheduling period for each group
+ ("<cgroup>/cpu.rt_period_us") configurable as well.
+
+ The constraint on the period is that a subgroup must have a smaller or
+ equal period to its parent. But realistically its not very useful _yet_
+ as its prone to starvation without deadline scheduling.
+
+ Consider two sibling groups A and B; both have 50% bandwidth, but A's
+ period is twice the length of B's.
+
+ * group A: period=100000us, runtime=50000us
+
+ - this runs for 0.05s once every 0.1s
+
+ * group B: period= 50000us, runtime=25000us
+
+ - this runs for 0.025s twice every 0.1s (or once every 0.05 sec).
+
+ This means that currently a while (1) loop in A will run for the full period of
+ B and can starve B's tasks (assuming they are of lower priority) for a whole
+ period.
+
+ The next project will be SCHED_EDF (Earliest Deadline First scheduling) to bring
+ full deadline scheduling to the linux kernel. Deadline scheduling the above
+ groups and treating end of the period as a deadline will ensure that they both
+ get their allocated time.
+
+ Implementing SCHED_EDF might take a while to complete. Priority Inheritance is
+ the biggest challenge as the current linux PI infrastructure is geared towards
+ the limited static priority levels 0-99. With deadline scheduling you need to
+ do deadline inheritance (since priority is inversely proportional to the
+ deadline delta (deadline - now)).
+
+ This means the whole PI machinery will have to be reworked - and that is one of
+ the most complex pieces of code we have.
amount of system memory that are available to a certain class of tasks.
For more information on the features of cpusets, see
-Documentation/cgroup-v1/cpusets.txt.
+Documentation/cgroup-v1/cpusets.rst.
There are a number of different configurations you can use for your needs. For
more information on the numa=fake command line option and its various ways of
- configuring fake nodes, see Documentation/x86/x86_64/boot-options.txt.
+ configuring fake nodes, see Documentation/x86/x86_64/boot-options.rst.
For the purposes of this introduction, we'll assume a very primitive NUMA
emulation setup of "numa=fake=4*512,". This will split our system memory into