--- /dev/null
+================================
+Application Data Integrity (ADI)
+================================
+
+SPARC M7 processor adds the Application Data Integrity (ADI) feature.
+ADI allows a task to set version tags on any subset of its address
+space. Once ADI is enabled and version tags are set for ranges of
+address space of a task, the processor will compare the tag in pointers
+to memory in these ranges to the version set by the application
+previously. Access to memory is granted only if the tag in given pointer
+matches the tag set by the application. In case of mismatch, processor
+raises an exception.
+
+Following steps must be taken by a task to enable ADI fully:
+
+1. Set the user mode PSTATE.mcde bit. This acts as master switch for
+ the task's entire address space to enable/disable ADI for the task.
+
+2. Set TTE.mcd bit on any TLB entries that correspond to the range of
+ addresses ADI is being enabled on. MMU checks the version tag only
+ on the pages that have TTE.mcd bit set.
+
+3. Set the version tag for virtual addresses using stxa instruction
+ and one of the MCD specific ASIs. Each stxa instruction sets the
+ given tag for one ADI block size number of bytes. This step must
+ be repeated for entire page to set tags for entire page.
+
+ADI block size for the platform is provided by the hypervisor to kernel
+in machine description tables. Hypervisor also provides the number of
+top bits in the virtual address that specify the version tag. Once
+version tag has been set for a memory location, the tag is stored in the
+physical memory and the same tag must be present in the ADI version tag
+bits of the virtual address being presented to the MMU. For example on
+SPARC M7 processor, MMU uses bits 63-60 for version tags and ADI block
+size is same as cacheline size which is 64 bytes. A task that sets ADI
+version to, say 10, on a range of memory, must access that memory using
+virtual addresses that contain 0xa in bits 63-60.
+
+ADI is enabled on a set of pages using mprotect() with PROT_ADI flag.
+When ADI is enabled on a set of pages by a task for the first time,
+kernel sets the PSTATE.mcde bit fot the task. Version tags for memory
+addresses are set with an stxa instruction on the addresses using
+ASI_MCD_PRIMARY or ASI_MCD_ST_BLKINIT_PRIMARY. ADI block size is
+provided by the hypervisor to the kernel. Kernel returns the value of
+ADI block size to userspace using auxiliary vector along with other ADI
+info. Following auxiliary vectors are provided by the kernel:
+
+ ============ ===========================================
+ AT_ADI_BLKSZ ADI block size. This is the granularity and
+ alignment, in bytes, of ADI versioning.
+ AT_ADI_NBITS Number of ADI version bits in the VA
+ ============ ===========================================
+
+
+IMPORTANT NOTES
+===============
+
+- Version tag values of 0x0 and 0xf are reserved. These values match any
+ tag in virtual address and never generate a mismatch exception.
+
+- Version tags are set on virtual addresses from userspace even though
+ tags are stored in physical memory. Tags are set on a physical page
+ after it has been allocated to a task and a pte has been created for
+ it.
+
+- When a task frees a memory page it had set version tags on, the page
+ goes back to free page pool. When this page is re-allocated to a task,
+ kernel clears the page using block initialization ASI which clears the
+ version tags as well for the page. If a page allocated to a task is
+ freed and allocated back to the same task, old version tags set by the
+ task on that page will no longer be present.
+
+- ADI tag mismatches are not detected for non-faulting loads.
+
+- Kernel does not set any tags for user pages and it is entirely a
+ task's responsibility to set any version tags. Kernel does ensure the
+ version tags are preserved if a page is swapped out to the disk and
+ swapped back in. It also preserves that version tags if a page is
+ migrated.
+
+- ADI works for any size pages. A userspace task need not be aware of
+ page size when using ADI. It can simply select a virtual address
+ range, enable ADI on the range using mprotect() and set version tags
+ for the entire range. mprotect() ensures range is aligned to page size
+ and is a multiple of page size.
+
+- ADI tags can only be set on writable memory. For example, ADI tags can
+ not be set on read-only mappings.
+
+
+
+ADI related traps
+=================
+
+With ADI enabled, following new traps may occur:
+
+Disrupting memory corruption
+----------------------------
+
+ When a store accesses a memory localtion that has TTE.mcd=1,
+ the task is running with ADI enabled (PSTATE.mcde=1), and the ADI
+ tag in the address used (bits 63:60) does not match the tag set on
+ the corresponding cacheline, a memory corruption trap occurs. By
+ default, it is a disrupting trap and is sent to the hypervisor
+ first. Hypervisor creates a sun4v error report and sends a
+ resumable error (TT=0x7e) trap to the kernel. The kernel sends
+ a SIGSEGV to the task that resulted in this trap with the following
+ info::
+
+ siginfo.si_signo = SIGSEGV;
+ siginfo.errno = 0;
+ siginfo.si_code = SEGV_ADIDERR;
+ siginfo.si_addr = addr; /* PC where first mismatch occurred */
+ siginfo.si_trapno = 0;
+
+
+Precise memory corruption
+-------------------------
+
+ When a store accesses a memory location that has TTE.mcd=1,
+ the task is running with ADI enabled (PSTATE.mcde=1), and the ADI
+ tag in the address used (bits 63:60) does not match the tag set on
+ the corresponding cacheline, a memory corruption trap occurs. If
+ MCD precise exception is enabled (MCDPERR=1), a precise
+ exception is sent to the kernel with TT=0x1a. The kernel sends
+ a SIGSEGV to the task that resulted in this trap with the following
+ info::
+
+ siginfo.si_signo = SIGSEGV;
+ siginfo.errno = 0;
+ siginfo.si_code = SEGV_ADIPERR;
+ siginfo.si_addr = addr; /* address that caused trap */
+ siginfo.si_trapno = 0;
+
+ NOTE:
+ ADI tag mismatch on a load always results in precise trap.
+
+
+MCD disabled
+------------
+
+ When a task has not enabled ADI and attempts to set ADI version
+ on a memory address, processor sends an MCD disabled trap. This
+ trap is handled by hypervisor first and the hypervisor vectors this
+ trap through to the kernel as Data Access Exception trap with
+ fault type set to 0xa (invalid ASI). When this occurs, the kernel
+ sends the task SIGSEGV signal with following info::
+
+ siginfo.si_signo = SIGSEGV;
+ siginfo.errno = 0;
+ siginfo.si_code = SEGV_ACCADI;
+ siginfo.si_addr = addr; /* address that caused trap */
+ siginfo.si_trapno = 0;
+
+
+Sample program to use ADI
+-------------------------
+
+Following sample program is meant to illustrate how to use the ADI
+functionality::
+
+ #include <unistd.h>
+ #include <stdio.h>
+ #include <stdlib.h>
+ #include <elf.h>
+ #include <sys/ipc.h>
+ #include <sys/shm.h>
+ #include <sys/mman.h>
+ #include <asm/asi.h>
+
+ #ifndef AT_ADI_BLKSZ
+ #define AT_ADI_BLKSZ 48
+ #endif
+ #ifndef AT_ADI_NBITS
+ #define AT_ADI_NBITS 49
+ #endif
+
+ #ifndef PROT_ADI
+ #define PROT_ADI 0x10
+ #endif
+
+ #define BUFFER_SIZE 32*1024*1024UL
+
+ main(int argc, char* argv[], char* envp[])
+ {
+ unsigned long i, mcde, adi_blksz, adi_nbits;
+ char *shmaddr, *tmp_addr, *end, *veraddr, *clraddr;
+ int shmid, version;
+ Elf64_auxv_t *auxv;
+
+ adi_blksz = 0;
+
+ while(*envp++ != NULL);
+ for (auxv = (Elf64_auxv_t *)envp; auxv->a_type != AT_NULL; auxv++) {
+ switch (auxv->a_type) {
+ case AT_ADI_BLKSZ:
+ adi_blksz = auxv->a_un.a_val;
+ break;
+ case AT_ADI_NBITS:
+ adi_nbits = auxv->a_un.a_val;
+ break;
+ }
+ }
+ if (adi_blksz == 0) {
+ fprintf(stderr, "Oops! ADI is not supported\n");
+ exit(1);
+ }
+
+ printf("ADI capabilities:\n");
+ printf("\tBlock size = %ld\n", adi_blksz);
+ printf("\tNumber of bits = %ld\n", adi_nbits);
+
+ if ((shmid = shmget(2, BUFFER_SIZE,
+ IPC_CREAT | SHM_R | SHM_W)) < 0) {
+ perror("shmget failed");
+ exit(1);
+ }
+
+ shmaddr = shmat(shmid, NULL, 0);
+ if (shmaddr == (char *)-1) {
+ perror("shm attach failed");
+ shmctl(shmid, IPC_RMID, NULL);
+ exit(1);
+ }
+
+ if (mprotect(shmaddr, BUFFER_SIZE, PROT_READ|PROT_WRITE|PROT_ADI)) {
+ perror("mprotect failed");
+ goto err_out;
+ }
+
+ /* Set the ADI version tag on the shm segment
+ */
+ version = 10;
+ tmp_addr = shmaddr;
+ end = shmaddr + BUFFER_SIZE;
+ while (tmp_addr < end) {
+ asm volatile(
+ "stxa %1, [%0]0x90\n\t"
+ :
+ : "r" (tmp_addr), "r" (version));
+ tmp_addr += adi_blksz;
+ }
+ asm volatile("membar #Sync\n\t");
+
+ /* Create a versioned address from the normal address by placing
+ * version tag in the upper adi_nbits bits
+ */
+ tmp_addr = (void *) ((unsigned long)shmaddr << adi_nbits);
+ tmp_addr = (void *) ((unsigned long)tmp_addr >> adi_nbits);
+ veraddr = (void *) (((unsigned long)version << (64-adi_nbits))
+ | (unsigned long)tmp_addr);
+
+ printf("Starting the writes:\n");
+ for (i = 0; i < BUFFER_SIZE; i++) {
+ veraddr[i] = (char)(i);
+ if (!(i % (1024 * 1024)))
+ printf(".");
+ }
+ printf("\n");
+
+ printf("Verifying data...");
+ fflush(stdout);
+ for (i = 0; i < BUFFER_SIZE; i++)
+ if (veraddr[i] != (char)i)
+ printf("\nIndex %lu mismatched\n", i);
+ printf("Done.\n");
+
+ /* Disable ADI and clean up
+ */
+ if (mprotect(shmaddr, BUFFER_SIZE, PROT_READ|PROT_WRITE)) {
+ perror("mprotect failed");
+ goto err_out;
+ }
+
+ if (shmdt((const void *)shmaddr) != 0)
+ perror("Detach failure");
+ shmctl(shmid, IPC_RMID, NULL);
+
+ exit(0);
+
+ err_out:
+ if (shmdt((const void *)shmaddr) != 0)
+ perror("Detach failure");
+ shmctl(shmid, IPC_RMID, NULL);
+ exit(1);
+ }
+++ /dev/null
-Application Data Integrity (ADI)
-================================
-
-SPARC M7 processor adds the Application Data Integrity (ADI) feature.
-ADI allows a task to set version tags on any subset of its address
-space. Once ADI is enabled and version tags are set for ranges of
-address space of a task, the processor will compare the tag in pointers
-to memory in these ranges to the version set by the application
-previously. Access to memory is granted only if the tag in given pointer
-matches the tag set by the application. In case of mismatch, processor
-raises an exception.
-
-Following steps must be taken by a task to enable ADI fully:
-
-1. Set the user mode PSTATE.mcde bit. This acts as master switch for
- the task's entire address space to enable/disable ADI for the task.
-
-2. Set TTE.mcd bit on any TLB entries that correspond to the range of
- addresses ADI is being enabled on. MMU checks the version tag only
- on the pages that have TTE.mcd bit set.
-
-3. Set the version tag for virtual addresses using stxa instruction
- and one of the MCD specific ASIs. Each stxa instruction sets the
- given tag for one ADI block size number of bytes. This step must
- be repeated for entire page to set tags for entire page.
-
-ADI block size for the platform is provided by the hypervisor to kernel
-in machine description tables. Hypervisor also provides the number of
-top bits in the virtual address that specify the version tag. Once
-version tag has been set for a memory location, the tag is stored in the
-physical memory and the same tag must be present in the ADI version tag
-bits of the virtual address being presented to the MMU. For example on
-SPARC M7 processor, MMU uses bits 63-60 for version tags and ADI block
-size is same as cacheline size which is 64 bytes. A task that sets ADI
-version to, say 10, on a range of memory, must access that memory using
-virtual addresses that contain 0xa in bits 63-60.
-
-ADI is enabled on a set of pages using mprotect() with PROT_ADI flag.
-When ADI is enabled on a set of pages by a task for the first time,
-kernel sets the PSTATE.mcde bit fot the task. Version tags for memory
-addresses are set with an stxa instruction on the addresses using
-ASI_MCD_PRIMARY or ASI_MCD_ST_BLKINIT_PRIMARY. ADI block size is
-provided by the hypervisor to the kernel. Kernel returns the value of
-ADI block size to userspace using auxiliary vector along with other ADI
-info. Following auxiliary vectors are provided by the kernel:
-
- AT_ADI_BLKSZ ADI block size. This is the granularity and
- alignment, in bytes, of ADI versioning.
- AT_ADI_NBITS Number of ADI version bits in the VA
-
-
-IMPORTANT NOTES:
-
-- Version tag values of 0x0 and 0xf are reserved. These values match any
- tag in virtual address and never generate a mismatch exception.
-
-- Version tags are set on virtual addresses from userspace even though
- tags are stored in physical memory. Tags are set on a physical page
- after it has been allocated to a task and a pte has been created for
- it.
-
-- When a task frees a memory page it had set version tags on, the page
- goes back to free page pool. When this page is re-allocated to a task,
- kernel clears the page using block initialization ASI which clears the
- version tags as well for the page. If a page allocated to a task is
- freed and allocated back to the same task, old version tags set by the
- task on that page will no longer be present.
-
-- ADI tag mismatches are not detected for non-faulting loads.
-
-- Kernel does not set any tags for user pages and it is entirely a
- task's responsibility to set any version tags. Kernel does ensure the
- version tags are preserved if a page is swapped out to the disk and
- swapped back in. It also preserves that version tags if a page is
- migrated.
-
-- ADI works for any size pages. A userspace task need not be aware of
- page size when using ADI. It can simply select a virtual address
- range, enable ADI on the range using mprotect() and set version tags
- for the entire range. mprotect() ensures range is aligned to page size
- and is a multiple of page size.
-
-- ADI tags can only be set on writable memory. For example, ADI tags can
- not be set on read-only mappings.
-
-
-
-ADI related traps
------------------
-
-With ADI enabled, following new traps may occur:
-
-Disrupting memory corruption
-
- When a store accesses a memory localtion that has TTE.mcd=1,
- the task is running with ADI enabled (PSTATE.mcde=1), and the ADI
- tag in the address used (bits 63:60) does not match the tag set on
- the corresponding cacheline, a memory corruption trap occurs. By
- default, it is a disrupting trap and is sent to the hypervisor
- first. Hypervisor creates a sun4v error report and sends a
- resumable error (TT=0x7e) trap to the kernel. The kernel sends
- a SIGSEGV to the task that resulted in this trap with the following
- info:
-
- siginfo.si_signo = SIGSEGV;
- siginfo.errno = 0;
- siginfo.si_code = SEGV_ADIDERR;
- siginfo.si_addr = addr; /* PC where first mismatch occurred */
- siginfo.si_trapno = 0;
-
-
-Precise memory corruption
-
- When a store accesses a memory location that has TTE.mcd=1,
- the task is running with ADI enabled (PSTATE.mcde=1), and the ADI
- tag in the address used (bits 63:60) does not match the tag set on
- the corresponding cacheline, a memory corruption trap occurs. If
- MCD precise exception is enabled (MCDPERR=1), a precise
- exception is sent to the kernel with TT=0x1a. The kernel sends
- a SIGSEGV to the task that resulted in this trap with the following
- info:
-
- siginfo.si_signo = SIGSEGV;
- siginfo.errno = 0;
- siginfo.si_code = SEGV_ADIPERR;
- siginfo.si_addr = addr; /* address that caused trap */
- siginfo.si_trapno = 0;
-
- NOTE: ADI tag mismatch on a load always results in precise trap.
-
-
-MCD disabled
-
- When a task has not enabled ADI and attempts to set ADI version
- on a memory address, processor sends an MCD disabled trap. This
- trap is handled by hypervisor first and the hypervisor vectors this
- trap through to the kernel as Data Access Exception trap with
- fault type set to 0xa (invalid ASI). When this occurs, the kernel
- sends the task SIGSEGV signal with following info:
-
- siginfo.si_signo = SIGSEGV;
- siginfo.errno = 0;
- siginfo.si_code = SEGV_ACCADI;
- siginfo.si_addr = addr; /* address that caused trap */
- siginfo.si_trapno = 0;
-
-
-Sample program to use ADI
--------------------------
-
-Following sample program is meant to illustrate how to use the ADI
-functionality.
-
-#include <unistd.h>
-#include <stdio.h>
-#include <stdlib.h>
-#include <elf.h>
-#include <sys/ipc.h>
-#include <sys/shm.h>
-#include <sys/mman.h>
-#include <asm/asi.h>
-
-#ifndef AT_ADI_BLKSZ
-#define AT_ADI_BLKSZ 48
-#endif
-#ifndef AT_ADI_NBITS
-#define AT_ADI_NBITS 49
-#endif
-
-#ifndef PROT_ADI
-#define PROT_ADI 0x10
-#endif
-
-#define BUFFER_SIZE 32*1024*1024UL
-
-main(int argc, char* argv[], char* envp[])
-{
- unsigned long i, mcde, adi_blksz, adi_nbits;
- char *shmaddr, *tmp_addr, *end, *veraddr, *clraddr;
- int shmid, version;
- Elf64_auxv_t *auxv;
-
- adi_blksz = 0;
-
- while(*envp++ != NULL);
- for (auxv = (Elf64_auxv_t *)envp; auxv->a_type != AT_NULL; auxv++) {
- switch (auxv->a_type) {
- case AT_ADI_BLKSZ:
- adi_blksz = auxv->a_un.a_val;
- break;
- case AT_ADI_NBITS:
- adi_nbits = auxv->a_un.a_val;
- break;
- }
- }
- if (adi_blksz == 0) {
- fprintf(stderr, "Oops! ADI is not supported\n");
- exit(1);
- }
-
- printf("ADI capabilities:\n");
- printf("\tBlock size = %ld\n", adi_blksz);
- printf("\tNumber of bits = %ld\n", adi_nbits);
-
- if ((shmid = shmget(2, BUFFER_SIZE,
- IPC_CREAT | SHM_R | SHM_W)) < 0) {
- perror("shmget failed");
- exit(1);
- }
-
- shmaddr = shmat(shmid, NULL, 0);
- if (shmaddr == (char *)-1) {
- perror("shm attach failed");
- shmctl(shmid, IPC_RMID, NULL);
- exit(1);
- }
-
- if (mprotect(shmaddr, BUFFER_SIZE, PROT_READ|PROT_WRITE|PROT_ADI)) {
- perror("mprotect failed");
- goto err_out;
- }
-
- /* Set the ADI version tag on the shm segment
- */
- version = 10;
- tmp_addr = shmaddr;
- end = shmaddr + BUFFER_SIZE;
- while (tmp_addr < end) {
- asm volatile(
- "stxa %1, [%0]0x90\n\t"
- :
- : "r" (tmp_addr), "r" (version));
- tmp_addr += adi_blksz;
- }
- asm volatile("membar #Sync\n\t");
-
- /* Create a versioned address from the normal address by placing
- * version tag in the upper adi_nbits bits
- */
- tmp_addr = (void *) ((unsigned long)shmaddr << adi_nbits);
- tmp_addr = (void *) ((unsigned long)tmp_addr >> adi_nbits);
- veraddr = (void *) (((unsigned long)version << (64-adi_nbits))
- | (unsigned long)tmp_addr);
-
- printf("Starting the writes:\n");
- for (i = 0; i < BUFFER_SIZE; i++) {
- veraddr[i] = (char)(i);
- if (!(i % (1024 * 1024)))
- printf(".");
- }
- printf("\n");
-
- printf("Verifying data...");
- fflush(stdout);
- for (i = 0; i < BUFFER_SIZE; i++)
- if (veraddr[i] != (char)i)
- printf("\nIndex %lu mismatched\n", i);
- printf("Done.\n");
-
- /* Disable ADI and clean up
- */
- if (mprotect(shmaddr, BUFFER_SIZE, PROT_READ|PROT_WRITE)) {
- perror("mprotect failed");
- goto err_out;
- }
-
- if (shmdt((const void *)shmaddr) != 0)
- perror("Detach failure");
- shmctl(shmid, IPC_RMID, NULL);
-
- exit(0);
-
-err_out:
- if (shmdt((const void *)shmaddr) != 0)
- perror("Detach failure");
- shmctl(shmid, IPC_RMID, NULL);
- exit(1);
-}
--- /dev/null
+Steps for sending 'break' on sunhv console
+==========================================
+
+On Baremetal:
+ 1. press Esc + 'B'
+
+On LDOM:
+ 1. press Ctrl + ']'
+ 2. telnet> send break
+++ /dev/null
-Steps for sending 'break' on sunhv console:
-===========================================
-
-On Baremetal:
- 1. press Esc + 'B'
-
-On LDOM:
- 1. press Ctrl + ']'
- 2. telnet> send break
--- /dev/null
+:orphan:
+
+==================
+Sparc Architecture
+==================
+
+.. toctree::
+ :maxdepth: 1
+
+ console
+ adi
+
+ oradax/oracle-dax
--- /dev/null
+=======================================
+Oracle Data Analytics Accelerator (DAX)
+=======================================
+
+DAX is a coprocessor which resides on the SPARC M7 (DAX1) and M8
+(DAX2) processor chips, and has direct access to the CPU's L3 caches
+as well as physical memory. It can perform several operations on data
+streams with various input and output formats. A driver provides a
+transport mechanism and has limited knowledge of the various opcodes
+and data formats. A user space library provides high level services
+and translates these into low level commands which are then passed
+into the driver and subsequently the Hypervisor and the coprocessor.
+The library is the recommended way for applications to use the
+coprocessor, and the driver interface is not intended for general use.
+This document describes the general flow of the driver, its
+structures, and its programmatic interface. It also provides example
+code sufficient to write user or kernel applications that use DAX
+functionality.
+
+The user library is open source and available at:
+
+ https://oss.oracle.com/git/gitweb.cgi?p=libdax.git
+
+The Hypervisor interface to the coprocessor is described in detail in
+the accompanying document, dax-hv-api.txt, which is a plain text
+excerpt of the (Oracle internal) "UltraSPARC Virtual Machine
+Specification" version 3.0.20+15, dated 2017-09-25.
+
+
+High Level Overview
+===================
+
+A coprocessor request is described by a Command Control Block
+(CCB). The CCB contains an opcode and various parameters. The opcode
+specifies what operation is to be done, and the parameters specify
+options, flags, sizes, and addresses. The CCB (or an array of CCBs)
+is passed to the Hypervisor, which handles queueing and scheduling of
+requests to the available coprocessor execution units. A status code
+returned indicates if the request was submitted successfully or if
+there was an error. One of the addresses given in each CCB is a
+pointer to a "completion area", which is a 128 byte memory block that
+is written by the coprocessor to provide execution status. No
+interrupt is generated upon completion; the completion area must be
+polled by software to find out when a transaction has finished, but
+the M7 and later processors provide a mechanism to pause the virtual
+processor until the completion status has been updated by the
+coprocessor. This is done using the monitored load and mwait
+instructions, which are described in more detail later. The DAX
+coprocessor was designed so that after a request is submitted, the
+kernel is no longer involved in the processing of it. The polling is
+done at the user level, which results in almost zero latency between
+completion of a request and resumption of execution of the requesting
+thread.
+
+
+Addressing Memory
+=================
+
+The kernel does not have access to physical memory in the Sun4v
+architecture, as there is an additional level of memory virtualization
+present. This intermediate level is called "real" memory, and the
+kernel treats this as if it were physical. The Hypervisor handles the
+translations between real memory and physical so that each logical
+domain (LDOM) can have a partition of physical memory that is isolated
+from that of other LDOMs. When the kernel sets up a virtual mapping,
+it specifies a virtual address and the real address to which it should
+be mapped.
+
+The DAX coprocessor can only operate on physical memory, so before a
+request can be fed to the coprocessor, all the addresses in a CCB must
+be converted into physical addresses. The kernel cannot do this since
+it has no visibility into physical addresses. So a CCB may contain
+either the virtual or real addresses of the buffers or a combination
+of them. An "address type" field is available for each address that
+may be given in the CCB. In all cases, the Hypervisor will translate
+all the addresses to physical before dispatching to hardware. Address
+translations are performed using the context of the process initiating
+the request.
+
+
+The Driver API
+==============
+
+An application makes requests to the driver via the write() system
+call, and gets results (if any) via read(). The completion areas are
+made accessible via mmap(), and are read-only for the application.
+
+The request may either be an immediate command or an array of CCBs to
+be submitted to the hardware.
+
+Each open instance of the device is exclusive to the thread that
+opened it, and must be used by that thread for all subsequent
+operations. The driver open function creates a new context for the
+thread and initializes it for use. This context contains pointers and
+values used internally by the driver to keep track of submitted
+requests. The completion area buffer is also allocated, and this is
+large enough to contain the completion areas for many concurrent
+requests. When the device is closed, any outstanding transactions are
+flushed and the context is cleaned up.
+
+On a DAX1 system (M7), the device will be called "oradax1", while on a
+DAX2 system (M8) it will be "oradax2". If an application requires one
+or the other, it should simply attempt to open the appropriate
+device. Only one of the devices will exist on any given system, so the
+name can be used to determine what the platform supports.
+
+The immediate commands are CCB_DEQUEUE, CCB_KILL, and CCB_INFO. For
+all of these, success is indicated by a return value from write()
+equal to the number of bytes given in the call. Otherwise -1 is
+returned and errno is set.
+
+CCB_DEQUEUE
+-----------
+
+Tells the driver to clean up resources associated with past
+requests. Since no interrupt is generated upon the completion of a
+request, the driver must be told when it may reclaim resources. No
+further status information is returned, so the user should not
+subsequently call read().
+
+CCB_KILL
+--------
+
+Kills a CCB during execution. The CCB is guaranteed to not continue
+executing once this call returns successfully. On success, read() must
+be called to retrieve the result of the action.
+
+CCB_INFO
+--------
+
+Retrieves information about a currently executing CCB. Note that some
+Hypervisors might return 'notfound' when the CCB is in 'inprogress'
+state. To ensure a CCB in the 'notfound' state will never be executed,
+CCB_KILL must be invoked on that CCB. Upon success, read() must be
+called to retrieve the details of the action.
+
+Submission of an array of CCBs for execution
+---------------------------------------------
+
+A write() whose length is a multiple of the CCB size is treated as a
+submit operation. The file offset is treated as the index of the
+completion area to use, and may be set via lseek() or using the
+pwrite() system call. If -1 is returned then errno is set to indicate
+the error. Otherwise, the return value is the length of the array that
+was actually accepted by the coprocessor. If the accepted length is
+equal to the requested length, then the submission was completely
+successful and there is no further status needed; hence, the user
+should not subsequently call read(). Partial acceptance of the CCB
+array is indicated by a return value less than the requested length,
+and read() must be called to retrieve further status information. The
+status will reflect the error caused by the first CCB that was not
+accepted, and status_data will provide additional data in some cases.
+
+MMAP
+----
+
+The mmap() function provides access to the completion area allocated
+in the driver. Note that the completion area is not writeable by the
+user process, and the mmap call must not specify PROT_WRITE.
+
+
+Completion of a Request
+=======================
+
+The first byte in each completion area is the command status which is
+updated by the coprocessor hardware. Software may take advantage of
+new M7/M8 processor capabilities to efficiently poll this status byte.
+First, a "monitored load" is achieved via a Load from Alternate Space
+(ldxa, lduba, etc.) with ASI 0x84 (ASI_MONITOR_PRIMARY). Second, a
+"monitored wait" is achieved via the mwait instruction (a write to
+%asr28). This instruction is like pause in that it suspends execution
+of the virtual processor for the given number of nanoseconds, but in
+addition will terminate early when one of several events occur. If the
+block of data containing the monitored location is modified, then the
+mwait terminates. This causes software to resume execution immediately
+(without a context switch or kernel to user transition) after a
+transaction completes. Thus the latency between transaction completion
+and resumption of execution may be just a few nanoseconds.
+
+
+Application Life Cycle of a DAX Submission
+==========================================
+
+ - open dax device
+ - call mmap() to get the completion area address
+ - allocate a CCB and fill in the opcode, flags, parameters, addresses, etc.
+ - submit CCB via write() or pwrite()
+ - go into a loop executing monitored load + monitored wait and
+ terminate when the command status indicates the request is complete
+ (CCB_KILL or CCB_INFO may be used any time as necessary)
+ - perform a CCB_DEQUEUE
+ - call munmap() for completion area
+ - close the dax device
+
+
+Memory Constraints
+==================
+
+The DAX hardware operates only on physical addresses. Therefore, it is
+not aware of virtual memory mappings and the discontiguities that may
+exist in the physical memory that a virtual buffer maps to. There is
+no I/O TLB or any scatter/gather mechanism. All buffers, whether input
+or output, must reside in a physically contiguous region of memory.
+
+The Hypervisor translates all addresses within a CCB to physical
+before handing off the CCB to DAX. The Hypervisor determines the
+virtual page size for each virtual address given, and uses this to
+program a size limit for each address. This prevents the coprocessor
+from reading or writing beyond the bound of the virtual page, even
+though it is accessing physical memory directly. A simpler way of
+saying this is that a DAX operation will never "cross" a virtual page
+boundary. If an 8k virtual page is used, then the data is strictly
+limited to 8k. If a user's buffer is larger than 8k, then a larger
+page size must be used, or the transaction size will be truncated to
+8k.
+
+Huge pages. A user may allocate huge pages using standard interfaces.
+Memory buffers residing on huge pages may be used to achieve much
+larger DAX transaction sizes, but the rules must still be followed,
+and no transaction will cross a page boundary, even a huge page. A
+major caveat is that Linux on Sparc presents 8Mb as one of the huge
+page sizes. Sparc does not actually provide a 8Mb hardware page size,
+and this size is synthesized by pasting together two 4Mb pages. The
+reasons for this are historical, and it creates an issue because only
+half of this 8Mb page can actually be used for any given buffer in a
+DAX request, and it must be either the first half or the second half;
+it cannot be a 4Mb chunk in the middle, since that crosses a
+(hardware) page boundary. Note that this entire issue may be hidden by
+higher level libraries.
+
+
+CCB Structure
+-------------
+A CCB is an array of 8 64-bit words. Several of these words provide
+command opcodes, parameters, flags, etc., and the rest are addresses
+for the completion area, output buffer, and various inputs::
+
+ struct ccb {
+ u64 control;
+ u64 completion;
+ u64 input0;
+ u64 access;
+ u64 input1;
+ u64 op_data;
+ u64 output;
+ u64 table;
+ };
+
+See libdax/common/sys/dax1/dax1_ccb.h for a detailed description of
+each of these fields, and see dax-hv-api.txt for a complete description
+of the Hypervisor API available to the guest OS (ie, Linux kernel).
+
+The first word (control) is examined by the driver for the following:
+ - CCB version, which must be consistent with hardware version
+ - Opcode, which must be one of the documented allowable commands
+ - Address types, which must be set to "virtual" for all the addresses
+ given by the user, thereby ensuring that the application can
+ only access memory that it owns
+
+
+Example Code
+============
+
+The DAX is accessible to both user and kernel code. The kernel code
+can make hypercalls directly while the user code must use wrappers
+provided by the driver. The setup of the CCB is nearly identical for
+both; the only difference is in preparation of the completion area. An
+example of user code is given now, with kernel code afterwards.
+
+In order to program using the driver API, the file
+arch/sparc/include/uapi/asm/oradax.h must be included.
+
+First, the proper device must be opened. For M7 it will be
+/dev/oradax1 and for M8 it will be /dev/oradax2. The simplest
+procedure is to attempt to open both, as only one will succeed::
+
+ fd = open("/dev/oradax1", O_RDWR);
+ if (fd < 0)
+ fd = open("/dev/oradax2", O_RDWR);
+ if (fd < 0)
+ /* No DAX found */
+
+Next, the completion area must be mapped::
+
+ completion_area = mmap(NULL, DAX_MMAP_LEN, PROT_READ, MAP_SHARED, fd, 0);
+
+All input and output buffers must be fully contained in one hardware
+page, since as explained above, the DAX is strictly constrained by
+virtual page boundaries. In addition, the output buffer must be
+64-byte aligned and its size must be a multiple of 64 bytes because
+the coprocessor writes in units of cache lines.
+
+This example demonstrates the DAX Scan command, which takes as input a
+vector and a match value, and produces a bitmap as the output. For
+each input element that matches the value, the corresponding bit is
+set in the output.
+
+In this example, the input vector consists of a series of single bits,
+and the match value is 0. So each 0 bit in the input will produce a 1
+in the output, and vice versa, which produces an output bitmap which
+is the input bitmap inverted.
+
+For details of all the parameters and bits used in this CCB, please
+refer to section 36.2.1.3 of the DAX Hypervisor API document, which
+describes the Scan command in detail::
+
+ ccb->control = /* Table 36.1, CCB Header Format */
+ (2L << 48) /* command = Scan Value */
+ | (3L << 40) /* output address type = primary virtual */
+ | (3L << 34) /* primary input address type = primary virtual */
+ /* Section 36.2.1, Query CCB Command Formats */
+ | (1 << 28) /* 36.2.1.1.1 primary input format = fixed width bit packed */
+ | (0 << 23) /* 36.2.1.1.2 primary input element size = 0 (1 bit) */
+ | (8 << 10) /* 36.2.1.1.6 output format = bit vector */
+ | (0 << 5) /* 36.2.1.3 First scan criteria size = 0 (1 byte) */
+ | (31 << 0); /* 36.2.1.3 Disable second scan criteria */
+
+ ccb->completion = 0; /* Completion area address, to be filled in by driver */
+
+ ccb->input0 = (unsigned long) input; /* primary input address */
+
+ ccb->access = /* Section 36.2.1.2, Data Access Control */
+ (2 << 24) /* Primary input length format = bits */
+ | (nbits - 1); /* number of bits in primary input stream, minus 1 */
+
+ ccb->input1 = 0; /* secondary input address, unused */
+
+ ccb->op_data = 0; /* scan criteria (value to be matched) */
+
+ ccb->output = (unsigned long) output; /* output address */
+
+ ccb->table = 0; /* table address, unused */
+
+The CCB submission is a write() or pwrite() system call to the
+driver. If the call fails, then a read() must be used to retrieve the
+status::
+
+ if (pwrite(fd, ccb, 64, 0) != 64) {
+ struct ccb_exec_result status;
+ read(fd, &status, sizeof(status));
+ /* bail out */
+ }
+
+After a successful submission of the CCB, the completion area may be
+polled to determine when the DAX is finished. Detailed information on
+the contents of the completion area can be found in section 36.2.2 of
+the DAX HV API document::
+
+ while (1) {
+ /* Monitored Load */
+ __asm__ __volatile__("lduba [%1] 0x84, %0\n"
+ : "=r" (status)
+ : "r" (completion_area));
+
+ if (status) /* 0 indicates command in progress */
+ break;
+
+ /* MWAIT */
+ __asm__ __volatile__("wr %%g0, 1000, %%asr28\n" ::); /* 1000 ns */
+ }
+
+A completion area status of 1 indicates successful completion of the
+CCB and validity of the output bitmap, which may be used immediately.
+All other non-zero values indicate error conditions which are
+described in section 36.2.2::
+
+ if (completion_area[0] != 1) { /* section 36.2.2, 1 = command ran and succeeded */
+ /* completion_area[0] contains the completion status */
+ /* completion_area[1] contains an error code, see 36.2.2 */
+ }
+
+After the completion area has been processed, the driver must be
+notified that it can release any resources associated with the
+request. This is done via the dequeue operation::
+
+ struct dax_command cmd;
+ cmd.command = CCB_DEQUEUE;
+ if (write(fd, &cmd, sizeof(cmd)) != sizeof(cmd)) {
+ /* bail out */
+ }
+
+Finally, normal program cleanup should be done, i.e., unmapping
+completion area, closing the dax device, freeing memory etc.
+
+Kernel example
+--------------
+
+The only difference in using the DAX in kernel code is the treatment
+of the completion area. Unlike user applications which mmap the
+completion area allocated by the driver, kernel code must allocate its
+own memory to use for the completion area, and this address and its
+type must be given in the CCB::
+
+ ccb->control |= /* Table 36.1, CCB Header Format */
+ (3L << 32); /* completion area address type = primary virtual */
+
+ ccb->completion = (unsigned long) completion_area; /* Completion area address */
+
+The dax submit hypercall is made directly. The flags used in the
+ccb_submit call are documented in the DAX HV API in section 36.3.1/
+
+::
+
+ #include <asm/hypervisor.h>
+
+ hv_rv = sun4v_ccb_submit((unsigned long)ccb, 64,
+ HV_CCB_QUERY_CMD |
+ HV_CCB_ARG0_PRIVILEGED | HV_CCB_ARG0_TYPE_PRIMARY |
+ HV_CCB_VA_PRIVILEGED,
+ 0, &bytes_accepted, &status_data);
+
+ if (hv_rv != HV_EOK) {
+ /* hv_rv is an error code, status_data contains */
+ /* potential additional status, see 36.3.1.1 */
+ }
+
+After the submission, the completion area polling code is identical to
+that in user land::
+
+ while (1) {
+ /* Monitored Load */
+ __asm__ __volatile__("lduba [%1] 0x84, %0\n"
+ : "=r" (status)
+ : "r" (completion_area));
+
+ if (status) /* 0 indicates command in progress */
+ break;
+
+ /* MWAIT */
+ __asm__ __volatile__("wr %%g0, 1000, %%asr28\n" ::); /* 1000 ns */
+ }
+
+ if (completion_area[0] != 1) { /* section 36.2.2, 1 = command ran and succeeded */
+ /* completion_area[0] contains the completion status */
+ /* completion_area[1] contains an error code, see 36.2.2 */
+ }
+
+The output bitmap is ready for consumption immediately after the
+completion status indicates success.
+
+Excer[t from UltraSPARC Virtual Machine Specification
+=====================================================
+
+ .. include:: dax-hv-api.txt
+ :literal:
+++ /dev/null
-Oracle Data Analytics Accelerator (DAX)
----------------------------------------
-
-DAX is a coprocessor which resides on the SPARC M7 (DAX1) and M8
-(DAX2) processor chips, and has direct access to the CPU's L3 caches
-as well as physical memory. It can perform several operations on data
-streams with various input and output formats. A driver provides a
-transport mechanism and has limited knowledge of the various opcodes
-and data formats. A user space library provides high level services
-and translates these into low level commands which are then passed
-into the driver and subsequently the Hypervisor and the coprocessor.
-The library is the recommended way for applications to use the
-coprocessor, and the driver interface is not intended for general use.
-This document describes the general flow of the driver, its
-structures, and its programmatic interface. It also provides example
-code sufficient to write user or kernel applications that use DAX
-functionality.
-
-The user library is open source and available at:
- https://oss.oracle.com/git/gitweb.cgi?p=libdax.git
-
-The Hypervisor interface to the coprocessor is described in detail in
-the accompanying document, dax-hv-api.txt, which is a plain text
-excerpt of the (Oracle internal) "UltraSPARC Virtual Machine
-Specification" version 3.0.20+15, dated 2017-09-25.
-
-
-High Level Overview
--------------------
-
-A coprocessor request is described by a Command Control Block
-(CCB). The CCB contains an opcode and various parameters. The opcode
-specifies what operation is to be done, and the parameters specify
-options, flags, sizes, and addresses. The CCB (or an array of CCBs)
-is passed to the Hypervisor, which handles queueing and scheduling of
-requests to the available coprocessor execution units. A status code
-returned indicates if the request was submitted successfully or if
-there was an error. One of the addresses given in each CCB is a
-pointer to a "completion area", which is a 128 byte memory block that
-is written by the coprocessor to provide execution status. No
-interrupt is generated upon completion; the completion area must be
-polled by software to find out when a transaction has finished, but
-the M7 and later processors provide a mechanism to pause the virtual
-processor until the completion status has been updated by the
-coprocessor. This is done using the monitored load and mwait
-instructions, which are described in more detail later. The DAX
-coprocessor was designed so that after a request is submitted, the
-kernel is no longer involved in the processing of it. The polling is
-done at the user level, which results in almost zero latency between
-completion of a request and resumption of execution of the requesting
-thread.
-
-
-Addressing Memory
------------------
-
-The kernel does not have access to physical memory in the Sun4v
-architecture, as there is an additional level of memory virtualization
-present. This intermediate level is called "real" memory, and the
-kernel treats this as if it were physical. The Hypervisor handles the
-translations between real memory and physical so that each logical
-domain (LDOM) can have a partition of physical memory that is isolated
-from that of other LDOMs. When the kernel sets up a virtual mapping,
-it specifies a virtual address and the real address to which it should
-be mapped.
-
-The DAX coprocessor can only operate on physical memory, so before a
-request can be fed to the coprocessor, all the addresses in a CCB must
-be converted into physical addresses. The kernel cannot do this since
-it has no visibility into physical addresses. So a CCB may contain
-either the virtual or real addresses of the buffers or a combination
-of them. An "address type" field is available for each address that
-may be given in the CCB. In all cases, the Hypervisor will translate
-all the addresses to physical before dispatching to hardware. Address
-translations are performed using the context of the process initiating
-the request.
-
-
-The Driver API
---------------
-
-An application makes requests to the driver via the write() system
-call, and gets results (if any) via read(). The completion areas are
-made accessible via mmap(), and are read-only for the application.
-
-The request may either be an immediate command or an array of CCBs to
-be submitted to the hardware.
-
-Each open instance of the device is exclusive to the thread that
-opened it, and must be used by that thread for all subsequent
-operations. The driver open function creates a new context for the
-thread and initializes it for use. This context contains pointers and
-values used internally by the driver to keep track of submitted
-requests. The completion area buffer is also allocated, and this is
-large enough to contain the completion areas for many concurrent
-requests. When the device is closed, any outstanding transactions are
-flushed and the context is cleaned up.
-
-On a DAX1 system (M7), the device will be called "oradax1", while on a
-DAX2 system (M8) it will be "oradax2". If an application requires one
-or the other, it should simply attempt to open the appropriate
-device. Only one of the devices will exist on any given system, so the
-name can be used to determine what the platform supports.
-
-The immediate commands are CCB_DEQUEUE, CCB_KILL, and CCB_INFO. For
-all of these, success is indicated by a return value from write()
-equal to the number of bytes given in the call. Otherwise -1 is
-returned and errno is set.
-
-CCB_DEQUEUE
-
-Tells the driver to clean up resources associated with past
-requests. Since no interrupt is generated upon the completion of a
-request, the driver must be told when it may reclaim resources. No
-further status information is returned, so the user should not
-subsequently call read().
-
-CCB_KILL
-
-Kills a CCB during execution. The CCB is guaranteed to not continue
-executing once this call returns successfully. On success, read() must
-be called to retrieve the result of the action.
-
-CCB_INFO
-
-Retrieves information about a currently executing CCB. Note that some
-Hypervisors might return 'notfound' when the CCB is in 'inprogress'
-state. To ensure a CCB in the 'notfound' state will never be executed,
-CCB_KILL must be invoked on that CCB. Upon success, read() must be
-called to retrieve the details of the action.
-
-Submission of an array of CCBs for execution
-
-A write() whose length is a multiple of the CCB size is treated as a
-submit operation. The file offset is treated as the index of the
-completion area to use, and may be set via lseek() or using the
-pwrite() system call. If -1 is returned then errno is set to indicate
-the error. Otherwise, the return value is the length of the array that
-was actually accepted by the coprocessor. If the accepted length is
-equal to the requested length, then the submission was completely
-successful and there is no further status needed; hence, the user
-should not subsequently call read(). Partial acceptance of the CCB
-array is indicated by a return value less than the requested length,
-and read() must be called to retrieve further status information. The
-status will reflect the error caused by the first CCB that was not
-accepted, and status_data will provide additional data in some cases.
-
-MMAP
-
-The mmap() function provides access to the completion area allocated
-in the driver. Note that the completion area is not writeable by the
-user process, and the mmap call must not specify PROT_WRITE.
-
-
-Completion of a Request
------------------------
-
-The first byte in each completion area is the command status which is
-updated by the coprocessor hardware. Software may take advantage of
-new M7/M8 processor capabilities to efficiently poll this status byte.
-First, a "monitored load" is achieved via a Load from Alternate Space
-(ldxa, lduba, etc.) with ASI 0x84 (ASI_MONITOR_PRIMARY). Second, a
-"monitored wait" is achieved via the mwait instruction (a write to
-%asr28). This instruction is like pause in that it suspends execution
-of the virtual processor for the given number of nanoseconds, but in
-addition will terminate early when one of several events occur. If the
-block of data containing the monitored location is modified, then the
-mwait terminates. This causes software to resume execution immediately
-(without a context switch or kernel to user transition) after a
-transaction completes. Thus the latency between transaction completion
-and resumption of execution may be just a few nanoseconds.
-
-
-Application Life Cycle of a DAX Submission
-------------------------------------------
-
- - open dax device
- - call mmap() to get the completion area address
- - allocate a CCB and fill in the opcode, flags, parameters, addresses, etc.
- - submit CCB via write() or pwrite()
- - go into a loop executing monitored load + monitored wait and
- terminate when the command status indicates the request is complete
- (CCB_KILL or CCB_INFO may be used any time as necessary)
- - perform a CCB_DEQUEUE
- - call munmap() for completion area
- - close the dax device
-
-
-Memory Constraints
-------------------
-
-The DAX hardware operates only on physical addresses. Therefore, it is
-not aware of virtual memory mappings and the discontiguities that may
-exist in the physical memory that a virtual buffer maps to. There is
-no I/O TLB or any scatter/gather mechanism. All buffers, whether input
-or output, must reside in a physically contiguous region of memory.
-
-The Hypervisor translates all addresses within a CCB to physical
-before handing off the CCB to DAX. The Hypervisor determines the
-virtual page size for each virtual address given, and uses this to
-program a size limit for each address. This prevents the coprocessor
-from reading or writing beyond the bound of the virtual page, even
-though it is accessing physical memory directly. A simpler way of
-saying this is that a DAX operation will never "cross" a virtual page
-boundary. If an 8k virtual page is used, then the data is strictly
-limited to 8k. If a user's buffer is larger than 8k, then a larger
-page size must be used, or the transaction size will be truncated to
-8k.
-
-Huge pages. A user may allocate huge pages using standard interfaces.
-Memory buffers residing on huge pages may be used to achieve much
-larger DAX transaction sizes, but the rules must still be followed,
-and no transaction will cross a page boundary, even a huge page. A
-major caveat is that Linux on Sparc presents 8Mb as one of the huge
-page sizes. Sparc does not actually provide a 8Mb hardware page size,
-and this size is synthesized by pasting together two 4Mb pages. The
-reasons for this are historical, and it creates an issue because only
-half of this 8Mb page can actually be used for any given buffer in a
-DAX request, and it must be either the first half or the second half;
-it cannot be a 4Mb chunk in the middle, since that crosses a
-(hardware) page boundary. Note that this entire issue may be hidden by
-higher level libraries.
-
-
-CCB Structure
--------------
-A CCB is an array of 8 64-bit words. Several of these words provide
-command opcodes, parameters, flags, etc., and the rest are addresses
-for the completion area, output buffer, and various inputs:
-
- struct ccb {
- u64 control;
- u64 completion;
- u64 input0;
- u64 access;
- u64 input1;
- u64 op_data;
- u64 output;
- u64 table;
- };
-
-See libdax/common/sys/dax1/dax1_ccb.h for a detailed description of
-each of these fields, and see dax-hv-api.txt for a complete description
-of the Hypervisor API available to the guest OS (ie, Linux kernel).
-
-The first word (control) is examined by the driver for the following:
- - CCB version, which must be consistent with hardware version
- - Opcode, which must be one of the documented allowable commands
- - Address types, which must be set to "virtual" for all the addresses
- given by the user, thereby ensuring that the application can
- only access memory that it owns
-
-
-Example Code
-------------
-
-The DAX is accessible to both user and kernel code. The kernel code
-can make hypercalls directly while the user code must use wrappers
-provided by the driver. The setup of the CCB is nearly identical for
-both; the only difference is in preparation of the completion area. An
-example of user code is given now, with kernel code afterwards.
-
-In order to program using the driver API, the file
-arch/sparc/include/uapi/asm/oradax.h must be included.
-
-First, the proper device must be opened. For M7 it will be
-/dev/oradax1 and for M8 it will be /dev/oradax2. The simplest
-procedure is to attempt to open both, as only one will succeed:
-
- fd = open("/dev/oradax1", O_RDWR);
- if (fd < 0)
- fd = open("/dev/oradax2", O_RDWR);
- if (fd < 0)
- /* No DAX found */
-
-Next, the completion area must be mapped:
-
- completion_area = mmap(NULL, DAX_MMAP_LEN, PROT_READ, MAP_SHARED, fd, 0);
-
-All input and output buffers must be fully contained in one hardware
-page, since as explained above, the DAX is strictly constrained by
-virtual page boundaries. In addition, the output buffer must be
-64-byte aligned and its size must be a multiple of 64 bytes because
-the coprocessor writes in units of cache lines.
-
-This example demonstrates the DAX Scan command, which takes as input a
-vector and a match value, and produces a bitmap as the output. For
-each input element that matches the value, the corresponding bit is
-set in the output.
-
-In this example, the input vector consists of a series of single bits,
-and the match value is 0. So each 0 bit in the input will produce a 1
-in the output, and vice versa, which produces an output bitmap which
-is the input bitmap inverted.
-
-For details of all the parameters and bits used in this CCB, please
-refer to section 36.2.1.3 of the DAX Hypervisor API document, which
-describes the Scan command in detail.
-
- ccb->control = /* Table 36.1, CCB Header Format */
- (2L << 48) /* command = Scan Value */
- | (3L << 40) /* output address type = primary virtual */
- | (3L << 34) /* primary input address type = primary virtual */
- /* Section 36.2.1, Query CCB Command Formats */
- | (1 << 28) /* 36.2.1.1.1 primary input format = fixed width bit packed */
- | (0 << 23) /* 36.2.1.1.2 primary input element size = 0 (1 bit) */
- | (8 << 10) /* 36.2.1.1.6 output format = bit vector */
- | (0 << 5) /* 36.2.1.3 First scan criteria size = 0 (1 byte) */
- | (31 << 0); /* 36.2.1.3 Disable second scan criteria */
-
- ccb->completion = 0; /* Completion area address, to be filled in by driver */
-
- ccb->input0 = (unsigned long) input; /* primary input address */
-
- ccb->access = /* Section 36.2.1.2, Data Access Control */
- (2 << 24) /* Primary input length format = bits */
- | (nbits - 1); /* number of bits in primary input stream, minus 1 */
-
- ccb->input1 = 0; /* secondary input address, unused */
-
- ccb->op_data = 0; /* scan criteria (value to be matched) */
-
- ccb->output = (unsigned long) output; /* output address */
-
- ccb->table = 0; /* table address, unused */
-
-The CCB submission is a write() or pwrite() system call to the
-driver. If the call fails, then a read() must be used to retrieve the
-status:
-
- if (pwrite(fd, ccb, 64, 0) != 64) {
- struct ccb_exec_result status;
- read(fd, &status, sizeof(status));
- /* bail out */
- }
-
-After a successful submission of the CCB, the completion area may be
-polled to determine when the DAX is finished. Detailed information on
-the contents of the completion area can be found in section 36.2.2 of
-the DAX HV API document.
-
- while (1) {
- /* Monitored Load */
- __asm__ __volatile__("lduba [%1] 0x84, %0\n"
- : "=r" (status)
- : "r" (completion_area));
-
- if (status) /* 0 indicates command in progress */
- break;
-
- /* MWAIT */
- __asm__ __volatile__("wr %%g0, 1000, %%asr28\n" ::); /* 1000 ns */
- }
-
-A completion area status of 1 indicates successful completion of the
-CCB and validity of the output bitmap, which may be used immediately.
-All other non-zero values indicate error conditions which are
-described in section 36.2.2.
-
- if (completion_area[0] != 1) { /* section 36.2.2, 1 = command ran and succeeded */
- /* completion_area[0] contains the completion status */
- /* completion_area[1] contains an error code, see 36.2.2 */
- }
-
-After the completion area has been processed, the driver must be
-notified that it can release any resources associated with the
-request. This is done via the dequeue operation:
-
- struct dax_command cmd;
- cmd.command = CCB_DEQUEUE;
- if (write(fd, &cmd, sizeof(cmd)) != sizeof(cmd)) {
- /* bail out */
- }
-
-Finally, normal program cleanup should be done, i.e., unmapping
-completion area, closing the dax device, freeing memory etc.
-
-[Kernel example]
-
-The only difference in using the DAX in kernel code is the treatment
-of the completion area. Unlike user applications which mmap the
-completion area allocated by the driver, kernel code must allocate its
-own memory to use for the completion area, and this address and its
-type must be given in the CCB:
-
- ccb->control |= /* Table 36.1, CCB Header Format */
- (3L << 32); /* completion area address type = primary virtual */
-
- ccb->completion = (unsigned long) completion_area; /* Completion area address */
-
-The dax submit hypercall is made directly. The flags used in the
-ccb_submit call are documented in the DAX HV API in section 36.3.1.
-
-#include <asm/hypervisor.h>
-
- hv_rv = sun4v_ccb_submit((unsigned long)ccb, 64,
- HV_CCB_QUERY_CMD |
- HV_CCB_ARG0_PRIVILEGED | HV_CCB_ARG0_TYPE_PRIMARY |
- HV_CCB_VA_PRIVILEGED,
- 0, &bytes_accepted, &status_data);
-
- if (hv_rv != HV_EOK) {
- /* hv_rv is an error code, status_data contains */
- /* potential additional status, see 36.3.1.1 */
- }
-
-After the submission, the completion area polling code is identical to
-that in user land:
-
- while (1) {
- /* Monitored Load */
- __asm__ __volatile__("lduba [%1] 0x84, %0\n"
- : "=r" (status)
- : "r" (completion_area));
-
- if (status) /* 0 indicates command in progress */
- break;
-
- /* MWAIT */
- __asm__ __volatile__("wr %%g0, 1000, %%asr28\n" ::); /* 1000 ns */
- }
-
- if (completion_area[0] != 1) { /* section 36.2.2, 1 = command ran and succeeded */
- /* completion_area[0] contains the completion status */
- /* completion_area[1] contains an error code, see 36.2.2 */
- }
-
-The output bitmap is ready for consumption immediately after the
-completion status indicates success.
* the recommended way for applications to use the coprocessor, and
* the driver interface is not intended for general use.
*
- * See Documentation/sparc/oradax/oracle-dax.txt for more details.
+ * See Documentation/sparc/oradax/oracle-dax.rst for more details.
*/
#include <linux/uaccess.h>