ceph/src/spdk/intel-ipsec-mb/ReleaseNotes.txt

   1 ========================================================================
   2 Release Notes for Intel(R) Multi-Buffer Crypto for IPsec Library
   3
   4 v0.49 March 2018
   5 ========================================================================
   6
   7 21 Mar, 2018
   8
   9 General
  10 - AES-CMAC support added (AES-CMAC-128 and AES-CMAC-96)
  11 - 3DES support added
  12 - Library compiles to SO/DLL by default
  13 - Install/uninstall targets added to makefiles
  14 - Multiple API header files consolidated into one (intel-ipsec-mb.h)
  15 - Unhalted cycles support added to LibPerfApp (Linux at the moment)
  16 - ELF stack execute protection added for assembly files
  17 - VZEROUPPER instruction issued after AVX2/AVX512 code to avoid
  18   expensive SSE<->AVX transitions
  19 - MAN page added
  20 - README documentation extensions and updates
  21 - AVX512 DES performance smoothed out
  22 - Multi-buffer manager instance allocate and free API's added
  23 - Core affinity support added in LibPerfApp
  24
  25 v0.48 December 2017
  26 ========================================================================
  27
  28 12 Dec, 2017
  29
  30 General
  31 - Linux SO compilation option added
  32 - Windows DLL compilation option added
  33 - AES CCM 128 support added
  34 - Multithread command line option added to LibPerfApp
  35 - Coding style fixes
  36 - Coding style target added to Makefile
  37
  38 v0.47 October 2017
  39 ========================================================================
  40
  41 Oct 5, 2017
  42
  43 Intel(R) AVX-512 Instructions
  44 - DES CBC AVX512 implementation
  45 - DOCSIS DES AVX512 implementation
  46 General
  47 - DES CBC cipher added (generic x86 implementation)
  48 - DOCSIS DES cipher added (generic x86 implementation)
  49 - DES and DOCSIS DES tests added
  50 - RPM SPEC file created
  51
  52 v0.46 June 2017
  53 ========================================================================
  54
  55 Jun 27, 2017
  56
  57 General
  58 - AES GCM optimizations for AVX2
  59 - Change of AES GCM API: renamed and expanded keys separated from the context
  60 - New AES GCM API via job structure and API's
  61   -  use of the interface may simplify application design at the expense of
  62      slightly lower performance vs direct AES GCM API's
  63 - AES GCM IV automatically padded with block counter (no need for application to do it)
  64 - IV in AES CTR mode can be 12 bytes (no block counter); 16 byte format still allowed
  65 - Macros added to ease access to job API for specific architecture
  66   - use of these macros can simplify application design but it may produce worse
  67     performance than calling architecture job API's directly
  68 - Submit_job_nocheck() API added to gain some cycles by not validating job structure
  69 - Result stability improvements in LibPerfApp
  70
  71 v0.45 March 2017
  72 ========================================================================
  73
  74 Mar 29, 2017
  75
  76 Intel(R) AVX-512 Instructions
  77 - Added optimized HMAC-SHA224 and HMAC-SHA256
  78 - Added optimized HMAC-SHA384 and HMAC-SHA512
  79 General
  80 - Windows x64 compilation target
  81 - New DOCSIS SEC BPI V3.1 cipher
  82 - GCM128 and GCM256 updates (with new API that is scatter gather list friendly)
  83 - GCM192 added
  84 - Added library API benchmark tool 'ipsec_perf' and
  85   script to compare results 'ipsec_diff_tool.py'
  86 Bug Fixes (vs v0.44)
  87 - AES CTR mode fix to allow message size not to be multiple of AES block size
  88 - RSI and RDI registers clobbered when running HMAC-SHA224 or HMAC-SHA256
  89   on Windows using SHA extensions
  90
  91 v0.44 November 2016
  92 ========================================================================
  93
  94 Nov 21, 2016
  95
  96 Intel(R) AVX-512 Instructions
  97 - AVX512 multi buffer manager added (uses AVX2 implementations by default)
  98 - Optimized SHA1 implementation added
  99 Intel(R) SHA Extensions
 100 - SHA1, SHA224 and SHA256 implementations added for Intel(R) SSE
 101 General
 102 - NULL cipher added
 103 - NULL hash added
 104 - NASM tool chain compilation added (default)
 105
 106 =======================================
 107 Feb 11, 2015
 108
 109 Fixed, so that the job auth_tag_output_len_in_bytes takes a different
 110 value for different MAC types. In particular, the valid values are(in bytes):
 111 SHA1 - 12
 112 sha224 - 14
 113 SHA256 - 16
 114 sha384 - 24
 115 SHA512 - 32
 116 XCBC - 12
 117 MD5 - 12
 118
 119 =======================================
 120 Oct 24, 2011
 121
 122 SHA_256 added to multibuffer
 123 ------------------------
 124 12 Aug 2011
 125
 126 API
 127
 128   The GCM API is distinct from the Multi-buffer API. This is because
 129   the GCM code is an optimized single-buffer implementation. By
 130   packaging them separately, the application has the option of where,
 131   when, and how to call the GCM code, independent of how it is calling
 132   the multi-buffer code.
 133
 134   For example, the application might be enqueing multi-buffer requests
 135   for a separate thread to process. In this scenario, if a particular
 136   packet used GCM, then the application could choose whether to call
 137   the GCM routines directly, or whether to enqueue those requests and
 138   have the compute thread call the GCM routines.
 139
 140 GCM API
 141
 142   The GCM functions are defined as described the the header
 143   files. They are simple computational routines, with no state
 144   associated with them.
 145
 146 Multi-Buffer API: Two Sets of Functions
 147
 148   There are two parallel interfaces, one suffixed with "_sse" and one
 149   suffixed with "_avx". These are functionally equivalent. The "_sse"
 150   functions work on WSM and later processors. The "_avx" functions
 151   offer better performance, but they only run on processors after WSM.
 152
 153   The same interface object structures are used for both sets of
 154   interfaces, although one cannot mix the two interfaces on the same
 155   initialized object (e.g. it would be wrong to initialize with
 156   init_mb_mgr_sse() and then to pass that to submit_job_avx() ). After
 157   the MB_MGR structure has been initialized with one of the two
 158   initialization functions (init_mb_mgr_sse() or init_mb_mgr_avx()),
 159   only the corresponding functions should be used on it.
 160
 161   There are several ways in which an application could use these
 162   interfaces.
 163
 164   1) Direct
 165      If an application is only going to be run on a post-WSM machine,
 166      it can just call the "_avx" functions directly. Conversely, if it
 167      is just going to be run on WSM machines, it can call the "_sse"
 168      functions directly.
 169
 170   2) Via Branches
 171      If an application can run on both WSM and SNB and wants the
 172      improved performance on SNB, then it can use some method to
 173      determine if it is on SNB, and then use a conditional branch to
 174      determine which function to call. E.g. this could be wrapped in a
 175      macro along the lines of:
 176      #define submit_job(mb_mgr) \
 177         if (_use_avx) submit_job_avx(mb_mgr); \
 178         else          submit_job_sse(mb_mgr)
 179
 180   3) Via a Function Table
 181      One can embed the function addresses into a structure, call them
 182      through this structure, and change the structure based on which
 183      set of functions one wishes to use, e.g.
 184
 185         struct funcs_t {
 186             init_mb_mgr_t       init_mb_mgr;
 187             get_next_job_t      get_next_job;
 188             submit_job_t        submit_job;
 189             get_completed_job_t get_completed_job;
 190             flush_job_t         flush_job;
 191         };
 192
 193         funcs_t funcs_sse = {
 194             init_mb_mgr_sse,
 195             get_next_job_sse,
 196             submit_job_sse,
 197             get_completed_job_sse,
 198             flush_job_sse
 199         };
 200         funcs_t funcs_avx = {
 201             init_mb_mgr_avx,
 202             get_next_job_avx,
 203             submit_job_avx,
 204             get_completed_job_avx,
 205             flush_job_avx
 206         };
 207         funcs_t *funcs = &funcs_sse;
 208         ...
 209         if (do_avx)
 210             funcs = &funcs_avx;
 211         ...
 212         funcs->init_mb_mgr(&mb_mgr);
 213
 214   For simplicity in the rest of this document, the functions will be
 215   refered to no suffix.
 216
 217 API: Overview
 218
 219   The basic unit of work is a "job". It is represented by a
 220   JOB_AES_HMAC structure. It contains all of the information needed to
 221   perform encryption/decryption and SHA1/HMAC authentication on one
 222   buffer for IPSec processing.
 223
 224   The basic paradigm is that the application needs to be able to
 225   provide new jobs before old jobs have completed processing. One
 226   might call this an "asynchronous" interface.
 227
 228   The basic interface is that the application "submits" a job to the
 229   multi-buffer manager (MB_MGR), and it may receive a completed job
 230   back, or it may receive NULL. The returned job, if there is one,
 231   will not be the same as the submitted job, but the jobs will be
 232   returned in the same order in which they are submitted.
 233
 234   Since there can be a semi-arbitrary number of outstanding jobs,
 235   management of the job object is handled by the MB_MGR. The
 236   application gets a pointer to a new job object by calling
 237   get_next_job(). It then fills in the data fields and submits it by
 238   calling submit_job(). If a job is returned, then that job has been
 239   completed, and the application should do whatever it needs to do in
 240   order to further process that buffer.
 241
 242   The job object is not explicitly returned to the MB_MGR. Rather it
 243   is implicitly returned by the next call to get_next_job(). Another
 244   way to put this is that the data within the job object is
 245   guaranteed to be valid until the next call to get_next_job().
 246
 247   In order to reduce latency, there is an optional function that may
 248   be called, get_completed_job(). This returns the next job if that
 249   job has previously been completed. But if that job has not been
 250   completed, no processing is done, and the function returns
 251   NULL. This may be used to reduce the number of outstanding jobs
 252   within the MB_MGR.
 253
 254   At times, it may be necessary to process the jobs currently within
 255   the MB_MGR without providing new jobs as input. This process is
 256   called "flushing", and it is invoked by calling flush_job(). If
 257   there are any jobs within the MB_MGR, this will complete processing
 258   on the earliest job and return it. It will only return NULL if there
 259   are no jobs within the MB_MGR.
 260
 261   Flushing will be described in more detail below.
 262
 263   The presumption is that the same AES key will apply to a number of
 264   buffers. For increased efficiency, it requires that the AES key
 265   expansion happens as a distinct step apart from buffer
 266   encryption/decryption. The expanded keys are stored in a data
 267   structure (array), and this expanded key structure is used by the
 268   job object.
 269
 270   There are two variants provided, MB_MGR and MB_MGR2. They are
 271   functionally equivalent. The reason that two are provided is that
 272   they differ slightly in their implementation, and so they may have
 273   slightly different characteristics in terms of latency and overhead.
 274
 275 API: Usage Skeleton
 276   The basic usage is illustrated in the following pseudo_code:
 277
 278     init_mb_mgr(&mb_mgr);
 279     ...
 280     aes_keyexp_128(key, enc_exp_keys, dec_exp_keys);
 281     ...
 282     while (work_to_be_done) {
 283         job = get_next_job(&mb_mgr);
 284         // TODO: Fill in job fields
 285         job = submit_job(&mb_mgr);
 286         while (job) {
 287             // TODO: Complete processing on job
 288         job = get_completed_job(&mb_mgr);
 289         }
 290     }
 291
 292 API: Job Fields
 293   The mode is determined by the fields "cipher_direction" and
 294   "chain_order". The first specifies encrypt or decrypt, and the
 295   second specifies whether whether the hash should be done before or
 296   after the cipher operation.
 297   In the current implementation, only two combinations of these are
 298   supported. For encryption, these should be set to "ENCRYPT" and
 299   "CIPHER_HASH", and for decryption, these should be set to "DECRYPT"
 300   and "HASH_CIPHER".
 301
 302   The expanded keys are pointed to by "aes_enc_key_expanded" and
 303   "aes_dec_key_expanded". These arrays must be aligned on a 16-byte
 304   boundary. Only one of these is necessary (as determined by
 305   "cipher_direction").
 306
 307   One selects AES128 vs AES256 by using the "aes_key_len_in_bytes"
 308   field. The only valid values are 16 (AES128) and 32 (AES256).
 309
 310   One selects the AES mode (CBC versus counter-mode) using
 311   "cipher_mode".
 312
 313   One selects the hash algorith (SHA1-HMAC, AES-XCBC, or MD5-HMAC)
 314   using "hash_alg".
 315
 316   The data to be encrypted/decrypted is defined by
 317   "src + cipher_start_src_offset_in_bytes". The length of data is
 318   given by "msg_len_to_cipher_in_bytes". It must be a multiple of
 319   16 bytes.
 320
 321   The destination for the cipher operation is given by "dst" (NOT by
 322   "dst + cipher_start_src_offset_in_bytes". In many/most applications,
 323   the destination pointer may overlap the source pointer. That is,
 324   "dst" may be equal to "src + cipher_start_src_offset_in_bytes".
 325
 326   The IV for the cipher operation is given by "iv". The
 327   "iv_len_in_bytes" should be 16. This pointer does not need to be
 328   aligned.
 329
 330   The data to be hashed is defined by
 331   "src + hash_start_src_offset_in_bytes". The length of data is
 332   given by "msg_len_to_hash_in_bytes".
 333
 334   The output of the hash operation is defined by
 335   "auth_tag_output". The number of bytes written is given by
 336   "auth_tag_output_len_in_bytes". Currently the only valid value for
 337   this parameter is 12.
 338
 339   The ipad and opad are given as the result of hashing the HMAC key
 340   xor'ed with the appropriate value. That is, rather than passing in
 341   the HMAC key and rehashing the initial block for every buffer, the
 342   hashing of the initial block is done separately, and the results of
 343   this hash are used as input in the job structure.
 344
 345   Similar to the expanded AES keys, the premise here is that one HMAC
 346   key will apply to many buffers, so we want to do that hashing once
 347   and not for each buffer.
 348
 349   The "status" reflects the status of the returned job. It should be
 350   "STS_COMPLETED".
 351
 352   The "user_data" field is ignored. It can be used to attach
 353   application data to the job object.
 354
 355 Flushing Concerns
 356   As long as jobs are coming in at a reasonable rate, jobs should be
 357   returned at a reasonable rate. However, if there is a lull in the
 358   arrival of new jobs, the last few jobs that were submitted tend to
 359   stay in the MB_MGR until new jobs arrive. This might result in there
 360   being an unreasonable latency for these jobs.
 361
 362   In this case, flush_job() should be used to complete processing on
 363   these outstanding jobs and prevent them from having excessive
 364   latency.
 365
 366   Exactly when and how to use flush_job() is up to the application,
 367   and is a balancing act. The processing of flush_job() is less
 368   efficient than that of submit_job(), so calling flush_job() too
 369   often will lower the system efficiency. Conversely, calling
 370   flush_job() too rarely may result in some jobs seeing excessive
 371   latency.
 372
 373   There are several strategies that the application may employ for
 374   flushing. One usage model is that there is a (thread-safe) queue
 375   containing work items. One or more threads puts work onto this
 376   queue, and one or more processing threads removes items from this
 377   queue and processes them through the MB_MGR. In this usage, a simple
 378   flushing strategy is that when the processing thread wants to do
 379   more work, but the queue is empty, it then proceeds to flush jobs
 380   until either the queue contains more work, or the MB_MGR no longer
 381   contains jobs (i.e. that flush_job() returns NULL). A variation on
 382   this is that when the work queue is empty, the processing thread
 383   might pause for a short time to see if any new work appears, before
 384   it starts flushing.
 385
 386   In other usage models, there may be no such queue. An alternate
 387   flushing strategy is that have a separate "flush thread" hanging
 388   around. It wakes up periodically and checks to see if any work has
 389   been requested since the last time it woke up. If some period of
 390   time has gone by with no new work appearing, it would proceed to
 391   flush the MB_MGR.
 392
 393 AES Key Usage
 394   If the AES mode is CBC, then the fields aes_enc_key_expanded or
 395   aes_dec_key_expanded are using depending on whether the data is
 396   being encrypted or decrypted. However, if the AES mode is CNTR
 397   (counter mode), then only aes_enc_key_expanded is used, even for a
 398   decrypt operation.
 399
 400   The application can handle this dichotomy, or it might choose to
 401   simply set both fields in all cases.
 402
 403 Thread Safety
 404   The MB_MGR and the associated functions ARE NOT thread safe. If
 405   there are multiple threads that may be calling these functions
 406   (e.g. a processing thread and a flushing thread), it is the
 407   responsibility of the application to put in place sufficient locking
 408   so that no two threads will make calls to the same MB_MGR object at
 409   the same time.
 410
 411 XMM Register Usage
 412   The current implementation is designed for integration in the Linux
 413   Kernel. All of the functions satisfy the Linux ABI with respect to
 414   general purpose registers. However, the submit_job() and flush_job()
 415   functions use XMM registers without saving/restoring any of them. It
 416   is up to the application to manage the saving/restoring of XMM
 417   registers itself.
 418
 419 Auxiliary Functions
 420   There are several auxiliary functions packed with MB_MGR. These may
 421   be used, or the application may choose to use their own version. Two
 422   of these, aes_keyexp_128() and aes_keyexp_256() expand AES keys into
 423   a form that is acceptable for reference in the job structure.
 424
 425   In the case of AES128, the expanded key structure should be an array
 426   of 11 128-bit words, aligned on a 16-byte boundary. In the case of
 427   AES256, it should be an array of 15 128-bit words, aligned on a
 428   16-byte boundary.
 429
 430   There is also a function, sha1(), which will compute the SHA1 digest
 431   of a single 64-byte block. It can be used to compute the ipad and
 432   opad digests. There is a similar function, md5(), which can be used
 433   when using MD5-HMAC.
 434
 435   For further details on the usage of these functions, see the sample
 436   test application.