ceph/doc/dev/crimson/poseidonstore.rst

   1 ===============
   2  PoseidonStore
   3 ===============
   4
   5 Key concepts and goals
   6 ======================
   7
   8 * As one of the pluggable backend stores for Crimson, PoseidonStore targets only
   9   high-end NVMe SSDs (not concerned with ZNS devices).
  10 * Designed entirely for low CPU consumption
  11
  12   - Hybrid update strategies for different data types (in-place, out-of-place) to
  13     minimize CPU consumption by reducing host-side GC.
  14   - Remove a black-box component like RocksDB and a file abstraction layer in BlueStore
  15     to avoid unnecessary overheads (e.g., data copy and serialization/deserialization)
  16   - Utilize NVMe feature (atomic large write command, Atomic Write Unit Normal).
  17     Make use of io_uring, new kernel asynchronous I/O interface, to selectively use the interrupt
  18     driven mode for CPU efficiency (or polled mode for low latency).
  19 * Sharded data/processing model
  20
  21 Background
  22 ----------
  23
  24 Both in-place and out-of-place update strategies have their pros and cons.
  25
  26 * Log-structured store
  27
  28   Log-structured based storage system is a typical example that adopts an update-out-of-place approach.
  29   It never modifies the written data. Writes always go to the end of the log. It enables I/O sequentializing.
  30
  31   * Pros
  32
  33     - Without a doubt, one sequential write is enough to store the data
  34     - It naturally supports transaction (this is no overwrite, so the store can rollback
  35       previous stable state)
  36     - Flash friendly (it mitigates GC burden on SSDs)
  37   * Cons
  38
  39     - There is host-side GC that induces overheads
  40
  41       - I/O amplification (host-side)
  42       - More host-CPU consumption
  43
  44     - Slow metadata lookup
  45     - Space overhead (live and unused data co-exist)
  46
  47 * In-place update store
  48
  49   The update-in-place strategy has been used widely for conventional file systems such as ext4 and xfs.
  50   Once a block has been placed in a given disk location, it doesn't move.
  51   Thus, writes go to the corresponding location in the disk.
  52
  53   * Pros
  54
  55     - Less host-CPU consumption (No host-side GC is required)
  56     - Fast lookup
  57     - No additional space for log-structured, but there is internal fragmentation
  58   * Cons
  59
  60     - More writes occur to record the data (metadata and data section are separated)
  61     - It cannot support transaction. Some form of WAL required to ensure update atomicity
  62       in the general case
  63     - Flash unfriendly (Give more burdens on SSDs due to device-level GC)
  64
  65 Motivation and Key idea
  66 -----------------------
  67
  68 In modern distributed storage systems, a server node can be equipped with multiple
  69 NVMe storage devices. In fact, ten or more NVMe SSDs could be attached on a server.
  70 As a result, it is hard to achieve NVMe SSD's full performance due to the limited CPU resources
  71 available in a server node. In such environments, CPU tends to become a performance bottleneck.
  72 Thus, now we should focus on minimizing host-CPU consumption, which is the same as the Crimson's objective.
  73
  74 Towards an object store highly optimized for CPU consumption, three design choices have been made.
  75
  76 * **PoseidonStore does not have a black-box component like RocksDB in BlueStore.**
  77
  78   Thus, it can avoid unnecessary data copy and serialization/deserialization overheads.
  79   Moreover, we can remove an unnecessary file abstraction layer, which was required to run RocksDB.
  80   Object data and metadata is now directly mapped to the disk blocks.
  81   Eliminating all these overheads will reduce CPU consumption (e.g., pre-allocation, NVME atomic feature).
  82
  83 * **PoseidonStore uses hybrid update strategies for different data size, similar to BlueStore.**
  84
  85   As we discussed, both in-place and out-of-place update strategies have their pros and cons.
  86   Since CPU is only bottlenecked under small I/O workloads, we chose update-in-place for small I/Os to mininize CPU consumption
  87   while choosing update-out-of-place for large I/O to avoid double write. Double write for small data may be better than host-GC overhead
  88   in terms of CPU consumption in the long run. Although it leaves GC entirely up to SSDs,
  89
  90 * **PoseidonStore makes use of io_uring, new kernel asynchronous I/O interface to exploit interrupt-driven I/O.**
  91
  92   User-space driven I/O solutions like SPDK provide high I/O performance by avoiding syscalls and enabling zero-copy
  93   access from the application. However, it does not support interrupt-driven I/O, which is only possible with kernel-space driven I/O.
  94   Polling is good for low-latency but bad for CPU efficiency. On the other hand, interrupt is good for CPU efficiency and bad for
  95   low-latency (but not that bad as I/O size increases). Note that network acceleration solutions like DPDK also excessively consume
  96   CPU resources for polling. Using polling both for network and storage processing aggravates CPU consumption.
  97   Since network is typically much faster and has a higher priority than storage, polling should be applied only to network processing.
  98
  99 high-end NVMe SSD has enough powers to handle more works. Also, SSD lifespan is not a practical concern these days
 100 (there is enough program-erase cycle limit [#f1]_). On the other hand, for large I/O workloads, the host can afford process host-GC.
 101 Also, the host can garbage collect invalid objects more effectively when their size is large
 102
 103 Observation
 104 -----------
 105
 106 Two data types in Ceph
 107
 108 * Data (object data)
 109
 110   - The cost of double write is high
 111   - The best method to store this data is in-place update
 112
 113     - At least two operations required to store the data: 1) data and 2) location of
 114       data. Nevertheless, a constant number of operations would be better than out-of-place
 115       even if it aggravates WAF in SSDs
 116
 117 * Metadata or small data (e.g., object_info_t, snapset, pg_log, and collection)
 118
 119   - Multiple small-sized metadata entries for an object
 120   - The best solution to store this data is WAL + Using cache
 121
 122     - The efficient way to store metadata is to merge all metadata related to data
 123       and store it though a single write operation even though it requires background
 124       flush to update the data partition
 125
 126
 127 Design
 128 ======
 129 .. ditaa::
 130
 131   +-WAL partition-|----------------------Data partition-------------------------------+
 132   | Sharded partition                                                                 |
 133   +-----------------------------------------------------------------------------------+
 134   | WAL -> |      | Super block | Freelist info | Onode radix tree info| Data blocks  |
 135   +-----------------------------------------------------------------------------------+
 136   | Sharded partition 2
 137   +-----------------------------------------------------------------------------------+
 138   | WAL -> |      | Super block | Freelist info | Onode radix tree info| Data blocks  |
 139   +-----------------------------------------------------------------------------------+
 140   | Sharded partition N
 141   +-----------------------------------------------------------------------------------+
 142   | WAL -> |      | Super block | Freelist info | Onode radix tree info| Data blocks  |
 143   +-----------------------------------------------------------------------------------+
 144   | Global information (in reverse order)
 145   +-----------------------------------------------------------------------------------+
 146   | Global WAL -> | | SB | Freelist |                                                 |
 147   +-----------------------------------------------------------------------------------+
 148
 149
 150 * WAL
 151
 152   - Log, metadata and small data are stored in the WAL partition
 153   - Space within the WAL partition is continually reused in a circular manner
 154   - Flush data to trim WAL as necessary
 155 * Disk layout
 156
 157   - Data blocks are metadata blocks or data blocks
 158   - Freelist manages the root of free space B+tree
 159   - Super block contains management info for a data partition
 160   - Onode radix tree info contains the root of onode radix tree
 161
 162
 163 I/O procedure
 164 -------------
 165 * Write
 166
 167   For incoming writes, data is handled differently depending on the request size;
 168   data is either written twice (WAL) or written in a log-structured manner.
 169
 170   #. If Request Size ≤ Threshold (similar to minimum allocation size in BlueStore)
 171
 172      Write data and metadata to [WAL] —flush—> Write them to [Data section (in-place)] and
 173      [Metadata section], respectively.
 174
 175      Since the CPU becomes the bottleneck for small I/O workloads, in-place update scheme is used.
 176      Double write for small data may be better than host-GC overhead in terms of CPU consumption
 177      in the long run
 178   #. Else if Request Size > Threshold
 179
 180      Append data to [Data section (log-structure)] —> Write the corresponding metadata to [WAL]
 181      —flush—> Write the metadata to [Metadata section]
 182
 183   For large I/O workloads, the host can afford process host-GC
 184   Also, the host can garbage collect invalid objects more effectively when their size is large
 185
 186   Note that Threshold can be configured to a very large number so that only the scenario (1) occurs.
 187   With this design, we can control the overall I/O procedure with the optimizations for crimson
 188   as described above.
 189
 190   * Detailed flow
 191
 192     We make use of a NVMe write command which provides atomicity guarantees (Atomic Write Unit Power Fail)
 193     For example, 512 Kbytes of data can be atomically written at once without fsync().
 194
 195     * stage 1
 196
 197       - if the data is small
 198         WAL (written) --> | TxBegin A | Log Entry | TxEnd A |
 199         Append a log entry that contains pg_log, snapset, object_infot_t and block allocation
 200         using NVMe atomic write command on the WAL
 201       - if the data is large
 202         Data partition (written) --> | Data blocks |
 203     * stage 2
 204
 205       - if the data is small
 206         No need.
 207       - if the data is large
 208         Then, append the metadata to WAL.
 209         WAL --> | TxBegin A | Log Entry | TxEnd A |
 210
 211 * Read
 212
 213   - Use the cached object metadata to find out the data location
 214   - If not cached, need to search WAL after checkpoint and Object meta partition to find the
 215     latest meta data
 216
 217 * Flush (WAL --> Data partition)
 218
 219   - Flush WAL entries that have been committed. There are two conditions
 220     (1. the size of WAL is close to full, 2. a signal to flush).
 221     We can mitigate the overhead of frequent flush via batching processing, but it leads to
 222     delaying completion.
 223
 224
 225 Crash consistency
 226 ------------------
 227
 228 * Large case
 229
 230   #. Crash occurs right after writing Data blocks
 231
 232      - Data partition --> | Data blocks |
 233      - We don't need to care this case. Data is not alloacted yet in reality. The blocks will be reused.
 234   #. Crash occurs right after WAL
 235
 236      - Data partition --> | Data blocks |
 237      - WAL --> | TxBegin A | Log Entry | TxEnd A |
 238      - Write procedure is completed, so there is no data loss or inconsistent state
 239
 240 * Small case
 241
 242   #. Crash occurs right after writing WAL
 243
 244      - WAL --> | TxBegin A | Log Entry| TxEnd A |
 245      - All data has been written
 246
 247
 248 Comparison
 249 ----------
 250
 251 * Best case (pre-allocation)
 252
 253   - Only need writes on both WAL and Data partition without updating object metadata (for the location).
 254 * Worst case
 255
 256   - At least three writes are required additionally on WAL, object metadata, and data blocks.
 257   - If the flush from WAL to the data partition occurs frequently, radix tree onode structure needs to be update
 258     in many times. To minimize such overhead, we can make use of batch processing to minimize the update on the tree
 259     (the data related to the object has a locality because it will have the same parent node, so updates can be minimized)
 260
 261 * WAL needs to be flushed if the WAL is close to full or a signal to flush.
 262
 263   - The premise behind this design is OSD can manage the latest metadata as a single copy. So,
 264     appended entries are not to be read
 265 * Either best of the worst case does not produce severe I/O amplification (it produce I/Os, but I/O rate is constant)
 266   unlike LSM-tree DB (the proposed design is similar to LSM-tree which has only level-0)
 267
 268
 269 Detailed Design
 270 ===============
 271
 272 * Onode lookup
 273
 274   * Radix tree
 275     Our design is entirely based on the prefix tree. Ceph already makes use of the characteristic of OID's prefix to split or search
 276     the OID (e.g., pool id + hash + oid). So, the prefix tree fits well to store or search the object. Our scheme is designed
 277     to lookup the prefix tree efficiently.
 278
 279   * Sharded partition
 280     A few bits (leftmost bits of the hash) of the OID determine a sharded partition where the object is located.
 281     For example, if the number of partitions is configured as four, The entire space of the hash in hobject_t
 282     can be divided into four domains (0x0xxx ~ 0x3xxx, 0x4xxx ~ 0x7xxx, 0x8xxx ~ 0xBxxx and 0xCxxx ~ 0xFxxx).
 283
 284   * Ondisk onode
 285
 286     .. code-block:: c
 287
 288             struct onode {
 289               extent_tree block_maps;
 290               b+_tree omaps;
 291               map xattrs;
 292             }
 293
 294     onode contains the radix tree nodes for lookup, which means we can search for objects using tree node information in onode.
 295     Also, if the data size is small, the onode can embed the data and xattrs.
 296     The onode is fixed size (256 or 512 byte). On the other hands, omaps and block_maps are variable-length by using pointers in the onode.
 297
 298     .. ditaa::
 299
 300            +----------------+------------+--------+
 301            | on\-disk onode | block_maps | omaps  |
 302            +----------+-----+------------+--------+
 303                       |           ^         ^
 304                       |           |         |
 305                       +-----------+---------+
 306
 307
 308   * Lookup
 309     The location of the root of onode tree is specified on Onode radix tree info, so we can find out where the object
 310     is located by using the root of prefix tree. For example, shared partition is determined by OID as described above.
 311     Using the rest of the OID's bits and radix tree, lookup procedure find outs the location of the onode.
 312     The extent tree (block_maps) contains where data chunks locate, so we finally figure out the data location.
 313
 314
 315 * Allocation
 316
 317   * Sharded partitions
 318
 319     The entire disk space is divided into several  data chunks called sharded partition (SP).
 320     Each SP has its own data structures to manage the partition.
 321
 322   * Data allocation
 323
 324     As we explained above, the management infos (e.g., super block, freelist info, onode radix tree info) are pre-allocated
 325     in each shared partition. Given OID, we can map any data in Data block section to the extent tree in the onode.
 326     Blocks can be allocated by searching the free space tracking data structure (we explain below).
 327
 328     ::
 329
 330             +-----------------------------------+
 331             | onode radix tree root node block  |
 332             |          (Per-SP Meta)            |
 333             |                                   |
 334             |           # of records            |
 335             |    left_sibling / right_sibling   |
 336             | +--------------------------------+|
 337             | | keys[# of records]             ||
 338             | | +-----------------------------+||
 339             | | |    start onode ID           |||
 340             | | |           ...               |||
 341             | | +-----------------------------+||
 342             | +--------------------------------||
 343             | +--------------------------------+|
 344             | | ptrs[# of records]             ||
 345             | | +-----------------------------+||
 346             | | |       SP block number       |||
 347             | | |           ...               |||
 348             | | +-----------------------------+||
 349             | +--------------------------------+|
 350             +-----------------------------------+
 351
 352   * Free space tracking
 353     The freespace is tracked on a per-SP basis. We can use extent-based B+tree in XFS for free space tracking.
 354     The freelist info contains the root of free space B+tree. Granularity is a data block in Data blocks partition.
 355     The data block is the smallest and fixed size unit of data.
 356
 357     ::
 358
 359              +-----------------------------------+
 360              | Free space B+tree root node block |
 361              |          (Per-SP Meta)            |
 362              |                                   |
 363              |           # of records            |
 364              |    left_sibling / right_sibling   |
 365              | +--------------------------------+|
 366              | | keys[# of records]             ||
 367              | | +-----------------------------+||
 368              | | |   startblock / blockcount   |||
 369              | | |           ...               |||
 370              | | +-----------------------------+||
 371              | +--------------------------------||
 372              | +--------------------------------+|
 373              | | ptrs[# of records]             ||
 374              | | +-----------------------------+||
 375              | | |       SP block number       |||
 376              | | |           ...               |||
 377              | | +-----------------------------+||
 378              | +--------------------------------+|
 379              +-----------------------------------+
 380
 381 * Omap and xattr
 382   In this design, omap and xattr data is tracked by b+tree in onode. The onode only has the root node of b+tree.
 383   The root node contains entries which indicate where the key onode exists.
 384   So, if we know the onode, omap can be found via omap b+tree.
 385
 386 * Fragmentation
 387
 388   - Internal fragmentation
 389
 390     We pack different types of data/metadata in a single block as many as possible to reduce internal fragmentation.
 391     Extent-based B+tree may help reduce this further by allocating contiguous blocks that best fit for the object
 392
 393   - External fragmentation
 394
 395     Frequent object create/delete may lead to external fragmentation
 396     In this case, we need cleaning work (GC-like) to address this.
 397     For this, we are referring the NetApp’s Continuous Segment Cleaning, which seems similar to the SeaStore’s approach
 398     Countering Fragmentation in an Enterprise Storage System (NetApp, ACM TOS, 2020)
 399
 400 .. ditaa::
 401
 402
 403        +---------------+-------------------+-------------+
 404        | Freelist info | Onode radix tree  | Data blocks +-------+
 405        +---------------+---------+---------+-+-----------+       |
 406                                  |           |                   |
 407             +--------------------+           |                   |
 408             |                                |                   |
 409             |        OID                     |                   |
 410             |                                |                   |
 411         +---+---+                            |                   |
 412         | Root  |                            |                   |
 413         +---+---+                            |                   |
 414             |                                |                   |
 415             v                                |                   |
 416        /-----------------------------\       |                   |
 417        |          Radix tree         |       |                   v
 418        +---------+---------+---------+       |      /---------------\
 419        | onode   | ...     | ...     |       |      | Num Chunk     |
 420        +---------+---------+---------+       |      |               |
 421     +--+ onode   | ...     | ...     |       |      | <Offset, len> |
 422     |  +---------+---------+---------+       |      | <Offset, len> +-------+
 423     |                                        |      | ...           |       |
 424     |                                        |      +---------------+       |
 425     |                                        |      ^                       |
 426     |                                        |      |                       |
 427     |                                        |      |                       |
 428     |                                        |      |                       |
 429     |  /---------------\  /-------------\    |      |                       v
 430     +->| onode         |  | onode       |<---+      |       /------------+------------\
 431        +---------------+  +-------------+           |       | Block0     | Block1     |
 432        | OID           |  | OID         |           |       +------------+------------+
 433        | Omaps         |  | Omaps       |           |       | Data       | Data       |
 434        | Data Extent   |  | Data Extent +-----------+       +------------+------------+
 435        +---------------+  +-------------+
 436
 437 WAL
 438 ---
 439 Each SP has a WAL.
 440 The data written to the WAL are metadata updates, free space update and small data.
 441 Note that only data smaller than the predefined threshold needs to be written to the WAL.
 442 The larger data is written to the unallocated free space and its onode's extent_tree is updated accordingly
 443 (also on-disk extent tree). We statically allocate WAL partition aside from data partition pre-configured.
 444
 445
 446 Partition and Reactor thread
 447 ----------------------------
 448 In early stage development, PoseidonStore will employ static allocation of partition. The number of sharded partitions
 449 is fixed and the size of each partition also should be configured before running cluster.
 450 But, the number of partitions can grow as below. We leave this as a future work.
 451 Also, each reactor thread has a static set of SPs.
 452
 453 .. ditaa::
 454
 455    +------+------+-------------+------------------+
 456    | SP 1 | SP N | -->     <-- | global partition |
 457    +------+------+-------------+------------------+
 458
 459
 460
 461 Cache
 462 -----
 463 There are mainly two cache data structures; onode cache and block cache.
 464 It looks like below.
 465
 466 #. Onode cache:
 467    lru_map <OID, OnodeRef>;
 468 #. Block cache (data and omap):
 469    Data cache --> lru_map <paddr, value>
 470
 471 To fill the onode data structure, the target onode needs to be retrieved using the prefix tree.
 472 Block cache is used for caching a block contents. For a transaction, all the updates to blocks
 473 (including object meta block, data block) are first performed in the in-memory block cache.
 474 After writing a transaction to the WAL, the dirty blocks are flushed to their respective locations in the
 475 respective partitions.
 476 PoseidonStore can configure cache size for each type. Simple LRU cache eviction strategy can be used for both.
 477
 478
 479 Sharded partitions (with cross-SP transaction)
 480 ----------------------------------------------
 481 The entire disk space is divided into a number of chunks called sharded partitions (SP).
 482 The prefixes of the parent collection ID (original collection ID before collection splitting. That is, hobject.hash)
 483 is used to map any collections to SPs.
 484 We can use BlueStore's approach for collection splitting, changing the number of significant bits for the collection prefixes.
 485 Because the prefixes of the parent collection ID do not change even after collection splitting, the mapping between
 486 the collection and SP are maintained.
 487 The number of SPs may be configured to match the number of CPUs allocated for each disk so that each SP can hold
 488 a number of objects large enough for cross-SP transaction not to occur.
 489
 490 In case of need of cross-SP transaction, we could use the global WAL. The coordinator thread (mainly manages global partition) handles
 491 cross-SP transaction via acquire the source SP and target SP locks before processing the cross-SP transaction.
 492 Source and target probably are blocked.
 493
 494 For the load unbalanced situation,
 495 Poseidonstore can create partitions to make full use of entire space efficiently and provide load balaning.
 496
 497
 498 CoW/Clone
 499 ---------
 500 As for CoW/Clone, a clone has its own onode like other normal objects.
 501
 502 Although each clone has its own onode, data blocks should be shared between the original object and clones
 503 if there are no changes on them to minimize the space overhead.
 504 To do so, the reference count for the data blocks is needed to manage those shared data blocks.
 505
 506 To deal with the data blocks which has the reference count, poseidon store makes use of shared_blob
 507 which maintains the referenced data block.
 508
 509 As shown the figure as below,
 510 the shared_blob tracks the data blocks shared between other onodes by using a reference count.
 511 The shared_blobs are managed by shared_blob_list in the superblock.
 512
 513
 514 .. ditaa::
 515
 516
 517     /----------\         /----------\
 518     | Object A |         | Object B |
 519     +----------+         +----------+
 520     | Extent   |         | Extent   |
 521     +---+--+---+         +--+----+--+
 522         |  |                |    |
 523         |  |     +----------+    |
 524         |  |     |               |
 525         |  +---------------+     |
 526         |        |         |     |
 527         v        v         v     v
 528     +---------------+---------------+
 529     | Data block 1  | Data block 2  |
 530     +-------+-------+------+--------+
 531             |              |
 532             v              v
 533     /---------------+---------------\
 534     | shared_blob 1 | shared_blob 2 |
 535     +---------------+---------------+ shared_blob_list
 536     | refcount      | refcount      |
 537     +---------------+---------------+
 538
 539 Plans
 540 =====
 541
 542 All PRs should contain unit tests to verify its minimal functionality.
 543
 544 * WAL and block cache implementation
 545
 546   As a first step, we are going to build the WAL including the I/O procedure to read/write the WAL.
 547   With WAL development, the block cache needs to be developed together.
 548   Besides, we are going to add an I/O library to read/write from/to the NVMe storage to
 549   utilize NVMe feature and the asynchronous interface.
 550
 551 * Radix tree and onode
 552
 553   First, submit a PR against this file with a more detailed on disk layout and lookup strategy for the onode radix tree.
 554   Follow up with implementation based on the above design once design PR is merged.
 555   The second PR will be the implementation regarding radix tree which is the key structure to look up
 556   objects.
 557
 558 * Extent tree
 559
 560   This PR is the extent tree to manage data blocks in the onode. We build the extent tree, and
 561   demonstrate how it works when looking up the object.
 562
 563 * B+tree for omap
 564
 565   We will put together a simple key/value interface for omap. This probably will be a separate PR.
 566
 567 * CoW/Clone
 568
 569   To support CoW/Clone, shared_blob and shared_blob_list will be added.
 570
 571 * Integration to Crimson as to I/O interfaces
 572
 573   At this stage, interfaces for interacting with Crimson such as queue_transaction(), read(), clone_range(), etc.
 574   should work right.
 575
 576 * Configuration
 577
 578   We will define Poseidon store configuration in detail.
 579
 580 * Stress test environment and integration to teuthology
 581
 582   We will add stress tests and teuthology suites.
 583
 584 .. rubric:: Footnotes
 585
 586 .. [#f1] Stathis Maneas, Kaveh Mahdaviani, Tim Emami, Bianca Schroeder: A Study of SSD Reliability in Large Scale Enterprise Storage Deployments. FAST 2020: 137-149