ceph/doc/dev/seastore.rst

   1 ==========
   2  SeaStore
   3 ==========
   4
   5 Goals and Basics
   6 ================
   7
   8 * Target NVMe devices.  Not primarily concerned with pmem or HDD.
   9 * make use of SPDK for user-space driven IO
  10 * Use Seastar futures programming model to facilitate
  11   run-to-completion and a sharded memory/processing model
  12 * Allow zero- (or minimal) data copying on read and write paths when
  13   combined with a seastar-based messenger using DPDK
  14
  15 Motivation and background
  16 -------------------------
  17
  18 All flash devices are internally structured in terms of segments that
  19 can be written efficiently but must be erased in their entirety.  The
  20 NVMe device generally has limited knowledge about what data in a
  21 segment is still "live" (hasn't been logically discarded), making the
  22 inevitable garbage collection within the device inefficient.  We can
  23 design an on-disk layout that is friendly to GC at lower layers and
  24 drive garbage collection at higher layers.
  25
  26 In principle a fine-grained discard could communicate our intent to
  27 the device, but in practice discard is poorly implemented in the
  28 device and intervening software layers.
  29
  30 The basic idea is that all data will be stream out sequentially to
  31 large segments on the device.  In the SSD hardware, segments are
  32 likely to be on the order of 100's of MB to tens of GB.
  33
  34 SeaStore's logical segments would ideally be perfectly aligned with
  35 the hardware segments.  In practice, it may be challenging to
  36 determine geometry and to sufficiently hint to the device that LBAs
  37 being written should be aligned to the underlying hardware.  In the
  38 worst case, we can structure our logical segments to correspond to
  39 e.g. 5x the physical segment size so that we have about ~20% of our
  40 data misaligned.
  41
  42 When we reach some utilization threshold, we mix cleaning work in with
  43 the ongoing write workload in order to evacuate live data from
  44 previously written segments.  Once they are completely free we can
  45 discard the entire segment so that it can be erased and reclaimed by
  46 the device.
  47
  48 The key is to mix a small bit of cleaning work with every write
  49 transaction to avoid spikes and variance in write latency.
  50
  51 Data layout basics
  52 ------------------
  53
  54 One or more cores/shards will be reading and writing to the device at
  55 once.  Each shard will have its own independent data it is operating
  56 on and stream to its own open segments.  Devices that support streams
  57 can be hinted accordingly so that data from different shards is not
  58 mixed on the underlying media.
  59
  60 Persistent Memory
  61 -----------------
  62
  63 As the initial sequential design above matures, we'll introduce
  64 persistent memory support for metadata and caching structures.
  65
  66 Design
  67 ======
  68
  69 The design is based heavily on both f2fs and btrfs.  Each reactor
  70 manages its own root.  Prior to reusing a segment, we rewrite any live
  71 blocks to an open segment.
  72
  73 Because we are only writing sequentially to open segments, we must
  74 “clean” one byte of an existing segment for every byte written at
  75 steady state.  Generally, we’ll need to reserve some portion of the
  76 usable capacity in order to ensure that write amplification remains
  77 acceptably low (20% for 2x? -- TODO: find prior work).  As a design
  78 choice, we want to avoid a background gc scheme as it tends to
  79 complicate estimating operation cost and tends to introduce
  80 non-deterministic latency behavior.  Thus, we want a set of structures
  81 that permits us to relocate blocks from existing segments inline with
  82 ongoing client IO.
  83
  84 To that end, at a high level, we’ll maintain 2 basic metadata trees.
  85 First, we need a tree mapping ghobject_t->onode_t (onode_by_hobject).
  86 Second, we need a way to find live blocks within a segment and a way
  87 to decouple internal references from physical locations (lba_tree).
  88
  89 Each onode contains xattrs directly as well as the top of the omap and
  90 extent trees (optimization: we ought to be able to fit small enough
  91 objects into the onode).
  92
  93 Segment Layout
  94 --------------
  95
  96 The backing storage is abstracted into a set of segments.  Each
  97 segment can be in one of 3 states: empty, open, closed.  The byte
  98 contents of a segment are a sequence of records.  A record is prefixed
  99 by a header (including length and checksums) and contains a sequence
 100 of deltas and/or blocks.  Each delta describes a logical mutation for
 101 some block.  Each included block is an aligned extent addressable by
 102 <segment_id_t, segment_offset_t>.  A transaction can be implemented by
 103 constructing a record combining deltas and updated blocks and writing
 104 it to an open segment.
 105
 106 Note that segments will generally be large (something like >=256MB),
 107 so there will not typically be very many of them.
 108
 109 record: [ header | delta | delta... | block | block ... ]
 110 segment: [ record ... ]
 111
 112 See src/crimson/os/seastore/journal.h for Journal implementation
 113 See src/crimson/os/seastore/seastore_types.h for most seastore structures.
 114
 115 Each shard will keep open N segments for writes
 116
 117 - HDD: N is probably 1 on one shard
 118 - NVME/SSD: N is probably 2/shard, one for "journal" and one for
 119   finished data records as their lifetimes are different.
 120
 121 I think the exact number to keep open and how to partition writes
 122 among them will be a tuning question -- gc/layout should be flexible.
 123 Where practical, the goal is probably to partition blocks by expected
 124 lifetime so that a segment either has long lived or short lived
 125 blocks.
 126
 127 The backing physical layer is exposed via a segment based interface.
 128 See src/crimson/os/seastore/segment_manager.h
 129
 130 Journal and Atomicity
 131 ---------------------
 132
 133 One open segment is designated to be the journal.  A transaction is
 134 represented by an atomically written record.  A record will contain
 135 blocks written as part of the transaction as well as deltas which
 136 are logical mutations to existing physical extents.  Transaction deltas
 137 are always written to the journal.  If the transaction is associated
 138 with blocks written to other segments, final record with the deltas
 139 should be written only once the other blocks are persisted.  Crash
 140 recovery is done by finding the segment containing the beginning of
 141 the current journal, loading the root node, replaying the deltas, and
 142 loading blocks into the cache as needed.
 143
 144 See src/crimson/os/seastore/journal.h
 145
 146 Block Cache
 147 -----------
 148
 149 Every block is in one of two states:
 150
 151 - clean: may be in cache or not, reads may cause cache residence or
 152   not
 153 - dirty: the current version of the record requires overlaying deltas
 154   from the journal.  Must be fully present in the cache.
 155
 156 Periodically, we need to trim the journal (else, we’d have to replay
 157 journal deltas from the beginning of time).  To do this, we need to
 158 create a checkpoint by rewriting the root blocks and all currently
 159 dirty blocks.  Note, we can do journal checkpoints relatively
 160 infrequently, and they needn’t block the write stream.
 161
 162 Note, deltas may not be byte range modifications.  Consider a btree
 163 node structured with keys to the left and values to the right (common
 164 trick for improving point query/key scan performance).  Inserting a
 165 key/value into that node at the min would involve moving a bunch of
 166 bytes, which would be expensive (or verbose) to express purely as a
 167 sequence of byte operations.  As such, each delta indicates the type
 168 as well as the location of the corresponding extent.  Each block
 169 type can therefore implement CachedExtent::apply_delta as appropriate.
 170
 171 See src/os/crimson/seastore/cached_extent.h.
 172 See src/os/crimson/seastore/cache.h.
 173
 174 GC
 175 ---
 176
 177 Prior to reusing a segment, we must relocate all live blocks.  Because
 178 we only write sequentially to empty segments, for every byte we write
 179 to currently open segments, we need to clean a byte of an existing
 180 closed segment.  As a design choice, we’d like to avoid background
 181 work as it complicates estimating operation cost and has a tendency to
 182 create non-deterministic latency spikes.  Thus, under normal operation
 183 each seastore reactor will be inserting enough work to clean a segment
 184 at the same rate as incoming operations.
 185
 186 In order to make this cheap for sparse segments, we need a way to
 187 positively identify dead blocks.  Thus, for every block written, an
 188 entry will be added to the lba tree with a pointer to the previous lba
 189 in the segment.  Any transaction that moves a block or modifies the
 190 reference set of an existing one will include deltas/blocks required
 191 to update the lba tree to update or remove the previous block
 192 allocation.  The gc state thus simply needs to maintain an iterator
 193 (of a sort) into the lba tree segment linked list for segment
 194 currently being cleaned and a pointer to the next record to be
 195 examined -- records not present in the allocation tree may still
 196 contain roots (like allocation tree blocks) and so the record metadata
 197 must be checked for a flag indicating root blocks.
 198
 199 For each transaction, we evaluate a heuristic function of the
 200 currently available space and currently live space in order to
 201 determine whether we need to do cleaning work (could be simply a range
 202 of live/used space ratios).
 203
 204 TODO: there is not yet a GC implementation
 205
 206 Logical Layout
 207 ==============
 208
 209 Using the above block and delta semantics, we build two root level trees:
 210 - onode tree: maps hobject_t to onode_t
 211 - lba_tree: maps lba_t to lba_range_t
 212
 213 Each of the above structures is comprised of blocks with mutations
 214 encoded in deltas.  Each node of the above trees maps onto a block.
 215 Each block is either physically addressed (root blocks and the
 216 lba_tree nodes) or is logically addressed (everything else).
 217 Physically addressed blocks are located by a paddr_t: <segment_id_t,
 218 segment_off_t> tuple and are marked as physically addressed in the
 219 record.  Logical blocks are addressed by laddr_t and require a lookup in
 220 the lba_tree to address.
 221
 222 Because the cache/transaction machinery lives below the level of the
 223 lba tree, we can represent atomic mutations of the lba tree and other
 224 structures by simply including both in a transaction.
 225
 226 LBAManager/BtreeLBAManager
 227 --------------------------
 228
 229 Implementations of the LBAManager interface are responsible for managing
 230 the logical->physical mapping -- see crimson/os/seastore/lba_manager.h.
 231
 232 The BtreeLBAManager implements this interface directly on top of
 233 Journal and SegmentManager using a wandering btree approach.
 234
 235 Because SegmentManager does not let us predict the location of a
 236 committed record (a property of both SMR and Zone devices), references
 237 to blocks created within the same transaction will necessarily be
 238 *relative* addresses.  The BtreeLBAManager maintains an invariant by
 239 which the in-memory copy of any block will contain only absolute
 240 addresses when !is_pending() -- on_commit and complete_load fill in
 241 absolute addresses based on the actual block addr and on_delta_write
 242 does so based on the just committed record.  When is_pending(), if
 243 is_initial_pending references in memory are block_relative (because
 244 they will be written to the original block location) and
 245 record_relative otherwise (value will be written to delta).
 246
 247 TransactionManager
 248 ------------------
 249
 250 The TransactionManager is responsible for presenting a unified
 251 interface on top of the Journal, SegmentManager, Cache, and
 252 LBAManager.  Users can allocate and mutate extents based on logical
 253 addresses with segment cleaning handled in the background.
 254
 255 See crimson/os/seastore/transaction_manager.h
 256
 257 Next Steps
 258 ==========
 259
 260 Journal
 261 -------
 262
 263 - Support for scanning a segment to find physically addressed blocks
 264 - Add support for trimming the journal and releasing segments.
 265
 266 Cache
 267 -----
 268
 269 - Support for rewriting dirty blocks
 270
 271   - Need to add support to CachedExtent for finding/updating
 272     dependent blocks
 273   - Need to add support for adding dirty block writout to
 274     try_construct_record
 275
 276 LBAManager
 277 ----------
 278
 279 - Add support for pinning
 280 - Add segment -> laddr for use in GC
 281 - Support for locating remaining used blocks in segments
 282
 283 GC
 284 ---
 285
 286 - Initial implementation
 287 - Support in BtreeLBAManager for tracking used blocks in segments
 288 - Heuristic for identifying segments to clean
 289
 290 Other
 291 ------
 292
 293 - Add support for periodically generating a journal checkpoint.
 294 - Onode tree
 295 - Extent tree
 296 - Remaining ObjectStore integration
 297
 298 ObjectStore considerations
 299 ==========================
 300
 301 Splits, merges, and sharding
 302 ----------------------------
 303
 304 One of the current ObjectStore requirements is to be able to split a
 305 collection (PG) in O(1) time.  Starting in mimic, we also need to be
 306 able to merge two collections into one (i.e., exactly the reverse of a
 307 split).
 308
 309 However, the PGs that we split into would hash to different shards of
 310 the OSD in the current sharding scheme.  One can imagine replacing
 311 that sharding scheme with a temporary mapping directing the smaller
 312 child PG to the right shard since we generally then migrate that PG to
 313 another OSD anyway, but this wouldn't help us in the merge case where
 314 the constituent pieces may start out on different shards and
 315 ultimately need to be handled in the same collection (and be operated
 316 on via single transactions).
 317
 318 This suggests that we likely need a way for data written via one shard
 319 to "switch ownership" and later be read and managed by a different
 320 shard.
 321
 322
 323