]> git.proxmox.com Git - ceph.git/blame - ceph/doc/dev/seastore.rst
update source to Ceph Pacific 16.2.2
[ceph.git] / ceph / doc / dev / seastore.rst
CommitLineData
11fdf7f2
TL
1==========
2 SeaStore
3==========
4
f67539c2
TL
5Goals and Basics
6================
11fdf7f2
TL
7
8* Target NVMe devices. Not primarily concerned with pmem or HDD.
9* make use of SPDK for user-space driven IO
f67539c2
TL
10* Use Seastar futures programming model to facilitate
11 run-to-completion and a sharded memory/processing model
12* Allow zero- (or minimal) data copying on read and write paths when
13 combined with a seastar-based messenger using DPDK
11fdf7f2
TL
14
15Motivation and background
f67539c2 16-------------------------
11fdf7f2
TL
17
18All flash devices are internally structured in terms of segments that
19can be written efficiently but must be erased in their entirety. The
20NVMe device generally has limited knowledge about what data in a
21segment is still "live" (hasn't been logically discarded), making the
22inevitable garbage collection within the device inefficient. We can
23design an on-disk layout that is friendly to GC at lower layers and
24drive garbage collection at higher layers.
25
26In principle a fine-grained discard could communicate our intent to
27the device, but in practice discard is poorly implemented in the
28device and intervening software layers.
29
11fdf7f2
TL
30The basic idea is that all data will be stream out sequentially to
31large segments on the device. In the SSD hardware, segments are
32likely to be on the order of 100's of MB to tens of GB.
33
34SeaStore's logical segments would ideally be perfectly aligned with
35the hardware segments. In practice, it may be challenging to
36determine geometry and to sufficiently hint to the device that LBAs
37being written should be aligned to the underlying hardware. In the
38worst case, we can structure our logical segments to correspond to
39e.g. 5x the physical segment size so that we have about ~20% of our
40data misaligned.
41
42When we reach some utilization threshold, we mix cleaning work in with
43the ongoing write workload in order to evacuate live data from
44previously written segments. Once they are completely free we can
45discard the entire segment so that it can be erased and reclaimed by
46the device.
47
48The key is to mix a small bit of cleaning work with every write
49transaction to avoid spikes and variance in write latency.
11fdf7f2
TL
50
51Data layout basics
f67539c2 52------------------
11fdf7f2
TL
53
54One or more cores/shards will be reading and writing to the device at
55once. Each shard will have its own independent data it is operating
56on and stream to its own open segments. Devices that support streams
57can be hinted accordingly so that data from different shards is not
58mixed on the underlying media.
59
f67539c2
TL
60Persistent Memory
61-----------------
11fdf7f2 62
f67539c2
TL
63As the initial sequential design above matures, we'll introduce
64persistent memory support for metadata and caching structures.
11fdf7f2 65
f67539c2
TL
66Design
67======
11fdf7f2 68
f67539c2
TL
69The design is based heavily on both f2fs and btrfs. Each reactor
70manages its own root. Prior to reusing a segment, we rewrite any live
71blocks to an open segment.
72
73Because we are only writing sequentially to open segments, we must
74“clean” one byte of an existing segment for every byte written at
75steady state. Generally, we’ll need to reserve some portion of the
76usable capacity in order to ensure that write amplification remains
77acceptably low (20% for 2x? -- TODO: find prior work). As a design
78choice, we want to avoid a background gc scheme as it tends to
79complicate estimating operation cost and tends to introduce
80non-deterministic latency behavior. Thus, we want a set of structures
81that permits us to relocate blocks from existing segments inline with
82ongoing client IO.
83
84To that end, at a high level, we’ll maintain 2 basic metadata trees.
85First, we need a tree mapping ghobject_t->onode_t (onode_by_hobject).
86Second, we need a way to find live blocks within a segment and a way
87to decouple internal references from physical locations (lba_tree).
88
89Each onode contains xattrs directly as well as the top of the omap and
90extent trees (optimization: we ought to be able to fit small enough
91objects into the onode).
92
93Segment Layout
94--------------
11fdf7f2 95
f67539c2
TL
96The backing storage is abstracted into a set of segments. Each
97segment can be in one of 3 states: empty, open, closed. The byte
98contents of a segment are a sequence of records. A record is prefixed
99by a header (including length and checksums) and contains a sequence
100of deltas and/or blocks. Each delta describes a logical mutation for
101some block. Each included block is an aligned extent addressable by
102<segment_id_t, segment_offset_t>. A transaction can be implemented by
103constructing a record combining deltas and updated blocks and writing
104it to an open segment.
105
106Note that segments will generally be large (something like >=256MB),
107so there will not typically be very many of them.
108
109record: [ header | delta | delta... | block | block ... ]
110segment: [ record ... ]
111
112See src/crimson/os/seastore/journal.h for Journal implementation
113See src/crimson/os/seastore/seastore_types.h for most seastore structures.
114
115Each shard will keep open N segments for writes
116
117- HDD: N is probably 1 on one shard
118- NVME/SSD: N is probably 2/shard, one for "journal" and one for
119 finished data records as their lifetimes are different.
120
121I think the exact number to keep open and how to partition writes
122among them will be a tuning question -- gc/layout should be flexible.
123Where practical, the goal is probably to partition blocks by expected
124lifetime so that a segment either has long lived or short lived
125blocks.
126
127The backing physical layer is exposed via a segment based interface.
128See src/crimson/os/seastore/segment_manager.h
129
130Journal and Atomicity
131---------------------
132
133One open segment is designated to be the journal. A transaction is
134represented by an atomically written record. A record will contain
135blocks written as part of the transaction as well as deltas which
136are logical mutations to existing physical extents. Transaction deltas
137are always written to the journal. If the transaction is associated
138with blocks written to other segments, final record with the deltas
139should be written only once the other blocks are persisted. Crash
140recovery is done by finding the segment containing the beginning of
141the current journal, loading the root node, replaying the deltas, and
142loading blocks into the cache as needed.
143
144See src/crimson/os/seastore/journal.h
145
146Block Cache
147-----------
148
149Every block is in one of two states:
150
151- clean: may be in cache or not, reads may cause cache residence or
152 not
153- dirty: the current version of the record requires overlaying deltas
154 from the journal. Must be fully present in the cache.
155
156Periodically, we need to trim the journal (else, we’d have to replay
157journal deltas from the beginning of time). To do this, we need to
158create a checkpoint by rewriting the root blocks and all currently
159dirty blocks. Note, we can do journal checkpoints relatively
160infrequently, and they needn’t block the write stream.
161
162Note, deltas may not be byte range modifications. Consider a btree
163node structured with keys to the left and values to the right (common
164trick for improving point query/key scan performance). Inserting a
165key/value into that node at the min would involve moving a bunch of
166bytes, which would be expensive (or verbose) to express purely as a
167sequence of byte operations. As such, each delta indicates the type
168as well as the location of the corresponding extent. Each block
169type can therefore implement CachedExtent::apply_delta as appopriate.
170
171See src/os/crimson/seastore/cached_extent.h.
172See src/os/crimson/seastore/cache.h.
173
174GC
175---
176
177Prior to reusing a segment, we must relocate all live blocks. Because
178we only write sequentially to empty segments, for every byte we write
179to currently open segments, we need to clean a byte of an existing
180closed segment. As a design choice, we’d like to avoid background
181work as it complicates estimating operation cost and has a tendency to
182create non-deterministic latency spikes. Thus, under normal operation
183each seastore reactor will be inserting enough work to clean a segment
184at the same rate as incoming operations.
185
186In order to make this cheap for sparse segments, we need a way to
187positively identify dead blocks. Thus, for every block written, an
188entry will be added to the lba tree with a pointer to the previous lba
189in the segment. Any transaction that moves a block or modifies the
190reference set of an existing one will include deltas/blocks required
191to update the lba tree to update or remove the previous block
192allocation. The gc state thus simply needs to maintain an iterator
193(of a sort) into the lba tree segment linked list for segment
194currently being cleaned and a pointer to the next record to be
195examined -- records not present in the allocation tree may still
196contain roots (like allocation tree blocks) and so the record metadata
197must be checked for a flag indicating root blocks.
198
199For each transaction, we evaluate a heuristic function of the
200currently available space and currently live space in order to
201determine whether we need to do cleaning work (could be simply a range
202of live/used space ratios).
203
204TODO: there is not yet a GC implementation
205
206Logical Layout
207==============
208
209Using the above block and delta semantics, we build two root level trees:
210- onode tree: maps hobject_t to onode_t
211- lba_tree: maps lba_t to lba_range_t
212
213Each of the above structures is comprised of blocks with mutations
214encoded in deltas. Each node of the above trees maps onto a block.
215Each block is either physically addressed (root blocks and the
216lba_tree nodes) or is logically addressed (everything else).
217Physically addressed blocks are located by a paddr_t: <segment_id_t,
218segment_off_t> tuple and are marked as physically addressed in the
219record. Logical blocks are addressed by laddr_t and require a lookup in
220the lba_tree to address.
221
222Because the cache/transaction machinery lives below the level of the
223lba tree, we can represent atomic mutations of the lba tree and other
224structures by simply including both in a transaction.
225
226LBAManager/BtreeLBAManager
227--------------------------
228
229Implementations of the LBAManager interface are responsible for managing
230the logical->physical mapping -- see crimson/os/seastore/lba_manager.h.
231
232The BtreeLBAManager implements this interface directly on top of
233Journal and SegmentManager using a wandering btree approach.
234
235Because SegmentManager does not let us predict the location of a
236committed record (a property of both SMR and Zone devices), references
237to blocks created within the same transaction will necessarily be
238*relative* addresses. The BtreeLBAManager maintains an invariant by
239which the in-memory copy of any block will contain only absolute
240addresses when !is_pending() -- on_commit and complete_load fill in
241absolute addresses based on the actual block addr and on_delta_write
242does so based on the just committed record. When is_pending(), if
243is_initial_pending references in memory are block_relative (because
244they will be written to the original block location) and
245record_relative otherwise (value will be written to delta).
246
247TransactionManager
248------------------
249
250The TransactionManager is responsible for presenting a unified
251interface on top of the Journal, SegmentManager, Cache, and
252LBAManager. Users can allocate and mutate extents based on logical
253addresses with segment cleaning handled in the background.
254
255See crimson/os/seastore/transaction_manager.h
256
257Next Steps
258==========
11fdf7f2 259
f67539c2
TL
260Journal
261-------
11fdf7f2 262
f67539c2
TL
263- Support for scanning a segment to find physically addressed blocks
264- Add support for trimming the journal and releasing segments.
11fdf7f2 265
f67539c2
TL
266Cache
267-----
11fdf7f2 268
f67539c2 269- Support for rewriting dirty blocks
11fdf7f2 270
f67539c2
TL
271 - Need to add support to CachedExtent for finding/updating
272 dependent blocks
273 - Need to add support for adding dirty block writout to
274 try_construct_record
11fdf7f2 275
f67539c2
TL
276LBAManager
277----------
11fdf7f2 278
f67539c2
TL
279- Add support for pinning
280- Add segment -> laddr for use in GC
281- Support for locating remaining used blocks in segments
11fdf7f2 282
f67539c2
TL
283GC
284---
11fdf7f2 285
f67539c2
TL
286- Initial implementation
287- Support in BtreeLBAManager for tracking used blocks in segments
288- Heuristic for identifying segments to clean
11fdf7f2 289
f67539c2
TL
290Other
291------
11fdf7f2 292
f67539c2
TL
293- Add support for periodically generating a journal checkpoint.
294- Onode tree
295- Extent tree
296- Remaining ObjectStore integration
11fdf7f2
TL
297
298ObjectStore considerations
299==========================
300
301Splits, merges, and sharding
302----------------------------
303
304One of the current ObjectStore requirements is to be able to split a
305collection (PG) in O(1) time. Starting in mimic, we also need to be
306able to merge two collections into one (i.e., exactly the reverse of a
307split).
308
309However, the PGs that we split into would hash to different shards of
310the OSD in the current sharding scheme. One can imagine replacing
311that sharding scheme with a temporary mapping directing the smaller
312child PG to the right shard since we generally then migrate that PG to
313another OSD anyway, but this wouldn't help us in the merge case where
314the constituent pieces may start out on different shards and
315ultimately need to be handled in the same collection (and be operated
316on via single transactions).
317
318This suggests that we likely need a way for data written via one shard
319to "switch ownership" and later be read and managed by a different
320shard.
321
322
323