]>
Commit | Line | Data |
---|---|---|
11fdf7f2 TL |
1 | ========== |
2 | SeaStore | |
3 | ========== | |
4 | ||
f67539c2 TL |
5 | Goals and Basics |
6 | ================ | |
11fdf7f2 TL |
7 | |
8 | * Target NVMe devices. Not primarily concerned with pmem or HDD. | |
9 | * make use of SPDK for user-space driven IO | |
f67539c2 TL |
10 | * Use Seastar futures programming model to facilitate |
11 | run-to-completion and a sharded memory/processing model | |
12 | * Allow zero- (or minimal) data copying on read and write paths when | |
13 | combined with a seastar-based messenger using DPDK | |
11fdf7f2 TL |
14 | |
15 | Motivation and background | |
f67539c2 | 16 | ------------------------- |
11fdf7f2 TL |
17 | |
18 | All flash devices are internally structured in terms of segments that | |
19 | can be written efficiently but must be erased in their entirety. The | |
20 | NVMe device generally has limited knowledge about what data in a | |
21 | segment is still "live" (hasn't been logically discarded), making the | |
22 | inevitable garbage collection within the device inefficient. We can | |
23 | design an on-disk layout that is friendly to GC at lower layers and | |
24 | drive garbage collection at higher layers. | |
25 | ||
26 | In principle a fine-grained discard could communicate our intent to | |
27 | the device, but in practice discard is poorly implemented in the | |
28 | device and intervening software layers. | |
29 | ||
11fdf7f2 TL |
30 | The basic idea is that all data will be stream out sequentially to |
31 | large segments on the device. In the SSD hardware, segments are | |
32 | likely to be on the order of 100's of MB to tens of GB. | |
33 | ||
34 | SeaStore's logical segments would ideally be perfectly aligned with | |
35 | the hardware segments. In practice, it may be challenging to | |
36 | determine geometry and to sufficiently hint to the device that LBAs | |
37 | being written should be aligned to the underlying hardware. In the | |
38 | worst case, we can structure our logical segments to correspond to | |
39 | e.g. 5x the physical segment size so that we have about ~20% of our | |
40 | data misaligned. | |
41 | ||
42 | When we reach some utilization threshold, we mix cleaning work in with | |
43 | the ongoing write workload in order to evacuate live data from | |
44 | previously written segments. Once they are completely free we can | |
45 | discard the entire segment so that it can be erased and reclaimed by | |
46 | the device. | |
47 | ||
48 | The key is to mix a small bit of cleaning work with every write | |
49 | transaction to avoid spikes and variance in write latency. | |
11fdf7f2 TL |
50 | |
51 | Data layout basics | |
f67539c2 | 52 | ------------------ |
11fdf7f2 TL |
53 | |
54 | One or more cores/shards will be reading and writing to the device at | |
55 | once. Each shard will have its own independent data it is operating | |
56 | on and stream to its own open segments. Devices that support streams | |
57 | can be hinted accordingly so that data from different shards is not | |
58 | mixed on the underlying media. | |
59 | ||
f67539c2 TL |
60 | Persistent Memory |
61 | ----------------- | |
11fdf7f2 | 62 | |
f67539c2 TL |
63 | As the initial sequential design above matures, we'll introduce |
64 | persistent memory support for metadata and caching structures. | |
11fdf7f2 | 65 | |
f67539c2 TL |
66 | Design |
67 | ====== | |
11fdf7f2 | 68 | |
f67539c2 TL |
69 | The design is based heavily on both f2fs and btrfs. Each reactor |
70 | manages its own root. Prior to reusing a segment, we rewrite any live | |
71 | blocks to an open segment. | |
72 | ||
73 | Because we are only writing sequentially to open segments, we must | |
74 | “clean” one byte of an existing segment for every byte written at | |
75 | steady state. Generally, we’ll need to reserve some portion of the | |
76 | usable capacity in order to ensure that write amplification remains | |
77 | acceptably low (20% for 2x? -- TODO: find prior work). As a design | |
78 | choice, we want to avoid a background gc scheme as it tends to | |
79 | complicate estimating operation cost and tends to introduce | |
80 | non-deterministic latency behavior. Thus, we want a set of structures | |
81 | that permits us to relocate blocks from existing segments inline with | |
82 | ongoing client IO. | |
83 | ||
84 | To that end, at a high level, we’ll maintain 2 basic metadata trees. | |
85 | First, we need a tree mapping ghobject_t->onode_t (onode_by_hobject). | |
86 | Second, we need a way to find live blocks within a segment and a way | |
87 | to decouple internal references from physical locations (lba_tree). | |
88 | ||
89 | Each onode contains xattrs directly as well as the top of the omap and | |
90 | extent trees (optimization: we ought to be able to fit small enough | |
91 | objects into the onode). | |
92 | ||
93 | Segment Layout | |
94 | -------------- | |
11fdf7f2 | 95 | |
f67539c2 TL |
96 | The backing storage is abstracted into a set of segments. Each |
97 | segment can be in one of 3 states: empty, open, closed. The byte | |
98 | contents of a segment are a sequence of records. A record is prefixed | |
99 | by a header (including length and checksums) and contains a sequence | |
100 | of deltas and/or blocks. Each delta describes a logical mutation for | |
101 | some block. Each included block is an aligned extent addressable by | |
102 | <segment_id_t, segment_offset_t>. A transaction can be implemented by | |
103 | constructing a record combining deltas and updated blocks and writing | |
104 | it to an open segment. | |
105 | ||
106 | Note that segments will generally be large (something like >=256MB), | |
107 | so there will not typically be very many of them. | |
108 | ||
109 | record: [ header | delta | delta... | block | block ... ] | |
110 | segment: [ record ... ] | |
111 | ||
112 | See src/crimson/os/seastore/journal.h for Journal implementation | |
113 | See src/crimson/os/seastore/seastore_types.h for most seastore structures. | |
114 | ||
115 | Each shard will keep open N segments for writes | |
116 | ||
117 | - HDD: N is probably 1 on one shard | |
118 | - NVME/SSD: N is probably 2/shard, one for "journal" and one for | |
119 | finished data records as their lifetimes are different. | |
120 | ||
121 | I think the exact number to keep open and how to partition writes | |
122 | among them will be a tuning question -- gc/layout should be flexible. | |
123 | Where practical, the goal is probably to partition blocks by expected | |
124 | lifetime so that a segment either has long lived or short lived | |
125 | blocks. | |
126 | ||
127 | The backing physical layer is exposed via a segment based interface. | |
128 | See src/crimson/os/seastore/segment_manager.h | |
129 | ||
130 | Journal and Atomicity | |
131 | --------------------- | |
132 | ||
133 | One open segment is designated to be the journal. A transaction is | |
134 | represented by an atomically written record. A record will contain | |
135 | blocks written as part of the transaction as well as deltas which | |
136 | are logical mutations to existing physical extents. Transaction deltas | |
137 | are always written to the journal. If the transaction is associated | |
138 | with blocks written to other segments, final record with the deltas | |
139 | should be written only once the other blocks are persisted. Crash | |
140 | recovery is done by finding the segment containing the beginning of | |
141 | the current journal, loading the root node, replaying the deltas, and | |
142 | loading blocks into the cache as needed. | |
143 | ||
144 | See src/crimson/os/seastore/journal.h | |
145 | ||
146 | Block Cache | |
147 | ----------- | |
148 | ||
149 | Every block is in one of two states: | |
150 | ||
151 | - clean: may be in cache or not, reads may cause cache residence or | |
152 | not | |
153 | - dirty: the current version of the record requires overlaying deltas | |
154 | from the journal. Must be fully present in the cache. | |
155 | ||
156 | Periodically, we need to trim the journal (else, we’d have to replay | |
157 | journal deltas from the beginning of time). To do this, we need to | |
158 | create a checkpoint by rewriting the root blocks and all currently | |
159 | dirty blocks. Note, we can do journal checkpoints relatively | |
160 | infrequently, and they needn’t block the write stream. | |
161 | ||
162 | Note, deltas may not be byte range modifications. Consider a btree | |
163 | node structured with keys to the left and values to the right (common | |
164 | trick for improving point query/key scan performance). Inserting a | |
165 | key/value into that node at the min would involve moving a bunch of | |
166 | bytes, which would be expensive (or verbose) to express purely as a | |
167 | sequence of byte operations. As such, each delta indicates the type | |
168 | as well as the location of the corresponding extent. Each block | |
20effc67 | 169 | type can therefore implement CachedExtent::apply_delta as appropriate. |
f67539c2 TL |
170 | |
171 | See src/os/crimson/seastore/cached_extent.h. | |
172 | See src/os/crimson/seastore/cache.h. | |
173 | ||
174 | GC | |
175 | --- | |
176 | ||
177 | Prior to reusing a segment, we must relocate all live blocks. Because | |
178 | we only write sequentially to empty segments, for every byte we write | |
179 | to currently open segments, we need to clean a byte of an existing | |
180 | closed segment. As a design choice, we’d like to avoid background | |
181 | work as it complicates estimating operation cost and has a tendency to | |
182 | create non-deterministic latency spikes. Thus, under normal operation | |
183 | each seastore reactor will be inserting enough work to clean a segment | |
184 | at the same rate as incoming operations. | |
185 | ||
186 | In order to make this cheap for sparse segments, we need a way to | |
187 | positively identify dead blocks. Thus, for every block written, an | |
188 | entry will be added to the lba tree with a pointer to the previous lba | |
189 | in the segment. Any transaction that moves a block or modifies the | |
190 | reference set of an existing one will include deltas/blocks required | |
191 | to update the lba tree to update or remove the previous block | |
192 | allocation. The gc state thus simply needs to maintain an iterator | |
193 | (of a sort) into the lba tree segment linked list for segment | |
194 | currently being cleaned and a pointer to the next record to be | |
195 | examined -- records not present in the allocation tree may still | |
196 | contain roots (like allocation tree blocks) and so the record metadata | |
197 | must be checked for a flag indicating root blocks. | |
198 | ||
199 | For each transaction, we evaluate a heuristic function of the | |
200 | currently available space and currently live space in order to | |
201 | determine whether we need to do cleaning work (could be simply a range | |
202 | of live/used space ratios). | |
203 | ||
204 | TODO: there is not yet a GC implementation | |
205 | ||
206 | Logical Layout | |
207 | ============== | |
208 | ||
209 | Using the above block and delta semantics, we build two root level trees: | |
210 | - onode tree: maps hobject_t to onode_t | |
211 | - lba_tree: maps lba_t to lba_range_t | |
212 | ||
213 | Each of the above structures is comprised of blocks with mutations | |
214 | encoded in deltas. Each node of the above trees maps onto a block. | |
215 | Each block is either physically addressed (root blocks and the | |
216 | lba_tree nodes) or is logically addressed (everything else). | |
217 | Physically addressed blocks are located by a paddr_t: <segment_id_t, | |
218 | segment_off_t> tuple and are marked as physically addressed in the | |
219 | record. Logical blocks are addressed by laddr_t and require a lookup in | |
220 | the lba_tree to address. | |
221 | ||
222 | Because the cache/transaction machinery lives below the level of the | |
223 | lba tree, we can represent atomic mutations of the lba tree and other | |
224 | structures by simply including both in a transaction. | |
225 | ||
226 | LBAManager/BtreeLBAManager | |
227 | -------------------------- | |
228 | ||
229 | Implementations of the LBAManager interface are responsible for managing | |
230 | the logical->physical mapping -- see crimson/os/seastore/lba_manager.h. | |
231 | ||
232 | The BtreeLBAManager implements this interface directly on top of | |
233 | Journal and SegmentManager using a wandering btree approach. | |
234 | ||
235 | Because SegmentManager does not let us predict the location of a | |
236 | committed record (a property of both SMR and Zone devices), references | |
237 | to blocks created within the same transaction will necessarily be | |
238 | *relative* addresses. The BtreeLBAManager maintains an invariant by | |
239 | which the in-memory copy of any block will contain only absolute | |
240 | addresses when !is_pending() -- on_commit and complete_load fill in | |
241 | absolute addresses based on the actual block addr and on_delta_write | |
242 | does so based on the just committed record. When is_pending(), if | |
243 | is_initial_pending references in memory are block_relative (because | |
244 | they will be written to the original block location) and | |
245 | record_relative otherwise (value will be written to delta). | |
246 | ||
247 | TransactionManager | |
248 | ------------------ | |
249 | ||
250 | The TransactionManager is responsible for presenting a unified | |
251 | interface on top of the Journal, SegmentManager, Cache, and | |
252 | LBAManager. Users can allocate and mutate extents based on logical | |
253 | addresses with segment cleaning handled in the background. | |
254 | ||
255 | See crimson/os/seastore/transaction_manager.h | |
256 | ||
257 | Next Steps | |
258 | ========== | |
11fdf7f2 | 259 | |
f67539c2 TL |
260 | Journal |
261 | ------- | |
11fdf7f2 | 262 | |
f67539c2 TL |
263 | - Support for scanning a segment to find physically addressed blocks |
264 | - Add support for trimming the journal and releasing segments. | |
11fdf7f2 | 265 | |
f67539c2 TL |
266 | Cache |
267 | ----- | |
11fdf7f2 | 268 | |
f67539c2 | 269 | - Support for rewriting dirty blocks |
11fdf7f2 | 270 | |
f67539c2 TL |
271 | - Need to add support to CachedExtent for finding/updating |
272 | dependent blocks | |
273 | - Need to add support for adding dirty block writout to | |
274 | try_construct_record | |
11fdf7f2 | 275 | |
f67539c2 TL |
276 | LBAManager |
277 | ---------- | |
11fdf7f2 | 278 | |
f67539c2 TL |
279 | - Add support for pinning |
280 | - Add segment -> laddr for use in GC | |
281 | - Support for locating remaining used blocks in segments | |
11fdf7f2 | 282 | |
f67539c2 TL |
283 | GC |
284 | --- | |
11fdf7f2 | 285 | |
f67539c2 TL |
286 | - Initial implementation |
287 | - Support in BtreeLBAManager for tracking used blocks in segments | |
288 | - Heuristic for identifying segments to clean | |
11fdf7f2 | 289 | |
f67539c2 TL |
290 | Other |
291 | ------ | |
11fdf7f2 | 292 | |
f67539c2 TL |
293 | - Add support for periodically generating a journal checkpoint. |
294 | - Onode tree | |
295 | - Extent tree | |
296 | - Remaining ObjectStore integration | |
11fdf7f2 TL |
297 | |
298 | ObjectStore considerations | |
299 | ========================== | |
300 | ||
301 | Splits, merges, and sharding | |
302 | ---------------------------- | |
303 | ||
304 | One of the current ObjectStore requirements is to be able to split a | |
305 | collection (PG) in O(1) time. Starting in mimic, we also need to be | |
306 | able to merge two collections into one (i.e., exactly the reverse of a | |
307 | split). | |
308 | ||
309 | However, the PGs that we split into would hash to different shards of | |
310 | the OSD in the current sharding scheme. One can imagine replacing | |
311 | that sharding scheme with a temporary mapping directing the smaller | |
312 | child PG to the right shard since we generally then migrate that PG to | |
313 | another OSD anyway, but this wouldn't help us in the merge case where | |
314 | the constituent pieces may start out on different shards and | |
315 | ultimately need to be handled in the same collection (and be operated | |
316 | on via single transactions). | |
317 | ||
318 | This suggests that we likely need a way for data written via one shard | |
319 | to "switch ownership" and later be read and managed by a different | |
320 | shard. | |
321 | ||
322 | ||
323 |