]>
Commit | Line | Data |
---|---|---|
1e59de90 TL |
1 | ================================== |
2 | BlueStore Configuration Reference | |
3 | ================================== | |
d2e6a577 FG |
4 | |
5 | Devices | |
6 | ======= | |
7 | ||
1e59de90 TL |
8 | BlueStore manages either one, two, or in certain cases three storage devices. |
9 | These *devices* are "devices" in the Linux/Unix sense. This means that they are | |
10 | assets listed under ``/dev`` or ``/devices``. Each of these devices may be an | |
11 | entire storage drive, or a partition of a storage drive, or a logical volume. | |
12 | BlueStore does not create or mount a conventional file system on devices that | |
13 | it uses; BlueStore reads and writes to the devices directly in a "raw" fashion. | |
14 | ||
15 | In the simplest case, BlueStore consumes all of a single storage device. This | |
16 | device is known as the *primary device*. The primary device is identified by | |
17 | the ``block`` symlink in the data directory. | |
18 | ||
19 | The data directory is a ``tmpfs`` mount. When this data directory is booted or | |
20 | activated by ``ceph-volume``, it is populated with metadata files and links | |
21 | that hold information about the OSD: for example, the OSD's identifier, the | |
22 | name of the cluster that the OSD belongs to, and the OSD's private keyring. | |
23 | ||
24 | In more complicated cases, BlueStore is deployed across one or two additional | |
25 | devices: | |
26 | ||
27 | * A *write-ahead log (WAL) device* (identified as ``block.wal`` in the data | |
28 | directory) can be used to separate out BlueStore's internal journal or | |
29 | write-ahead log. Using a WAL device is advantageous only if the WAL device | |
30 | is faster than the primary device (for example, if the WAL device is an SSD | |
31 | and the primary device is an HDD). | |
11fdf7f2 | 32 | * A *DB device* (identified as ``block.db`` in the data directory) can be used |
1e59de90 TL |
33 | to store BlueStore's internal metadata. BlueStore (or more precisely, the |
34 | embedded RocksDB) will put as much metadata as it can on the DB device in | |
35 | order to improve performance. If the DB device becomes full, metadata will | |
36 | spill back onto the primary device (where it would have been located in the | |
37 | absence of the DB device). Again, it is advantageous to provision a DB device | |
38 | only if it is faster than the primary device. | |
39 | ||
40 | If there is only a small amount of fast storage available (for example, less | |
41 | than a gigabyte), we recommend using the available space as a WAL device. But | |
42 | if more fast storage is available, it makes more sense to provision a DB | |
43 | device. Because the BlueStore journal is always placed on the fastest device | |
44 | available, using a DB device provides the same benefit that using a WAL device | |
45 | would, while *also* allowing additional metadata to be stored off the primary | |
46 | device (provided that it fits). DB devices make this possible because whenever | |
47 | a DB device is specified but an explicit WAL device is not, the WAL will be | |
48 | implicitly colocated with the DB on the faster device. | |
49 | ||
50 | To provision a single-device (colocated) BlueStore OSD, run the following | |
51 | command: | |
d2e6a577 | 52 | |
39ae355f | 53 | .. prompt:: bash $ |
d2e6a577 | 54 | |
39ae355f | 55 | ceph-volume lvm prepare --bluestore --data <device> |
d2e6a577 | 56 | |
1e59de90 | 57 | To specify a WAL device or DB device, run the following command: |
39ae355f TL |
58 | |
59 | .. prompt:: bash $ | |
60 | ||
61 | ceph-volume lvm prepare --bluestore --data <device> --block.wal <wal-device> --block.db <db-device> | |
11fdf7f2 | 62 | |
1e59de90 TL |
63 | .. note:: The option ``--data`` can take as its argument any of the the |
64 | following devices: logical volumes specified using *vg/lv* notation, | |
65 | existing logical volumes, and GPT partitions. | |
66 | ||
67 | ||
d2e6a577 | 68 | |
f64942e4 AA |
69 | Provisioning strategies |
70 | ----------------------- | |
1e59de90 TL |
71 | |
72 | BlueStore differs from Filestore in that there are several ways to deploy a | |
73 | BlueStore OSD. However, the overall deployment strategy for BlueStore can be | |
74 | clarified by examining just these two common arrangements: | |
f64942e4 AA |
75 | |
76 | .. _bluestore-single-type-device-config: | |
77 | ||
78 | **block (data) only** | |
79 | ^^^^^^^^^^^^^^^^^^^^^ | |
1e59de90 TL |
80 | If all devices are of the same type (for example, they are all HDDs), and if |
81 | there are no fast devices available for the storage of metadata, then it makes | |
82 | sense to specify the block device only and to leave ``block.db`` and | |
83 | ``block.wal`` unseparated. The :ref:`ceph-volume-lvm` command for a single | |
84 | ``/dev/sda`` device is as follows: | |
39ae355f TL |
85 | |
86 | .. prompt:: bash $ | |
f64942e4 | 87 | |
39ae355f | 88 | ceph-volume lvm create --bluestore --data /dev/sda |
f64942e4 | 89 | |
1e59de90 TL |
90 | If the devices to be used for a BlueStore OSD are pre-created logical volumes, |
91 | then the :ref:`ceph-volume-lvm` call for an logical volume named | |
92 | ``ceph-vg/block-lv`` is as follows: | |
39ae355f TL |
93 | |
94 | .. prompt:: bash $ | |
f64942e4 | 95 | |
39ae355f | 96 | ceph-volume lvm create --bluestore --data ceph-vg/block-lv |
f64942e4 AA |
97 | |
98 | .. _bluestore-mixed-device-config: | |
99 | ||
100 | **block and block.db** | |
101 | ^^^^^^^^^^^^^^^^^^^^^^ | |
f64942e4 | 102 | |
1e59de90 TL |
103 | If you have a mix of fast and slow devices (for example, SSD or HDD), then we |
104 | recommend placing ``block.db`` on the faster device while ``block`` (that is, | |
105 | the data) is stored on the slower device (that is, the rotational drive). | |
106 | ||
107 | You must create these volume groups and these logical volumes manually. as The | |
108 | ``ceph-volume`` tool is currently unable to do so [create them?] automatically. | |
f67539c2 | 109 | |
1e59de90 TL |
110 | The following procedure illustrates the manual creation of volume groups and |
111 | logical volumes. For this example, we shall assume four rotational drives | |
112 | (``sda``, ``sdb``, ``sdc``, and ``sdd``) and one (fast) SSD (``sdx``). First, | |
113 | to create the volume groups, run the following commands: | |
f64942e4 | 114 | |
39ae355f | 115 | .. prompt:: bash $ |
f64942e4 | 116 | |
39ae355f TL |
117 | vgcreate ceph-block-0 /dev/sda |
118 | vgcreate ceph-block-1 /dev/sdb | |
119 | vgcreate ceph-block-2 /dev/sdc | |
120 | vgcreate ceph-block-3 /dev/sdd | |
f64942e4 | 121 | |
1e59de90 | 122 | Next, to create the logical volumes for ``block``, run the following commands: |
39ae355f TL |
123 | |
124 | .. prompt:: bash $ | |
125 | ||
126 | lvcreate -l 100%FREE -n block-0 ceph-block-0 | |
127 | lvcreate -l 100%FREE -n block-1 ceph-block-1 | |
128 | lvcreate -l 100%FREE -n block-2 ceph-block-2 | |
129 | lvcreate -l 100%FREE -n block-3 ceph-block-3 | |
f64942e4 | 130 | |
1e59de90 TL |
131 | Because there are four HDDs, there will be four OSDs. Supposing that there is a |
132 | 200GB SSD in ``/dev/sdx``, we can create four 50GB logical volumes by running | |
133 | the following commands: | |
39ae355f TL |
134 | |
135 | .. prompt:: bash $ | |
f64942e4 | 136 | |
39ae355f TL |
137 | vgcreate ceph-db-0 /dev/sdx |
138 | lvcreate -L 50GB -n db-0 ceph-db-0 | |
139 | lvcreate -L 50GB -n db-1 ceph-db-0 | |
140 | lvcreate -L 50GB -n db-2 ceph-db-0 | |
141 | lvcreate -L 50GB -n db-3 ceph-db-0 | |
f64942e4 | 142 | |
1e59de90 | 143 | Finally, to create the four OSDs, run the following commands: |
f64942e4 | 144 | |
39ae355f TL |
145 | .. prompt:: bash $ |
146 | ||
147 | ceph-volume lvm create --bluestore --data ceph-block-0/block-0 --block.db ceph-db-0/db-0 | |
148 | ceph-volume lvm create --bluestore --data ceph-block-1/block-1 --block.db ceph-db-0/db-1 | |
149 | ceph-volume lvm create --bluestore --data ceph-block-2/block-2 --block.db ceph-db-0/db-2 | |
150 | ceph-volume lvm create --bluestore --data ceph-block-3/block-3 --block.db ceph-db-0/db-3 | |
f64942e4 | 151 | |
1e59de90 TL |
152 | After this procedure is finished, there should be four OSDs, ``block`` should |
153 | be on the four HDDs, and each HDD should have a 50GB logical volume | |
154 | (specifically, a DB device) on the shared SSD. | |
f64942e4 AA |
155 | |
156 | Sizing | |
157 | ====== | |
1e59de90 TL |
158 | When using a :ref:`mixed spinning-and-solid-drive setup |
159 | <bluestore-mixed-device-config>`, it is important to make a large enough | |
160 | ``block.db`` logical volume for BlueStore. The logical volumes associated with | |
161 | ``block.db`` should have logical volumes that are *as large as possible*. | |
162 | ||
163 | It is generally recommended that the size of ``block.db`` be somewhere between | |
164 | 1% and 4% of the size of ``block``. For RGW workloads, it is recommended that | |
165 | the ``block.db`` be at least 4% of the ``block`` size, because RGW makes heavy | |
166 | use of ``block.db`` to store metadata (in particular, omap keys). For example, | |
167 | if the ``block`` size is 1TB, then ``block.db`` should have a size of at least | |
168 | 40GB. For RBD workloads, however, ``block.db`` usually needs no more than 1% to | |
169 | 2% of the ``block`` size. | |
170 | ||
171 | In older releases, internal level sizes are such that the DB can fully utilize | |
172 | only those specific partition / logical volume sizes that correspond to sums of | |
173 | L0, L0+L1, L1+L2, and so on--that is, given default settings, sizes of roughly | |
174 | 3GB, 30GB, 300GB, and so on. Most deployments do not substantially benefit from | |
175 | sizing that accommodates L3 and higher, though DB compaction can be facilitated | |
176 | by doubling these figures to 6GB, 60GB, and 600GB. | |
177 | ||
178 | Improvements in Nautilus 14.2.12, Octopus 15.2.6, and subsequent releases allow | |
179 | for better utilization of arbitrarily-sized DB devices. Moreover, the Pacific | |
180 | release brings experimental dynamic-level support. Because of these advances, | |
181 | users of older releases might want to plan ahead by provisioning larger DB | |
182 | devices today so that the benefits of scale can be realized when upgrades are | |
183 | made in the future. | |
184 | ||
185 | When *not* using a mix of fast and slow devices, there is no requirement to | |
186 | create separate logical volumes for ``block.db`` or ``block.wal``. BlueStore | |
187 | will automatically colocate these devices within the space of ``block``. | |
f64942e4 AA |
188 | |
189 | Automatic Cache Sizing | |
190 | ====================== | |
191 | ||
1e59de90 TL |
192 | BlueStore can be configured to automatically resize its caches, provided that |
193 | certain conditions are met: TCMalloc must be configured as the memory allocator | |
194 | and the ``bluestore_cache_autotune`` configuration option must be enabled (note | |
195 | that it is currently enabled by default). When automatic cache sizing is in | |
196 | effect, BlueStore attempts to keep OSD heap-memory usage under a certain target | |
197 | size (as determined by ``osd_memory_target``). This approach makes use of a | |
198 | best-effort algorithm and caches do not shrink smaller than the size defined by | |
199 | the value of ``osd_memory_cache_min``. Cache ratios are selected in accordance | |
200 | with a hierarchy of priorities. But if priority information is not available, | |
201 | the values specified in the ``bluestore_cache_meta_ratio`` and | |
202 | ``bluestore_cache_kv_ratio`` options are used as fallback cache ratios. | |
f64942e4 | 203 | |
20effc67 TL |
204 | .. confval:: bluestore_cache_autotune |
205 | .. confval:: osd_memory_target | |
206 | .. confval:: bluestore_cache_autotune_interval | |
207 | .. confval:: osd_memory_base | |
208 | .. confval:: osd_memory_expected_fragmentation | |
209 | .. confval:: osd_memory_cache_min | |
210 | .. confval:: osd_memory_cache_resize_interval | |
f64942e4 AA |
211 | |
212 | ||
213 | Manual Cache Sizing | |
214 | =================== | |
d2e6a577 | 215 | |
1e59de90 TL |
216 | The amount of memory consumed by each OSD to be used for its BlueStore cache is |
217 | determined by the ``bluestore_cache_size`` configuration option. If that option | |
218 | has not been specified (that is, if it remains at 0), then Ceph uses a | |
219 | different configuration option to determine the default memory budget: | |
220 | ``bluestore_cache_size_hdd`` if the primary device is an HDD, or | |
221 | ``bluestore_cache_size_ssd`` if the primary device is an SSD. | |
d2e6a577 | 222 | |
1e59de90 TL |
223 | BlueStore and the rest of the Ceph OSD daemon make every effort to work within |
224 | this memory budget. Note that in addition to the configured cache size, there | |
225 | is also memory consumed by the OSD itself. There is additional utilization due | |
226 | to memory fragmentation and other allocator overhead. | |
d2e6a577 | 227 | |
1e59de90 TL |
228 | The configured cache-memory budget can be used to store the following types of |
229 | things: | |
d2e6a577 | 230 | |
1e59de90 | 231 | * Key/Value metadata (that is, RocksDB's internal cache) |
d2e6a577 | 232 | * BlueStore metadata |
1e59de90 | 233 | * BlueStore data (that is, recently read or recently written object data) |
d2e6a577 | 234 | |
1e59de90 TL |
235 | Cache memory usage is governed by the configuration options |
236 | ``bluestore_cache_meta_ratio`` and ``bluestore_cache_kv_ratio``. The fraction | |
237 | of the cache that is reserved for data is governed by both the effective | |
238 | BlueStore cache size (which depends on the relevant | |
239 | ``bluestore_cache_size[_ssd|_hdd]`` option and the device class of the primary | |
240 | device) and the "meta" and "kv" ratios. This data fraction can be calculated | |
241 | with the following formula: ``<effective_cache_size> * (1 - | |
242 | bluestore_cache_meta_ratio - bluestore_cache_kv_ratio)``. | |
d2e6a577 | 243 | |
20effc67 TL |
244 | .. confval:: bluestore_cache_size |
245 | .. confval:: bluestore_cache_size_hdd | |
246 | .. confval:: bluestore_cache_size_ssd | |
247 | .. confval:: bluestore_cache_meta_ratio | |
248 | .. confval:: bluestore_cache_kv_ratio | |
d2e6a577 FG |
249 | |
250 | Checksums | |
251 | ========= | |
252 | ||
1e59de90 TL |
253 | BlueStore checksums all metadata and all data written to disk. Metadata |
254 | checksumming is handled by RocksDB and uses the `crc32c` algorithm. By | |
255 | contrast, data checksumming is handled by BlueStore and can use either | |
256 | `crc32c`, `xxhash32`, or `xxhash64`. Nonetheless, `crc32c` is the default | |
257 | checksum algorithm and it is suitable for most purposes. | |
258 | ||
259 | Full data checksumming increases the amount of metadata that BlueStore must | |
260 | store and manage. Whenever possible (for example, when clients hint that data | |
261 | is written and read sequentially), BlueStore will checksum larger blocks. In | |
262 | many cases, however, it must store a checksum value (usually 4 bytes) for every | |
263 | 4 KB block of data. | |
264 | ||
265 | It is possible to obtain a smaller checksum value by truncating the checksum to | |
266 | one or two bytes and reducing the metadata overhead. A drawback of this | |
267 | approach is that it increases the probability of a random error going | |
268 | undetected: about one in four billion given a 32-bit (4 byte) checksum, 1 in | |
269 | 65,536 given a 16-bit (2 byte) checksum, and 1 in 256 given an 8-bit (1 byte) | |
270 | checksum. To use the smaller checksum values, select `crc32c_16` or `crc32c_8` | |
271 | as the checksum algorithm. | |
272 | ||
273 | The *checksum algorithm* can be specified either via a per-pool ``csum_type`` | |
274 | configuration option or via the global configuration option. For example: | |
39ae355f TL |
275 | |
276 | .. prompt:: bash $ | |
d2e6a577 | 277 | |
39ae355f | 278 | ceph osd pool set <pool-name> csum_type <algorithm> |
d2e6a577 | 279 | |
20effc67 | 280 | .. confval:: bluestore_csum_type |
d2e6a577 FG |
281 | |
282 | Inline Compression | |
283 | ================== | |
284 | ||
1e59de90 | 285 | BlueStore supports inline compression using `snappy`, `zlib`, `lz4`, or `zstd`. |
d2e6a577 | 286 | |
1e59de90 TL |
287 | Whether data in BlueStore is compressed is determined by two factors: (1) the |
288 | *compression mode* and (2) any client hints associated with a write operation. | |
289 | The compression modes are as follows: | |
d2e6a577 FG |
290 | |
291 | * **none**: Never compress data. | |
11fdf7f2 | 292 | * **passive**: Do not compress data unless the write operation has a |
d2e6a577 | 293 | *compressible* hint set. |
1e59de90 | 294 | * **aggressive**: Do compress data unless the write operation has an |
d2e6a577 FG |
295 | *incompressible* hint set. |
296 | * **force**: Try to compress data no matter what. | |
297 | ||
1e59de90 TL |
298 | For more information about the *compressible* and *incompressible* I/O hints, |
299 | see :c:func:`rados_set_alloc_hint`. | |
d2e6a577 | 300 | |
1e59de90 TL |
301 | Note that data in Bluestore will be compressed only if the data chunk will be |
302 | sufficiently reduced in size (as determined by the ``bluestore compression | |
303 | required ratio`` setting). No matter which compression modes have been used, if | |
304 | the data chunk is too big, then it will be discarded and the original | |
305 | (uncompressed) data will be stored instead. For example, if ``bluestore | |
306 | compression required ratio`` is set to ``.7``, then data compression will take | |
307 | place only if the size of the compressed data is no more than 70% of the size | |
308 | of the original data. | |
d2e6a577 | 309 | |
1e59de90 TL |
310 | The *compression mode*, *compression algorithm*, *compression required ratio*, |
311 | *min blob size*, and *max blob size* settings can be specified either via a | |
312 | per-pool property or via a global config option. To specify pool properties, | |
313 | run the following commands: | |
39ae355f TL |
314 | |
315 | .. prompt:: bash $ | |
d2e6a577 | 316 | |
39ae355f TL |
317 | ceph osd pool set <pool-name> compression_algorithm <algorithm> |
318 | ceph osd pool set <pool-name> compression_mode <mode> | |
319 | ceph osd pool set <pool-name> compression_required_ratio <ratio> | |
320 | ceph osd pool set <pool-name> compression_min_blob_size <size> | |
321 | ceph osd pool set <pool-name> compression_max_blob_size <size> | |
d2e6a577 | 322 | |
20effc67 TL |
323 | .. confval:: bluestore_compression_algorithm |
324 | .. confval:: bluestore_compression_mode | |
325 | .. confval:: bluestore_compression_required_ratio | |
326 | .. confval:: bluestore_compression_min_blob_size | |
327 | .. confval:: bluestore_compression_min_blob_size_hdd | |
328 | .. confval:: bluestore_compression_min_blob_size_ssd | |
329 | .. confval:: bluestore_compression_max_blob_size | |
330 | .. confval:: bluestore_compression_max_blob_size_hdd | |
331 | .. confval:: bluestore_compression_max_blob_size_ssd | |
332 | ||
333 | .. _bluestore-rocksdb-sharding: | |
334 | ||
335 | RocksDB Sharding | |
336 | ================ | |
337 | ||
1e59de90 TL |
338 | BlueStore maintains several types of internal key-value data, all of which are |
339 | stored in RocksDB. Each data type in BlueStore is assigned a unique prefix. | |
340 | Prior to the Pacific release, all key-value data was stored in a single RocksDB | |
341 | column family: 'default'. In Pacific and later releases, however, BlueStore can | |
342 | divide key-value data into several RocksDB column families. BlueStore achieves | |
343 | better caching and more precise compaction when keys are similar: specifically, | |
344 | when keys have similar access frequency, similar modification frequency, and a | |
345 | similar lifetime. Under such conditions, performance is improved and less disk | |
346 | space is required during compaction (because each column family is smaller and | |
347 | is able to compact independently of the others). | |
348 | ||
349 | OSDs deployed in Pacific or later releases use RocksDB sharding by default. | |
350 | However, if Ceph has been upgraded to Pacific or a later version from a | |
351 | previous version, sharding is disabled on any OSDs that were created before | |
352 | Pacific. | |
353 | ||
354 | To enable sharding and apply the Pacific defaults to a specific OSD, stop the | |
355 | OSD and run the following command: | |
20effc67 TL |
356 | |
357 | .. prompt:: bash # | |
358 | ||
1e59de90 | 359 | ceph-bluestore-tool \ |
20effc67 | 360 | --path <data path> \ |
1e59de90 | 361 | --sharding="m(3) p(3,0-12) o(3,0-13)=block_cache={type=binned_lru} l p" \ |
20effc67 TL |
362 | reshard |
363 | ||
364 | .. confval:: bluestore_rocksdb_cf | |
365 | .. confval:: bluestore_rocksdb_cfs | |
366 | ||
367 | Throttling | |
368 | ========== | |
369 | ||
370 | .. confval:: bluestore_throttle_bytes | |
371 | .. confval:: bluestore_throttle_deferred_bytes | |
372 | .. confval:: bluestore_throttle_cost_per_io | |
373 | .. confval:: bluestore_throttle_cost_per_io_hdd | |
374 | .. confval:: bluestore_throttle_cost_per_io_ssd | |
d2e6a577 | 375 | |
11fdf7f2 | 376 | SPDK Usage |
1e59de90 | 377 | ========== |
11fdf7f2 | 378 | |
1e59de90 TL |
379 | To use the SPDK driver for NVMe devices, you must first prepare your system. |
380 | See `SPDK document`__. | |
11fdf7f2 TL |
381 | |
382 | .. __: http://www.spdk.io/doc/getting_started.html#getting_started_examples | |
383 | ||
1e59de90 TL |
384 | SPDK offers a script that will configure the device automatically. Run this |
385 | script with root permissions: | |
11fdf7f2 | 386 | |
39ae355f TL |
387 | .. prompt:: bash $ |
388 | ||
389 | sudo src/spdk/scripts/setup.sh | |
11fdf7f2 | 390 | |
1e59de90 TL |
391 | You will need to specify the subject NVMe device's device selector with the |
392 | "spdk:" prefix for ``bluestore_block_path``. | |
11fdf7f2 | 393 | |
1e59de90 TL |
394 | In the following example, you first find the device selector of an Intel NVMe |
395 | SSD by running the following command: | |
39ae355f TL |
396 | |
397 | .. prompt:: bash $ | |
11fdf7f2 | 398 | |
1e59de90 TL |
399 | lspci -mm -n -d -d 8086:0953 |
400 | ||
401 | The form of the device selector is either ``DDDD:BB:DD.FF`` or | |
402 | ``DDDD.BB.DD.FF``. | |
11fdf7f2 | 403 | |
1e59de90 TL |
404 | Next, supposing that ``0000:01:00.0`` is the device selector found in the |
405 | output of the ``lspci`` command, you can specify the device selector by running | |
406 | the following command:: | |
11fdf7f2 | 407 | |
1e59de90 | 408 | bluestore_block_path = "spdk:trtype:pcie traddr:0000:01:00.0" |
11fdf7f2 | 409 | |
1e59de90 TL |
410 | You may also specify a remote NVMeoF target over the TCP transport, as in the |
411 | following example:: | |
11fdf7f2 | 412 | |
1e59de90 | 413 | bluestore_block_path = "spdk:trtype:tcp traddr:10.67.110.197 trsvcid:4420 subnqn:nqn.2019-02.io.spdk:cnode1" |
11fdf7f2 | 414 | |
1e59de90 TL |
415 | To run multiple SPDK instances per node, you must make sure each instance uses |
416 | its own DPDK memory by specifying for each instance the amount of DPDK memory | |
417 | (in MB) that the instance will use. | |
11fdf7f2 | 418 | |
1e59de90 | 419 | In most cases, a single device can be used for data, DB, and WAL. We describe |
f67539c2 | 420 | this strategy as *colocating* these components. Be sure to enter the below |
1e59de90 | 421 | settings to ensure that all I/Os are issued through SPDK:: |
11fdf7f2 TL |
422 | |
423 | bluestore_block_db_path = "" | |
424 | bluestore_block_db_size = 0 | |
425 | bluestore_block_wal_path = "" | |
426 | bluestore_block_wal_size = 0 | |
427 | ||
1e59de90 TL |
428 | If these settings are not entered, then the current implementation will |
429 | populate the SPDK map files with kernel file system symbols and will use the | |
430 | kernel driver to issue DB/WAL I/Os. | |
39ae355f TL |
431 | |
432 | Minimum Allocation Size | |
1e59de90 TL |
433 | ======================= |
434 | ||
435 | There is a configured minimum amount of storage that BlueStore allocates on an | |
436 | underlying storage device. In practice, this is the least amount of capacity | |
437 | that even a tiny RADOS object can consume on each OSD's primary device. The | |
438 | configuration option in question--:confval:`bluestore_min_alloc_size`--derives | |
439 | its value from the value of either :confval:`bluestore_min_alloc_size_hdd` or | |
440 | :confval:`bluestore_min_alloc_size_ssd`, depending on the OSD's ``rotational`` | |
441 | attribute. Thus if an OSD is created on an HDD, BlueStore is initialized with | |
442 | the current value of :confval:`bluestore_min_alloc_size_hdd`; but with SSD OSDs | |
443 | (including NVMe devices), Bluestore is initialized with the current value of | |
444 | :confval:`bluestore_min_alloc_size_ssd`. | |
445 | ||
446 | In Mimic and earlier releases, the default values were 64KB for rotational | |
447 | media (HDD) and 16KB for non-rotational media (SSD). The Octopus release | |
448 | changed the the default value for non-rotational media (SSD) to 4KB, and the | |
449 | Pacific release changed the default value for rotational media (HDD) to 4KB. | |
450 | ||
451 | These changes were driven by space amplification that was experienced by Ceph | |
452 | RADOS GateWay (RGW) deployments that hosted large numbers of small files | |
39ae355f TL |
453 | (S3/Swift objects). |
454 | ||
1e59de90 TL |
455 | For example, when an RGW client stores a 1 KB S3 object, that object is written |
456 | to a single RADOS object. In accordance with the default | |
457 | :confval:`min_alloc_size` value, 4 KB of underlying drive space is allocated. | |
458 | This means that roughly 3 KB (that is, 4 KB minus 1 KB) is allocated but never | |
459 | used: this corresponds to 300% overhead or 25% efficiency. Similarly, a 5 KB | |
460 | user object will be stored as two RADOS objects, a 4 KB RADOS object and a 1 KB | |
461 | RADOS object, with the result that 4KB of device capacity is stranded. In this | |
462 | case, however, the overhead percentage is much smaller. Think of this in terms | |
463 | of the remainder from a modulus operation. The overhead *percentage* thus | |
464 | decreases rapidly as object size increases. | |
465 | ||
466 | There is an additional subtlety that is easily missed: the amplification | |
467 | phenomenon just described takes place for *each* replica. For example, when | |
468 | using the default of three copies of data (3R), a 1 KB S3 object actually | |
469 | strands roughly 9 KB of storage device capacity. If erasure coding (EC) is used | |
470 | instead of replication, the amplification might be even higher: for a ``k=4, | |
471 | m=2`` pool, our 1 KB S3 object allocates 24 KB (that is, 4 KB multiplied by 6) | |
472 | of device capacity. | |
39ae355f TL |
473 | |
474 | When an RGW bucket pool contains many relatively large user objects, the effect | |
1e59de90 TL |
475 | of this phenomenon is often negligible. However, with deployments that can |
476 | expect a significant fraction of relatively small user objects, the effect | |
477 | should be taken into consideration. | |
478 | ||
479 | The 4KB default value aligns well with conventional HDD and SSD devices. | |
480 | However, certain novel coarse-IU (Indirection Unit) QLC SSDs perform and wear | |
481 | best when :confval:`bluestore_min_alloc_size_ssd` is specified at OSD creation | |
482 | to match the device's IU: this might be 8KB, 16KB, or even 64KB. These novel | |
483 | storage drives can achieve read performance that is competitive with that of | |
484 | conventional TLC SSDs and write performance that is faster than that of HDDs, | |
485 | with higher density and lower cost than TLC SSDs. | |
486 | ||
487 | Note that when creating OSDs on these novel devices, one must be careful to | |
488 | apply the non-default value only to appropriate devices, and not to | |
489 | conventional HDD and SSD devices. Error can be avoided through careful ordering | |
490 | of OSD creation, with custom OSD device classes, and especially by the use of | |
491 | central configuration *masks*. | |
492 | ||
493 | In Quincy and later releases, you can use the | |
494 | :confval:`bluestore_use_optimal_io_size_for_min_alloc_size` option to allow | |
495 | automatic discovery of the correct value as each OSD is created. Note that the | |
496 | use of ``bcache``, ``OpenCAS``, ``dmcrypt``, ``ATA over Ethernet``, `iSCSI`, or | |
497 | other device-layering and abstraction technologies might confound the | |
498 | determination of correct values. Moreover, OSDs deployed on top of VMware | |
499 | storage have sometimes been found to report a ``rotational`` attribute that | |
500 | does not match the underlying hardware. | |
501 | ||
502 | We suggest inspecting such OSDs at startup via logs and admin sockets in order | |
503 | to ensure that their behavior is correct. Be aware that this kind of inspection | |
504 | might not work as expected with older kernels. To check for this issue, | |
505 | examine the presence and value of ``/sys/block/<drive>/queue/optimal_io_size``. | |
506 | ||
507 | .. note:: When running Reef or a later Ceph release, the ``min_alloc_size`` | |
508 | baked into each OSD is conveniently reported by ``ceph osd metadata``. | |
509 | ||
510 | To inspect a specific OSD, run the following command: | |
39ae355f TL |
511 | |
512 | .. prompt:: bash # | |
513 | ||
1e59de90 | 514 | ceph osd metadata osd.1701 | egrep rotational\|alloc |
39ae355f | 515 | |
1e59de90 TL |
516 | This space amplification might manifest as an unusually high ratio of raw to |
517 | stored data as reported by ``ceph df``. There might also be ``%USE`` / ``VAR`` | |
518 | values reported by ``ceph osd df`` that are unusually high in comparison to | |
519 | other, ostensibly identical, OSDs. Finally, there might be unexpected balancer | |
520 | behavior in pools that use OSDs that have mismatched ``min_alloc_size`` values. | |
39ae355f | 521 | |
1e59de90 TL |
522 | This BlueStore attribute takes effect *only* at OSD creation; if the attribute |
523 | is changed later, a specific OSD's behavior will not change unless and until | |
524 | the OSD is destroyed and redeployed with the appropriate option value(s). | |
525 | Upgrading to a later Ceph release will *not* change the value used by OSDs that | |
526 | were deployed under older releases or with other settings. | |
39ae355f TL |
527 | |
528 | .. confval:: bluestore_min_alloc_size | |
529 | .. confval:: bluestore_min_alloc_size_hdd | |
530 | .. confval:: bluestore_min_alloc_size_ssd | |
531 | .. confval:: bluestore_use_optimal_io_size_for_min_alloc_size | |
532 | ||
1e59de90 | 533 | DSA (Data Streaming Accelerator) Usage |
39ae355f TL |
534 | ====================================== |
535 | ||
1e59de90 TL |
536 | If you want to use the DML library to drive the DSA device for offloading |
537 | read/write operations on persistent memory (PMEM) in BlueStore, you need to | |
538 | install `DML`_ and the `idxd-config`_ library. This will work only on machines | |
539 | that have a SPR (Sapphire Rapids) CPU. | |
39ae355f | 540 | |
1e59de90 | 541 | .. _dml: https://github.com/intel/dml |
39ae355f TL |
542 | .. _idxd-config: https://github.com/intel/idxd-config |
543 | ||
1e59de90 TL |
544 | After installing the DML software, configure the shared work queues (WQs) with |
545 | reference to the following WQ configuration example: | |
39ae355f TL |
546 | |
547 | .. prompt:: bash $ | |
548 | ||
1e59de90 | 549 | accel-config config-wq --group-id=1 --mode=shared --wq-size=16 --threshold=15 --type=user --name="myapp1" --priority=10 --block-on-fault=1 dsa0/wq0.1 |
39ae355f TL |
550 | accel-config config-engine dsa0/engine0.1 --group-id=1 |
551 | accel-config enable-device dsa0 | |
552 | accel-config enable-wq dsa0/wq0.1 |