]> git.proxmox.com Git - ceph.git/blame - ceph/doc/rados/configuration/bluestore-config-ref.rst
update ceph source to reef 18.1.2
[ceph.git] / ceph / doc / rados / configuration / bluestore-config-ref.rst
CommitLineData
1e59de90
TL
1==================================
2 BlueStore Configuration Reference
3==================================
d2e6a577
FG
4
5Devices
6=======
7
1e59de90
TL
8BlueStore manages either one, two, or in certain cases three storage devices.
9These *devices* are "devices" in the Linux/Unix sense. This means that they are
10assets listed under ``/dev`` or ``/devices``. Each of these devices may be an
11entire storage drive, or a partition of a storage drive, or a logical volume.
12BlueStore does not create or mount a conventional file system on devices that
13it uses; BlueStore reads and writes to the devices directly in a "raw" fashion.
14
15In the simplest case, BlueStore consumes all of a single storage device. This
16device is known as the *primary device*. The primary device is identified by
17the ``block`` symlink in the data directory.
18
19The data directory is a ``tmpfs`` mount. When this data directory is booted or
20activated by ``ceph-volume``, it is populated with metadata files and links
21that hold information about the OSD: for example, the OSD's identifier, the
22name of the cluster that the OSD belongs to, and the OSD's private keyring.
23
24In more complicated cases, BlueStore is deployed across one or two additional
25devices:
26
27* A *write-ahead log (WAL) device* (identified as ``block.wal`` in the data
28 directory) can be used to separate out BlueStore's internal journal or
29 write-ahead log. Using a WAL device is advantageous only if the WAL device
30 is faster than the primary device (for example, if the WAL device is an SSD
31 and the primary device is an HDD).
11fdf7f2 32* A *DB device* (identified as ``block.db`` in the data directory) can be used
1e59de90
TL
33 to store BlueStore's internal metadata. BlueStore (or more precisely, the
34 embedded RocksDB) will put as much metadata as it can on the DB device in
35 order to improve performance. If the DB device becomes full, metadata will
36 spill back onto the primary device (where it would have been located in the
37 absence of the DB device). Again, it is advantageous to provision a DB device
38 only if it is faster than the primary device.
39
40If there is only a small amount of fast storage available (for example, less
41than a gigabyte), we recommend using the available space as a WAL device. But
42if more fast storage is available, it makes more sense to provision a DB
43device. Because the BlueStore journal is always placed on the fastest device
44available, using a DB device provides the same benefit that using a WAL device
45would, while *also* allowing additional metadata to be stored off the primary
46device (provided that it fits). DB devices make this possible because whenever
47a DB device is specified but an explicit WAL device is not, the WAL will be
48implicitly colocated with the DB on the faster device.
49
50To provision a single-device (colocated) BlueStore OSD, run the following
51command:
d2e6a577 52
39ae355f 53.. prompt:: bash $
d2e6a577 54
39ae355f 55 ceph-volume lvm prepare --bluestore --data <device>
d2e6a577 56
1e59de90 57To specify a WAL device or DB device, run the following command:
39ae355f
TL
58
59.. prompt:: bash $
60
61 ceph-volume lvm prepare --bluestore --data <device> --block.wal <wal-device> --block.db <db-device>
11fdf7f2 62
1e59de90
TL
63.. note:: The option ``--data`` can take as its argument any of the the
64 following devices: logical volumes specified using *vg/lv* notation,
65 existing logical volumes, and GPT partitions.
66
67
d2e6a577 68
f64942e4
AA
69Provisioning strategies
70-----------------------
1e59de90
TL
71
72BlueStore differs from Filestore in that there are several ways to deploy a
73BlueStore OSD. However, the overall deployment strategy for BlueStore can be
74clarified by examining just these two common arrangements:
f64942e4
AA
75
76.. _bluestore-single-type-device-config:
77
78**block (data) only**
79^^^^^^^^^^^^^^^^^^^^^
1e59de90
TL
80If all devices are of the same type (for example, they are all HDDs), and if
81there are no fast devices available for the storage of metadata, then it makes
82sense to specify the block device only and to leave ``block.db`` and
83``block.wal`` unseparated. The :ref:`ceph-volume-lvm` command for a single
84``/dev/sda`` device is as follows:
39ae355f
TL
85
86.. prompt:: bash $
f64942e4 87
39ae355f 88 ceph-volume lvm create --bluestore --data /dev/sda
f64942e4 89
1e59de90
TL
90If the devices to be used for a BlueStore OSD are pre-created logical volumes,
91then the :ref:`ceph-volume-lvm` call for an logical volume named
92``ceph-vg/block-lv`` is as follows:
39ae355f
TL
93
94.. prompt:: bash $
f64942e4 95
39ae355f 96 ceph-volume lvm create --bluestore --data ceph-vg/block-lv
f64942e4
AA
97
98.. _bluestore-mixed-device-config:
99
100**block and block.db**
101^^^^^^^^^^^^^^^^^^^^^^
f64942e4 102
1e59de90
TL
103If you have a mix of fast and slow devices (for example, SSD or HDD), then we
104recommend placing ``block.db`` on the faster device while ``block`` (that is,
105the data) is stored on the slower device (that is, the rotational drive).
106
107You must create these volume groups and these logical volumes manually. as The
108``ceph-volume`` tool is currently unable to do so [create them?] automatically.
f67539c2 109
1e59de90
TL
110The following procedure illustrates the manual creation of volume groups and
111logical volumes. For this example, we shall assume four rotational drives
112(``sda``, ``sdb``, ``sdc``, and ``sdd``) and one (fast) SSD (``sdx``). First,
113to create the volume groups, run the following commands:
f64942e4 114
39ae355f 115.. prompt:: bash $
f64942e4 116
39ae355f
TL
117 vgcreate ceph-block-0 /dev/sda
118 vgcreate ceph-block-1 /dev/sdb
119 vgcreate ceph-block-2 /dev/sdc
120 vgcreate ceph-block-3 /dev/sdd
f64942e4 121
1e59de90 122Next, to create the logical volumes for ``block``, run the following commands:
39ae355f
TL
123
124.. prompt:: bash $
125
126 lvcreate -l 100%FREE -n block-0 ceph-block-0
127 lvcreate -l 100%FREE -n block-1 ceph-block-1
128 lvcreate -l 100%FREE -n block-2 ceph-block-2
129 lvcreate -l 100%FREE -n block-3 ceph-block-3
f64942e4 130
1e59de90
TL
131Because there are four HDDs, there will be four OSDs. Supposing that there is a
132200GB SSD in ``/dev/sdx``, we can create four 50GB logical volumes by running
133the following commands:
39ae355f
TL
134
135.. prompt:: bash $
f64942e4 136
39ae355f
TL
137 vgcreate ceph-db-0 /dev/sdx
138 lvcreate -L 50GB -n db-0 ceph-db-0
139 lvcreate -L 50GB -n db-1 ceph-db-0
140 lvcreate -L 50GB -n db-2 ceph-db-0
141 lvcreate -L 50GB -n db-3 ceph-db-0
f64942e4 142
1e59de90 143Finally, to create the four OSDs, run the following commands:
f64942e4 144
39ae355f
TL
145.. prompt:: bash $
146
147 ceph-volume lvm create --bluestore --data ceph-block-0/block-0 --block.db ceph-db-0/db-0
148 ceph-volume lvm create --bluestore --data ceph-block-1/block-1 --block.db ceph-db-0/db-1
149 ceph-volume lvm create --bluestore --data ceph-block-2/block-2 --block.db ceph-db-0/db-2
150 ceph-volume lvm create --bluestore --data ceph-block-3/block-3 --block.db ceph-db-0/db-3
f64942e4 151
1e59de90
TL
152After this procedure is finished, there should be four OSDs, ``block`` should
153be on the four HDDs, and each HDD should have a 50GB logical volume
154(specifically, a DB device) on the shared SSD.
f64942e4
AA
155
156Sizing
157======
1e59de90
TL
158When using a :ref:`mixed spinning-and-solid-drive setup
159<bluestore-mixed-device-config>`, it is important to make a large enough
160``block.db`` logical volume for BlueStore. The logical volumes associated with
161``block.db`` should have logical volumes that are *as large as possible*.
162
163It is generally recommended that the size of ``block.db`` be somewhere between
1641% and 4% of the size of ``block``. For RGW workloads, it is recommended that
165the ``block.db`` be at least 4% of the ``block`` size, because RGW makes heavy
166use of ``block.db`` to store metadata (in particular, omap keys). For example,
167if the ``block`` size is 1TB, then ``block.db`` should have a size of at least
16840GB. For RBD workloads, however, ``block.db`` usually needs no more than 1% to
1692% of the ``block`` size.
170
171In older releases, internal level sizes are such that the DB can fully utilize
172only those specific partition / logical volume sizes that correspond to sums of
173L0, L0+L1, L1+L2, and so on--that is, given default settings, sizes of roughly
1743GB, 30GB, 300GB, and so on. Most deployments do not substantially benefit from
175sizing that accommodates L3 and higher, though DB compaction can be facilitated
176by doubling these figures to 6GB, 60GB, and 600GB.
177
178Improvements in Nautilus 14.2.12, Octopus 15.2.6, and subsequent releases allow
179for better utilization of arbitrarily-sized DB devices. Moreover, the Pacific
180release brings experimental dynamic-level support. Because of these advances,
181users of older releases might want to plan ahead by provisioning larger DB
182devices today so that the benefits of scale can be realized when upgrades are
183made in the future.
184
185When *not* using a mix of fast and slow devices, there is no requirement to
186create separate logical volumes for ``block.db`` or ``block.wal``. BlueStore
187will automatically colocate these devices within the space of ``block``.
f64942e4
AA
188
189Automatic Cache Sizing
190======================
191
1e59de90
TL
192BlueStore can be configured to automatically resize its caches, provided that
193certain conditions are met: TCMalloc must be configured as the memory allocator
194and the ``bluestore_cache_autotune`` configuration option must be enabled (note
195that it is currently enabled by default). When automatic cache sizing is in
196effect, BlueStore attempts to keep OSD heap-memory usage under a certain target
197size (as determined by ``osd_memory_target``). This approach makes use of a
198best-effort algorithm and caches do not shrink smaller than the size defined by
199the value of ``osd_memory_cache_min``. Cache ratios are selected in accordance
200with a hierarchy of priorities. But if priority information is not available,
201the values specified in the ``bluestore_cache_meta_ratio`` and
202``bluestore_cache_kv_ratio`` options are used as fallback cache ratios.
f64942e4 203
20effc67
TL
204.. confval:: bluestore_cache_autotune
205.. confval:: osd_memory_target
206.. confval:: bluestore_cache_autotune_interval
207.. confval:: osd_memory_base
208.. confval:: osd_memory_expected_fragmentation
209.. confval:: osd_memory_cache_min
210.. confval:: osd_memory_cache_resize_interval
f64942e4
AA
211
212
213Manual Cache Sizing
214===================
d2e6a577 215
1e59de90
TL
216The amount of memory consumed by each OSD to be used for its BlueStore cache is
217determined by the ``bluestore_cache_size`` configuration option. If that option
218has not been specified (that is, if it remains at 0), then Ceph uses a
219different configuration option to determine the default memory budget:
220``bluestore_cache_size_hdd`` if the primary device is an HDD, or
221``bluestore_cache_size_ssd`` if the primary device is an SSD.
d2e6a577 222
1e59de90
TL
223BlueStore and the rest of the Ceph OSD daemon make every effort to work within
224this memory budget. Note that in addition to the configured cache size, there
225is also memory consumed by the OSD itself. There is additional utilization due
226to memory fragmentation and other allocator overhead.
d2e6a577 227
1e59de90
TL
228The configured cache-memory budget can be used to store the following types of
229things:
d2e6a577 230
1e59de90 231* Key/Value metadata (that is, RocksDB's internal cache)
d2e6a577 232* BlueStore metadata
1e59de90 233* BlueStore data (that is, recently read or recently written object data)
d2e6a577 234
1e59de90
TL
235Cache memory usage is governed by the configuration options
236``bluestore_cache_meta_ratio`` and ``bluestore_cache_kv_ratio``. The fraction
237of the cache that is reserved for data is governed by both the effective
238BlueStore cache size (which depends on the relevant
239``bluestore_cache_size[_ssd|_hdd]`` option and the device class of the primary
240device) and the "meta" and "kv" ratios. This data fraction can be calculated
241with the following formula: ``<effective_cache_size> * (1 -
242bluestore_cache_meta_ratio - bluestore_cache_kv_ratio)``.
d2e6a577 243
20effc67
TL
244.. confval:: bluestore_cache_size
245.. confval:: bluestore_cache_size_hdd
246.. confval:: bluestore_cache_size_ssd
247.. confval:: bluestore_cache_meta_ratio
248.. confval:: bluestore_cache_kv_ratio
d2e6a577
FG
249
250Checksums
251=========
252
1e59de90
TL
253BlueStore checksums all metadata and all data written to disk. Metadata
254checksumming is handled by RocksDB and uses the `crc32c` algorithm. By
255contrast, data checksumming is handled by BlueStore and can use either
256`crc32c`, `xxhash32`, or `xxhash64`. Nonetheless, `crc32c` is the default
257checksum algorithm and it is suitable for most purposes.
258
259Full data checksumming increases the amount of metadata that BlueStore must
260store and manage. Whenever possible (for example, when clients hint that data
261is written and read sequentially), BlueStore will checksum larger blocks. In
262many cases, however, it must store a checksum value (usually 4 bytes) for every
2634 KB block of data.
264
265It is possible to obtain a smaller checksum value by truncating the checksum to
266one or two bytes and reducing the metadata overhead. A drawback of this
267approach is that it increases the probability of a random error going
268undetected: about one in four billion given a 32-bit (4 byte) checksum, 1 in
26965,536 given a 16-bit (2 byte) checksum, and 1 in 256 given an 8-bit (1 byte)
270checksum. To use the smaller checksum values, select `crc32c_16` or `crc32c_8`
271as the checksum algorithm.
272
273The *checksum algorithm* can be specified either via a per-pool ``csum_type``
274configuration option or via the global configuration option. For example:
39ae355f
TL
275
276.. prompt:: bash $
d2e6a577 277
39ae355f 278 ceph osd pool set <pool-name> csum_type <algorithm>
d2e6a577 279
20effc67 280.. confval:: bluestore_csum_type
d2e6a577
FG
281
282Inline Compression
283==================
284
1e59de90 285BlueStore supports inline compression using `snappy`, `zlib`, `lz4`, or `zstd`.
d2e6a577 286
1e59de90
TL
287Whether data in BlueStore is compressed is determined by two factors: (1) the
288*compression mode* and (2) any client hints associated with a write operation.
289The compression modes are as follows:
d2e6a577
FG
290
291* **none**: Never compress data.
11fdf7f2 292* **passive**: Do not compress data unless the write operation has a
d2e6a577 293 *compressible* hint set.
1e59de90 294* **aggressive**: Do compress data unless the write operation has an
d2e6a577
FG
295 *incompressible* hint set.
296* **force**: Try to compress data no matter what.
297
1e59de90
TL
298For more information about the *compressible* and *incompressible* I/O hints,
299see :c:func:`rados_set_alloc_hint`.
d2e6a577 300
1e59de90
TL
301Note that data in Bluestore will be compressed only if the data chunk will be
302sufficiently reduced in size (as determined by the ``bluestore compression
303required ratio`` setting). No matter which compression modes have been used, if
304the data chunk is too big, then it will be discarded and the original
305(uncompressed) data will be stored instead. For example, if ``bluestore
306compression required ratio`` is set to ``.7``, then data compression will take
307place only if the size of the compressed data is no more than 70% of the size
308of the original data.
d2e6a577 309
1e59de90
TL
310The *compression mode*, *compression algorithm*, *compression required ratio*,
311*min blob size*, and *max blob size* settings can be specified either via a
312per-pool property or via a global config option. To specify pool properties,
313run the following commands:
39ae355f
TL
314
315.. prompt:: bash $
d2e6a577 316
39ae355f
TL
317 ceph osd pool set <pool-name> compression_algorithm <algorithm>
318 ceph osd pool set <pool-name> compression_mode <mode>
319 ceph osd pool set <pool-name> compression_required_ratio <ratio>
320 ceph osd pool set <pool-name> compression_min_blob_size <size>
321 ceph osd pool set <pool-name> compression_max_blob_size <size>
d2e6a577 322
20effc67
TL
323.. confval:: bluestore_compression_algorithm
324.. confval:: bluestore_compression_mode
325.. confval:: bluestore_compression_required_ratio
326.. confval:: bluestore_compression_min_blob_size
327.. confval:: bluestore_compression_min_blob_size_hdd
328.. confval:: bluestore_compression_min_blob_size_ssd
329.. confval:: bluestore_compression_max_blob_size
330.. confval:: bluestore_compression_max_blob_size_hdd
331.. confval:: bluestore_compression_max_blob_size_ssd
332
333.. _bluestore-rocksdb-sharding:
334
335RocksDB Sharding
336================
337
1e59de90
TL
338BlueStore maintains several types of internal key-value data, all of which are
339stored in RocksDB. Each data type in BlueStore is assigned a unique prefix.
340Prior to the Pacific release, all key-value data was stored in a single RocksDB
341column family: 'default'. In Pacific and later releases, however, BlueStore can
342divide key-value data into several RocksDB column families. BlueStore achieves
343better caching and more precise compaction when keys are similar: specifically,
344when keys have similar access frequency, similar modification frequency, and a
345similar lifetime. Under such conditions, performance is improved and less disk
346space is required during compaction (because each column family is smaller and
347is able to compact independently of the others).
348
349OSDs deployed in Pacific or later releases use RocksDB sharding by default.
350However, if Ceph has been upgraded to Pacific or a later version from a
351previous version, sharding is disabled on any OSDs that were created before
352Pacific.
353
354To enable sharding and apply the Pacific defaults to a specific OSD, stop the
355OSD and run the following command:
20effc67
TL
356
357 .. prompt:: bash #
358
1e59de90 359 ceph-bluestore-tool \
20effc67 360 --path <data path> \
1e59de90 361 --sharding="m(3) p(3,0-12) o(3,0-13)=block_cache={type=binned_lru} l p" \
20effc67
TL
362 reshard
363
364.. confval:: bluestore_rocksdb_cf
365.. confval:: bluestore_rocksdb_cfs
366
367Throttling
368==========
369
370.. confval:: bluestore_throttle_bytes
371.. confval:: bluestore_throttle_deferred_bytes
372.. confval:: bluestore_throttle_cost_per_io
373.. confval:: bluestore_throttle_cost_per_io_hdd
374.. confval:: bluestore_throttle_cost_per_io_ssd
d2e6a577 375
11fdf7f2 376SPDK Usage
1e59de90 377==========
11fdf7f2 378
1e59de90
TL
379To use the SPDK driver for NVMe devices, you must first prepare your system.
380See `SPDK document`__.
11fdf7f2
TL
381
382.. __: http://www.spdk.io/doc/getting_started.html#getting_started_examples
383
1e59de90
TL
384SPDK offers a script that will configure the device automatically. Run this
385script with root permissions:
11fdf7f2 386
39ae355f
TL
387.. prompt:: bash $
388
389 sudo src/spdk/scripts/setup.sh
11fdf7f2 390
1e59de90
TL
391You will need to specify the subject NVMe device's device selector with the
392"spdk:" prefix for ``bluestore_block_path``.
11fdf7f2 393
1e59de90
TL
394In the following example, you first find the device selector of an Intel NVMe
395SSD by running the following command:
39ae355f
TL
396
397.. prompt:: bash $
11fdf7f2 398
1e59de90
TL
399 lspci -mm -n -d -d 8086:0953
400
401The form of the device selector is either ``DDDD:BB:DD.FF`` or
402``DDDD.BB.DD.FF``.
11fdf7f2 403
1e59de90
TL
404Next, supposing that ``0000:01:00.0`` is the device selector found in the
405output of the ``lspci`` command, you can specify the device selector by running
406the following command::
11fdf7f2 407
1e59de90 408 bluestore_block_path = "spdk:trtype:pcie traddr:0000:01:00.0"
11fdf7f2 409
1e59de90
TL
410You may also specify a remote NVMeoF target over the TCP transport, as in the
411following example::
11fdf7f2 412
1e59de90 413 bluestore_block_path = "spdk:trtype:tcp traddr:10.67.110.197 trsvcid:4420 subnqn:nqn.2019-02.io.spdk:cnode1"
11fdf7f2 414
1e59de90
TL
415To run multiple SPDK instances per node, you must make sure each instance uses
416its own DPDK memory by specifying for each instance the amount of DPDK memory
417(in MB) that the instance will use.
11fdf7f2 418
1e59de90 419In most cases, a single device can be used for data, DB, and WAL. We describe
f67539c2 420this strategy as *colocating* these components. Be sure to enter the below
1e59de90 421settings to ensure that all I/Os are issued through SPDK::
11fdf7f2
TL
422
423 bluestore_block_db_path = ""
424 bluestore_block_db_size = 0
425 bluestore_block_wal_path = ""
426 bluestore_block_wal_size = 0
427
1e59de90
TL
428If these settings are not entered, then the current implementation will
429populate the SPDK map files with kernel file system symbols and will use the
430kernel driver to issue DB/WAL I/Os.
39ae355f
TL
431
432Minimum Allocation Size
1e59de90
TL
433=======================
434
435There is a configured minimum amount of storage that BlueStore allocates on an
436underlying storage device. In practice, this is the least amount of capacity
437that even a tiny RADOS object can consume on each OSD's primary device. The
438configuration option in question--:confval:`bluestore_min_alloc_size`--derives
439its value from the value of either :confval:`bluestore_min_alloc_size_hdd` or
440:confval:`bluestore_min_alloc_size_ssd`, depending on the OSD's ``rotational``
441attribute. Thus if an OSD is created on an HDD, BlueStore is initialized with
442the current value of :confval:`bluestore_min_alloc_size_hdd`; but with SSD OSDs
443(including NVMe devices), Bluestore is initialized with the current value of
444:confval:`bluestore_min_alloc_size_ssd`.
445
446In Mimic and earlier releases, the default values were 64KB for rotational
447media (HDD) and 16KB for non-rotational media (SSD). The Octopus release
448changed the the default value for non-rotational media (SSD) to 4KB, and the
449Pacific release changed the default value for rotational media (HDD) to 4KB.
450
451These changes were driven by space amplification that was experienced by Ceph
452RADOS GateWay (RGW) deployments that hosted large numbers of small files
39ae355f
TL
453(S3/Swift objects).
454
1e59de90
TL
455For example, when an RGW client stores a 1 KB S3 object, that object is written
456to a single RADOS object. In accordance with the default
457:confval:`min_alloc_size` value, 4 KB of underlying drive space is allocated.
458This means that roughly 3 KB (that is, 4 KB minus 1 KB) is allocated but never
459used: this corresponds to 300% overhead or 25% efficiency. Similarly, a 5 KB
460user object will be stored as two RADOS objects, a 4 KB RADOS object and a 1 KB
461RADOS object, with the result that 4KB of device capacity is stranded. In this
462case, however, the overhead percentage is much smaller. Think of this in terms
463of the remainder from a modulus operation. The overhead *percentage* thus
464decreases rapidly as object size increases.
465
466There is an additional subtlety that is easily missed: the amplification
467phenomenon just described takes place for *each* replica. For example, when
468using the default of three copies of data (3R), a 1 KB S3 object actually
469strands roughly 9 KB of storage device capacity. If erasure coding (EC) is used
470instead of replication, the amplification might be even higher: for a ``k=4,
471m=2`` pool, our 1 KB S3 object allocates 24 KB (that is, 4 KB multiplied by 6)
472of device capacity.
39ae355f
TL
473
474When an RGW bucket pool contains many relatively large user objects, the effect
1e59de90
TL
475of this phenomenon is often negligible. However, with deployments that can
476expect a significant fraction of relatively small user objects, the effect
477should be taken into consideration.
478
479The 4KB default value aligns well with conventional HDD and SSD devices.
480However, certain novel coarse-IU (Indirection Unit) QLC SSDs perform and wear
481best when :confval:`bluestore_min_alloc_size_ssd` is specified at OSD creation
482to match the device's IU: this might be 8KB, 16KB, or even 64KB. These novel
483storage drives can achieve read performance that is competitive with that of
484conventional TLC SSDs and write performance that is faster than that of HDDs,
485with higher density and lower cost than TLC SSDs.
486
487Note that when creating OSDs on these novel devices, one must be careful to
488apply the non-default value only to appropriate devices, and not to
489conventional HDD and SSD devices. Error can be avoided through careful ordering
490of OSD creation, with custom OSD device classes, and especially by the use of
491central configuration *masks*.
492
493In Quincy and later releases, you can use the
494:confval:`bluestore_use_optimal_io_size_for_min_alloc_size` option to allow
495automatic discovery of the correct value as each OSD is created. Note that the
496use of ``bcache``, ``OpenCAS``, ``dmcrypt``, ``ATA over Ethernet``, `iSCSI`, or
497other device-layering and abstraction technologies might confound the
498determination of correct values. Moreover, OSDs deployed on top of VMware
499storage have sometimes been found to report a ``rotational`` attribute that
500does not match the underlying hardware.
501
502We suggest inspecting such OSDs at startup via logs and admin sockets in order
503to ensure that their behavior is correct. Be aware that this kind of inspection
504might not work as expected with older kernels. To check for this issue,
505examine the presence and value of ``/sys/block/<drive>/queue/optimal_io_size``.
506
507.. note:: When running Reef or a later Ceph release, the ``min_alloc_size``
508 baked into each OSD is conveniently reported by ``ceph osd metadata``.
509
510To inspect a specific OSD, run the following command:
39ae355f
TL
511
512.. prompt:: bash #
513
1e59de90 514 ceph osd metadata osd.1701 | egrep rotational\|alloc
39ae355f 515
1e59de90
TL
516This space amplification might manifest as an unusually high ratio of raw to
517stored data as reported by ``ceph df``. There might also be ``%USE`` / ``VAR``
518values reported by ``ceph osd df`` that are unusually high in comparison to
519other, ostensibly identical, OSDs. Finally, there might be unexpected balancer
520behavior in pools that use OSDs that have mismatched ``min_alloc_size`` values.
39ae355f 521
1e59de90
TL
522This BlueStore attribute takes effect *only* at OSD creation; if the attribute
523is changed later, a specific OSD's behavior will not change unless and until
524the OSD is destroyed and redeployed with the appropriate option value(s).
525Upgrading to a later Ceph release will *not* change the value used by OSDs that
526were deployed under older releases or with other settings.
39ae355f
TL
527
528.. confval:: bluestore_min_alloc_size
529.. confval:: bluestore_min_alloc_size_hdd
530.. confval:: bluestore_min_alloc_size_ssd
531.. confval:: bluestore_use_optimal_io_size_for_min_alloc_size
532
1e59de90 533DSA (Data Streaming Accelerator) Usage
39ae355f
TL
534======================================
535
1e59de90
TL
536If you want to use the DML library to drive the DSA device for offloading
537read/write operations on persistent memory (PMEM) in BlueStore, you need to
538install `DML`_ and the `idxd-config`_ library. This will work only on machines
539that have a SPR (Sapphire Rapids) CPU.
39ae355f 540
1e59de90 541.. _dml: https://github.com/intel/dml
39ae355f
TL
542.. _idxd-config: https://github.com/intel/idxd-config
543
1e59de90
TL
544After installing the DML software, configure the shared work queues (WQs) with
545reference to the following WQ configuration example:
39ae355f
TL
546
547.. prompt:: bash $
548
1e59de90 549 accel-config config-wq --group-id=1 --mode=shared --wq-size=16 --threshold=15 --type=user --name="myapp1" --priority=10 --block-on-fault=1 dsa0/wq0.1
39ae355f
TL
550 accel-config config-engine dsa0/engine0.1 --group-id=1
551 accel-config enable-device dsa0
552 accel-config enable-wq dsa0/wq0.1