1 ==========================
2 BlueStore Config Reference
3 ==========================
8 BlueStore manages either one, two, or (in certain cases) three storage
11 In the simplest case, BlueStore consumes a single (primary) storage device.
12 The storage device is normally used as a whole, occupying the full device that
13 is managed directly by BlueStore. This *primary device* is normally identified
14 by a ``block`` symlink in the data directory.
16 The data directory is a ``tmpfs`` mount which gets populated (at boot time, or
17 when ``ceph-volume`` activates it) with all the common OSD files that hold
18 information about the OSD, like: its identifier, which cluster it belongs to,
19 and its private keyring.
21 It is also possible to deploy BlueStore across one or two additional devices:
23 * A *write-ahead log (WAL) device* (identified as ``block.wal`` in the data directory) can be
24 used for BlueStore's internal journal or write-ahead log. It is only useful
25 to use a WAL device if the device is faster than the primary device (e.g.,
26 when it is on an SSD and the primary device is an HDD).
27 * A *DB device* (identified as ``block.db`` in the data directory) can be used
28 for storing BlueStore's internal metadata. BlueStore (or rather, the
29 embedded RocksDB) will put as much metadata as it can on the DB device to
30 improve performance. If the DB device fills up, metadata will spill back
31 onto the primary device (where it would have been otherwise). Again, it is
32 only helpful to provision a DB device if it is faster than the primary
35 If there is only a small amount of fast storage available (e.g., less
36 than a gigabyte), we recommend using it as a WAL device. If there is
37 more, provisioning a DB device makes more sense. The BlueStore
38 journal will always be placed on the fastest device available, so
39 using a DB device will provide the same benefit that the WAL device
40 would while *also* allowing additional metadata to be stored there (if
41 it will fit). This means that if a DB device is specified but an explicit
42 WAL device is not, the WAL will be implicitly colocated with the DB on the faster
45 A single-device (colocated) BlueStore OSD can be provisioned with::
47 ceph-volume lvm prepare --bluestore --data <device>
49 To specify a WAL device and/or DB device, ::
51 ceph-volume lvm prepare --bluestore --data <device> --block.wal <wal-device> --block.db <db-device>
53 .. note:: ``--data`` can be a Logical Volume using *vg/lv* notation. Other
54 devices can be existing logical volumes or GPT partitions.
56 Provisioning strategies
57 -----------------------
58 Although there are multiple ways to deploy a BlueStore OSD (unlike Filestore
59 which had just one), there are two common arrangements that should help clarify
60 the deployment strategy:
62 .. _bluestore-single-type-device-config:
66 If all devices are the same type, for example all rotational drives, and
67 there are no fast devices to use for metadata, it makes sense to specify the
68 block device only and to not separate ``block.db`` or ``block.wal``. The
69 :ref:`ceph-volume-lvm` command for a single ``/dev/sda`` device looks like::
71 ceph-volume lvm create --bluestore --data /dev/sda
73 If logical volumes have already been created for each device, (a single LV
74 using 100% of the device), then the :ref:`ceph-volume-lvm` call for an LV named
75 ``ceph-vg/block-lv`` would look like::
77 ceph-volume lvm create --bluestore --data ceph-vg/block-lv
79 .. _bluestore-mixed-device-config:
81 **block and block.db**
82 ^^^^^^^^^^^^^^^^^^^^^^
83 If you have a mix of fast and slow devices (SSD / NVMe and rotational),
84 it is recommended to place ``block.db`` on the faster device while ``block``
85 (data) lives on the slower (spinning drive).
87 You must create these volume groups and logical volumes manually as
88 the ``ceph-volume`` tool is currently not able to do so automatically.
90 For the below example, let us assume four rotational (``sda``, ``sdb``, ``sdc``, and ``sdd``)
91 and one (fast) solid state drive (``sdx``). First create the volume groups::
93 $ vgcreate ceph-block-0 /dev/sda
94 $ vgcreate ceph-block-1 /dev/sdb
95 $ vgcreate ceph-block-2 /dev/sdc
96 $ vgcreate ceph-block-3 /dev/sdd
98 Now create the logical volumes for ``block``::
100 $ lvcreate -l 100%FREE -n block-0 ceph-block-0
101 $ lvcreate -l 100%FREE -n block-1 ceph-block-1
102 $ lvcreate -l 100%FREE -n block-2 ceph-block-2
103 $ lvcreate -l 100%FREE -n block-3 ceph-block-3
105 We are creating 4 OSDs for the four slow spinning devices, so assuming a 200GB
106 SSD in ``/dev/sdx`` we will create 4 logical volumes, each of 50GB::
108 $ vgcreate ceph-db-0 /dev/sdx
109 $ lvcreate -L 50GB -n db-0 ceph-db-0
110 $ lvcreate -L 50GB -n db-1 ceph-db-0
111 $ lvcreate -L 50GB -n db-2 ceph-db-0
112 $ lvcreate -L 50GB -n db-3 ceph-db-0
114 Finally, create the 4 OSDs with ``ceph-volume``::
116 $ ceph-volume lvm create --bluestore --data ceph-block-0/block-0 --block.db ceph-db-0/db-0
117 $ ceph-volume lvm create --bluestore --data ceph-block-1/block-1 --block.db ceph-db-0/db-1
118 $ ceph-volume lvm create --bluestore --data ceph-block-2/block-2 --block.db ceph-db-0/db-2
119 $ ceph-volume lvm create --bluestore --data ceph-block-3/block-3 --block.db ceph-db-0/db-3
121 These operations should end up creating four OSDs, with ``block`` on the slower
122 rotational drives with a 50 GB logical volume (DB) for each on the solid state
127 When using a :ref:`mixed spinning and solid drive setup
128 <bluestore-mixed-device-config>` it is important to make a large enough
129 ``block.db`` logical volume for BlueStore. Generally, ``block.db`` should have
130 *as large as possible* logical volumes.
132 The general recommendation is to have ``block.db`` size in between 1% to 4%
133 of ``block`` size. For RGW workloads, it is recommended that the ``block.db``
134 size isn't smaller than 4% of ``block``, because RGW heavily uses it to store
135 metadata (omap keys). For example, if the ``block`` size is 1TB, then ``block.db`` shouldn't
136 be less than 40GB. For RBD workloads, 1% to 2% of ``block`` size is usually enough.
138 In older releases, internal level sizes mean that the DB can fully utilize only
139 specific partition / LV sizes that correspond to sums of L0, L0+L1, L1+L2,
140 etc. sizes, which with default settings means roughly 3 GB, 30 GB, 300 GB, and
141 so forth. Most deployments will not substantially benefit from sizing to
142 accommodate L3 and higher, though DB compaction can be facilitated by doubling
143 these figures to 6GB, 60GB, and 600GB.
145 Improvements in releases beginning with Nautilus 14.2.12 and Octopus 15.2.6
146 enable better utilization of arbitrary DB device sizes, and the Pacific
147 release brings experimental dynamic level support. Users of older releases may
148 thus wish to plan ahead by provisioning larger DB devices today so that their
149 benefits may be realized with future upgrades.
151 When *not* using a mix of fast and slow devices, it isn't required to create
152 separate logical volumes for ``block.db`` (or ``block.wal``). BlueStore will
153 automatically colocate these within the space of ``block``.
156 Automatic Cache Sizing
157 ======================
159 BlueStore can be configured to automatically resize its caches when TCMalloc
160 is configured as the memory allocator and the ``bluestore_cache_autotune``
161 setting is enabled. This option is currently enabled by default. BlueStore
162 will attempt to keep OSD heap memory usage under a designated target size via
163 the ``osd_memory_target`` configuration option. This is a best effort
164 algorithm and caches will not shrink smaller than the amount specified by
165 ``osd_memory_cache_min``. Cache ratios will be chosen based on a hierarchy
166 of priorities. If priority information is not available, the
167 ``bluestore_cache_meta_ratio`` and ``bluestore_cache_kv_ratio`` options are
170 .. confval:: bluestore_cache_autotune
171 .. confval:: osd_memory_target
172 .. confval:: bluestore_cache_autotune_interval
173 .. confval:: osd_memory_base
174 .. confval:: osd_memory_expected_fragmentation
175 .. confval:: osd_memory_cache_min
176 .. confval:: osd_memory_cache_resize_interval
182 The amount of memory consumed by each OSD for BlueStore caches is
183 determined by the ``bluestore_cache_size`` configuration option. If
184 that config option is not set (i.e., remains at 0), there is a
185 different default value that is used depending on whether an HDD or
186 SSD is used for the primary device (set by the
187 ``bluestore_cache_size_ssd`` and ``bluestore_cache_size_hdd`` config
190 BlueStore and the rest of the Ceph OSD daemon do the best they can
191 to work within this memory budget. Note that on top of the configured
192 cache size, there is also memory consumed by the OSD itself, and
193 some additional utilization due to memory fragmentation and other
196 The configured cache memory budget can be used in a few different ways:
198 * Key/Value metadata (i.e., RocksDB's internal cache)
200 * BlueStore data (i.e., recently read or written object data)
202 Cache memory usage is governed by the following options:
203 ``bluestore_cache_meta_ratio`` and ``bluestore_cache_kv_ratio``.
204 The fraction of the cache devoted to data
205 is governed by the effective bluestore cache size (depending on
206 ``bluestore_cache_size[_ssd|_hdd]`` settings and the device class of the primary
207 device) as well as the meta and kv ratios.
208 The data fraction can be calculated by
209 ``<effective_cache_size> * (1 - bluestore_cache_meta_ratio - bluestore_cache_kv_ratio)``
211 .. confval:: bluestore_cache_size
212 .. confval:: bluestore_cache_size_hdd
213 .. confval:: bluestore_cache_size_ssd
214 .. confval:: bluestore_cache_meta_ratio
215 .. confval:: bluestore_cache_kv_ratio
220 BlueStore checksums all metadata and data written to disk. Metadata
221 checksumming is handled by RocksDB and uses `crc32c`. Data
222 checksumming is done by BlueStore and can make use of `crc32c`,
223 `xxhash32`, or `xxhash64`. The default is `crc32c` and should be
224 suitable for most purposes.
226 Full data checksumming does increase the amount of metadata that
227 BlueStore must store and manage. When possible, e.g., when clients
228 hint that data is written and read sequentially, BlueStore will
229 checksum larger blocks, but in many cases it must store a checksum
230 value (usually 4 bytes) for every 4 kilobyte block of data.
232 It is possible to use a smaller checksum value by truncating the
233 checksum to two or one byte, reducing the metadata overhead. The
234 trade-off is that the probability that a random error will not be
235 detected is higher with a smaller checksum, going from about one in
236 four billion with a 32-bit (4 byte) checksum to one in 65,536 for a
237 16-bit (2 byte) checksum or one in 256 for an 8-bit (1 byte) checksum.
238 The smaller checksum values can be used by selecting `crc32c_16` or
239 `crc32c_8` as the checksum algorithm.
241 The *checksum algorithm* can be set either via a per-pool
242 ``csum_type`` property or the global config option. For example, ::
244 ceph osd pool set <pool-name> csum_type <algorithm>
246 .. confval:: bluestore_csum_type
251 BlueStore supports inline compression using `snappy`, `zlib`, or
252 `lz4`. Please note that the `lz4` compression plugin is not
253 distributed in the official release.
255 Whether data in BlueStore is compressed is determined by a combination
256 of the *compression mode* and any hints associated with a write
257 operation. The modes are:
259 * **none**: Never compress data.
260 * **passive**: Do not compress data unless the write operation has a
261 *compressible* hint set.
262 * **aggressive**: Compress data unless the write operation has an
263 *incompressible* hint set.
264 * **force**: Try to compress data no matter what.
266 For more information about the *compressible* and *incompressible* IO
267 hints, see :c:func:`rados_set_alloc_hint`.
269 Note that regardless of the mode, if the size of the data chunk is not
270 reduced sufficiently it will not be used and the original
271 (uncompressed) data will be stored. For example, if the ``bluestore
272 compression required ratio`` is set to ``.7`` then the compressed data
273 must be 70% of the size of the original (or smaller).
275 The *compression mode*, *compression algorithm*, *compression required
276 ratio*, *min blob size*, and *max blob size* can be set either via a
277 per-pool property or a global config option. Pool properties can be
280 ceph osd pool set <pool-name> compression_algorithm <algorithm>
281 ceph osd pool set <pool-name> compression_mode <mode>
282 ceph osd pool set <pool-name> compression_required_ratio <ratio>
283 ceph osd pool set <pool-name> compression_min_blob_size <size>
284 ceph osd pool set <pool-name> compression_max_blob_size <size>
286 .. confval:: bluestore_compression_algorithm
287 .. confval:: bluestore_compression_mode
288 .. confval:: bluestore_compression_required_ratio
289 .. confval:: bluestore_compression_min_blob_size
290 .. confval:: bluestore_compression_min_blob_size_hdd
291 .. confval:: bluestore_compression_min_blob_size_ssd
292 .. confval:: bluestore_compression_max_blob_size
293 .. confval:: bluestore_compression_max_blob_size_hdd
294 .. confval:: bluestore_compression_max_blob_size_ssd
296 .. _bluestore-rocksdb-sharding:
301 Internally BlueStore uses multiple types of key-value data,
302 stored in RocksDB. Each data type in BlueStore is assigned a
303 unique prefix. Until Pacific all key-value data was stored in
304 single RocksDB column family: 'default'. Since Pacific,
305 BlueStore can divide this data into multiple RocksDB column
306 families. When keys have similar access frequency, modification
307 frequency and lifetime, BlueStore benefits from better caching
308 and more precise compaction. This improves performance, and also
309 requires less disk space during compaction, since each column
310 family is smaller and can compact independent of others.
312 OSDs deployed in Pacific or later use RocksDB sharding by default.
313 If Ceph is upgraded to Pacific from a previous version, sharding is off.
315 To enable sharding and apply the Pacific defaults, stop an OSD and run
319 ceph-bluestore-tool \
321 --sharding="m(3) p(3,0-12) O(3,0-13)=block_cache={type=binned_lru} L P" \
324 .. confval:: bluestore_rocksdb_cf
325 .. confval:: bluestore_rocksdb_cfs
330 .. confval:: bluestore_throttle_bytes
331 .. confval:: bluestore_throttle_deferred_bytes
332 .. confval:: bluestore_throttle_cost_per_io
333 .. confval:: bluestore_throttle_cost_per_io_hdd
334 .. confval:: bluestore_throttle_cost_per_io_ssd
339 If you want to use the SPDK driver for NVMe devices, you must prepare your system.
340 Refer to `SPDK document`__ for more details.
342 .. __: http://www.spdk.io/doc/getting_started.html#getting_started_examples
344 SPDK offers a script to configure the device automatically. Users can run the
347 $ sudo src/spdk/scripts/setup.sh
349 You will need to specify the subject NVMe device's device selector with
350 the "spdk:" prefix for ``bluestore_block_path``.
352 For example, you can find the device selector of an Intel PCIe SSD with::
354 $ lspci -mm -n -D -d 8086:0953
356 The device selector always has the form of ``DDDD:BB:DD.FF`` or ``DDDD.BB.DD.FF``.
360 bluestore_block_path = spdk:0000:01:00.0
362 Where ``0000:01:00.0`` is the device selector found in the output of ``lspci``
365 To run multiple SPDK instances per node, you must specify the
366 amount of dpdk memory in MB that each instance will use, to make sure each
367 instance uses its own dpdk memory
369 In most cases, a single device can be used for data, DB, and WAL. We describe
370 this strategy as *colocating* these components. Be sure to enter the below
371 settings to ensure that all IOs are issued through SPDK.::
373 bluestore_block_db_path = ""
374 bluestore_block_db_size = 0
375 bluestore_block_wal_path = ""
376 bluestore_block_wal_size = 0
378 Otherwise, the current implementation will populate the SPDK map files with
379 kernel file system symbols and will use the kernel driver to issue DB/WAL IO.